S3

class verta.dataset.S3(paths, enable_mdb_versioning=False)

Captures metadata about S3 objects.

If your S3 object requires additional information to identify it, such as its version ID, you can use S3.location().

Parameters:
  • paths (list) – List of S3 URLs of the form "s3://<bucket-name>" or "s3://<bucket-name>/<key>", or objects returned by S3.location().

  • enable_mdb_versioning (bool, default False) – Whether to upload the data itself to ModelDB to enable managed data versioning.

Examples

from verta.dataset import S3
dataset1 = S3([
    "s3://verta-starter/census-train.csv",
    "s3://verta-starter/census-test.csv",
])
dataset2 = S3([
    "s3://verta-starter",
])
dataset3 = S3([
    S3.location("s3://verta-starter/census-train.csv",
                version_id="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"),
])
dataset += other

Updates the dataset, adding paths from other.

dataset + other + ...

Returns a new dataset with paths from the dataset and all others.

add(paths)

Adds paths to this dataset.

Parameters:

paths (list) – List of S3 URLs of the form "s3://<bucket-name>" or "s3://<bucket-name>/<key>", or objects returned by S3.location().

static blob_msg_to_object(blob_msg)

Deserialize a blob protobuf message into an instance.

Parameters:

blob_msg (VersioningService_pb2.Blob) –

Returns:

instance of subclass of Blob

download(component_path=None, download_to_path=None)

Downloads component_path from this dataset if ModelDB-managed versioning was enabled.

Parameters:
  • component_path (str, optional) – Original path of the file or directory in this dataset to download. If not provided, all files will be downloaded.

  • download_to_path (str, optional) – Path to download to. If not provided, the file(s) will be downloaded into a new path in the current directory. If provided and the path already exists, it will be overwritten.

Returns:

downloaded_to_path (str) – Absolute path where file(s) were downloaded to. Matches download_to_path if it was provided as an argument.

list_components()

Returns the components in this dataset.

Returns:

components (list of Component) – Components.

list_paths()

Returns the paths of all components in this dataset.

Returns:

component_paths (list of str) – Paths of all components.

static location(path, version_id=None)

Returns an object describing an S3 location that can be passed into a new S3.

Parameters:
  • path (str) – S3 URL of the form "s3://<bucket-name>" or "s3://<bucket-name>/<key>".

  • version_id (str, optional) – ID of an S3 object version.

Returns:

S3Location – A location in S3.

Raises:

ValueError – If version_id is provided but path represents a bucket rather than a single object.

classmethod with_spark(sc, paths)

Creates a dataset blob with a SparkContext instance.

Parameters:
  • sc (pyspark.SparkContext) – SparkContext instance.

  • paths (list of strs) – List of paths to binary input data file(s).

Returns:

dataset (dataset) – Dataset blob capturing the metadata of the binary files.