API: Data Handles

Overview

Data Handle is a core concept in RapidFire AI to manage data files on your cluster in an organized manner. Please review the concepts in API: Data Ingestion and Locators first.

A data handle corresponds to a set of ESFs and object directories (to be) retrieved from remote storage based on a given locator dictionary, optional sampling, and optional subset of column names.

The Data Handle class has the functions and semantics detailed below. We illustrate each with usage from the COCO tutorial notebook.

Data Handle Constructor

Constructor to instantiate a new data handle object. Note that this does not initiate downloading of files automatically.

__init__(self, locators: Dict, download_columns: List | None, fraction: float = 1.0, seed: int = 42) → None

Parameters:

locators (Dict [str,str]) – Dictionary with relevant ingestion locators on S3
download_columns (List [str], optional) – List of (subset of) names of object columns in Example Structure File
fraction (float, optional) – Sampling fraction of the dataset
seed (int, optional) – Seed for random shuffling of train set (default: 42)

Returns:

None

Return type:

None

Example:

# Obtain 1% sample of all partitions of COCO detection + segmentation dataset
>>> dh_coco = DataHandle (COCODetSegLocators, download_columns=["file_name"], fraction=0.01)
Data Handle dh_coco created

Notes:

Within a session you can instantiate as many data handle objects as you want. They can overlap however you want, including being redundant copies of the same base raw files.

A given data handle can be reused across any of the run_fit() or other functions of an experiment or even across experiments. Note that you must explicitly run download() as explained below to be able to use a data handle for any experiment operation.

The optional parameter download_columns enables you to download only a subset of ESF object columns. So, it is useful for feature subset exploration in your experimentation.

The optional parameters fraction (and seed) enable you to download a uniformly random subset of examples from the base data. As of this writing, this fraction is shared across all data partitions (train, validation, test, predict) listed in the locators. You can, of course, adjust the ESFs themselves to effect other forms of sampling.

Download

Download all the files and folders given in the locators of this data handle. If sampling and/or download of subset of columns was specified, those will be factored in accordingly.

download(self) → None

Returns:: None
Return type:: None

Example:

# Download the above COCO data handle
>>> dh_coco.download()
Downloading files ...

Notes:

For datasets that are not in-situ, it will print progress bars on the notebook for the download of the object files. It might take 1-2 minutes for the progress bars to appear because RapidFire AI needs to process the downloaded ESFs first before downloading objects.

Note that download() is a blocking operation, i.e., you cannot run any other operation (experiment op or another data handle download) on the cluster while this operation is in progress.

All downloaded files will persist on the cluster through any number of cluster stop-restart cycles (see cluster ops <clusterops.rst> for details).

Atomicity and Idempotency:

This operation is atomic, i.e., either all locators get downloaded or the data handle will be marked as not ready for use. So, if something fails during the first download, an error message will be printed. You must rerun download and ensure it succeeds–or just create a different data handle.

Likewise, this operation is idempotent after first success, i.e., any future invocations will not do anything further. A class-internal bool is_downloaded is used to track this success status.

If you started a download by mistake and/or prefer to cancel it instead of waiting for it to finish, stop the cell running this function and then in a new cell run cancel_current(). Note that partially downloaded files may then be present on the cluster. We recommend explicitly invoking delete_local() (see below) to clean up wasted storage.

Delete Local

Delete the local copy on the cluster of all the files and folders downloaded as part of this data handle (if any). The original files on remote storage are not affected.

delete_local(self) → None

Returns:: None
Return type:: None

Example:

# Delete the local sampled files of the COCO dataset
>>> dh_coco.delete_local()
Deleting files ...

Notes:

This operation is not blocking and not atomic. So, if the first deletion does not finish successfully for whatever reason, we recommend running it again explicitly.

The primary use of this function is to save local storage space on your cluster. Running this on data handles that you no longer need will reduce the chances of you hitting your account’s cluster storage limit.

List

List all the data handles created on this cluster. This enables you to see the internally created IDs of all data handles, as well as their constructor arguments to be able to identify them.

list(self) → pd.DataFrame

Returns:: A pandas DataFrame with the following columns: handle_id, locators, download_columns, fraction, seed, is_downloaded
Return type:: pd.DataFrame

Example:

# List the data handles on this cluster
>>> DataHandle.list()

Notes:

This function is useful both to see the metadata of all data handles on the cluster and to re-obtain a lost in-memory data handle object, say, due to Internet disconnection or Jupyter reload. Pass the ID value listed here to get() (see below) to obtain that data handle object.

Get

Given a data handle ID, obtain an instance of the data handle object that was created before on this cluster.

get(self, handle_id: str) → pd.DataFrame

Parameters:: handle_id (str) – ID of a previously created data handle object
Returns:: A pandas DataFrame with the following columns: handle_id, locators, download_columns, fraction, seed, is_downloaded
Return type:: pd.DataFrame

Example:

# Get the data handle with ID "2"
>>> dh2 = DataHandle.get("2")

Notes:

This function is useful to re-obtain a lost in-memory data handle object, say, due to Internet disconnection or Jupyter reload. Pass an ID value noted from list() above to get that data handle object.