API: Data Handles
=================

Overview
------

Data Handle is a core concept in RapidFire AI to manage data files on your cluster in an organized manner.
Please review the concepts in :doc:`API: Data Ingestion and Locators </ingestion>` first.

A data handle corresponds to a set of ESFs and object directories (to be) retrieved from remote storage based 
on a given locator dictionary, optional sampling, and optional subset of column names.


The Data Handle class has the functions and semantics detailed below. 
We illustrate each with usage from the COCO tutorial notebook.


Data Handle Constructor 
------

Constructor to instantiate a new data handle object. Note that this does *not* initiate downloading of files automatically.

.. py:function:: __init__(self, locators: Dict, download_columns: List | None, fraction: float = 1.0, seed: int = 42) -> None

	:param locators: Dictionary with relevant ingestion locators on S3
	:type locators: Dict [str,str]

	:param download_columns: List of (subset of) names of object columns in Example Structure File
	:type download_columns: List [str], optional

	:param fraction: Sampling fraction of the dataset
	:type fraction: float, optional

	:param seed: Seed for random shuffling of train set (default: 42)
	:type seed: int, optional

	:return: None
	:rtype: None

**Example:**

.. code-block:: python

	# Obtain 1% sample of all partitions of COCO detection + segmentation dataset
	>>> dh_coco = DataHandle (COCODetSegLocators, download_columns=["file_name"], fraction=0.01)
	Data Handle dh_coco created

**Notes:**

Within a session you can instantiate as many data handle objects as you want. 
They can overlap however you want, including being redundant copies of the same base raw files.

A given data handle can be reused across any of the :func:`run_fit()` or other functions of 
an experiment or even across experiments.
Note that you must explicitly run :func:`download()` as explained below to be able to use a data handle
for any experiment operation. 

The optional parameter :code:`download_columns` enables you to download only a subset of ESF object columns.
So, it is useful for feature subset exploration in your experimentation. 

The optional parameters :code:`fraction` (and :code:`seed`) enable you to download a uniformly random 
subset of examples from the base data. 
As of this writing, this fraction is shared across all data partitions (train, validation, test, predict) 
listed in the locators. You can, of course, adjust the ESFs themselves to effect other forms of sampling. 


Download
-----

Download all the files and folders given in the locators of this data handle.
If sampling and/or download of subset of columns was specified, those will be factored in accordingly.


.. py:function:: download(self) -> None

	:return: None
	:rtype: None

**Example:**

.. code-block:: python

	# Download the above COCO data handle
	>>> dh_coco.download()
	Downloading files ...


**Notes:**

For datasets that are not in-situ, it will print progress bars on the notebook for the 
download of the object files.
It might take 1-2 minutes for the progress bars to appear because RapidFire AI needs to 
process the downloaded ESFs first before downloading objects.

Note that :func:`download()` is a *blocking operation*, i.e., you cannot run any other 
operation (experiment op or another data handle download) on the cluster while 
this operation is in progress.

All downloaded files will persist on the cluster through any number of cluster stop-restart cycles 
(see `cluster ops <clusterops.rst>` for details).


**Atomicity and Idempotency:**

This operation is *atomic*, i.e., either all locators get downloaded or the data handle will
be marked as not ready for use.
So, if something fails during the first download, an error message will be printed. 
You must rerun download and ensure it succeeds--or just create a different data handle.

Likewise, this operation is *idempotent* after first success, i.e., any future invocations 
will not do anything further. A class-internal bool :code:`is_downloaded` is used to track 
this success status.

If you started a download by mistake and/or prefer to cancel it instead of waiting for 
it to finish, stop the cell running this function and then in a new cell run :func:`cancel_current()`. 
Note that partially downloaded files may then be present on the cluster. 
We recommend explicitly invoking :func:`delete_local()` (see below) to clean up wasted storage.


Delete Local
-----

Delete the local copy on the cluster of all the files and folders downloaded as part of 
this data handle (if any). The original files on remote storage are not affected. 


.. py:function:: delete_local(self) -> None

	:return: None
	:rtype: None

**Example:**

.. code-block:: python

	# Delete the local sampled files of the COCO dataset
	>>> dh_coco.delete_local()
	Deleting files ...


**Notes:**

This operation is not blocking and not atomic. So, if the first deletion does not 
finish successfully for whatever reason, we recommend running it again explicitly.

The primary use of this function is to save local storage space on your cluster. 
Running this on data handles that you no longer need will reduce the chances of you 
hitting your account's cluster storage limit.


List
-----

List all the data handles created on this cluster. This enables you to see the internally 
created IDs of all data handles, as well as their constructor arguments to be able to identify them. 


.. py:function:: list(self) -> pd.DataFrame

	:return: A pandas DataFrame with the following columns: handle_id, locators, download_columns, fraction, seed, is_downloaded
	:rtype: pd.DataFrame

**Example:**

.. code-block:: python

	# List the data handles on this cluster
	>>> DataHandle.list()
	

**Notes:**

This function is useful both to see the metadata of all data handles on the cluster 
and to re-obtain a lost in-memory data handle object, say, due to Internet disconnection or Jupyter reload.
Pass the ID value listed here to :func:`get()` (see below) to obtain that data handle object.


Get
-----

Given a data handle ID, obtain an instance of the data handle object that was created before on this cluster.


.. py:function:: get(self, handle_id: str) -> pd.DataFrame

	:param handle_id: ID of a previously created data handle object
	:type handle_id: str

	:return: A pandas DataFrame with the following columns: handle_id, locators, download_columns, fraction, seed, is_downloaded
	:rtype: pd.DataFrame

**Example:**

.. code-block:: python

	# Get the data handle with ID "2"
	>>> dh2 = DataHandle.get("2")


**Notes:**

This function is useful to re-obtain a lost in-memory data handle object, say, due to Internet 
disconnection or Jupyter reload.
Pass an ID value noted from :func:`list()` above to get that data handle object.