API: Experiment Ops
================

Overview
--------

The concept of an "Experiment" is core to RapidFire AI's API.
User-given :code:`MLSpec` code and training/testing/prediction operations using that code are associated 
with named Experiment objects.


**Experiment Name:**

All experiments must be given a user-set :code:`experiment_name` that is unique on that cluster. 
The :code:`experiment_name` will also be used when displaying plots on the app ML metrics dashboard
and when saving artifacts to S3 at the end of that experiment.

If you mistakenly reuse a previous experiment name on the same cluster in the constructor, 
RapidFire AI will append a suffix to the name you give (akin to what filesystems do).


**Single Tenancy of Experiments:**

At any point in time only one experiment can be alive on your cluster. 

Start the session with an experiment constructor. We also recommend running :func:`end()` on a 
currently alive experiment (if applicable) before instantiating another. 

We do not currently support multi-tenant experiments, although you can run multiple experiments 
one after another on the same cluster. 
You can run the different experiments in the same notebook back to back or even from different 
notebooks on the same cluster.

Note that even within an experiment, you can launch multiple :func:`run_fit()` and other ops 
one after another. All runs of a given experiment will appear on the same metrics plots on the dashboard.


**When to Create a New Experiment:**

Any time you change your :code:`MLSpec` code, please end the previously alive experiment (and save its 
artifacts if you'd like) before creating a new one. 
If you forget to end the previous one explicitly, RapidFire AI will forcibly end it for you but without 
saving its artifacts.


**Jupyter Kernel State or Internet Disconnection Issues:**

If you accidentally close your notebook/laptop, or if you restart your Jupyter kernel, or if your 
Internet connection temporarily disconnects, you will lose the notebook state and past cell outputs. 
But any experiment op you had launched will still be running on the cluster as usual in the background. 

Upon reopening the notebook, you can simply reconnect the kernel and pick up from where you left off. 
You do NOT need to rerun all the cells from the top. 

To obtain the :code:`Experiment` object instance you had created that is still alive, just rerun its 
constructor cell as is. Using that object you can continue working on your experiment as before.


The Experiment class has the functions and semantics detailed below. 
We illustrate each with usage from the IMDB tutorial notebook.


Experiment Constructor 
------

Constructor to instantiate a new experiment with the given name and :code:`MLSpec` code.

.. py:function:: __init__(self, ml_spec: str, experiment_name: str, epoch_checkpoints: bool = False) -> None

	:param ml_spec: Path to Python file on Jupyter server (relative to the notebook's directory) with your implementation of the class :code:`MLSpec` as described in :doc:`the API: MLSpec: Training and Inference page</ml>`
	:type ml_spec: str 

	:param experiment_name: Unique name for experiment in this session
	:type experiment_name: str

	:param epoch_checkpoints: Whether to retain all epoch-level checkpoints of all runs trained in this experiment (default: False)
	:type epoch_checkpoints: bool, optional

	:return: None
	:rtype: None

**Example:**

.. code-block:: python

	>>> myexp = Experiment("rf_spec_imdb.py", experiment_name="myexp1-imdb")
	Experiment myexp1-imdb created


**Notes:**

Within a session you can instantiate as many experiment objects as you want. 
We recommend explicitly ending a previous experiment (see :func:`end()` below) before 
starting a new one so that you are cognizant of your code changes and their impact on the ML metrics.

The :code:`epoch_checkpoints` parameter allows you retain and use any non-final model checkpoints 
from any :func:`run_fit()` in this experiment. 
This could be useful in case the accuracy metrics oscillate or diverge for some runs. 

The :code:`MLSpec` code you give in the separate py file will be automatically sent to all cluster workers. 
Please take care to ensure that if you change the code inside that file, you explicitly end that previous 
experiment and create a newly named experiment even if your py code file name is the same. 
RapidFire AI does not track or diff your py file contents.
So, if you give the same file name and experiment name to the constructor after a disconnection but 
edit the py file in between, it could lead to inconsistent or non-deterministic system behavior.


Run Fit
------

Main function to launch DL training and validation for the given group of configs in one go,
invoking our multidimensional-parallel engine for efficient execution at scale. 
A config is a dictionary of knobs (hyperparameters, architecture/adapter knobs, other user knobs, etc.) that configures a 
single model/run. This function can work with a group of configs as described in :doc:`the Configs page</configs>`. 


.. py:function:: run_fit(self, data_handle: DataHandle, ml_config: Config-group, seed: int=42) -> None:

	:param data_handle: Data Handle listing at least the train partition and optionally the validation partition
	:type data_handle: Described in :doc:`the Data Handles page</datahandle>`

	:param ml_config: Single config knob dictionary, a generated config-group, or a :code:`List` of configs or config-groups
	:type ml_config: Config-group or list as described in :doc:`the Configs page</configs>`

	:param seed: Seed for any randomness used in the ML code (default: 42)
	:type seed: int, optional

	:return: None
	:rtype: None

**Example:**

.. code-block:: python

	# Launch training and validation for given config group on given data handle
	>>> myexp.run_fit(data_handle=dh_imdb, ml_config=config_group, seed=42)
	Creating RapidFire Workers ...

**Notes:**

It will print a table on the notebook with details of all runs, as well as their status. 
It will also print progress bars on the notebook for each run at each epoch.

It auto-generates the ML metrics files as per user specification and auto-plots them on the 
app MLflow dashboard.
Note that :func:`run_fit()` must be actively running for you to be able to use Interactive 
Control (IC) ops on the app dashboard.

Within an experiment, you can rerun :func:`run_fit()` as many times as you want, with 
different data handles and config groups if you wish. 
The table and the plots will just get appended with the new sets of runs as you go along.

If you change the data handle used across :func:`run_fit()`, take care to ensure the runs 
produced are actually meaningful to compare on the ML metrics.

Also note that if you change your :code:`MLSpec` code, you must end the current experiment
and start a new one before you can use your new ML code. 


The :code:`ml_config` argument is very versatile in allowing you to construct various knob 
combinations and launch them simultaneously. 
It can be a single config dictionary, a regular Python :code:`List` of config dictionaries, a 
config-group generator output (via :func:`GridSearch()`, :func:`RandomSearch()`, or AutoML heuristic), 
or even a :code:`List` with mix of configs or config-group generator outputs as its elements.
Please see the :doc:`the Configs page</configs>` for more details and advanced examples.


Run Test
--------

Launch a batch inference testing job with a trained model. 
This function has two overloaded pathways, one using :code:`run_id` from a 
:func:`run_fit()` executed in the same experiment, and the other using an imported model checkpoint.


.. py:function:: run_test(self, data_handle: DataHandle, run_id: int | None, model_tag: str | None, config: Dict[str, Any] | None, epoch_checkpoint: int | None, batch_size: int=64) -> None

	:param data_handle: Data Handle listing at least the test partition
	:type data_handle: Described in :doc:`the Data Handles page</datahandle>`

	:param run_id: Run ID of a model produced by a :func:`run_fit()` executed in the same experiment
	:type run_id: int, optional

	:param model_tag: Absolute path to a model checkpoint on remote storage (S3 for now)
	:type model_tag: str, optional

	:param config: Config dictionary for knobs in ML code when testing with an imported model
	:type config: Dict[str, Any], optional

	:param epoch_checkpoint: Use this epoch's model checkpoint for the given run; only applies to the :code:`run_id` pathway
	:type epoch_checkpoint: int, optional

	:param batch_size: Per-GPU batch size for inference; unrelated to train batch size (default: 64)
	:type batch_size: int, optional

	:return: None
	:rtype: None

**Examples:**

.. code-block:: python

	# Launch testing for run_id 3 from latest run_fit() in the same experiment
	myexp.run_test(data_handle=dh_imdb, run_id=3)

	# Launch testing for run_id 5 from latest run_fit() with a larger per-GPU batch size
	myexp.run_test(data_handle=dh_imdb, run_id=5, batch_size=128)

	# Launch testing with an imported model checkpoint and a config knob dictionary
	myexp.run_test(data_handle=dh_imdb, model_tag="s3://path-to/mycheckpoint.pt", config=test_cfg)


**Notes:**

All arguments needed for exactly one of the two pathways must be given; otherwise, 
this function will error out.

The optional :code:`batch_size` can be adjusted up or down based on your model and 
your GPU if you'd like to tweak inference throughput.

When using the model tag pathway, the specified model checkpoint will be read and loaded  
by RapidFire AI automatically.

Only the :func:`compute_forward()` and (if provided) the metrics functions in your :code:`MLSpec` will be executed.  
If those functions have any user-given knobs, say, inside :func:`compute_forward()`, 
you must provide a single value for each of those knobs in the :code:`config` dictionary. 
The dictionary can also contain any named metrics you want calculated on the test set. 
(If the model is large, for now you must also specify the FSDP layer in :code:`fsdp_layer_cls` knob.)

Both pathways will print a progress bar on the notebook. At the end, they will also 
print a table with all accuracy metrics specified.

Note that to use the :code:`epoch_checkpoint` for the :code:`run_id` pathway, you must have 
set :code:`epoch_checkpoints` to True in the experiment constructor; otherwise, it will throw an error.
By default, if epoch checkpoints are not saved or if they are saved but not used here, we 
will use the final model checkpoint.


Run Predict
----------

Main function for batch inference to obtain predictions (pure inference) 
with a trained model on the predict data partition. 
This ensures consistency of codepaths used for training (say, data preprocessing, model 
object creation, or forward pass) to also be used for inference, reducing chances of 
mismatches or semantic errors between training and inference.


.. py:function:: run_predict(self, data_handle: DataHandle, run_id: int | None, model_tag: str | None, config: Dict[str, Any] | None, epoch_checkpoint: int | None, batch_size: int=64) -> None

	:param data_handle: Data Handle listing at least the predict partition
	:type data_handle: Described in :doc:`the Data Handles page</datahandle>`

	:param run_id: Run ID of a model produced by a :func:`run_fit()` executed in the same experiment
	:type run_id: int, optional

	:param model_tag: Absolute path to a model checkpoint on remote storage (S3 for now)
	:type model_tag: str, optional

	:param config: Config dictionary for knobs in ML code when predicting with an imported model
	:type config: Dict[str, Any], optional

	:param epoch_checkpoint: Use this epoch's model checkpoint for the given run; only applies to the :code:`run_id` pathway
	:type epoch_checkpoint: int, optional

	:param batch_size: Per-GPU batch size for inference; unrelated to train batch size (default: 64)
	:type batch_size: int, optional

	:return: None
	:rtype: None

**Examples:**

.. code-block:: python

	# Predict with run_id 3 from latest run_fit() in the same experiment
	myexp.run_predict(data_handle=dh_imdb, run_id=3)

	# Predict with run_id 5 from latest run_fit() with given per-GPU batch size
	myexp.run_predict(data_handle=dh_imdb, run_id=5, batch_size=256)

	# Predict with my imported model checkpoint and a config knob dictionary
	myexp.run_predict(data_handle=dh_imdb, model_path="s3://path-to/mycheckpoint.pt", config=predict_cfg)


**Notes:**

This function is almost identical to :func:`run_test()` except that it does not return metrics, 
since there are no targets/labels to compare model predictions against. 
Otherwise, the arguments for invoking either pathway are identical to :func:`run_test()`.

Both pathways will print a progress bar on the notebook. 
In the end, it outputs a file with the predictions themselves along with the example identifiers 
on your Jupyter home directory. Feel free to download it or post-process it in the notebook however you like.

Note that to use the :code:`epoch_checkpoint` for the :code:`run_id` pathway, you must have 
set :code:`epoch_checkpoints` to True in the constructor; otherwise, it will throw an error.
By default, if epoch checkpoints are not saved or if they are saved but not used here, we 
will use the final model checkpoint.

Also note that prediction output file is automatically named based on the Run ID (and possibly 
epoch checkpoint) or the model tag.


Get Results
-------

This function returns all the validation metrics for all epochs for all runs from across 
all :func:`run_fit()` invocations in the current experiment. 
You can also provide a specific Run ID as a filter condition if you prefer.

.. py:function:: get_results(self, run_id: int | None) -> pd.DataFrame

	:param run_id: Run ID of a model produced by a :func:`run_fit()` executed in the same experiment
	:type run_id: int, optional

	:return: A DataFrame with the following columns: run ID, epoch number, and one per validation metric (both named and custom metrics)
	:rtype: pandas.DataFrame

**Examples:**

.. code-block:: python

	# Get results of all runs from this experiments so far
	all_results = myexp.get_results()
	all_results.head()

	# Print results for just Run ID 5
	print(myexp.get_results(5))


**Notes:**
 
This function can be useful for programmatic post-processing of the results of your experiments.
For instance, you can use it as part of new custom AutoML procedure if you'd like to adjust your 
config for a new :func:`run_fit()` based on the results of your last :func:`run_fit()`.


Get Runs Information
-------

This function returns metadata about all the runs from across all :func:`run_fit()` invocations in the current experiment. 

.. py:function:: get_runs_info(self) -> pd.DataFrame:

	:return: A DataFrame with the following columns: run ID, data handle ID, status, source, ended by, completed epochs, MLflow run ID, full configuration dictionary
	:rtype: pandas.DataFrame

**Examples:**

.. code-block:: python

	# Get metadata of all runs from this experiments so far
	all_runs = myexp.get_runs_info()
	all_results.head()


**Notes:**
 
This function is also useful for programmatic post-processing and/or pre-processing of runs and their config knobs.
For instance, you can use it as part of new custom AutoML procedure to launch a new :func:`run_fit()` with new config 
knob values based on :func:`get_results()` from past :func:`run_fit()` invocations.

We plan to expand this API in the future to return other details about runs such as total runtime, GPU utilization, etc. based on feedback.


End
-------

End the current experiment to clear out relevant system state and allow you to move on to a new experiment. 
Please do *not* run this when another op is still running; run :func:`cancel_current()` first to cancel that other op.

.. py:function:: end(self, artifacts_bucket: str | None, save_epoch_checkpoints: bool = False) -> None

	:param artifacts_bucket: The output bucket on S3 to write all artifacts of this experiment to; if None, artifacts will not be saved
	:type artifacts_bucket: str, optional

	:param save_epoch_checkpoints: Whether to save all epoch-level checkpoints of all runs trained in this experiment to S3 (default: False)
	:type save_epoch_checkpoints: bool, optional

	:return: None
	:rtype: None

**Examples:**

.. code-block:: python

	# End current experiment and do not persist its artifacts
	myexp.end()

	# End another experiment and save its artifacts to S3 along with all epoch checkpoints
	user_bucket = getenv("USER_BUCKET")
	cluster_name = getenv("CLUSTER_NAME")
	outpath = f"s3://{user_bucket}/outputs/{cluster_name}/myexp2-outputs/"
	myexp2.end(artifacts_bucket=outpath, save_epoch_checkpoints=True)


**Notes:**
 
The saved artifacts include the final (and possibly all epoch-level) model checkpoints from all runs, 
all metrics files, and the message logs displayed on the UI. 
Not saving artifacts of ephemeral or throwaway experiments can help prevent file clutter.

If :code:`save_epoch_checkpoints` and :code:`epoch_checkpoints` (in the experiment constructor) are both 
:code:`True`, it will include all epoch-level model checkpoints of all runs alongside the above artifacts. 
If :code:`epoch_checkpoints` was :code:`False` in the constructor but you set :code:`save_epoch_checkpoints` 
to :code:`True`, it will throw an error.
Note that if :code:`artifacts_bucket` is not given, it just ignores :code:`save_epoch_checkpoints` altogether.

As noted earlier, you must run :func:`end()` on a current experiment if you want to change your :code:`MLSpec` code. 
As noted above, each experiment object is associated with a given :code:`MLSpec` code for traceability purposes.