API: Experiment Ops

Overview

The concept of an “Experiment” is core to RapidFire AI’s API. User-given MLSpec code and training/testing/prediction operations using that code are associated with named Experiment objects.

Experiment Name:

All experiments must be given a user-set experiment_name that is unique on that cluster. The experiment_name will also be used when displaying plots on the app ML metrics dashboard and when saving artifacts to S3 at the end of that experiment.

If you mistakenly reuse a previous experiment name on the same cluster in the constructor, RapidFire AI will append a suffix to the name you give (akin to what filesystems do).

Single Tenancy of Experiments:

At any point in time only one experiment can be alive on your cluster.

Start the session with an experiment constructor. We also recommend running end() on a currently alive experiment (if applicable) before instantiating another.

We do not currently support multi-tenant experiments, although you can run multiple experiments one after another on the same cluster. You can run the different experiments in the same notebook back to back or even from different notebooks on the same cluster.

Note that even within an experiment, you can launch multiple run_fit() and other ops one after another. All runs of a given experiment will appear on the same metrics plots on the dashboard.

When to Create a New Experiment:

Any time you change your MLSpec code, please end the previously alive experiment (and save its artifacts if you’d like) before creating a new one. If you forget to end the previous one explicitly, RapidFire AI will forcibly end it for you but without saving its artifacts.

Jupyter Kernel State or Internet Disconnection Issues:

If you accidentally close your notebook/laptop, or if you restart your Jupyter kernel, or if your Internet connection temporarily disconnects, you will lose the notebook state and past cell outputs. But any experiment op you had launched will still be running on the cluster as usual in the background.

Upon reopening the notebook, you can simply reconnect the kernel and pick up from where you left off. You do NOT need to rerun all the cells from the top.

To obtain the Experiment object instance you had created that is still alive, just rerun its constructor cell as is. Using that object you can continue working on your experiment as before.

The Experiment class has the functions and semantics detailed below. We illustrate each with usage from the IMDB tutorial notebook.

Experiment Constructor

Constructor to instantiate a new experiment with the given name and MLSpec code.

__init__(self, ml_spec: str, experiment_name: str, epoch_checkpoints: bool = False) → None

Parameters:

ml_spec (str) – Path to Python file on Jupyter server (relative to the notebook’s directory) with your implementation of the class MLSpec as described in the API: MLSpec: Training and Inference page
experiment_name (str) – Unique name for experiment in this session
epoch_checkpoints (bool, optional) – Whether to retain all epoch-level checkpoints of all runs trained in this experiment (default: False)

Returns:

None

Return type:

None

Example:

>>> myexp = Experiment("rf_spec_imdb.py", experiment_name="myexp1-imdb")
Experiment myexp1-imdb created

Notes:

Within a session you can instantiate as many experiment objects as you want. We recommend explicitly ending a previous experiment (see end() below) before starting a new one so that you are cognizant of your code changes and their impact on the ML metrics.

The epoch_checkpoints parameter allows you retain and use any non-final model checkpoints from any run_fit() in this experiment. This could be useful in case the accuracy metrics oscillate or diverge for some runs.

The MLSpec code you give in the separate py file will be automatically sent to all cluster workers. Please take care to ensure that if you change the code inside that file, you explicitly end that previous experiment and create a newly named experiment even if your py code file name is the same. RapidFire AI does not track or diff your py file contents. So, if you give the same file name and experiment name to the constructor after a disconnection but edit the py file in between, it could lead to inconsistent or non-deterministic system behavior.

Run Fit

Main function to launch DL training and validation for the given group of configs in one go, invoking our multidimensional-parallel engine for efficient execution at scale. A config is a dictionary of knobs (hyperparameters, architecture/adapter knobs, other user knobs, etc.) that configures a single model/run. This function can work with a group of configs as described in the Configs page.

run_fit(self, data_handle: DataHandle, ml_config: Config - group, seed: int = 42) → None:

Parameters:

data_handle (Described in the Data Handles page) – Data Handle listing at least the train partition and optionally the validation partition
ml_config (Config-group or list as described in the Configs page) – Single config knob dictionary, a generated config-group, or a List of configs or config-groups
seed (int, optional) – Seed for any randomness used in the ML code (default: 42)

Returns:

None

Return type:

None

Example:

# Launch training and validation for given config group on given data handle
>>> myexp.run_fit(data_handle=dh_imdb, ml_config=config_group, seed=42)
Creating RapidFire Workers ...

Notes:

It will print a table on the notebook with details of all runs, as well as their status. It will also print progress bars on the notebook for each run at each epoch.

It auto-generates the ML metrics files as per user specification and auto-plots them on the app MLflow dashboard. Note that run_fit() must be actively running for you to be able to use Interactive Control (IC) ops on the app dashboard.

Within an experiment, you can rerun run_fit() as many times as you want, with different data handles and config groups if you wish. The table and the plots will just get appended with the new sets of runs as you go along.

If you change the data handle used across run_fit(), take care to ensure the runs produced are actually meaningful to compare on the ML metrics.

Also note that if you change your MLSpec code, you must end the current experiment and start a new one before you can use your new ML code.

The ml_config argument is very versatile in allowing you to construct various knob combinations and launch them simultaneously. It can be a single config dictionary, a regular Python List of config dictionaries, a config-group generator output (via GridSearch(), RandomSearch(), or AutoML heuristic), or even a List with mix of configs or config-group generator outputs as its elements. Please see the the Configs page for more details and advanced examples.

Run Test

Launch a batch inference testing job with a trained model. This function has two overloaded pathways, one using run_id from a run_fit() executed in the same experiment, and the other using an imported model checkpoint.

run_test(self, data_handle: DataHandle, run_id: int | None, model_tag: str | None, config: Dict[str, Any] | None, epoch_checkpoint: int | None, batch_size: int = 64) → None

Parameters:

data_handle (Described in the Data Handles page) – Data Handle listing at least the test partition
run_id (int, optional) – Run ID of a model produced by a run_fit() executed in the same experiment
model_tag (str, optional) – Absolute path to a model checkpoint on remote storage (S3 for now)
config (Dict[str, Any], optional) – Config dictionary for knobs in ML code when testing with an imported model
epoch_checkpoint (int, optional) – Use this epoch’s model checkpoint for the given run; only applies to the run_id pathway
batch_size (int, optional) – Per-GPU batch size for inference; unrelated to train batch size (default: 64)

Returns:

None

Return type:

None

Examples:

# Launch testing for run_id 3 from latest run_fit() in the same experiment
myexp.run_test(data_handle=dh_imdb, run_id=3)

# Launch testing for run_id 5 from latest run_fit() with a larger per-GPU batch size
myexp.run_test(data_handle=dh_imdb, run_id=5, batch_size=128)

# Launch testing with an imported model checkpoint and a config knob dictionary
myexp.run_test(data_handle=dh_imdb, model_tag="s3://path-to/mycheckpoint.pt", config=test_cfg)

Notes:

All arguments needed for exactly one of the two pathways must be given; otherwise, this function will error out.

The optional batch_size can be adjusted up or down based on your model and your GPU if you’d like to tweak inference throughput.

When using the model tag pathway, the specified model checkpoint will be read and loaded by RapidFire AI automatically.

Only the compute_forward() and (if provided) the metrics functions in your MLSpec will be executed. If those functions have any user-given knobs, say, inside compute_forward(), you must provide a single value for each of those knobs in the config dictionary. The dictionary can also contain any named metrics you want calculated on the test set. (If the model is large, for now you must also specify the FSDP layer in fsdp_layer_cls knob.)

Both pathways will print a progress bar on the notebook. At the end, they will also print a table with all accuracy metrics specified.

Note that to use the epoch_checkpoint for the run_id pathway, you must have set epoch_checkpoints to True in the experiment constructor; otherwise, it will throw an error. By default, if epoch checkpoints are not saved or if they are saved but not used here, we will use the final model checkpoint.

Run Predict

Main function for batch inference to obtain predictions (pure inference) with a trained model on the predict data partition. This ensures consistency of codepaths used for training (say, data preprocessing, model object creation, or forward pass) to also be used for inference, reducing chances of mismatches or semantic errors between training and inference.

run_predict(self, data_handle: DataHandle, run_id: int | None, model_tag: str | None, config: Dict[str, Any] | None, epoch_checkpoint: int | None, batch_size: int = 64) → None

Parameters:

data_handle (Described in the Data Handles page) – Data Handle listing at least the predict partition
run_id (int, optional) – Run ID of a model produced by a run_fit() executed in the same experiment
model_tag (str, optional) – Absolute path to a model checkpoint on remote storage (S3 for now)
config (Dict[str, Any], optional) – Config dictionary for knobs in ML code when predicting with an imported model
epoch_checkpoint (int, optional) – Use this epoch’s model checkpoint for the given run; only applies to the run_id pathway
batch_size (int, optional) – Per-GPU batch size for inference; unrelated to train batch size (default: 64)

Returns:

None

Return type:

None

Examples:

# Predict with run_id 3 from latest run_fit() in the same experiment
myexp.run_predict(data_handle=dh_imdb, run_id=3)

# Predict with run_id 5 from latest run_fit() with given per-GPU batch size
myexp.run_predict(data_handle=dh_imdb, run_id=5, batch_size=256)

# Predict with my imported model checkpoint and a config knob dictionary
myexp.run_predict(data_handle=dh_imdb, model_path="s3://path-to/mycheckpoint.pt", config=predict_cfg)

Notes:

This function is almost identical to run_test() except that it does not return metrics, since there are no targets/labels to compare model predictions against. Otherwise, the arguments for invoking either pathway are identical to run_test().

Both pathways will print a progress bar on the notebook. In the end, it outputs a file with the predictions themselves along with the example identifiers on your Jupyter home directory. Feel free to download it or post-process it in the notebook however you like.

Note that to use the epoch_checkpoint for the run_id pathway, you must have set epoch_checkpoints to True in the constructor; otherwise, it will throw an error. By default, if epoch checkpoints are not saved or if they are saved but not used here, we will use the final model checkpoint.

Also note that prediction output file is automatically named based on the Run ID (and possibly epoch checkpoint) or the model tag.

Get Results

This function returns all the validation metrics for all epochs for all runs from across all run_fit() invocations in the current experiment. You can also provide a specific Run ID as a filter condition if you prefer.

get_results(self, run_id: int | None) → pd.DataFrame

Parameters:: run_id (int, optional) – Run ID of a model produced by a run_fit() executed in the same experiment
Returns:: A DataFrame with the following columns: run ID, epoch number, and one per validation metric (both named and custom metrics)
Return type:: pandas.DataFrame

Examples:

# Get results of all runs from this experiments so far
all_results = myexp.get_results()
all_results.head()

# Print results for just Run ID 5
print(myexp.get_results(5))

Notes:

This function can be useful for programmatic post-processing of the results of your experiments. For instance, you can use it as part of new custom AutoML procedure if you’d like to adjust your config for a new run_fit() based on the results of your last run_fit().

Get Runs Information

This function returns metadata about all the runs from across all run_fit() invocations in the current experiment.

get_runs_info(self) → pd.DataFrame:

Returns:: A DataFrame with the following columns: run ID, data handle ID, status, source, ended by, completed epochs, MLflow run ID, full configuration dictionary
Return type:: pandas.DataFrame

Examples:

# Get metadata of all runs from this experiments so far
all_runs = myexp.get_runs_info()
all_results.head()

Notes:

This function is also useful for programmatic post-processing and/or pre-processing of runs and their config knobs. For instance, you can use it as part of new custom AutoML procedure to launch a new run_fit() with new config knob values based on get_results() from past run_fit() invocations.

We plan to expand this API in the future to return other details about runs such as total runtime, GPU utilization, etc. based on feedback.

End

End the current experiment to clear out relevant system state and allow you to move on to a new experiment. Please do not run this when another op is still running; run cancel_current() first to cancel that other op.

end(self, artifacts_bucket: str | None, save_epoch_checkpoints: bool = False) → None

Parameters:

artifacts_bucket (str, optional) – The output bucket on S3 to write all artifacts of this experiment to; if None, artifacts will not be saved
save_epoch_checkpoints (bool, optional) – Whether to save all epoch-level checkpoints of all runs trained in this experiment to S3 (default: False)

Returns:

None

Return type:

None

Examples:

# End current experiment and do not persist its artifacts
myexp.end()

# End another experiment and save its artifacts to S3 along with all epoch checkpoints
user_bucket = getenv("USER_BUCKET")
cluster_name = getenv("CLUSTER_NAME")
outpath = f"s3://{user_bucket}/outputs/{cluster_name}/myexp2-outputs/"
myexp2.end(artifacts_bucket=outpath, save_epoch_checkpoints=True)

Notes:

The saved artifacts include the final (and possibly all epoch-level) model checkpoints from all runs, all metrics files, and the message logs displayed on the UI. Not saving artifacts of ephemeral or throwaway experiments can help prevent file clutter.

If save_epoch_checkpoints and epoch_checkpoints (in the experiment constructor) are both True, it will include all epoch-level model checkpoints of all runs alongside the above artifacts. If epoch_checkpoints was False in the constructor but you set save_epoch_checkpoints to True, it will throw an error. Note that if artifacts_bucket is not given, it just ignores save_epoch_checkpoints altogether.

As noted earlier, you must run end() on a current experiment if you want to change your MLSpec code. As noted above, each experiment object is associated with a given MLSpec code for traceability purposes.