API: Experiment Ops
Overview
The concept of an “Experiment” is core to RapidFire AI’s API.
User-given MLSpec
code and training/testing/prediction operations using that code are associated
with named Experiment objects.
Experiment Name:
All experiments must be given a user-set experiment_name
that is unique on that cluster.
The experiment_name
will also be used when displaying plots on the app ML metrics dashboard
and when saving artifacts to S3 at the end of that experiment.
If you mistakenly reuse a previous experiment name on the same cluster in the constructor, RapidFire AI will append a suffix to the name you give (akin to what filesystems do).
Single Tenancy of Experiments:
At any point in time only one experiment can be alive on your cluster.
Start the session with an experiment constructor. We also recommend running end()
on a
currently alive experiment (if applicable) before instantiating another.
We do not currently support multi-tenant experiments, although you can run multiple experiments one after another on the same cluster. You can run the different experiments in the same notebook back to back or even from different notebooks on the same cluster.
Note that even within an experiment, you can launch multiple run_fit()
and other ops
one after another. All runs of a given experiment will appear on the same metrics plots on the dashboard.
When to Create a New Experiment:
Any time you change your MLSpec
code, please end the previously alive experiment (and save its
artifacts if you’d like) before creating a new one.
If you forget to end the previous one explicitly, RapidFire AI will forcibly end it for you but without
saving its artifacts.
Jupyter Kernel State or Internet Disconnection Issues:
If you accidentally close your notebook/laptop, or if you restart your Jupyter kernel, or if your Internet connection temporarily disconnects, you will lose the notebook state and past cell outputs. But any experiment op you had launched will still be running on the cluster as usual in the background.
Upon reopening the notebook, you can simply reconnect the kernel and pick up from where you left off. You do NOT need to rerun all the cells from the top.
To obtain the Experiment
object instance you had created that is still alive, just rerun its
constructor cell as is. Using that object you can continue working on your experiment as before.
The Experiment class has the functions and semantics detailed below. We illustrate each with usage from the IMDB tutorial notebook.
Experiment Constructor
Constructor to instantiate a new experiment with the given name and MLSpec
code.
- __init__(self, ml_spec: str, experiment_name: str, epoch_checkpoints: bool = False) None
- Parameters:
ml_spec (str) – Path to Python file on Jupyter server (relative to the notebook’s directory) with your implementation of the class
MLSpec
as described in the API: MLSpec: Training and Inference pageexperiment_name (str) – Unique name for experiment in this session
epoch_checkpoints (bool, optional) – Whether to retain all epoch-level checkpoints of all runs trained in this experiment (default: False)
- Returns:
None
- Return type:
None
Example:
>>> myexp = Experiment("rf_spec_imdb.py", experiment_name="myexp1-imdb")
Experiment myexp1-imdb created
Notes:
Within a session you can instantiate as many experiment objects as you want.
We recommend explicitly ending a previous experiment (see end()
below) before
starting a new one so that you are cognizant of your code changes and their impact on the ML metrics.
The epoch_checkpoints
parameter allows you retain and use any non-final model checkpoints
from any run_fit()
in this experiment.
This could be useful in case the accuracy metrics oscillate or diverge for some runs.
The MLSpec
code you give in the separate py file will be automatically sent to all cluster workers.
Please take care to ensure that if you change the code inside that file, you explicitly end that previous
experiment and create a newly named experiment even if your py code file name is the same.
RapidFire AI does not track or diff your py file contents.
So, if you give the same file name and experiment name to the constructor after a disconnection but
edit the py file in between, it could lead to inconsistent or non-deterministic system behavior.
Run Fit
Main function to launch DL training and validation for the given group of configs in one go, invoking our multidimensional-parallel engine for efficient execution at scale. A config is a dictionary of knobs (hyperparameters, architecture/adapter knobs, other user knobs, etc.) that configures a single model/run. This function can work with a group of configs as described in the Configs page.
- run_fit(self, data_handle: DataHandle, ml_config: Config - group, seed: int = 42) None:
- Parameters:
data_handle (Described in the Data Handles page) – Data Handle listing at least the train partition and optionally the validation partition
ml_config (Config-group or list as described in the Configs page) – Single config knob dictionary, a generated config-group, or a
List
of configs or config-groupsseed (int, optional) – Seed for any randomness used in the ML code (default: 42)
- Returns:
None
- Return type:
None
Example:
# Launch training and validation for given config group on given data handle
>>> myexp.run_fit(data_handle=dh_imdb, ml_config=config_group, seed=42)
Creating RapidFire Workers ...
Notes:
It will print a table on the notebook with details of all runs, as well as their status. It will also print progress bars on the notebook for each run at each epoch.
It auto-generates the ML metrics files as per user specification and auto-plots them on the
app MLflow dashboard.
Note that run_fit()
must be actively running for you to be able to use Interactive
Control (IC) ops on the app dashboard.
Within an experiment, you can rerun run_fit()
as many times as you want, with
different data handles and config groups if you wish.
The table and the plots will just get appended with the new sets of runs as you go along.
If you change the data handle used across run_fit()
, take care to ensure the runs
produced are actually meaningful to compare on the ML metrics.
Also note that if you change your MLSpec
code, you must end the current experiment
and start a new one before you can use your new ML code.
The ml_config
argument is very versatile in allowing you to construct various knob
combinations and launch them simultaneously.
It can be a single config dictionary, a regular Python List
of config dictionaries, a
config-group generator output (via GridSearch()
, RandomSearch()
, or AutoML heuristic),
or even a List
with mix of configs or config-group generator outputs as its elements.
Please see the the Configs page for more details and advanced examples.
Run Test
Launch a batch inference testing job with a trained model.
This function has two overloaded pathways, one using run_id
from a
run_fit()
executed in the same experiment, and the other using an imported model checkpoint.
- run_test(self, data_handle: DataHandle, run_id: int | None, model_tag: str | None, config: Dict[str, Any] | None, epoch_checkpoint: int | None, batch_size: int = 64) None
- Parameters:
data_handle (Described in the Data Handles page) – Data Handle listing at least the test partition
run_id (int, optional) – Run ID of a model produced by a
run_fit()
executed in the same experimentmodel_tag (str, optional) – Absolute path to a model checkpoint on remote storage (S3 for now)
config (Dict[str, Any], optional) – Config dictionary for knobs in ML code when testing with an imported model
epoch_checkpoint (int, optional) – Use this epoch’s model checkpoint for the given run; only applies to the
run_id
pathwaybatch_size (int, optional) – Per-GPU batch size for inference; unrelated to train batch size (default: 64)
- Returns:
None
- Return type:
None
Examples:
# Launch testing for run_id 3 from latest run_fit() in the same experiment
myexp.run_test(data_handle=dh_imdb, run_id=3)
# Launch testing for run_id 5 from latest run_fit() with a larger per-GPU batch size
myexp.run_test(data_handle=dh_imdb, run_id=5, batch_size=128)
# Launch testing with an imported model checkpoint and a config knob dictionary
myexp.run_test(data_handle=dh_imdb, model_tag="s3://path-to/mycheckpoint.pt", config=test_cfg)
Notes:
All arguments needed for exactly one of the two pathways must be given; otherwise, this function will error out.
The optional batch_size
can be adjusted up or down based on your model and
your GPU if you’d like to tweak inference throughput.
When using the model tag pathway, the specified model checkpoint will be read and loaded by RapidFire AI automatically.
Only the compute_forward()
and (if provided) the metrics functions in your MLSpec
will be executed.
If those functions have any user-given knobs, say, inside compute_forward()
,
you must provide a single value for each of those knobs in the config
dictionary.
The dictionary can also contain any named metrics you want calculated on the test set.
(If the model is large, for now you must also specify the FSDP layer in fsdp_layer_cls
knob.)
Both pathways will print a progress bar on the notebook. At the end, they will also print a table with all accuracy metrics specified.
Note that to use the epoch_checkpoint
for the run_id
pathway, you must have
set epoch_checkpoints
to True in the experiment constructor; otherwise, it will throw an error.
By default, if epoch checkpoints are not saved or if they are saved but not used here, we
will use the final model checkpoint.
Run Predict
Main function for batch inference to obtain predictions (pure inference) with a trained model on the predict data partition. This ensures consistency of codepaths used for training (say, data preprocessing, model object creation, or forward pass) to also be used for inference, reducing chances of mismatches or semantic errors between training and inference.
- run_predict(self, data_handle: DataHandle, run_id: int | None, model_tag: str | None, config: Dict[str, Any] | None, epoch_checkpoint: int | None, batch_size: int = 64) None
- Parameters:
data_handle (Described in the Data Handles page) – Data Handle listing at least the predict partition
run_id (int, optional) – Run ID of a model produced by a
run_fit()
executed in the same experimentmodel_tag (str, optional) – Absolute path to a model checkpoint on remote storage (S3 for now)
config (Dict[str, Any], optional) – Config dictionary for knobs in ML code when predicting with an imported model
epoch_checkpoint (int, optional) – Use this epoch’s model checkpoint for the given run; only applies to the
run_id
pathwaybatch_size (int, optional) – Per-GPU batch size for inference; unrelated to train batch size (default: 64)
- Returns:
None
- Return type:
None
Examples:
# Predict with run_id 3 from latest run_fit() in the same experiment
myexp.run_predict(data_handle=dh_imdb, run_id=3)
# Predict with run_id 5 from latest run_fit() with given per-GPU batch size
myexp.run_predict(data_handle=dh_imdb, run_id=5, batch_size=256)
# Predict with my imported model checkpoint and a config knob dictionary
myexp.run_predict(data_handle=dh_imdb, model_path="s3://path-to/mycheckpoint.pt", config=predict_cfg)
Notes:
This function is almost identical to run_test()
except that it does not return metrics,
since there are no targets/labels to compare model predictions against.
Otherwise, the arguments for invoking either pathway are identical to run_test()
.
Both pathways will print a progress bar on the notebook. In the end, it outputs a file with the predictions themselves along with the example identifiers on your Jupyter home directory. Feel free to download it or post-process it in the notebook however you like.
Note that to use the epoch_checkpoint
for the run_id
pathway, you must have
set epoch_checkpoints
to True in the constructor; otherwise, it will throw an error.
By default, if epoch checkpoints are not saved or if they are saved but not used here, we
will use the final model checkpoint.
Also note that prediction output file is automatically named based on the Run ID (and possibly epoch checkpoint) or the model tag.
Get Results
This function returns all the validation metrics for all epochs for all runs from across
all run_fit()
invocations in the current experiment.
You can also provide a specific Run ID as a filter condition if you prefer.
Examples:
# Get results of all runs from this experiments so far
all_results = myexp.get_results()
all_results.head()
# Print results for just Run ID 5
print(myexp.get_results(5))
Notes:
This function can be useful for programmatic post-processing of the results of your experiments.
For instance, you can use it as part of new custom AutoML procedure if you’d like to adjust your
config for a new run_fit()
based on the results of your last run_fit()
.
Get Runs Information
This function returns metadata about all the runs from across all run_fit()
invocations in the current experiment.
- get_runs_info(self) pd.DataFrame:
- Returns:
A DataFrame with the following columns: run ID, data handle ID, status, source, ended by, completed epochs, MLflow run ID, full configuration dictionary
- Return type:
pandas.DataFrame
Examples:
# Get metadata of all runs from this experiments so far
all_runs = myexp.get_runs_info()
all_results.head()
Notes:
This function is also useful for programmatic post-processing and/or pre-processing of runs and their config knobs.
For instance, you can use it as part of new custom AutoML procedure to launch a new run_fit()
with new config
knob values based on get_results()
from past run_fit()
invocations.
We plan to expand this API in the future to return other details about runs such as total runtime, GPU utilization, etc. based on feedback.
End
End the current experiment to clear out relevant system state and allow you to move on to a new experiment.
Please do not run this when another op is still running; run cancel_current()
first to cancel that other op.
- end(self, artifacts_bucket: str | None, save_epoch_checkpoints: bool = False) None
- Parameters:
- Returns:
None
- Return type:
None
Examples:
# End current experiment and do not persist its artifacts
myexp.end()
# End another experiment and save its artifacts to S3 along with all epoch checkpoints
user_bucket = getenv("USER_BUCKET")
cluster_name = getenv("CLUSTER_NAME")
outpath = f"s3://{user_bucket}/outputs/{cluster_name}/myexp2-outputs/"
myexp2.end(artifacts_bucket=outpath, save_epoch_checkpoints=True)
Notes:
The saved artifacts include the final (and possibly all epoch-level) model checkpoints from all runs, all metrics files, and the message logs displayed on the UI. Not saving artifacts of ephemeral or throwaway experiments can help prevent file clutter.
If save_epoch_checkpoints
and epoch_checkpoints
(in the experiment constructor) are both
True
, it will include all epoch-level model checkpoints of all runs alongside the above artifacts.
If epoch_checkpoints
was False
in the constructor but you set save_epoch_checkpoints
to True
, it will throw an error.
Note that if artifacts_bucket
is not given, it just ignores save_epoch_checkpoints
altogether.
As noted earlier, you must run end()
on a current experiment if you want to change your MLSpec
code.
As noted above, each experiment object is associated with a given MLSpec
code for traceability purposes.