Glossary of Key Terms and Concepts

Artifacts

A set of files that can be saved (to S3) at the end of an experiment via end(). Includes the final (and possibly all epoch-level) model checkpoints of all runs across all run_fit() in that experiment, all test and prediction output files (if applicable), all associated metrics files, and the message logs displayed on the UI.

Read more here: API: Experiment Ops.

Cancel Current

Function to cancel an ongoing potentially long-running operation (e.g., run_fit() or download()) on the cluster.

Read more here: API: Utility Functions: Cancel Current.

Cluster Ops

Operations to control clusters in your account: Stop, Restart, Create, and Delete.

Config Dictionary

A dictionary of key-value pairs of experimentation knobs spanning training-related knobs, user-defined knobs, named metrics, and optional knobs. A knob’s value can be a singleton or set-valued (LIST or RANGE). A dictionary with set-valued knobs be fed to a config-group generator method. A single combination of knob values in this dictionary is called a “leaf” config. RapidFire AI instantiates one run per leaf config, and its values are injected via the cfg argument to your MLSpec functions.

Read more here: API: Config Knobs and AutoML.

Config Group

A set of config dictionary instances, produced by providing a config dictionary with set-valued knobs to config-group generator method such as grid search, random search, or an AutoML heuristic. It can also be a Python list of individual config dictionaries or config-group generators recursively.

Read more here: API: Config Knobs and AutoML.

Config Group Generator

A method to generate a group of config dictionaries in one go based on an input config dictionary with set-valued knobs (LIST or RANGE). Currently supported generator methods are grid search, random search, and an AutoML heuristic named Successive Halving Algorithm (SHA). Support for more AutoML heuristics coming soon.

Read more here: API: Config Knobs and AutoML: Config Group Generators.

Custom Metrics

User-defined evaluation metrics to track learning behaviors of runs based on the outputs of compute_forward() and targets in the data minibatch. Mini-batch level metrics are defined in compute_metrics(); the output can be tensors or even non-tensor objects. How to aggregate them at each epoch’s end is defined in aggregate_metrics(); the outputs must be numbers, and they will be automatically plotted on the app dashboard.

Read more here: API: MLSpec: Training and Inference: Custom Metrics Definition.

Data Handle

A core class in the RapidFire AI API that defines a dataset collection that is usable on the cluster. It can house separate data partitions for training, validation, testing, and/or prediction, as well as sampling of rows and sub-selection of columns to download.

Data Locators Dictionary

A dictionary indicating the ESF and (optionally) object directory for each data partition separately: training, validation, testing, and/or prediction. Optionally, miscellaneous files can also be listed for you to use in initialize_run(). This dictionary has 9 reserved keys: 4 each for the partitions’ ESFs, 4 each for the object directories, and one or miscellaneous files.

Read more here: API: Data Ingestion: Locators Dictionary and Semantics.

Example Structure File (ESF)

A core concept in RapidFire AI for natively multimodal data ingestion. This is a table with metadata capturing the structure of an example for your use case. The columns can be example identifiers, features in place, or relative file paths to object files in the corresponding object directory. Currently supported ESF file format is a CSV. Support for more formats coming soon.

Read more here: API: Data Ingestion: Example Structure File (ESF).

Experiment

A core concept in the RapidFire AI API that defines a collection of training, testing, and prediction operations performed with a given pair of Data Handle and MLSpec code. Each experiment is assigned a unique name that is useful for both display of plots on the app dashboard and for post-hoc artifact tracking and governance. At any point in time, only one experiment can be alive on a cluster.

Read more here: API: Experiment Ops.

Experiment Ops

AI computation methods associated with the Experiment class: run_fit(), run_test(), run_predict(), end(), and the constructor. Also includes two information-gathering methods that enable programmatic post-processing of learning results, say, for custom user-defined automation heuristics: get_runs_info() and get_results().

Read more here: API: Experiment Ops.

File Handling Utilities

Functions to let you list, read, write, or delete objects on remote storage (S3) or the local Jupyter filesystem.

Read more here: API: Utility Functions: File Handling Utilities.

In-Situ Dataset

A dataset in which all features and targets reside in the Example Structure File itself. That is, there are no separate object files.

Interactive Control Ops (IC Ops)

Operations to control runs in flight during a run_fit(). RapidFire AI automatically reapportions system resources across runs elastically. We currently support 4 IC Ops: Stop, Resume, Delete, and Clone-Modify.

Read more here: Dashboard: Interactive Control (IC) Ops.

Knob

A single entry in the config dictionary given for experimentation. There are 4 types of knobs: user-defined knobs, training-related knobs, named metrics, and optional knobs. User-defined knobs handle specifics of data preprocessing and model architecture. A knob’s value can be a singleton or set-valued (list or range).

Read more here: API: Config Knobs and AutoML.

Logs

Files from the cluster with entries from all running RapidFire AI processes to aid monitoring of debugging of job behaviors. Experiment Ops-related logs can be seen on the UI under its own pane or downloaded as files via download_logs(). Cluster Ops-related logs can be seen under the Clusters tab of the RapidFire AI web app.

Read more here: API: Utility Functions: Download Logs and Dashboard: MLflow: Message Logs.

ML Metrics Dashboard

RapidFire AI’s proprietary fork of MLFlow to display plots of all metrics of all runs and experiments, overlay IC Ops functionality onto the plots, and display informative logs.

MLSpec

A core class in the RapidFire AI API to provide your PyTorch code to define the details of a model’s processing logic in a structured manner via a sequence of key functions:

initialize_run(): Initialize read-only in-memory data structures for a run.
create_model(): Create and return an nn.Module object for a run.
row_prep(): Preprocess a single injected row/example from an ESF to get tensors.
collate_fn(): (Optional) Override default PyTorch collate for variable-length examples.
compute_forward(): Define the forward pass and return the loss and outputs on one minibatch.
compute_metrics(): (Optional) Define your own custom eval metrics over compute_forward() outputs.
aggregate_metrics(): (Optional) Aggregate your metrics across compute_metrics() outputs of all minibatches per epoch.

Read more here: API: MLSpec: Training and Inference.

Named Metrics

A knob in the config dictionary whose value can be list of reserved phrases for evaluation metrics with well-known semantics that require a pre-defined set of outputs in compute_forward(). As of this writing, we support “top1_accuracy” “top[k]-accuracy” (any positive integer k). Support for more named metrics coming soon.

Read more here: API: Config Knobs and AutoML: Named Metrics<https://rapidfire-ai-trial-docs.readthedocs-hosted.com/en/latest/configs.html#named-metrics>.

Non-Tensor Metrics

User-defined evaluation metrics returned in compute_metrics() with an output type that is not a tensor, e.g., a string or a list of any type. RapidFire AI collects these non-tensor metrics from across GPUs and workers and assembles them into a list that is fed to your aggregate_metrics().

Read more here: API: MLSpec: Training and Inference: Custom Metrics Definition<https://rapidfire-ai-trial-docs.readthedocs-hosted.com/en/latest/ml.html#custom-metrics-definition>.

Object Directories and Files

A folder on remote storage (S3 for now) that houses object files, possibly with a sub-directory structure. Object files can be of any file format: JPEG, MP4, TXT, JSON, DOC, etc. To use object files for your use case, your ESF must have a column (or columns) referencing the relevant object files with their path relative to that data partition’s object directory.

Results

A single DataFrame containing all evaluation metrics values of all runs across all epochs across all run_fit() invocations so far. Returned by the get_results() function.

Read more here: API: Experiment Ops.

Requirements

The requirements.txt file listing all the extra Python packages to install on cluster workers for your use case. These must be installable with a regular pip install command. Note that there must only be one requirements.txt file on your cluster, and it should be in the Jupyter home directory.

Read more here: Jupyter Notebook and File Handling.

Run

A central concept in RapidFire AI representing a single combination of configuration knob values for a model trained with run_fit(). It is the same concept as in ML metrics dashboards such as MLflow and Weights & Biases. RapidFire AI assigns each run a unique integer :code`run_id` within an experiment.

Read more here: API: Experiment Ops and Dashboard: MLflow.

System Metrics Dashboard

A standard Plutono dashboard displaying a pre-configured schema and layout of system metrics plots, at both individual machine level (averaged across its GPUs) and cluster-wide averages.

Train Knob

A named section (“train”) of the config dictionary given to experiment operations such as run_fit() that defines all learning-related and optimizer-related hyper-parameter knobs as key-value pairs.

Read more here: API: Config Knobs and AutoML.

User Knob

A named section (“user_knobs”) of the config dictionary given to experiment operations such as run_fit() that defines all user-defined knobs as key-value pairs. The keys can be any string and the values can be any Python data type (e.g., integer, float, or string). You use these knobs in your MLSpec code functions to direct which code path a run must take.

Read more here: API: Config Knobs and AutoML.