Dashboard: Interactive Control (IC) Ops

Interactive Control Operations (IC Ops) are a powerful and major differentiating aspect of RapidFire AI that enable our industry-first rapid experimentation capability for DL/LLM customization.

Motivation for IC Ops

IC Ops are a set of control operations over runs in flight in an ongoing experiment on the cluster. They are motivated by an often under-appreciated pain point felt by many AI developers:

  • How accurate a given training knob configuration will be is impossible to tell upfront. Experimentation is the name the game and one needs to try alternate configurations based on their intuition about their use case, dataset, and model.

  • Not all configurations are made equal. One must be able to easily try and retry values, zoom into promising regions of values, adjust on the fly, etc. This can help reach a better accuracy quickly. Otherwise, one might squander their labeled data and/or leave big value on the table for their use case.

  • Even for an established AI application with a well-trained prior model, one may need to adapt knobs over time as the data distribution evolves (e.g., concept drift), application schema evolves (e.g., data collection process changes), newer/better models emerging (e.g., smaller but more capable LLMs), etc.

Generic MLOps tools treat different model runs as generic monolithic “jobs” and schedule them at a very coarse granularity that is not DL-specific. That leads to a big disconnect between what runs on the cluster and what is needed for customization.

RapidFire AI’s IC Ops alters the status quo by giving a whole new level of control over runs in flight. In addition, IC Ops operate directly on top of our multidimensional-parallel engine. That means for the first time ever one can perform such advanced control operations regardless over multiple runs regardless of the size of the datasets and/or models. No need to juggle disparate data-parallel or model-parallel only tools, wrestle with low-level setup details, copy datasets and models manually, or generally struggle with multi-GPU clusters for your AI work.

Semantics of IC Ops

Our IC Ops are meant to be used after you have launched a run_fit() op in your experiment. You can access the IC Op panel by clicking on any run’s curve on any metrics plot in the Chart view of the ML metrics dashboard under the “Experiments” tab; also see the dashboard overview here.

As of this writing, we support 4 IC Ops: Stop, Resume, Clone-Modify, and Delete. We will shortly explain the semantics of each with screenshots.

Note that all IC Ops are queued by the system and executed at the end of an epoch boundary for all runs in one go. This helps avoid potentially non-deterministic or other inconsistent behaviors due to concurrent execution of runs.

IC ops can be used as intermittently as you like throughout the lifetime of a potentially long-running run_fit() op in your experiment. This means you can, say, launch 20 knob combinations in one go, then check after a couple of epochs and delete half of those runs. You can let the others continue for a couple more epochs, then stop a few runs you believe are not promising, clone and modify the promising ones to add more runs, etc.

Under the hood, RapidFire AI automatically adjusts the apportioning of the GPUs among the ongoing runs to ensure maximal utilization. So, even if you bring it down to just a single run at the end, it will use the whole cluster automatically.

Stop

This IC Op earmarks a run to be stopped at the end of its current epoch. It will still be alive but it will not use any GPU resources from the next epoch. Note that you will still see its minibatch-level plots advancing for the current epoch. You cannot stop an already stopped or deleted run.

_images/08-icops-stop.png

Resume

This IC Op is applicable only to a previously stopped run. It earmarks this run to be resumed from the next epoch onward, when it will be added to the mix of ongoing runs and assigned GPUs from the cluster automatically. You cannot resume an already resumed or deleted run.

Clone-Modify

This is a powerful IC Op that is applicable to any ongoing, stopped, or resumed run. It allows you to inject “clones” of a run, called the “parent” run, when a run_fit() is going on by clicking on any of its metrics curves on the “Chart” tab. The clones can be tweak of any knobs of the parent, e.g., learning rate, batch size, or even user knobs for data preprocessing (e.g., image transforms or resizing, time series window length, text tokenization, etc.) or model architecture definition (e.g., layer to do transfer learning from or number of new layers to add)

The IC Op panel displays a text box with the full knob config dictionary of the parent. You can edit that config to alter the knob values as if you are giving a new config-group for a new run_fit() on the fly from the notebook, except this is all from the metrics dashboard.

You can inject a single new config dictionary, or a whole config-group via a GridSearch() or RandomSearch(); see this page for details on config group generators. Note that if you change any user_knobs in the config, you are changing the codepath taken in your create_model(), compute_forward(), etc. in MLSpec as well.

You can also warm-start the clones using the parent’s weights–just select that option button. Warm-started clones typically build on top of their parent’s accuracy, allowing you to converge to a high accuracy even faster. One caveat for warm-started clones is that their model architecture must be identical as the parent; otherwise, it will error out.

When you are ready with your clones’ config-group, click “Submit” to execute this IC Op.

_images/08-icops-clone1.png

The clones will automatically appear on the plots from the next epoch onward. Under the hood, RapidFire AI automatically reapportions GPUs across all runs, including clones. So, you never need to worry about manually splitting GPUs across models, placing new jobs on clusters, etc.

_images/08-icops-results1.png _images/08-icops-results2.png

You can submit multiple Clone-Modify ops on the same run or different runs within the same epoch. They will get queued up and all clones will start together next epoch.

Clone-Modify combined with Stop enables you to turbocharge how you inject your intuition about your AI use case, data, and/or models to dramatically cut down time to higher accuracy even within in a single experiment.

Delete

This IC Op is applicable to a previously stopped run. It earmarks this run to be deleted from the next epoch onward, when it will be permanently removed from the set of runs. On the chart, you will see its curves vanish almost immediately. You cannot do any further IC ops on a deleted run because it will not be visible.

Note that although a deleted run vanishes from the plots, its model checkpoints are still part of the artifacts of that experiment so you have post-hoc audibility.

Templated Automation of IC Ops

IC Ops are a powerful capability to dramatically cut down time to accuracy even at full scale. We plan to expand support for IC Ops from an automated templated script as well based on feedback. This can help ensure a consistent policy across your data science team and DL applications if you’d like.