Current Limitations and Updates Coming Soon

As of this writing, RapidFire AI has the following limitations. We are actively working on resolving these and welcome feedback on their utility for your use cases to help with prioritization.

Model Size

Any given run’s model for its batch size must fit on a single worker instance’s collective GPU memory. Note that it is NOT required that a model must fit on a single GPU, i.e., RapidFire AI automatically shard it for you across the multiple GPUs of a worker.

In the future, we plan to expand support for even larger models that require sharding across workers.

Cap on Multi-Node Parallelism

When a run uses multiple worker instances, RapidFire AI uses PyTorch’s native distributed data parallelism (DDP) across workers to reduce runtimes. But on commodity networks, that can quickly lead to sub-linear speedups or even slowdowns when a run uses too many workers. So, RapidFire AI caps the maximum number of workers per run with a configurable knob max_multi_nodes.

By default max_multi_nodes is set at 2, but you can change its value under a specially recognized “dev_knobs” section in the config dictionary given to run_fit(). We recommend raising max_multi_nodes only if you have very large batch sizes for all runs, have relatively few GPUs per worker, and/or the processing is very GPU-intensive for your model. Note that the SGD batch size your give is global, i.e., it is split across all GPUs of all workers. For example, suppose you have 4 GPUs per worker:

  • For a small CNN, it might be better to set max_multi_nodes to 3 with a batch size of 600.

  • For a larger Transformer, it might better to set max_multi_nodes to 4 even with a batch size of 64.

In the near future, we plan to automate the selection of max_multi_nodes in a run-specific manner. That will free you to focus even more on accuracy-oriented knobs, while being confident that RapidFire AI rightsizes all parallelism for you to both minimize runtime and maximize GPU utilization.

Script-based Jobs Without Jupyter

Jupyter is popular for exploration and first development of the training/tuning pipeline. But teams often prefer to execute such long-running jobs as a script scheduled offline, perhaps nightly.

In the near future, we plan to release a fully command-line based utility to submit your RapidFire AI code to run on a RapidFire AI cluster without needing to use Jupyter.

Semi-Automated IC Ops

Triggering IC Ops manually from the dashboard is feasible only if there is a human in the loop. But IC Ops are useful even in offline scripted settings based on application logic, e.g., stop 90% of runs with poor validation metrics and clone-modify the top 10% to drill down into more fine-grained values for their knobs.

In the near future, we plan to update the MLSpec API to let you specify such custom semi-automation for IC Ops in code using the runs’ metrics and epoch counts.

Named Metrics Roster

DL is now the norm for 100s of types of prediction and generation use cases in AI. In the near future, we plan to expand the roster of named metrics supported by RapidFire AI, as well as its API in general, to cover even more of such use cases. This includes categorical metrics such as precision, recall, F1, sensitivity, specificity, and AuROC; NLP-specific metrics such as Rouge, BLEU, and perplexity; and CV-specific metrics such as the COCO metrics and CLIP score.

LLM-Specific Training, Fine-Tuning, and Inference

Finally, autoregressive promptable LLMs such as the Llama models are increasingly being customized with more specific techniques such as the following:

  • Parameter-Efficient Fine-Tuning (PEFT) methods such as Low-Rank Adaptation (LoRA) to avoid updating the original weight tensors.

  • Supervised Fine-Tuning (SFT) with labeled prompt-generation pairs.

  • Continued pre-training with self-supervision on unlabeled data.

  • Reinforcement learning-based methods such as Direct Policy Optimization (DPO) and Group-Relative Policy Optimization (GRPO) involving a frozen reference model, a trainable policy model, and a reward function.

  • Optimized batch inference such as Paged Attention with vLLM.

We are actively working on updating the MLSpec API to more easily specify such LLM-specific forms of training and inference in RapidFire AI. It will offer first-class support for Hugging Face libraries such as PEFT and TRL, as well as for vLLM.