API: Config Knobs and AutoML
The concept of a config with knobs is fundamental to the RapidFire AI API. It controls several aspects of a given run/model during its training and/or inference, unifying and generalizing the notions of hyperparameter tuning, optimizer selection, architecture selection, fine-tuning, layerwise transfer learning, data representation changes, and more.
In its most basic form, a config is just a Dict[str, Any]
dictionary in which a key
is a string identifying a knob and its value is a data type associated with that knob.
We dive into the structure and semantics of a config and its knobs with the following example adapted from the ImageNet tutorial notebook.
from transformers.models.vit.modeling_vit import ViTLayer
config = {
"user_knobs": {
"model_type": "vit",
},
"train": {
"epochs": 10,
"batch_size: 64,
"optimizer": {
"name": "Adam",
"args" : {
"lr": 1e-4,
"momentum": 0.9,
"weight_decay": 1e-5,
}
}
"lr_scheduler": {
"name": "StepLR",
"args": {
"step_size": 1000,
}
}
},
"named_metrics": ["top1_accuracy", "top5_accuracy"],
"fsdp_layer_cls": {ViTLayer} # Hint for handling large model with FSDP
}
As illustrated above, the config has 4 (sub)sections, out of which train
is the only required section.
Let us dive deeper into each section of the config.
Train Knobs
These are core knobs needed when executing the functions in MLSpec
.
For readability these must be listed under a named train
section of the config.
At a minimum we require that epochs
, batch_size
, and optimizer
be specified by you in the config.
Note that optimizer
and lr_scheduler
are nested dictionaries themselves.
The optimizer
knob indicates which PyTorch-native optimizer to use for gradient-based training.
We support any from the torch.optim
module: Adam, AdamW, Adagrad, Adadelta, RMSProp, (plain) SGD, etc.
Indicate the optimizer to use under the name
key.
Any arguments relevant for your chosen optimizer must be provided in the args
nested dictionary
with the same key string as what that PyTorch class expects, e.g., lr
for learning rate.
Likewise, the optional lr_scheduler
knob indicates which PyTorch-native learning rate scheduler to use
during optimization.
We support any from the torch.optim.lr_scheduler
module: StepLR, MultiStepLR, ExponentialLR, LinearLR, CosineAnnealingLR, etc.
Indicate the learning rate scheduler to use under the name
key.
Any arguments relevant for it must be provided in the args
nested dictionary with the same key
string as what that PyTorch class expects, e.g., step_size
for StepLR.
Named Metrics
Also optional but useful is the named_metrics
knob within the train
section.
Named-metrics in RapidFire AI are well-known ML metrics that our system can automatically calculate across epochs based on just the outputs of your compute_forward()
function in your MLSpec
.
Named-metrics are a syntactic convenience inspired by torchmetrics
so that you do not need to write these as custom metrics.
As of this writing, we support only the following 2 named-metrics:
top1_accuracy: A common metric used for binary or multi-class classification tasks.
top[k]_accuracy: A common metric used for multi-class classification tasks. Here, [k] is a placeholder for any integer above 1, e.g.,
top5_accuracy
ortop3_accuracy
.
We plan to continue expanding this API and add more functionality based on feedback.
User Knobs
These are optional knobs that you define for use in any function of MLSpec
that has a
cfg
argument injected by the system.
For readability these must be listed under a named user_knobs
section of the config.
User knobs help you compare, e.g., different base model architectures, different layer specifics to add to a model’s head for transfer learning, different loss function choices, different LoRA adapter specifics for LLM fine-tuning, and different data input/output tensorization specifics (e.g., image transforms or resizing, time series windowing, or text tokenization).
In the above example, model_type
is a user knob that gets used in the create_model()
and compute_forward()
functions. Please refer to the full ImageNet tutorial notebook for more details.
As another example, you can define image_size
and transform_type
as user knobs for your
row_prep()
to decide how much to resize your image and what augmentation transforms to apply.
In general, user knobs are a powerful way to amplify how many configurations you can launch together in
one run_fit()
, as well as what you can modify from the IC Ops panel.
This offers you maximum flexibility for hyperparallel exploration of variables that affect your AI
accuracy and other metrics.
You have full flexibility to define whatever knob key-value pairs you want, as long as their key
strings do not conflict with the reserved strings above.
User knob values can be of any Python data type, including int
, float
, str
,
List
, Dict
, etc. Just use them appropriately in your MLSpec
functions.
Other Knobs
As of this writing, we provide two other optional knobs: one related to demo mode for the tutorial use case notebooks provided and a hint knob related to large models that do not fit on a single GPU for the given batch size.
For the demo mode, we have a dev_knob
section with demo_mode
as a single key.
Its values can be “imagenet” or “imdb”, representing some pre-defined choices in how models and data are partitioned across GPUs.
Enabling demo mode means your run_fit()
will start a couple of minutes sooner, with the reduction depending on the number of runs given.
If you change the model architecture in those notebooks or if you are running your own use case with your own models, we recommend not using the demo mode knobs.
The FSDP hint knob is needed for large models that are executed today with a lower level library such as FSDP.
As of this writing, we do not automatically identify layer wrapping with FSDP in the most efficient manner. So, we suggest you provide a hint via the fsdp_layer_cls
knob to indicate which layer(s) might need other be wrapped with FSDP wrap policies for cross-GPU model sharding.
In the above example, since we use a large ViT model from HF transformers
, we indicate ViTLayer
as the hint value. Note that to be able to use that class name, you also need to import it from the relevant library, which is transformers.models.vit.modeling_vit
here.
We plan to continue refining this API and add more automation on this front for large models based on feedback.
Value Set Generators
It is common practice in ML to launch and compare multiple models/runs in one go for hyperparameter tuning by using grid search, random search, AutoML heuristics, etc. to generate combinations of knob values.
RapidFire AI generalizes that notion many folds to enable you to launch and compare combinations of any knob values in your config, not just hyperparameters but also model architecture knobs, optimizer knobs, or any other user-provided knobs.
To enable such group-level generation of configs, we provide two set generators for values of knobs and a series of whole config group generators that take a config with set generators in some values.
We currently support two common set generators: LIST()
for discrete set of values and RANGE()
for sampling from a continuous value interval.
- LIST(values: List[Any])
- Parameters:
values – List of discrete values for a knob; all values must be the same python data type.
- RANGE(start: int | float, end: int | float, dtype: str [= "int" or "float"])
Notes:
As of this writing, RANGE()
performs uniform sampling within the given interval.
We plan to continue refining this API and add more functionality on this front based on feedback.
Note that the return types of the set generators are internal to RapidFire AI and they are usable only within the context of our config-group generators.
Config Group Generators
We currently support two common config group generators: GridSearch()
for grid search and
and RandomSearch()
for random search.
Notes:
For GridSearch()
, each knob can have either a single value or a LIST()
of values but no knob
should have RANGE()
of values; otherwise, it will error out.
For RandomSearch()
, each knob can have either a single value, or a LIST()
of values, or a
RANGE()
of values. The semantics of sampling are independently-identically-distributed (IID), i.e.,
we uniformly randomly pick a value from each discrete set and from each continuous set to construct the
knob combination for one run.
Then we repeat that sampling process in an IID way to accumulate num_runs
distinct combinations.
Note that the return types of the config group generators are internal to RapidFire AI and they are usable only
within the context of run_fit()
in the Experiment
class.
Examples:
# Based on ImageNet tutorial notebook
from rapidfire.automl import GridSearch, List
# Grid search over 2 model types x 2 hyperparameters with 2 values each = 8 configs in group
config_group = GridSearch({
'user_knobs': {
'model_type': List(["resnet", "vgg"]),
},
'train': {
'epochs': 3,
'batch_size': List([128, 256]),
'optimizer': {
'name': "Adam",
'args': {
'lr': List([1e-4, 1e-6]),
'weight_decay': 1e-6
},
},
},
'named_metrics': ['top1_accuracy', 'top5_accuracy']
})
# Based on IMDB tutorial notebook
from rapidfire.automl.datatypes import List
from rapidfire.automl import RandomSearch
# Random search over 3 knob value sets to produce 7 configs in group
config_group = RandomSearch({
'train': {
'epochs': 5,
'batch_size': List([16, 32]),
'optimizer': {
'name': "Adam",
'args': {
'lr': Range(1e-4, 9e-4, dtype='float'),
'weight_decay': List([1e-4, 1e-5]),
},
},
},
'named_metrics': ['top1_accuracy']
},
num_runs = 7
)
AutoML Heuristics
While grid and random searches are simple, intuitive, and popular in practice, they could also be wasteful in that they produce all combinations upfront without any performance data being factored in across epochs. This is where so-called Automated ML, or AutoML, heuristics help.
There is much literature on AutoML heuristics, but at their core they use the metrics of different runs over time to stop low-accuracy runs early, promote high-accuracy runs, and/or generate new runs on the fly. However, a common criticism of such heuristics in practice is that their decisions are too opaque, they may waste resources unintentionally, and/or their own meta-hyperparmeters are too unintuitive to set or tune.
RapidFire AI’s powerful Interactive Control panel largely obviates the need to be stuck with a particular AutoML heuristic’s decisions. But to give users full flexibility, we also support some AutoML heuristics on top our elastic engine for ease of use.
As of this writing, we support one well-known AutoML heuristic: SuccessiveHalving
(SHA), explained in
this paper.
- SuccessiveHalving(cfg: Dict[str, Any], num_runs: int, min_epochs: int, max_epochs: int, reduction_factor: int, metric: str, direction: str [= "min" or "max"], min_early_stopping_rate: int=0)
- Parameters:
cfg (Dict[str, Any]) – A config dictionary with
RANGE()
for at least one knob value; no knob should haveLIST()
num_runs (int) – Number of runs/combinations of knob values to sample in total that start together
min_epochs – Number of epochs after which early stopping check is applied
max_epochs – Largest number of epochs for any run; overrides
epochs
knob in config (if any)reduction_factor (int) – Factor of runs that continue after each early stopping check
metric (str) – Any validation metric defined for
MLSpec
; this can be theloss
, any of the named-metrics given in config, or a user-defined custom metric’s key stringdirection (str [= "min" or "max"]) – Direction of optimization; must be one of two fixed strings (“min” or “max”)
min_early_stopping_rate (int, optional) – Please see the SHA paper for details (default: 0)
We plan to continue expanding this API and add support for more AutoML heuristics based on feedback.
Example:
# Based on ImageNet tutorial notebook
from rapidfire.automl import SuccessiveHalving, Range
# SHA with 8 configs in group
config_group = SuccessiveHalving({
'train': {
'epochs': 10,
'batch_size': 128
'optimizer': {
'name': 'Adam',
'args': {
'lr': Range(1e-5, 9e-5, dtype='float'),
'weight_decay': Range(1e-4, 1e-3, dtype='float'),
},
},
},
'named_metrics': ['top1_accuracy', 'top5_accuracy']
},
num_runs = 8,
min_epochs = 1,
max_epochs = 8,
reduction_factor = 2,
metric = 'top5_accuracy',
direction = 'max'
)
Advanced: Lists of Configs or Config Group Generators
The ml_config
argument in run_fit()
is very versatile in allowing you to construct
various knob combinations and launch them simultaneously.
It can be a single config dictionary, a regular Python List
of config dictionaries, a
config-group generator output (via GridSearch()
, RandomSearch()
, or AutoML heuristic),
or even a List
with mix of configs or config-group generator outputs as its elements.
So, you are not limited to launching just a single grid/random search at a time as shown in the tutorial use case notebooks. You have full flexibility to specify what sets of runs to launch together, as shown in the more advanced usage example below.
# Config-group list with a single dictionary, a grid search, and a random search together
# This list will have a total of 1 + 2 * 2 + 6 = 11 runs in total
config_group = [
{
'user_knobs': {
'model_type': "vit",
},
'train': {
'epochs': 1,
'batch_size': 256,
'optimizer': {
'name': "Adam",
'args': {
'lr': 1e-4,
'weight_decay': 1e-6,
},
},
},
"named_metrics": ["top1_accuracy", "top5_accuracy"],
"fsdp_layer_cls": {ViTLayer}, #Hint for large model handling with FSDP
},
GridSearch({
'user_knobs': {
'model_type': "vit",
},
'train': {
'epochs': 1,
'batch_size': 128,
'optimizer': {
'name': "Adam",
'args': {
'lr': List([1e-4, 1e-5]),
'weight_decay': List([1e-6, 1e-5]),
},
},
},
"named_metrics": ["top1_accuracy", "top5_accuracy"],
"fsdp_layer_cls": {ViTLayer}, #Hint for large model handling with FSDP
}),
RandomSearch({
'user_knobs': {
'model_type': "resnet",
},
'train': {
'epochs': 1,
'batch_size': List([64, 128, 256]),
'optimizer': {
'name': "Adam",
'args': {
'lr': Range(1e-4, 9e-4, dtype='float'),
'weight_decay': Range(1e-4, 1e-5, dtype="float"),
},
},
},
"named_metrics": ["top1_accuracy", "top5_accuracy"],
},
num_runs=6
)
]
# Launch training and validation for given hybrid mix of configs
>>> myexp.run_fit(ml_config=config_group, seed=42)
Creating RapidFire Workers ...