API: Config Knobs and AutoML

The concept of a config with knobs is fundamental to the RapidFire AI API. It controls several aspects of a given run/model during its training and/or inference, unifying and generalizing the notions of hyperparameter tuning, optimizer selection, architecture selection, fine-tuning, layerwise transfer learning, data representation changes, and more.

In its most basic form, a config is just a Dict[str, Any] dictionary in which a key is a string identifying a knob and its value is a data type associated with that knob.

We dive into the structure and semantics of a config and its knobs with the following example adapted from the ImageNet tutorial notebook.

from transformers.models.vit.modeling_vit import ViTLayer

config = {
         "user_knobs": {
                "model_type": "vit",
        },
        "train": {
                "epochs": 10,
                "batch_size: 64,
                "optimizer": {
                        "name": "Adam",
                        "args" : {
                                "lr": 1e-4,
                                "momentum": 0.9,
                                "weight_decay": 1e-5,
                        }
                }
                "lr_scheduler": {
                        "name": "StepLR",
                        "args": {
                                "step_size": 1000,
                        }
                }
        },
        "named_metrics": ["top1_accuracy", "top5_accuracy"],
        "fsdp_layer_cls": {ViTLayer} # Hint for handling large model with FSDP
}

As illustrated above, the config has 4 (sub)sections, out of which train is the only required section. Let us dive deeper into each section of the config.

Train Knobs

These are core knobs needed when executing the functions in MLSpec. For readability these must be listed under a named train section of the config. At a minimum we require that epochs, batch_size, and optimizer be specified by you in the config. Note that optimizer and lr_scheduler are nested dictionaries themselves.

The optimizer knob indicates which PyTorch-native optimizer to use for gradient-based training. We support any from the torch.optim module: Adam, AdamW, Adagrad, Adadelta, RMSProp, (plain) SGD, etc.

Indicate the optimizer to use under the name key. Any arguments relevant for your chosen optimizer must be provided in the args nested dictionary with the same key string as what that PyTorch class expects, e.g., lr for learning rate.

Likewise, the optional lr_scheduler knob indicates which PyTorch-native learning rate scheduler to use during optimization. We support any from the torch.optim.lr_scheduler module: StepLR, MultiStepLR, ExponentialLR, LinearLR, CosineAnnealingLR, etc.

Indicate the learning rate scheduler to use under the name key. Any arguments relevant for it must be provided in the args nested dictionary with the same key string as what that PyTorch class expects, e.g., step_size for StepLR.

Named Metrics

Also optional but useful is the named_metrics knob within the train section. Named-metrics in RapidFire AI are well-known ML metrics that our system can automatically calculate across epochs based on just the outputs of your compute_forward() function in your MLSpec.

Named-metrics are a syntactic convenience inspired by torchmetrics so that you do not need to write these as custom metrics.

As of this writing, we support only the following 2 named-metrics:

top1_accuracy: A common metric used for binary or multi-class classification tasks.
top[k]_accuracy: A common metric used for multi-class classification tasks. Here, [k] is a placeholder for any integer above 1, e.g., top5_accuracy or top3_accuracy.

We plan to continue expanding this API and add more functionality based on feedback.

User Knobs

These are optional knobs that you define for use in any function of MLSpec that has a cfg argument injected by the system. For readability these must be listed under a named user_knobs section of the config.

User knobs help you compare, e.g., different base model architectures, different layer specifics to add to a model’s head for transfer learning, different loss function choices, different LoRA adapter specifics for LLM fine-tuning, and different data input/output tensorization specifics (e.g., image transforms or resizing, time series windowing, or text tokenization).

In the above example, model_type is a user knob that gets used in the create_model() and compute_forward() functions. Please refer to the full ImageNet tutorial notebook for more details.

As another example, you can define image_size and transform_type as user knobs for your row_prep() to decide how much to resize your image and what augmentation transforms to apply.

In general, user knobs are a powerful way to amplify how many configurations you can launch together in one run_fit(), as well as what you can modify from the IC Ops panel. This offers you maximum flexibility for hyperparallel exploration of variables that affect your AI accuracy and other metrics.

You have full flexibility to define whatever knob key-value pairs you want, as long as their key strings do not conflict with the reserved strings above. User knob values can be of any Python data type, including int, float, str, List, Dict, etc. Just use them appropriately in your MLSpec functions.

Other Knobs

As of this writing, we provide two other optional knobs: one related to demo mode for the tutorial use case notebooks provided and a hint knob related to large models that do not fit on a single GPU for the given batch size.

For the demo mode, we have a dev_knob section with demo_mode as a single key. Its values can be “imagenet” or “imdb”, representing some pre-defined choices in how models and data are partitioned across GPUs. Enabling demo mode means your run_fit() will start a couple of minutes sooner, with the reduction depending on the number of runs given.

If you change the model architecture in those notebooks or if you are running your own use case with your own models, we recommend not using the demo mode knobs.

The FSDP hint knob is needed for large models that are executed today with a lower level library such as FSDP. As of this writing, we do not automatically identify layer wrapping with FSDP in the most efficient manner. So, we suggest you provide a hint via the fsdp_layer_cls knob to indicate which layer(s) might need other be wrapped with FSDP wrap policies for cross-GPU model sharding.

In the above example, since we use a large ViT model from HF transformers, we indicate ViTLayer as the hint value. Note that to be able to use that class name, you also need to import it from the relevant library, which is transformers.models.vit.modeling_vit here.

We plan to continue refining this API and add more automation on this front for large models based on feedback.

Value Set Generators

It is common practice in ML to launch and compare multiple models/runs in one go for hyperparameter tuning by using grid search, random search, AutoML heuristics, etc. to generate combinations of knob values.

RapidFire AI generalizes that notion many folds to enable you to launch and compare combinations of any knob values in your config, not just hyperparameters but also model architecture knobs, optimizer knobs, or any other user-provided knobs.

To enable such group-level generation of configs, we provide two set generators for values of knobs and a series of whole config group generators that take a config with set generators in some values.

We currently support two common set generators: LIST() for discrete set of values and RANGE() for sampling from a continuous value interval.

LIST(values: List[Any])

Parameters:: values – List of discrete values for a knob; all values must be the same python data type.

RANGE(start: int | float, end: int | float, dtype: str [= "int" or "float"])

Parameters:

start (int | float) – Lower bound of range interval
end (int | float) – Upper bound of range interval
dtype (int | float) – Data type of value to be sampled

Notes:

As of this writing, RANGE() performs uniform sampling within the given interval. We plan to continue refining this API and add more functionality on this front based on feedback.

Note that the return types of the set generators are internal to RapidFire AI and they are usable only within the context of our config-group generators.

Config Group Generators

We currently support two common config group generators: GridSearch() for grid search and and RandomSearch() for random search.

GridSearch(cfg: Dict[str, Any])

Parameters:: cfg (Dict[str, Any]) – A config dictionary with LIST() for at least one knob value

RandomSearch(cfg: Dict[str, Any], num_runs: int)

Parameters:

cfg (Dict[str, Any]) – A config dictionary with LIST() for at least one knob value
num_runs (int) – Number of runs/combinations of knob values to sample in total

Notes:

For GridSearch(), each knob can have either a single value or a LIST() of values but no knob should have RANGE() of values; otherwise, it will error out.

For RandomSearch(), each knob can have either a single value, or a LIST() of values, or a RANGE() of values. The semantics of sampling are independently-identically-distributed (IID), i.e., we uniformly randomly pick a value from each discrete set and from each continuous set to construct the knob combination for one run. Then we repeat that sampling process in an IID way to accumulate num_runs distinct combinations.

Note that the return types of the config group generators are internal to RapidFire AI and they are usable only within the context of run_fit() in the Experiment class.

Examples:

# Based on ImageNet tutorial notebook
from rapidfire.automl import GridSearch, List

# Grid search over 2 model types x 2 hyperparameters with 2 values each = 8 configs in group
config_group = GridSearch({
        'user_knobs': {
                'model_type': List(["resnet", "vgg"]),
        },
        'train': {
                'epochs': 3,
                'batch_size': List([128, 256]),
                'optimizer': {
                        'name': "Adam",
                        'args': {
                                'lr': List([1e-4, 1e-6]),
                                'weight_decay': 1e-6
                        },
                },
        },
        'named_metrics': ['top1_accuracy', 'top5_accuracy']
})

# Based on IMDB tutorial notebook
from rapidfire.automl.datatypes import List
from rapidfire.automl import RandomSearch

# Random search over 3 knob value sets to produce 7 configs in group
config_group = RandomSearch({
        'train': {
                'epochs': 5,
                'batch_size': List([16, 32]),
                'optimizer': {
                        'name': "Adam",
                        'args': {
                                'lr': Range(1e-4, 9e-4, dtype='float'),
                                'weight_decay': List([1e-4, 1e-5]),
                        },
                },
        },
        'named_metrics': ['top1_accuracy']
},
num_runs = 7
)

AutoML Heuristics

While grid and random searches are simple, intuitive, and popular in practice, they could also be wasteful in that they produce all combinations upfront without any performance data being factored in across epochs. This is where so-called Automated ML, or AutoML, heuristics help.

There is much literature on AutoML heuristics, but at their core they use the metrics of different runs over time to stop low-accuracy runs early, promote high-accuracy runs, and/or generate new runs on the fly. However, a common criticism of such heuristics in practice is that their decisions are too opaque, they may waste resources unintentionally, and/or their own meta-hyperparmeters are too unintuitive to set or tune.

RapidFire AI’s powerful Interactive Control panel largely obviates the need to be stuck with a particular AutoML heuristic’s decisions. But to give users full flexibility, we also support some AutoML heuristics on top our elastic engine for ease of use.

As of this writing, we support one well-known AutoML heuristic: SuccessiveHalving (SHA), explained in this paper.

SuccessiveHalving(cfg: Dict[str, Any], num_runs: int, min_epochs: int, max_epochs: int, reduction_factor: int, metric: str, direction: str [= "min" or "max"], min_early_stopping_rate: int=0)

Parameters:

cfg (Dict[str, Any]) – A config dictionary with RANGE() for at least one knob value; no knob should have LIST()
num_runs (int) – Number of runs/combinations of knob values to sample in total that start together
min_epochs – Number of epochs after which early stopping check is applied
max_epochs – Largest number of epochs for any run; overrides epochs knob in config (if any)
reduction_factor (int) – Factor of runs that continue after each early stopping check
metric (str) – Any validation metric defined for MLSpec; this can be the loss, any of the named-metrics given in config, or a user-defined custom metric’s key string
direction (str [= "min" or "max"]) – Direction of optimization; must be one of two fixed strings (“min” or “max”)
min_early_stopping_rate (int, optional) – Please see the SHA paper for details (default: 0)

We plan to continue expanding this API and add support for more AutoML heuristics based on feedback.

Example:

# Based on ImageNet tutorial notebook
from rapidfire.automl import SuccessiveHalving, Range

# SHA with 8 configs in group
config_group = SuccessiveHalving({
        'train': {
                'epochs': 10,
                'batch_size': 128
                'optimizer': {
                        'name': 'Adam',
                        'args': {
                                'lr': Range(1e-5, 9e-5, dtype='float'),
                                'weight_decay': Range(1e-4, 1e-3, dtype='float'),
                        },
                },
        },
        'named_metrics': ['top1_accuracy', 'top5_accuracy']
},
num_runs = 8,
min_epochs = 1,
max_epochs = 8,
reduction_factor = 2,
metric = 'top5_accuracy',
direction = 'max'
)

Advanced: Lists of Configs or Config Group Generators

The ml_config argument in run_fit() is very versatile in allowing you to construct various knob combinations and launch them simultaneously. It can be a single config dictionary, a regular Python List of config dictionaries, a config-group generator output (via GridSearch(), RandomSearch(), or AutoML heuristic), or even a List with mix of configs or config-group generator outputs as its elements.

So, you are not limited to launching just a single grid/random search at a time as shown in the tutorial use case notebooks. You have full flexibility to specify what sets of runs to launch together, as shown in the more advanced usage example below.

# Config-group list with a single dictionary, a grid search, and a random search together
# This list will have a total of 1 + 2 * 2 + 6 = 11 runs in total

config_group = [
        {
                'user_knobs': {
                        'model_type': "vit",
                },
                'train': {
                        'epochs': 1,
                        'batch_size': 256,
                        'optimizer': {
                                'name': "Adam",
                                'args': {
                                        'lr': 1e-4,
                                        'weight_decay': 1e-6,
                                },
                        },
                },
                "named_metrics": ["top1_accuracy", "top5_accuracy"],
                "fsdp_layer_cls": {ViTLayer}, #Hint for large model handling with FSDP
        },
        GridSearch({
                'user_knobs': {
                        'model_type': "vit",
                },
                'train': {
                        'epochs': 1,
                        'batch_size': 128,
                        'optimizer': {
                                'name': "Adam",
                                'args': {
                                        'lr': List([1e-4, 1e-5]),
                                        'weight_decay': List([1e-6, 1e-5]),
                                },
                        },
                },
                "named_metrics": ["top1_accuracy", "top5_accuracy"],
                "fsdp_layer_cls": {ViTLayer}, #Hint for large model handling with FSDP
        }),
        RandomSearch({
                'user_knobs': {
                        'model_type': "resnet",
                },
                'train': {
                        'epochs': 1,
                        'batch_size': List([64, 128, 256]),
                        'optimizer': {
                                'name': "Adam",
                                'args': {
                                        'lr': Range(1e-4, 9e-4, dtype='float'),
                                        'weight_decay': Range(1e-4, 1e-5, dtype="float"),
                                },
                        },
                },
                "named_metrics": ["top1_accuracy", "top5_accuracy"],
                },
                num_runs=6
        )
]

# Launch training and validation for given hybrid mix of configs
>>> myexp.run_fit(ml_config=config_group, seed=42)
Creating RapidFire Workers ...