API: Config Knobs and AutoML =============== The concept of a config with knobs is fundamental to the RapidFire AI API. It controls several aspects of a given run/model during its training and/or inference, unifying and generalizing the notions of hyperparameter tuning, optimizer selection, architecture selection, fine-tuning, layerwise transfer learning, data representation changes, and more. In its most basic form, a config is just a :code:`Dict[str, Any]` dictionary in which a key is a string identifying a knob and its value is a data type associated with that knob. We dive into the structure and semantics of a config and its knobs with the following example adapted from the ImageNet tutorial notebook. .. code-block:: python from transformers.models.vit.modeling_vit import ViTLayer config = { "user_knobs": { "model_type": "vit", }, "train": { "epochs": 10, "batch_size: 64, "optimizer": { "name": "Adam", "args" : { "lr": 1e-4, "momentum": 0.9, "weight_decay": 1e-5, } } "lr_scheduler": { "name": "StepLR", "args": { "step_size": 1000, } } }, "named_metrics": ["top1_accuracy", "top5_accuracy"], "fsdp_layer_cls": {ViTLayer} # Hint for handling large model with FSDP } As illustrated above, the config has 4 (sub)sections, out of which :code:`train` is the only required section. Let us dive deeper into each section of the config. Train Knobs ------- These are core knobs needed when executing the functions in :code:`MLSpec`. For readability these must be listed under a named :code:`train` section of the config. At a minimum we require that :code:`epochs`, :code:`batch_size`, and :code:`optimizer` be specified by you in the config. Note that :code:`optimizer` and :code:`lr_scheduler` are nested dictionaries themselves. The :code:`optimizer` knob indicates which PyTorch-native optimizer to use for gradient-based training. We support any from the :code:`torch.optim` module: Adam, AdamW, Adagrad, Adadelta, RMSProp, (plain) SGD, etc. Indicate the optimizer to use under the :code:`name` key. Any arguments relevant for your chosen optimizer must be provided in the :code:`args` nested dictionary with the same key string as what that PyTorch class expects, e.g., :code:`lr` for learning rate. Likewise, the optional :code:`lr_scheduler` knob indicates which PyTorch-native learning rate scheduler to use during optimization. We support any from the :code:`torch.optim.lr_scheduler` module: StepLR, MultiStepLR, ExponentialLR, LinearLR, CosineAnnealingLR, etc. Indicate the learning rate scheduler to use under the :code:`name` key. Any arguments relevant for it must be provided in the :code:`args` nested dictionary with the same key string as what that PyTorch class expects, e.g., :code:`step_size` for StepLR. Named Metrics ------ Also optional but useful is the :code:`named_metrics` knob within the :code:`train` section. **Named-metrics** in RapidFire AI are well-known ML metrics that our system can automatically calculate across epochs based on just the outputs of your :func:`compute_forward()` function in your :code:`MLSpec`. Named-metrics are a syntactic convenience inspired by :code:`torchmetrics` so that you do not need to write these as custom metrics. As of this writing, we support only the following 2 named-metrics: * **top1_accuracy**: A common metric used for binary or multi-class classification tasks. * **top[k]_accuracy**: A common metric used for multi-class classification tasks. Here, [k] is a placeholder for any integer above 1, e.g., :code:`top5_accuracy` or :code:`top3_accuracy`. We plan to continue expanding this API and add more functionality based on feedback. User Knobs ------ These are optional knobs that you define for use in any function of :code:`MLSpec` that has a :code:`cfg` argument injected by the system. For readability these must be listed under a named :code:`user_knobs` section of the config. User knobs help you compare, e.g., different base model architectures, different layer specifics to add to a model's head for transfer learning, different loss function choices, different LoRA adapter specifics for LLM fine-tuning, and different data input/output tensorization specifics (e.g., image transforms or resizing, time series windowing, or text tokenization). In the above example, :code:`model_type` is a user knob that gets used in the :func:`create_model()` and :func:`compute_forward()` functions. Please refer to the full ImageNet tutorial notebook for more details. As another example, you can define :code:`image_size` and :code:`transform_type` as user knobs for your :func:`row_prep()` to decide how much to resize your image and what augmentation transforms to apply. In general, user knobs are a powerful way to amplify how many configurations you can launch together in one :func:`run_fit()`, as well as what you can modify from the IC Ops panel. This offers you maximum flexibility for hyperparallel exploration of variables that affect your AI accuracy and other metrics. You have full flexibility to define whatever knob key-value pairs you want, as long as their key strings do not conflict with the reserved strings above. User knob values can be of any Python data type, including :code:`int`, :code:`float`, :code:`str`, :code:`List`, :code:`Dict`, etc. Just use them appropriately in your :code:`MLSpec` functions. Other Knobs ------- As of this writing, we provide two other optional knobs: one related to demo mode for the tutorial use case notebooks provided and a hint knob related to large models that do not fit on a single GPU for the given batch size. For the demo mode, we have a :code:`dev_knob` section with :code:`demo_mode` as a single key. Its values can be "imagenet" or "imdb", representing some pre-defined choices in how models and data are partitioned across GPUs. Enabling demo mode means your :func:`run_fit()` will start a couple of minutes sooner, with the reduction depending on the number of runs given. If you change the model architecture in those notebooks or if you are running your own use case with your own models, we recommend not using the demo mode knobs. The FSDP hint knob is needed for large models that are executed today with a lower level library such as FSDP. As of this writing, we do not automatically identify layer wrapping with FSDP in the most efficient manner. So, we suggest you provide a hint via the :code:`fsdp_layer_cls` knob to indicate which layer(s) might need other be wrapped with FSDP wrap policies for cross-GPU model sharding. In the above example, since we use a large ViT model from HF :code:`transformers`, we indicate :code:`ViTLayer` as the hint value. Note that to be able to use that class name, you also need to import it from the relevant library, which is :code:`transformers.models.vit.modeling_vit` here. We plan to continue refining this API and add more automation on this front for large models based on feedback. Value Set Generators ------- It is common practice in ML to launch and compare multiple models/runs in one go for hyperparameter tuning by using grid search, random search, AutoML heuristics, etc. to generate combinations of knob values. RapidFire AI generalizes that notion many folds to enable you to launch and compare combinations of any knob values in your config, not just hyperparameters but also model architecture knobs, optimizer knobs, or any other user-provided knobs. To enable such group-level generation of configs, we provide two **set generators** for values of knobs and a series of whole **config group generators** that take a config with set generators in some values. We currently support two common set generators: :func:`LIST()` for discrete set of values and :func:`RANGE()` for sampling from a continuous value interval. .. py:function:: LIST(values: List[Any]) :param values: List of discrete values for a knob; all values must be the same python data type. :type value: List[Any] .. py:function:: RANGE(start: int | float, end: int | float, dtype: str [= "int" or "float"]) :param start: Lower bound of range interval :type start: int | float :param end: Upper bound of range interval :type end: int | float :param dtype: Data type of value to be sampled :type dtype: int | float **Notes:** As of this writing, :func:`RANGE()` performs uniform sampling within the given interval. We plan to continue refining this API and add more functionality on this front based on feedback. Note that the return types of the set generators are internal to RapidFire AI and they are usable only within the context of our config-group generators. Config Group Generators ----- We currently support two common config group generators: :func:`GridSearch()` for grid search and and :func:`RandomSearch()` for random search. .. py:function:: GridSearch(cfg: Dict[str, Any]) :param cfg: A config dictionary with :func:`LIST()` for at least one knob value :type cfg: Dict[str, Any] .. py:function:: RandomSearch(cfg: Dict[str, Any], num_runs: int) :param cfg: A config dictionary with :func:`LIST()` for at least one knob value :type cfg: Dict[str, Any] :param num_runs: Number of runs/combinations of knob values to sample in total :type num_runs: int **Notes:** For :func:`GridSearch()`, each knob can have either a single value or a :func:`LIST()` of values but no knob should have :func:`RANGE()` of values; otherwise, it will error out. For :func:`RandomSearch()`, each knob can have either a single value, or a :func:`LIST()` of values, or a :func:`RANGE()` of values. The semantics of sampling are independently-identically-distributed (IID), i.e., we uniformly randomly pick a value from each discrete set and from each continuous set to construct the knob combination for one run. Then we repeat that sampling process in an IID way to accumulate :code:`num_runs` distinct combinations. Note that the return types of the config group generators are internal to RapidFire AI and they are usable only within the context of :func:`run_fit()` in the :code:`Experiment` class. **Examples:** .. code-block:: python # Based on ImageNet tutorial notebook from rapidfire.automl import GridSearch, List # Grid search over 2 model types x 2 hyperparameters with 2 values each = 8 configs in group config_group = GridSearch({ 'user_knobs': { 'model_type': List(["resnet", "vgg"]), }, 'train': { 'epochs': 3, 'batch_size': List([128, 256]), 'optimizer': { 'name': "Adam", 'args': { 'lr': List([1e-4, 1e-6]), 'weight_decay': 1e-6 }, }, }, 'named_metrics': ['top1_accuracy', 'top5_accuracy'] }) .. code-block:: python # Based on IMDB tutorial notebook from rapidfire.automl.datatypes import List from rapidfire.automl import RandomSearch # Random search over 3 knob value sets to produce 7 configs in group config_group = RandomSearch({ 'train': { 'epochs': 5, 'batch_size': List([16, 32]), 'optimizer': { 'name': "Adam", 'args': { 'lr': Range(1e-4, 9e-4, dtype='float'), 'weight_decay': List([1e-4, 1e-5]), }, }, }, 'named_metrics': ['top1_accuracy'] }, num_runs = 7 ) AutoML Heuristics -------- While grid and random searches are simple, intuitive, and popular in practice, they could also be wasteful in that they produce all combinations upfront without any performance data being factored in across epochs. This is where so-called Automated ML, or AutoML, heuristics help. There is much literature on AutoML heuristics, but at their core they use the metrics of different runs over time to stop low-accuracy runs early, promote high-accuracy runs, and/or generate new runs on the fly. However, a common criticism of such heuristics in practice is that their decisions are too opaque, they may waste resources unintentionally, and/or their own meta-hyperparmeters are too unintuitive to set or tune. RapidFire AI's powerful Interactive Control panel largely obviates the need to be stuck with a particular AutoML heuristic's decisions. But to give users full flexibility, we also support some AutoML heuristics on top our elastic engine for ease of use. As of this writing, we support one well-known AutoML heuristic: :code:`SuccessiveHalving` (SHA), explained in `this paper `_. .. py:function:: SuccessiveHalving(cfg: Dict[str, Any], num_runs: int, min_epochs: int, max_epochs: int, reduction_factor: int, metric: str, direction: str [= "min" or "max"], min_early_stopping_rate: int=0) :param cfg: A config dictionary with :func:`RANGE()` for at least one knob value; no knob should have :func:`LIST()` :type cfg: Dict[str, Any] :param num_runs: Number of runs/combinations of knob values to sample in total that start together :type num_runs: int :param min_epochs: Number of epochs after which early stopping check is applied :type min_epoch: int :param max_epochs: Largest number of epochs for any run; overrides :code:`epochs` knob in config (if any) :type max_epoch: int :param reduction_factor: Factor of runs that continue after each early stopping check :type reduction_factor: int :param metric: Any validation metric defined for :code:`MLSpec`; this can be the :code:`loss`, any of the named-metrics given in config, or a user-defined custom metric's key string :type metric: str :param direction: Direction of optimization; must be one of two fixed strings ("min" or "max") :type direction: str [= "min" or "max"] :param min_early_stopping_rate: Please see the SHA paper for details (default: 0) :type min_early_stopping_rate: int, optional We plan to continue expanding this API and add support for more AutoML heuristics based on feedback. **Example:** .. code-block:: python # Based on ImageNet tutorial notebook from rapidfire.automl import SuccessiveHalving, Range # SHA with 8 configs in group config_group = SuccessiveHalving({ 'train': { 'epochs': 10, 'batch_size': 128 'optimizer': { 'name': 'Adam', 'args': { 'lr': Range(1e-5, 9e-5, dtype='float'), 'weight_decay': Range(1e-4, 1e-3, dtype='float'), }, }, }, 'named_metrics': ['top1_accuracy', 'top5_accuracy'] }, num_runs = 8, min_epochs = 1, max_epochs = 8, reduction_factor = 2, metric = 'top5_accuracy', direction = 'max' ) Advanced: Lists of Configs or Config Group Generators ----- The :code:`ml_config` argument in :func:`run_fit()` is very versatile in allowing you to construct various knob combinations and launch them simultaneously. It can be a single config dictionary, a regular Python :code:`List` of config dictionaries, a config-group generator output (via :func:`GridSearch()`, :func:`RandomSearch()`, or AutoML heuristic), or even a :code:`List` with mix of configs or config-group generator outputs as its elements. So, you are not limited to launching just a single grid/random search at a time as shown in the tutorial use case notebooks. You have full flexibility to specify what sets of runs to launch together, as shown in the more advanced usage example below. .. code-block:: python # Config-group list with a single dictionary, a grid search, and a random search together # This list will have a total of 1 + 2 * 2 + 6 = 11 runs in total config_group = [ { 'user_knobs': { 'model_type': "vit", }, 'train': { 'epochs': 1, 'batch_size': 256, 'optimizer': { 'name': "Adam", 'args': { 'lr': 1e-4, 'weight_decay': 1e-6, }, }, }, "named_metrics": ["top1_accuracy", "top5_accuracy"], "fsdp_layer_cls": {ViTLayer}, #Hint for large model handling with FSDP }, GridSearch({ 'user_knobs': { 'model_type': "vit", }, 'train': { 'epochs': 1, 'batch_size': 128, 'optimizer': { 'name': "Adam", 'args': { 'lr': List([1e-4, 1e-5]), 'weight_decay': List([1e-6, 1e-5]), }, }, }, "named_metrics": ["top1_accuracy", "top5_accuracy"], "fsdp_layer_cls": {ViTLayer}, #Hint for large model handling with FSDP }), RandomSearch({ 'user_knobs': { 'model_type': "resnet", }, 'train': { 'epochs': 1, 'batch_size': List([64, 128, 256]), 'optimizer': { 'name': "Adam", 'args': { 'lr': Range(1e-4, 9e-4, dtype='float'), 'weight_decay': Range(1e-4, 1e-5, dtype="float"), }, }, }, "named_metrics": ["top1_accuracy", "top5_accuracy"], }, num_runs=6 ) ] # Launch training and validation for given hybrid mix of configs >>> myexp.run_fit(ml_config=config_group, seed=42) Creating RapidFire Workers ...