Tutorial Use Case: IMDb

Dataset

The IMDb dataset is a popular NLP dataset for sentiment classification. The input is a user review text string from the IMDb website. The target is a binary class label (0 or 1).

The raw dataset files are sourced from this HF datasets page. There are just 3 raw files in total: imdb-train.csv, imdb-test.csv, and imdb-pred.csv.

This is an in-situ dataset, i.e., there are no separate object files. All features and targets are present in the Example Structure File (ESF) itself. Please read API: Data Ingestion and Locators for more ESF-related details.

Model(s)

This tutorial notebook illustrates simple hyperparameter tuning with a single Transformer architecture based on distilbert. The exact model checkpoint and tokenizer used are sourced from this HF models page.

This model was already fine-tuned on this dataset–so, you should see good accuracy right from the start. If you’d like to retrain this model from scratch you can reinitialize the weights in your create_model() function in MLSpec.

Config Knobs

As mentioned above, we perform a simple hyperparameter tuning with GridSearch(). We pick two values each for the batch_size and lr (learning rate).

We also indicate the named-metric top1_accuracy to be plotted out of the box, which is relevant for binary classification.

Step by Step Code in our API

Please check out the notebook named rf-tutorial-imdb.ipynb on the Jupyter home directory in your RapidFire AI cluster.

Extension: Custom Collate

PyTorch’s default collate function stacks up examples into a minibatch tensor that is sent to the GPU. To do so, it requires all examples to be truncated and/or padded to same dimensions as tensors.

But it is common in NLP to have variable-length strings that are far apart in terms of number of tokens. Padding them to the largest length can lead to substantially wasteful computations on the GPU.

So, it is a common practice with sequence models such as Transformers to have custom collate functions that truncate and/or pad examples only till the longest example in their own minibatch.

In RapidFire AI, we support such custom collate with similar semantics under collate_fn() in MLSpec.

Please check out this extended version of the IMDb tutorial notebook. Upload it to your Jupyter home directory and run it just like the other tutorial notebooks.

The only difference for the custom collate version is the inclusion of the collate_fn() and a change to row_prep() in MLSpec as shown below.

# From the IMDb tutorial notebook extension for custom collate
def row_prep(self, row, is_predict: bool) -> Dict[str, torch.Tensor]:
    #Carry forward the strings to collate
    out = {"text": row['text']}
    if not is_predict:
        out["labels"] = row['label']
    return out

def collate_fn(self, batch, is_predict):
    import torch
    import transformers

    #Get unpadded dict for collate but still truncate all to model max length
    texts = [row["text"] for row in batch]
    texttok1 = self.tokenizer(texts, return_tensors=None, truncation=True, padding=False)
    batch_max = max(len(ex) for ex in texttok1["input_ids"])

    #Retokenize with padding to batch_max and get tensors
    texttok2 = self.tokenizer(texts, padding='max_length', truncation=True, max_length=batch_max, return_tensors="pt")
    out = {"input_ids": texttok2["input_ids"],
           "attention_mask": texttok2["attention_mask"]}

    if not is_predict:
        labs = [row["labels"] for row in batch]
        out["labels"] = torch.tensor(labs)
    return out