Tutorial Use Case: IMDb
Dataset
The IMDb dataset is a popular NLP dataset for sentiment classification. The input is a user review text string from the IMDb website. The target is a binary class label (0 or 1).
The raw dataset files are sourced from
this HF datasets page.
There are just 3 raw files in total:
imdb-train.csv
, imdb-test.csv
, and imdb-pred.csv
.
This is an in-situ dataset, i.e., there are no separate object files. All features and targets are present in the Example Structure File (ESF) itself. Please read API: Data Ingestion and Locators for more ESF-related details.
Model(s)
This tutorial notebook illustrates simple hyperparameter tuning with a single
Transformer architecture based on distilbert
.
The exact model checkpoint and tokenizer used are sourced from
this HF models page.
This model was already fine-tuned on this dataset–so, you should see good accuracy right
from the start.
If you’d like to retrain this model from scratch you can reinitialize the weights in your
create_model()
function in MLSpec
.
Config Knobs
As mentioned above, we perform a simple hyperparameter tuning with GridSearch()
.
We pick two values each for the batch_size
and lr
(learning rate).
We also indicate the named-metric top1_accuracy
to be plotted out of the box,
which is relevant for binary classification.
Step by Step Code in our API
Please check out the notebook named rf-tutorial-imdb.ipynb
on the Jupyter home
directory in your RapidFire AI cluster.
Extension: Custom Collate
PyTorch’s default collate function stacks up examples into a minibatch tensor that is sent to the GPU. To do so, it requires all examples to be truncated and/or padded to same dimensions as tensors.
But it is common in NLP to have variable-length strings that are far apart in terms of number of tokens. Padding them to the largest length can lead to substantially wasteful computations on the GPU.
So, it is a common practice with sequence models such as Transformers to have custom collate functions that truncate and/or pad examples only till the longest example in their own minibatch.
In RapidFire AI, we support such custom collate with similar semantics under collate_fn()
in MLSpec
.
Please check out this extended version
of the IMDb tutorial notebook.
Upload it to your Jupyter home directory and run it just like the other tutorial notebooks.
The only difference for the custom collate version is the inclusion of the collate_fn()
and a change to row_prep()
in MLSpec
as shown below.
# From the IMDb tutorial notebook extension for custom collate
def row_prep(self, row, is_predict: bool) -> Dict[str, torch.Tensor]:
#Carry forward the strings to collate
out = {"text": row['text']}
if not is_predict:
out["labels"] = row['label']
return out
def collate_fn(self, batch, is_predict):
import torch
import transformers
#Get unpadded dict for collate but still truncate all to model max length
texts = [row["text"] for row in batch]
texttok1 = self.tokenizer(texts, return_tensors=None, truncation=True, padding=False)
batch_max = max(len(ex) for ex in texttok1["input_ids"])
#Retokenize with padding to batch_max and get tensors
texttok2 = self.tokenizer(texts, padding='max_length', truncation=True, max_length=batch_max, return_tensors="pt")
out = {"input_ids": texttok2["input_ids"],
"attention_mask": texttok2["attention_mask"]}
if not is_predict:
labs = [row["labels"] for row in batch]
out["labels"] = torch.tensor(labs)
return out