Tutorial Use Case: IMDb
=======================

Dataset
-------

The IMDb dataset is a popular NLP dataset for sentiment classification.
The input is a user review text string from the IMDb website. 
The target is a binary class label (0 or 1).

The raw dataset files are sourced from 
`this HF datasets page <https://huggingface.co/datasets/stanfordnlp/imdb>`_.
There are just 3 raw files in total: 
:code:`imdb-train.csv`, :code:`imdb-test.csv`, and :code:`imdb-pred.csv`.

This is an *in-situ dataset*, i.e., there are no separate object files. 
All features and targets are present in the Example Structure File (ESF) itself. 
Please read :doc:`API: Data Ingestion and Locators <ingestion>` for more ESF-related details.


Model(s)
-------

This tutorial notebook illustrates simple hyperparameter tuning with a single 
Transformer architecture based on :code:`distilbert`. 
The exact model checkpoint and tokenizer used are sourced from 
`this HF models page <https://huggingface.co/lvwerra/distilbert-imdb>`_.

This model was already fine-tuned on this dataset--so, you should see good accuracy right 
from the start. 
If you'd like to retrain this model from scratch you can reinitialize the weights in your
:func:`create_model()` function in :code:`MLSpec`.


Config Knobs
-----------

As mentioned above, we perform a simple hyperparameter tuning with :func:`GridSearch()`. 
We pick two values each for the :code:`batch_size` and :code:`lr` (learning rate).

We also indicate the named-metric :code:`top1_accuracy` to be plotted out of the box, 
which is relevant for binary classification.


Step by Step Code in our API
----------------------------

Please check out the notebook named :code:`rf-tutorial-imdb.ipynb` on the Jupyter home 
directory in your RapidFire AI cluster.


Extension: Custom Collate
-------------------------

PyTorch's default collate function stacks up examples into a minibatch tensor that is sent to the GPU.
To do so, it requires all examples to be truncated and/or padded to same dimensions as tensors.

But it is common in NLP to have variable-length strings that are far apart in terms of number of tokens.
Padding them to the largest length can lead to substantially wasteful computations on the GPU.

So, it is a common practice with sequence models such as Transformers to have custom 
collate functions that truncate and/or pad examples only till the longest example in their own minibatch. 

In RapidFire AI, we support such custom collate with similar semantics under :func:`collate_fn()` in :code:`MLSpec`.

Please check out :download:`this extended version <files/rf-tutorial-imdb-collate.ipynb>` of the IMDb tutorial notebook. 
Upload it to your Jupyter home directory and run it just like the other tutorial notebooks.

The only difference for the custom collate version is the inclusion of the :func:`collate_fn()` and a change to :func:`row_prep()` in :code:`MLSpec` as shown below.

.. code-block:: python

    # From the IMDb tutorial notebook extension for custom collate
    def row_prep(self, row, is_predict: bool) -> Dict[str, torch.Tensor]:
        #Carry forward the strings to collate
        out = {"text": row['text']} 
        if not is_predict:
            out["labels"] = row['label']
        return out

    def collate_fn(self, batch, is_predict):
        import torch
        import transformers

        #Get unpadded dict for collate but still truncate all to model max length
        texts = [row["text"] for row in batch]
        texttok1 = self.tokenizer(texts, return_tensors=None, truncation=True, padding=False)
        batch_max = max(len(ex) for ex in texttok1["input_ids"])

        #Retokenize with padding to batch_max and get tensors
        texttok2 = self.tokenizer(texts, padding='max_length', truncation=True, max_length=batch_max, return_tensors="pt")
        out = {"input_ids": texttok2["input_ids"], 
               "attention_mask": texttok2["attention_mask"]}

        if not is_predict:
            labs = [row["labels"] for row in batch]
            out["labels"] = torch.tensor(labs)
        return out