Tutorial Use Case: News

Dataset

The News dataset is a popular NLP dataset for text summarization. The input is a new document as text string from two public news websites. The target is a summarized document as text string.

The raw dataset files are sourced from this GitHub repo. There are just 3 raw files in total: news-train.csv, news-val.csv, and news-test.csv.

This is an in-situ dataset, i.e., there are no separate object files. All features and targets are present in the Example Structure File (ESF) itself. Please read API: Data Ingestion and Locators for more ESF-related details.

Model(s)

This tutorial notebook illustrates simple hyperparameter tuning with a single Transformer architecture based on bart-large. The exact model checkpoint and tokenizer used are sourced from this HF models page. The HF tokenizer used is also associated with that same HF model string.

This model was already fine-tuned on this dataset–so, you should see good accuracy right from the start. If you’d like to retrain this model from scratch you can reinitialize the weights in your create_model() function in MLSpec.

Config Knobs

As mentioned above, we perform a simple hyperparameter tuning with GridSearch(). We pick two values each for the lr (learning rate) and weight_decay (regularization).

Unlike the IMDB example, in this use case we highlight the custom user-defined metrics functions in our API to define the rouge-1 score for evaluating text summarization (more on rouge-1 here).

Step by Step Code in our API

Please check out the notebook named rf-tutorial-news.ipynb on the Jupyter home directory in your RapidFire AI cluster.

Custom User-Defined Metrics

To compute the rouge-1 score, we use the HF evaluate library.

# From the News tutorial notebook
def compute_metrics(self, loss: torch.tensor, outputs: Any, minibatch, cfg: Dict[str, Any]) -> Dict[str, torch.Tensor]:
    """
    Function to compute train/val/test metrics using outputs from compute_forward() and corresponding minibatch.
    """
    import torch

    scores = self.rouge.compute(predictions = outputs["predicted-summary"], references = outputs["label-summary"], use_aggregator = False)
    sum_rouge1 = torch.tensor(sum(scores["rouge1"]))
    numex = torch.tensor(len(outputs["predicted-summary"]))
    return {"sum_rouge1": sum_rouge1, "numex": numex}


def aggregate_metrics(self, metrics: pd.DataFrame, cfg: Dict[str, Any]):
    """
    Function to aggregate metrics returned by compute_metrics() across all minibatches in an epoch.
    """
    sumall = sum(metrics["sum_rouge1"])
    total = sum(metrics["numex"])
    return {"avg_rouge1": sumall / total}

Extension: Custom Transfer Learning with a Llama model

Coming soon