Tutorial Use Case: News
Dataset
The News dataset is a popular NLP dataset for text summarization. The input is a new document as text string from two public news websites. The target is a summarized document as text string.
The raw dataset files are sourced from
this GitHub repo.
There are just 3 raw files in total:
news-train.csv
, news-val.csv
, and news-test.csv
.
This is an in-situ dataset, i.e., there are no separate object files. All features and targets are present in the Example Structure File (ESF) itself. Please read API: Data Ingestion and Locators for more ESF-related details.
Model(s)
This tutorial notebook illustrates simple hyperparameter tuning with a single
Transformer architecture based on bart-large
.
The exact model checkpoint and tokenizer used are sourced from
this HF models page.
The HF tokenizer used is also associated with that same HF model string.
This model was already fine-tuned on this dataset–so, you should see good accuracy right
from the start.
If you’d like to retrain this model from scratch you can reinitialize the weights in your
create_model()
function in MLSpec
.
Config Knobs
As mentioned above, we perform a simple hyperparameter tuning with GridSearch()
.
We pick two values each for the lr
(learning rate) and weight_decay
(regularization).
Unlike the IMDB example, in this use case we highlight the custom user-defined metrics
functions in our API to define the rouge-1
score for evaluating text summarization
(more on rouge-1 here).
Step by Step Code in our API
Please check out the notebook named rf-tutorial-news.ipynb
on the Jupyter home
directory in your RapidFire AI cluster.
Custom User-Defined Metrics
To compute the rouge-1
score, we use the HF evaluate library.
# From the News tutorial notebook
def compute_metrics(self, loss: torch.tensor, outputs: Any, minibatch, cfg: Dict[str, Any]) -> Dict[str, torch.Tensor]:
"""
Function to compute train/val/test metrics using outputs from compute_forward() and corresponding minibatch.
"""
import torch
scores = self.rouge.compute(predictions = outputs["predicted-summary"], references = outputs["label-summary"], use_aggregator = False)
sum_rouge1 = torch.tensor(sum(scores["rouge1"]))
numex = torch.tensor(len(outputs["predicted-summary"]))
return {"sum_rouge1": sum_rouge1, "numex": numex}
def aggregate_metrics(self, metrics: pd.DataFrame, cfg: Dict[str, Any]):
"""
Function to aggregate metrics returned by compute_metrics() across all minibatches in an epoch.
"""
sumall = sum(metrics["sum_rouge1"])
total = sum(metrics["numex"])
return {"avg_rouge1": sumall / total}
Extension: Custom Transfer Learning with a Llama model
Coming soon