Tutorial Use Case: News ======================= Dataset ------- The News dataset is a popular NLP dataset for text summarization. The input is a new document as text string from two public news websites. The target is a summarized document as text string. The raw dataset files are sourced from `this GitHub repo `_. There are just 3 raw files in total: :code:`news-train.csv`, :code:`news-val.csv`, and :code:`news-test.csv`. This is an *in-situ dataset*, i.e., there are no separate object files. All features and targets are present in the Example Structure File (ESF) itself. Please read :doc:`API: Data Ingestion and Locators ` for more ESF-related details. Model(s) ------- This tutorial notebook illustrates simple hyperparameter tuning with a single Transformer architecture based on :code:`bart-large`. The exact model checkpoint and tokenizer used are sourced from `this HF models page `_. The HF tokenizer used is also associated with that same HF model string. This model was already fine-tuned on this dataset--so, you should see good accuracy right from the start. If you'd like to retrain this model from scratch you can reinitialize the weights in your :func:`create_model()` function in :code:`MLSpec`. Config Knobs ----------- As mentioned above, we perform a simple hyperparameter tuning with :func:`GridSearch()`. We pick two values each for the :code:`lr` (learning rate) and :code:`weight_decay` (regularization). Unlike the IMDB example, in this use case we highlight the custom user-defined metrics functions in our API to define the :code:`rouge-1` score for evaluating text summarization (more on rouge-1 `here `_). Step by Step Code in our API ---------------------------- Please check out the notebook named :code:`rf-tutorial-news.ipynb` on the Jupyter home directory in your RapidFire AI cluster. Custom User-Defined Metrics ------------------------- To compute the :code:`rouge-1` score, we use the `HF evaluate library `_. .. code-block:: python # From the News tutorial notebook def compute_metrics(self, loss: torch.tensor, outputs: Any, minibatch, cfg: Dict[str, Any]) -> Dict[str, torch.Tensor]: """ Function to compute train/val/test metrics using outputs from compute_forward() and corresponding minibatch. """ import torch scores = self.rouge.compute(predictions = outputs["predicted-summary"], references = outputs["label-summary"], use_aggregator = False) sum_rouge1 = torch.tensor(sum(scores["rouge1"])) numex = torch.tensor(len(outputs["predicted-summary"])) return {"sum_rouge1": sum_rouge1, "numex": numex} def aggregate_metrics(self, metrics: pd.DataFrame, cfg: Dict[str, Any]): """ Function to aggregate metrics returned by compute_metrics() across all minibatches in an epoch. """ sumall = sum(metrics["sum_rouge1"]) total = sum(metrics["numex"]) return {"avg_rouge1": sumall / total} Extension: Custom Transfer Learning with a Llama model ------------------------------- Coming soon