Computer Vision (1)
Generative AI (2)
Machine Learning Basics (18)
Deep Learning (52)
- DL Basics (16)
- DL Architectures (17)
  - Feedforward Network / MLP (2)
  - Sequence models (6)
  - Transformers (9)
- DL Training and Optimization (17)
Natural Language Processing (27)
- NLP Data Preparation (18)
Supervised Learning (115)
- Regression (41)
  - Linear Regression (26)
  - Generalized Linear Models (9)
  - Regularization (6)
- Classification (70)
  - Logistic Regression (10)
  - Support Vector Machine (9)
  - Ensemble Learning (24)
  - Other Classification Models (9)
  - Classification Evaluations (9)
Unsupervised Learning (55)
- Clustering (37)
  - Distance Measures (9)
  - K-Means Clustering (9)
  - Hierarchical Clustering (3)
  - Gaussian Mixture Models (5)
  - Clustering Evaluations (6)
- Dimensionality Reduction (9)
Statistics (34)
Data Preparation (35)
- Feature Engineering (30)
- Sampling Techniques (5)

Explain 𝐑𝐎𝐔𝐆𝐄 𝐚𝐧𝐝 𝐢𝐭s 𝐢𝐦𝐩𝐨𝐫𝐭𝐚𝐧𝐭 𝐢𝐧 𝐍𝐋𝐏

Categories: Machine Learning Interview Questions, DL Basics

Updated: May 29, 2025

About ROUGE

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a popular metric to evaluate the quality of automatically generated text. It originates from the paper ROUGE: A Package for Automatic Evaluation of Summaries. The term “gist” in its name emphasizes the overall understanding or essence of the content rather than focusing on exact word matches or specific details. In the context of ROUGE, it highlights the metric’s aim to measure the overlap of significant textual elements (like n-grams, longest common subsequences, or word pairs) between the generated summary and the ground truth summary. Before diving into the details of ROUGE, let’s introduce some key concepts that underpin this metric.

N-grams in ROUGE

Like BLEU, ROUGE uses N-grams, which are contiguous sequences of n words. For example:

1-gram (unigram): Single words. Example: For the sentence “The dog barked,” the 1-grams are {the, dog, barked}

2-gram (bigram): Sequences of two words. Example: For the same sentence, the 2-grams are {the dog, dog barked}

3-gram (trigram): Sequences of three words. Example: The 3-gram for the sentence is {the dog barked}

Variants of ROUGE

ROUGE includes several variants, such as ROUGE-N, and ROUGE-L, each designed to measure text overlap in different ways. The most commonly used variant, ROUGE-N, evaluates the recall of overlapping n-grams between the generated text and ground truth texts. ROUGE-L focuses on the length of the Longest Common Subsequence (LCS) between the generated text and ground truth text, effectively capturing sentence-level structure and word order.

ROUGE-N: N-gram

ROUGE-N evaluates recall for n-grams. The formula is:

\text{ROUGE-N} = \frac{\text{Number of overlapping n-grams}}{\text{Total number of n-grams in the ground truth}}

For example:

Generated text: “the dog jumped over the fence“

Ground truth: “the dog leaped over the fence“

For $ROUGE\text{-}2$ (bigram recall),

Ground truth bigrams	{the dog, dog leaped, leaped over, over the, the fence}
Generated text bigrams	{the dog, dog jumped, jumped over, over the, the fence}
Overlapping bigrams	{the dog, over the, the fence}

Thus:

$ROUGE\text{-}2 = \frac{3}{5} = 0.6$

ROUGE-L: Longest Common Subsequence

ROUGE-L evaluates the longest common subsequence (LCS) between the generated text and ground truth texts, effectively capturing sentence-level structure and word order. The final score is calculated as an F1 value, balancing precision and recall. It is worth noting that ROUGE-L allows skipping words in the middle of sequences, enabling it to focus on overall structural alignment rather than strict word-by-word matching. This flexibility makes it particularly suitable for evaluating text with slight variations in phrasing.

$R_{LCS} = \frac{LCS(C, T)}{\text{len}(T)}$

$P_{LCS} = \frac{LCS(C, T)}{\text{len}(C)}$

$F_{LCS} = \frac{(1 + \beta^2) R_{LCS} P_{LCS}}{\beta^2 P_{LCS} + R_{LCS}}$

Here, LCS(C, T) represents the length of the longest common subsequence between the generated text C and the ground truth text T, while $ \text{len}(T)$ denotes the length of the ground truth text. The parameter $\beta$ is a hyperparameter that controls the balance between recall and precision in ROUGE-L. A large $\beta$ value emphasizes recall, while a small value near zero shifts the focus to precision. Typically, $\beta$ is set to a relatively large value to prioritize recall in ROUGE-L evaluations.

For example:

Generated text: “the dog jumped over the fence“

Ground truth: “the dog leaped over the fence“

Longest Common Subsequence (LCS):

{the, dog, over, the, fence}

(length = 5) ( allowing word skipping in the middle)

Recall using Longest Common Subsequence $R_{LCS}$:

R_{LCS} = \frac{\mathrm{LCS}(C, T)}{\mathrm{len}(T)} $$ $$ \mathrm{LCS}(C, S) = 5 $$ $$ \mathrm{len}(S) = 6 \quad \text{(Ground truth length)} $$ $$ R_{LCS} = \frac{5}{6} \approx 0.833

Precision using Longest Common Subsequence $P_{LCS}$:

P_{LCS} = \frac{\mathrm{LCS}(C, S)}{\mathrm{len}(C)} $$ $$ \mathrm{len}(C) = 6 \quad \text{(Generated text length)} $$ $$ P_{LCS} = \frac{5}{6} \approx 0.833

F-value using Longest Common Subsequence $F_{LCS}$:

F_{LCS} = \frac{(1 + \beta^2) \cdot R_{LCS} \cdot P_{LCS}}{\beta^2 \cdot P_{LCS} + R_{LCS}} $$ Here, assuming \( \beta^2 = 1 \), it implies that recall and precision are given equal weight $$ \text{ROUGE-L} = F_{LCS} = \frac{2 \cdot 0.833 \cdot 0.833}{0.833 + 0.833} = \frac{1.388}{1.666} \approx 0.833

Code Snippet:

To calculate ROUGE scores, we can use the rouge_score library in Python as follows:

Python

from rouge_score import rouge_scorer

# Define generated text and ground truth texts
reference = "the dog leaped over the fence"
candidate = "the dog jumped over the fence"

# Initialize the ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True) #A stemmer reduces all variants of a word to their root form, to improve robustness.

# Calculate ROUGE scores
scores = scorer.score(reference, candidate)
print(scores)

The source code for the rouge_scorer implementation is available here.

The difference with BLEU

Unlike BLEU, which primarily focuses on precision—using the total number of n-grams in the generated text as the denominator—ROUGE, on the other hand, emphasizes recall by using the total number of n-grams in the ground truth text as the denominator. Consequently, ROUGE reflects how much of the reference content is successfully captured by the generated text. This distinction makes ROUGE particularly suitable for summarization tasks, where the goal is not just fluency or brevity, but faithfully covering the essential content of the original text. In this context, recall is often more valuable than precision.

In general, it might be useful to examine both BLEU and ROGUE scores when working with text generation tasks (summarization, translation, etc.).

Limitation

Insensitive to Meaning:

ROUGE focuses solely on word overlap and does not consider semantic similarity or the overall context.

Example:

Ground truth: “The weather is pleasant today.”

Generated text: “It’s a beautiful day.”

Even though the generated text conveys the same meaning, ROUGE may still assign a low score due to the lack of overlapping words. This highlights a key limitation of n-gram-based metrics—they often fail to capture semantic equivalence when paraphrasing occurs.

Ignores Synonyms and Paraphrasing:

ROUGE does not recognize valid synonyms or paraphrases, penalizing generated texts that use alternative expressions.

Example:

Ground truth: “The boy was running quickly.”

Generated text: “The child sprinted fast.”

For example, “boy” and “child,” or “running quickly” and “sprinted fast,” are semantically similar. However, ROUGE still assigns a low score to the generated text in such cases, since it relies on exact word or phrase matches rather than capturing meaning.

Use Cases

Because of this, ROUGE is widely adopted in text summarization tasks to evaluate how closely a generated summary matches the human-written reference. Additionally, it is used in other text generation applications, such as headline generation and question-answering, where recall of important information is more crucial than exact phrasing.

Video Explanations

The video by Lewis provides clear and vivid examples demonstrating the exact computation of different ROUGE variants:

YouTube video — Rouge Metric by HuggingFace

Related Questions:

Author

Kangyu Zhu

Brown University CS
Machine Learning Content Writer

Help us improve this post by suggesting in comments below:

– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic

Leave the first comment (Cancel Reply)

You must be logged in to post a comment.

Partner Ad

Join us on:

Find out all the ways that you can

Contribute