
Source: AIML.com Research
About ROUGE
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a popular metric to evaluate the quality of automatically generated text. It originates from the paper ROUGE: A Package for Automatic Evaluation of Summaries. The term “gist” in its name emphasizes the overall understanding or essence of the content rather than focusing on exact word matches or specific details. In the context of ROUGE, it highlights the metric’s aim to measure the overlap of significant textual elements (like n-grams, longest common subsequences, or word pairs) between the generated summary and the ground truth summary. Before diving into the details of ROUGE, let’s introduce some key concepts that underpin this metric.
N-grams in ROUGE
Like BLEU, ROUGE uses N-grams, which are contiguous sequences of n words. For example:
- 1-gram (unigram): Single words. Example: For the sentence “The dog barked,” the 1-grams are {the, dog, barked}
- 2-gram (bigram): Sequences of two words. Example: For the same sentence, the 2-grams are {the dog, dog barked}
- 3-gram (trigram): Sequences of three words. Example: The 3-gram for the sentence is {the dog barked}
Variants of ROUGE
ROUGE includes several variants, such as ROUGE-N, and ROUGE-L, each designed to measure text overlap in different ways. The most commonly used variant, ROUGE-N, evaluates the recall of overlapping n-grams between the generated text and ground truth texts. ROUGE-L focuses on the length of the Longest Common Subsequence (LCS) between the generated text and ground truth text, effectively capturing sentence-level structure and word order.
ROUGE-N: N-gram
ROUGE-N evaluates recall for n-grams. The formula is:
For example:
Generated text: “the dog jumped over the fence“
Ground truth: “the dog leaped over the fence“
For $ROUGE\text{-}2$ (bigram recall),
Ground truth bigrams | {the dog, dog leaped, leaped over, over the, the fence} |
Generated text bigrams | {the dog, dog jumped, jumped over, over the, the fence} |
Overlapping bigrams | {the dog, over the, the fence} |
Thus:
$ROUGE\text{-}2 = \frac{3}{5} = 0.6$
ROUGE-L: Longest Common Subsequence
ROUGE-L evaluates the longest common subsequence (LCS) between the generated text and ground truth texts, effectively capturing sentence-level structure and word order. The final score is calculated as an F1 value, balancing precision and recall. It is worth noting that ROUGE-L allows skipping words in the middle of sequences, enabling it to focus on overall structural alignment rather than strict word-by-word matching. This flexibility makes it particularly suitable for evaluating text with slight variations in phrasing.
$R_{LCS} = \frac{LCS(C, T)}{\text{len}(T)}$
$P_{LCS} = \frac{LCS(C, T)}{\text{len}(C)}$
$F_{LCS} = \frac{(1 + \beta^2) R_{LCS} P_{LCS}}{\beta^2 P_{LCS} + R_{LCS}}$
Here, LCS(C, T) represents the length of the longest common subsequence between the generated text C and the ground truth text T, while $ \text{len}(T)$ denotes the length of the ground truth text. The parameter $\beta$ is a hyperparameter that controls the balance between recall and precision in ROUGE-L. A large $\beta$ value emphasizes recall, while a small value near zero shifts the focus to precision. Typically, $\beta$ is set to a relatively large value to prioritize recall in ROUGE-L evaluations.
For example:
Generated text: “the dog jumped over the fence“
Ground truth: “the dog leaped over the fence“
Longest Common Subsequence (LCS):
{the, dog, over, the, fence}
(length = 5) ( allowing word skipping in the middle)
Recall using Longest Common Subsequence $R_{LCS}$:
Precision using Longest Common Subsequence $P_{LCS}$:
F-value using Longest Common Subsequence $F_{LCS}$:
Code Snippet:
To calculate ROUGE scores, we can use the rouge_score library in Python as follows:
from rouge_score import rouge_scorer
# Define generated text and ground truth texts
reference = "the dog leaped over the fence"
candidate = "the dog jumped over the fence"
# Initialize the ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True) #A stemmer reduces all variants of a word to their root form, to improve robustness.
# Calculate ROUGE scores
scores = scorer.score(reference, candidate)
print(scores)
The source code for the rouge_scorer implementation is available here.
The difference with BLEU
Unlike BLEU, which primarily focuses on precision—using the total number of n-grams in the generated text as the denominator—ROUGE, on the other hand, emphasizes recall by using the total number of n-grams in the ground truth text as the denominator. Consequently, ROUGE reflects how much of the reference content is successfully captured by the generated text. This distinction makes ROUGE particularly suitable for summarization tasks, where the goal is not just fluency or brevity, but faithfully covering the essential content of the original text. In this context, recall is often more valuable than precision.
In general, it might be useful to examine both BLEU and ROGUE scores when working with text generation tasks (summarization, translation, etc.).
Limitation
Insensitive to Meaning:
ROUGE focuses solely on word overlap and does not consider semantic similarity or the overall context.
Example:
Ground truth: “The weather is pleasant today.”
Generated text: “It’s a beautiful day.”
Even though the generated text conveys the same meaning, ROUGE may still assign a low score due to the lack of overlapping words. This highlights a key limitation of n-gram-based metrics—they often fail to capture semantic equivalence when paraphrasing occurs.
Ignores Synonyms and Paraphrasing:
ROUGE does not recognize valid synonyms or paraphrases, penalizing generated texts that use alternative expressions.
Example:
Ground truth: “The boy was running quickly.”
Generated text: “The child sprinted fast.”
For example, “boy” and “child,” or “running quickly” and “sprinted fast,” are semantically similar. However, ROUGE still assigns a low score to the generated text in such cases, since it relies on exact word or phrase matches rather than capturing meaning.
Use Cases
Because of this, ROUGE is widely adopted in text summarization tasks to evaluate how closely a generated summary matches the human-written reference. Additionally, it is used in other text generation applications, such as headline generation and question-answering, where recall of important information is more crucial than exact phrasing.
Video Explanations
- The video by Lewis provides clear and vivid examples demonstrating the exact computation of different ROUGE variants:
Related Questions: