
About BLEU
IBM introduced BLEU (Bilingual Evaluation Understudy) in 2002 as a widely used metric for evaluating machine-generated translations by comparing them against one or more human-written references. It measures the precision of n-grams (sequences of n words) in the generated text relative to the reference(s).
Before diving into the definition of BLEU, it is important to first understand some key concepts related to it.
N-grams
N-gram is an NLP concept, which refers to contiguous sequences of n words. For example:
- 1-gram (unigram): Single words. Example: For the sentence “The dog barked,” the 1-grams are {the, dog, barked}.
- 2-gram (bigram): Sequences of two words. Example: For the same sentence, the 2-grams are {the dog, dog barked}.
- 3-gram (trigram): Sequences of three words. Example: The 3-gram for the sentence is {the dog barked}.
Definition of BLEU
The definition of BLEU is in the following format:
To fully understand its meaning, we split it into multiple components
$P_n$: n-gram precision
For example:
Candidate → ”the dog jumped over the fence”
Reference → “the dog leaped over the fence”
For $P_2$, we consider all the bigrams (sequences of two words) in the candidate sentence:
Candidate bigrams: {the dog, dog jumped, jumped over, over the, the fence}
Reference bigrams: {the dog, dog leaped, leaped over, over the, the fence}
We then check how many bigrams from the candidate appear in the reference. In this case, the overlapping bigrams are: {the dog, over the, the fence}.
There are 3 overlapping bigrams out of the 5 bigrams in the candidate, so: $P_2 = \frac{3}{5} = 0.6$
Now, let’s calculate $P_1, P_2, P_3, P_4$:
- $P_1$ = 5/6 = 0.833.
- $P_2$ = 3/5 = 0.6.
- $P_3$ = 1/4 = 0.25.
- $P_4$ = 0/3 = 0.
To account for the importance of n-grams of different lengths, the BLEU score assigns weights w_n that balance their contributions to the final score. Typically, evaluators distribute these weights evenly (e.g., $w_1 = w_2 = w_3 = w_4 = 0.25$). However, in cases where certain n-gram lengths carry more significance—such as emphasizing longer phrases to improve translation fluency—they can assign uneven weights to highlight their impact on the overall score.
BP: Brevity Penalty
Occasionally, overly short translations may achieve artificially high n-gram precision by leaving out key parts of the sentence. To mitigate this issue, BLEU introduces a brevity penalty (BP), calculated as:
When the candidate translation is too short, the system raises e to a negative exponent, producing a value less than 1. This penalizes overly short translations and prevents them from being unfairly favored.
Why using log and exp?
Multiplying the $P_n$ values directly may result in very small numbers, taking the logarithm of each $P_n$ can transform the process to be the sum, avoiding issues with underflow. To return the logarithm of the geometric mean to the original precision scale, the system applies $\exp$ to the sum of the weighted logarithms.
Code snippet
To calculate the BLEU score, we can simply call `nltk.translate.bleu_score` as follows:
from nltk.translate.bleu_score import sentence_bleu
reference = [["the", "dog", "leaped", "over", "the", "fence"]]
candidate = ["the", "dog", "jumped", "over", "the", "fence"]
score = sentence_bleu(reference, candidate)
print(f"BLEU Score: {score}")
The source code of `nltk.translate.bleu_score` can be found here.
Limitations
1. Insensitive to Meaning: BLEU focuses solely on n-gram overlap, ignoring the semantic meaning of words or the grammatical structure. For example,
Candidate → “The cat ate the food.”
Reference → “The food was eaten by the kitty.”
Although both sentences convey the same meaning, the BLEU score may be low due to the different n-gram order. Additionally, if the grammar is incorrect (e.g., “The cat food ate”), BLEU will not penalize this.
2. Fixed References: BLEU depends heavily on the quality and diversity of reference translations. For example,
Candidate → “He enjoys swimming in the sea.”
Reference 1 → “He loves to swim in the ocean.”
Reference 2 (missing) → “He enjoys swimming in seawater.”
If only Reference 1 is provided, the BLEU score for the candidate translation will likely be low, even though it aligns semantically with the reference. Including Reference 2 would significantly improve the score.
3. Sentence-Level Weakness: BLEU is better suited for corpus-level evaluation and can be unreliable for sentence-level scoring. For instance, Candidate → “A man is walking his dog in the park.” Reference → “A person strolls through the park with a dog.”
Although “a man” and “a person” or “walking” and “strolls” convey the same meaning, BLEU fails to recognize such semantic similarities. Likewise, phrases like “in the park” and “through the park” are valid paraphrases, yet BLEU treats them as mismatches. As a result, these differences reduce n-gram overlap and can lead to lower BLEU scores for individual sentences. Nevertheless, when applied to large corpora, these minor mismatches tend to average out. Therefore, BLEU becomes more stable and reliable as an evaluation metric at the corpus level.
Use Cases
Initially, researchers developed BLEU to evaluate the quality of machine-generated translations. Since then, it has evolved into a widely adopted standard for various NLP tasks. For example, in text generation applications such as summarization and dialogue generation, BLEU is commonly used to measure the similarity between a model’s output and one or more reference texts. Consequently, it remains a core metric in benchmarking language models.
Video Explanation
- This video by Andrew Ng provides a clear and engaging explanation of the BLEU score with vivid examples:
Related Questions: