Computer Vision (1)
Generative AI (2)
Machine Learning Basics (18)
Deep Learning (52)
- DL Basics (16)
- DL Architectures (17)
  - Feedforward Network / MLP (2)
  - Sequence models (6)
  - Transformers (9)
- DL Training and Optimization (17)
Natural Language Processing (27)
- NLP Data Preparation (18)
Supervised Learning (115)
- Regression (41)
  - Linear Regression (26)
  - Generalized Linear Models (9)
  - Regularization (6)
- Classification (70)
  - Logistic Regression (10)
  - Support Vector Machine (9)
  - Ensemble Learning (24)
  - Other Classification Models (9)
  - Classification Evaluations (9)
Unsupervised Learning (55)
- Clustering (37)
  - Distance Measures (9)
  - K-Means Clustering (9)
  - Hierarchical Clustering (3)
  - Gaussian Mixture Models (5)
  - Clustering Evaluations (6)
- Dimensionality Reduction (9)
Statistics (34)
Data Preparation (35)
- Feature Engineering (30)
- Sampling Techniques (5)

Explain BLEU (Bilingual Evaluation Understudy)

Updated: June 16, 2025

About BLEU

IBM introduced BLEU (Bilingual Evaluation Understudy) in 2002 as a widely used metric for evaluating machine-generated translations by comparing them against one or more human-written references. It measures the precision of n-grams (sequences of n words) in the generated text relative to the reference(s).

Before diving into the definition of BLEU, it is important to first understand some key concepts related to it.

N-grams

N-gram is an NLP concept, which refers to contiguous sequences of n words. For example:

1-gram (unigram): Single words. Example: For the sentence “The dog barked,” the 1-grams are {the, dog, barked}.
2-gram (bigram): Sequences of two words. Example: For the same sentence, the 2-grams are {the dog, dog barked}.
3-gram (trigram): Sequences of three words. Example: The 3-gram for the sentence is {the dog barked}.

Definition of BLEU

The definition of BLEU is in the following format:

BLEU = BP \cdot \exp\left(\sum_{n=1}^N w_n \log P_n\right)

To fully understand its meaning, we split it into multiple components

$P_n$: n-gram precision

P_n$ is the precision for n-grams of different lengths, the formula is as follows: $$\text{Precision} = \frac{\text{Number of overlapping n-grams}}{\text{Total number of n-grams in the candidate translation}}$

For example：

Candidate → ”the dog jumped over the fence”

Reference → “the dog leaped over the fence”

For $P_2$, we consider all the bigrams (sequences of two words) in the candidate sentence:

Candidate bigrams: {the dog, dog jumped, jumped over, over the, the fence}

Reference bigrams: {the dog, dog leaped, leaped over, over the, the fence}

We then check how many bigrams from the candidate appear in the reference. In this case, the overlapping bigrams are: {the dog, over the, the fence}.

There are 3 overlapping bigrams out of the 5 bigrams in the candidate, so: $P_2 = \frac{3}{5} = 0.6$

Now, let’s calculate $P_1, P_2, P_3, P_4$:

$P_1$ = 5/6 = 0.833.
$P_2$ = 3/5 = 0.6.
$P_3$ = 1/4 = 0.25.
$P_4$ = 0/3 = 0.

To account for the importance of n-grams of different lengths, the BLEU score assigns weights w_n that balance their contributions to the final score. Typically, evaluators distribute these weights evenly (e.g., $w_1 = w_2 = w_3 = w_4 = 0.25$). However, in cases where certain n-gram lengths carry more significance—such as emphasizing longer phrases to improve translation fluency—they can assign uneven weights to highlight their impact on the overall score.

BP: Brevity Penalty

Occasionally, overly short translations may achieve artificially high n-gram precision by leaving out key parts of the sentence. To mitigate this issue, BLEU introduces a brevity penalty (BP), calculated as:

BP = \begin{cases} 1 & \text{if } \text{length(candidate)} > \text{length(reference)} \\ e^{1 – \frac{\text{length(reference)}}{\text{length(candidate)}}} & \text{otherwise} \end{cases}

When the candidate translation is too short, the system raises e to a negative exponent, producing a value less than 1. This penalizes overly short translations and prevents them from being unfairly favored.

Why using log and exp？

Multiplying the $P_n$ values directly may result in very small numbers, taking the logarithm of each $P_n$ can transform the process to be the sum, avoiding issues with underflow. To return the logarithm of the geometric mean to the original precision scale, the system applies $\exp$ to the sum of the weighted logarithms.

Code snippet

To calculate the BLEU score, we can simply call `nltk.translate.bleu_score` as follows:

Python

from nltk.translate.bleu_score import sentence_bleu
reference = [["the", "dog", "leaped", "over", "the", "fence"]]
candidate = ["the", "dog", "jumped", "over", "the", "fence"]
score = sentence_bleu(reference, candidate)
print(f"BLEU Score: {score}")

The source code of `nltk.translate.bleu_score` can be found here.

Limitations

1. Insensitive to Meaning: BLEU focuses solely on n-gram overlap, ignoring the semantic meaning of words or the grammatical structure. For example,

Candidate → “The cat ate the food.”

Reference → “The food was eaten by the kitty.”

Although both sentences convey the same meaning, the BLEU score may be low due to the different n-gram order. Additionally, if the grammar is incorrect (e.g., “The cat food ate”), BLEU will not penalize this.

2. Fixed References: BLEU depends heavily on the quality and diversity of reference translations. For example,

Candidate → “He enjoys swimming in the sea.”

Reference 1 → “He loves to swim in the ocean.”

Reference 2 (missing) → “He enjoys swimming in seawater.”

If only Reference 1 is provided, the BLEU score for the candidate translation will likely be low, even though it aligns semantically with the reference. Including Reference 2 would significantly improve the score.

3. Sentence-Level Weakness: BLEU is better suited for corpus-level evaluation and can be unreliable for sentence-level scoring. For instance, Candidate → “A man is walking his dog in the park.” Reference → “A person strolls through the park with a dog.”

Although “a man” and “a person” or “walking” and “strolls” convey the same meaning, BLEU fails to recognize such semantic similarities. Likewise, phrases like “in the park” and “through the park” are valid paraphrases, yet BLEU treats them as mismatches. As a result, these differences reduce n-gram overlap and can lead to lower BLEU scores for individual sentences. Nevertheless, when applied to large corpora, these minor mismatches tend to average out. Therefore, BLEU becomes more stable and reliable as an evaluation metric at the corpus level.

Use Cases

Initially, researchers developed BLEU to evaluate the quality of machine-generated translations. Since then, it has evolved into a widely adopted standard for various NLP tasks. For example, in text generation applications such as summarization and dialogue generation, BLEU is commonly used to measure the similarity between a model’s output and one or more reference texts. Consequently, it remains a core metric in benchmarking language models.

Video Explanation

This video by Andrew Ng provides a clear and engaging explanation of the BLEU score with vivid examples:

YouTube video — Sequence-to-sequence models: BLEU Score by DeepLearning AI

Related Questions:

Author

Kangyu Zhu

Brown University CS
Machine Learning Content Writer

Help us improve this post by suggesting in comments below:

– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic

Leave the first comment (Cancel Reply)

You must be logged in to post a comment.

Partner Ad

Join us on:

Find out all the ways that you can

Contribute