What are some of the approaches for decoding the next word in LLMs?

Predicting/Decoding the next token of a large language model
Title: Predicting/Decoding the next token of a large language model
Source: Understanding LLM Decoding Strategies

Introduction

Large Language Models (LLMs) are auto-regressive models that predict the next word (or token) by calculating probabilities for a sequence of previous words based on their training data. These probabilities are then decoded to generate text. Various decoding strategies exist for next-word prediction. This post explores these approaches, compares their behavior, and demonstrates how LLMs allow customization through parameters like temperature. But before getting to that, let’s understand some important concepts first.

Auto-regressive model

Auto-regressive (AR) models are a class of models that generate output sequentially by predicting the next element in a sequence based on the previous elements. In the context of language models, an AR model generates text one token at a time, using all the tokens generated so far as input for the next prediction.

Decoding of Token Probabilities

Language models predict the next token by processing input text through a series of steps. First, the input is tokenized and mapped to embeddings, which are then passed through the transformer decoder. The decoder uses self-attention and feed-forward layers to generate a contextualized sequence representation. This representation is projected through a linear layer into the vocabulary space, producing logits for each token. Applying a softmax activation converts these logits into a probability distribution, with each value representing the likelihood of a token being the next in the sequence. Finally, the decoding process determines which token to select based on the probability distribution.

Common Approaches for Next-Word Prediction

Greedy Search is a decoding method where, at each step, the model selects the token with the highest probability. This straightforward approach ensures that the output is deterministic and coherent in many cases. However, it often struggles with generating diverse or creative responses, as it follows the most likely path without exploring alternatives. This can lead to repetitive loops or suboptimal outputs, especially in tasks requiring nuanced or imaginative answers.

As depicted in the figure below, the model selects the highest probability token at each step with greedy search decoding, resulting in the sequence “I really love cats.” Note that even though the last word choice “cats” and “dogs” have similar token probabilities, it will always generate “I really love cats”

Example of Decoding using Greedy Search
Title: Example of Decoding using Greedy Search
Source: AIML.com Research

Beam Search is a decoding strategy that simultaneously explores multiple candidate sequences (referred to as beams) rather than selecting tokens step by step like greedy search. At each step, it keeps the top sequences with the highest cumulative probabilities, ultimately choosing the one with the highest overall probability. This method improves coherence and diversity compared to greedy search, as it evaluates a broader range of possibilities. However, beam search can be computationally expensive and may still generate repetitive or overly generic outputs, particularly in tasks requiring more creativity or specificity.

As shown in the figure below, this is a beam search with a beam width of 2. It selects the sequence with the highest cumulative probability among two candidate sequences. In this example, the chosen sequence is “I like it.”

Example of Decoding using Beam Search with beam=2
Title: Example of Decoding using Beam Search with beam=2
Source: AIML.com Research

Sampling

Sampling is a decoding method that introduces randomness by selecting the next token based on its probability distribution rather than always choosing the most likely one. This approach allows for more creative and diverse responses, making it particularly suitable for tasks like storytelling or creative writing. However, the randomness in sampling can lead to incoherent or less relevant outputs, especially if the probabilities for lower-quality tokens are not sufficiently suppressed.

Example:

Given the probability distribution for the next word:

{“happy”: 0.5, “excited”: 0.3, “sad”: 0.2}

Sampling might choose “excited” or “sad” instead of the most likely token “happy”, introducing variability to the response.

Top-K Sampling

Top-k Sampling is a decoding strategy that restricts the selection pool for the next token to the top k highest-probability candidates, effectively filtering out less likely words. Sampling is then performed within this reduced set, introducing controlled randomness while maintaining coherence. This method balances creativity and reliability, making it well-suited for tasks like dialogue generation or creative writing. However, the choice of the k value is critical; a small k can limit diversity, while a large k may reintroduce incoherence.

Example:

For the probability distribution:

{“happy”: 0.5, “excited”: 0.3, “sad”: 0.2, “angry”: 0.05, “confused”: 0.02}

If _k = 3_, the pool is reduced to:

{“happy”: 0.5, “excited”: 0.3, “sad”: 0.2}

Sampling is then performed within this set, ensuring the model avoids unlikely tokens like “angry” or “confused” while still allowing variation in its output.

Top-P (Nucleus) Sampling

Top-p (Nucleus) Sampling is a decoding method that dynamically selects the smallest set of tokens whose cumulative probability exceeds a specified threshold (p). Unlike top-k sampling, which fixes the number of candidates, top-p adapts the pool size based on the distribution, ensuring that only the most probable tokens are considered while still allowing for diversity. This approach provides flexibility and helps balance coherence and creativity. However, careful tuning of the _p_ value is necessary to avoid overly random or overly generic outputs.

Example:

For the probability distribution:

{“happy”: 0.4, “excited”: 0.3, “sad”: 0.2, “angry”: 0.05, “confused”: 0.05}

If p = 0.7, the pool is:

{“happy”: 0.4, “excited”: 0.3}

The cumulative probability of 0.7 is reached with these two tokens, and sampling occurs only within this set, excluding less likely options like “sad” or “angry.”

Strategy for Controlling Next-Word Prediction in LLMs

One of the most common and effective strategies for controlling next word prediction in LLMs is Temperature

What is Temperature?

Temperature is a parameter that adjusts the randomness of a model’s predictions by scaling the logits before applying the softmax function. A higher temperature (e.g., 1.0) flattens the probability distribution, making the model more likely to sample less probable tokens, resulting in more random and diverse outputs. Conversely, a lower temperature (e.g., 0.2) sharpens the distribution, increasing the likelihood of selecting the most probable tokens, leading to more deterministic and focused outputs. 

The temperature typically ranges from 0 to 1, though it can be set higher for experimental scenarios. Commonly used values are between 0.7 and 1.0 for creative tasks (e.g., story generation) and around 0.2 to 0.5 for tasks requiring deterministic outputs (e.g., summarization or code generation).

The formula for applying temperature in a model’s predictions can be written as:

$$P_i = \frac{\exp\left(\frac{z_i}{T}\right)}{\sum_{j=1}^{N} \exp\left(\frac{z_j}{T}\right)}$$

where:

  • $P_i$ : Probability of selecting token  $i$ .
  • $z_i$ : Logit (raw score) for token $i$ .
  • $T$ : Temperature parameter.
  • $N$ : Total number of tokens in the vocabulary.

Explaining the concept of Temperature using an example:

For the logits {happy: 2.0, excited: 1.5, sad: 1.0}, applying different temperature values results in varying distributions:

  • Temperature = 1.0:

Probability distribution: {happy: 0.42, excited: 0.32, sad: 0.26}

The output is diverse and may select “excited” or “sad” more frequently.

  • Temperature = 0.2:

Probability distribution: {happy: 0.85, excited: 0.10, sad: 0.05}

The output is highly focused, almost always selecting “happy.”

Use Cases of Different Decoding Methods

Different decoding strategies are suited to different scenarios. The table below outlines some examples of scenarios where each decoding strategy is most appropriate.

Scenarios and examples for each decoding strategy.
Title: Scenarios and examples for each decoding strategy.
Source: AIML.com Research

Other Novel Decoding Method

In addition to the commonly used decoding methods mentioned above, there are some novel approaches designed specifically to help reduce hallucinations. Below, we introduce a few of these methods.

DoLa: Decoding by Contrasting Layers

Some interpretability studies on large models have shown that in Transformer-based language models, earlier layers encode “low-level” information (such as part-of-speech tags), while later layers capture more “high-level” semantic knowledge. The primary inspiration behind DoLa’s approach to reducing hallucinations lies in emphasizing the knowledge in higher layers while de-emphasizing information from lower layers.

As illustrated in the example, when predicting the next token, the most appropriate choice is “Olympia.” However, if we only consider the final layer, the model might still assign a high probability to “Seattle.” In such cases, while “Seattle” maintains a relatively high probability across all layers, leveraging the contrast between different layers can significantly increase the probability of the correct answer, “Olympia.”

An illustration of how an LLM gradually integrates factual information across its layers.
Title: An illustration of how an LLM gradually integrates factual information across its layers.
Source: DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models by Chuang et al.

The paper also provides specific formulas on how to implement contrastive analysis of token probabilities across different transformer layers, enabling the effective identification of the most preferred answer. For more details, please refer to the original paper.

VCD: Visual Contrastive Decoding

VCD decoding is a simple, training-free method designed to reduce hallucinations in vision-language models. As shown in the Figure, it works by contrasting the token distributions generated from the model when given original visual inputs versus distorted visual inputs. This contrast helps identify and filter out tokens likely to be hallucinations. For more details, refer to the original paper.

An example of Visual Contrastive Decoding, where the hallucinated object “Surfboards” is highlighted in red and subsequently removed.
Title: An example of Visual Contrastive Decoding, where the hallucinated object “Surfboards” is highlighted in red and subsequently removed.
Source: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding by Leng et al.

Video Explanation

  • The video by Andrew Ng provides a detailed explanation of the beam search algorithm.
YouTube video
Decoding techniques for language models by Deeplearning AI

Related Questions:

Author

  • Brown University CS

    Machine Learning Content Writer

Help us improve this post by suggesting in comments below:

– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic

Leave the first comment

Partner Ad
Find out all the ways that you can
Contribute