Computer Vision (1)
Generative AI (2)
Machine Learning Basics (18)
Deep Learning (52)
- DL Basics (16)
- DL Architectures (17)
  - Feedforward Network / MLP (2)
  - Sequence models (6)
  - Transformers (9)
- DL Training and Optimization (17)
Natural Language Processing (27)
- NLP Data Preparation (18)
Supervised Learning (115)
- Regression (41)
  - Linear Regression (26)
  - Generalized Linear Models (9)
  - Regularization (6)
- Classification (70)
  - Logistic Regression (10)
  - Support Vector Machine (9)
  - Ensemble Learning (24)
  - Other Classification Models (9)
  - Classification Evaluations (9)
Unsupervised Learning (55)
- Clustering (37)
  - Distance Measures (9)
  - K-Means Clustering (9)
  - Hierarchical Clustering (3)
  - Gaussian Mixture Models (5)
  - Clustering Evaluations (6)
- Dimensionality Reduction (9)
Statistics (34)
Data Preparation (35)
- Feature Engineering (30)
- Sampling Techniques (5)

Cross-Attention vs Self-Attention Explained

Updated: April 24, 2025

Introduction

The key difference between cross-attention and self-attention lies in the type of input sequences they operate on and their respective purposes. While self-attention captures relationships within a single input sequence, cross-attention captures relationships between elements of two different input sequences, allowing the model to generate coherent and contextually relevant outputs.

Key components of Transformer Architecture — Title: The Encoding layers on the left use Self-Attention to encode Input language text, while the Decoder layers on the right use Cross-Attention to attend to the encoded input text, while generating target language text
Source: “Attention is All You Need” by Vaswani et al. Enriched by AIML.com

Comparison between Self-Attention and Cross-Attention

The following table summarizes the key difference between the two:

	Self-Attention	Cross-Attention
Input type	– Self-attention operates on a single input sequence. – It is typically used within the encoder layers of a Transformer model, where the input sequence is the source or input text.	– Cross-attention operates on two different input sequences: a source sequence and a target sequence. – It is typically used within the decoder layers of a Transformer model, where the source sequence is the context, and the target sequence is the sequence being generated
Purpose	– Self-attention captures relationships within the same input sequence. – It helps the model learn context and long-range dependencies by weighing the importance of each element within the input sequence	– Cross-attention allows the model to focus on different parts of the source sequence when generating each element of the target sequence. – It captures how elements in the source sequence relate to elements in the target sequence. This helps in generating contextually relevant outputs.
Usage	In the encoder of a Transformer, each word or token attends to all other words in the same sentence, learning contextual information about the entire sentence.	In machine translation, cross-attention in the decoder allows the model to look at the source sentence while generating each word in the target sentence. This helps ensure that the generated translation is coherent and contextually accurate
Formulation	Self-attention mechanism computes attention scores based on the Query (Q), Key (K), and Value (V) vectors derived from the same input sequence	Cross-attention also computes attention scores based on Query (Q), Key (K), and Value (V) vectors. However, in cross-attention, these vectors are derived from different sequences. Q and V from the target sequence (decoder input), and K from the source sequence (encoder output)

Title: Difference between Self-Attention and Cross-Attention
Source: AIML.com Research

Illustrative Example explaining the difference

	Self-Attention	Cross-Attention
Example	In machine translation, self-attention in the encoder allows the model to understand how each word in the source sentence relates to the other words in the same sentence, which is crucial for accurate translation.	In image captioning, cross-attention enables the model to attend to different regions of an image (represented as the source sequence) while generating each word of the caption (target sequence), ensuring that the caption describes the image appropriately.

Title: Example explaining the difference between Self-Attention and Cross-Attention
Source: AIML.com Research

Related articles:

Author

AIML.com

Help us improve this post by suggesting in comments below:

– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic

Leave the first comment (Cancel Reply)

You must be logged in to post a comment.

Partner Ad

Join us on:

Find out all the ways that you can

Contribute