Cross-Attention vs Self-Attention Explained

difference between cross-attention and self-attention

Introduction

The key difference between cross-attention and self-attention lies in the type of input sequences they operate on and their respective purposes. While self-attention captures relationships within a single input sequence, cross-attention captures relationships between elements of two different input sequences, allowing the model to generate coherent and contextually relevant outputs.

Key components of Transformer Architecture
Title: The Encoding layers on the left use Self-Attention to encode Input language text, while the Decoder layers on the right use Cross-Attention to attend to the encoded input text, while generating target language text
Source: “Attention is All You Need” by Vaswani et al. Enriched by AIML.com

Comparison between Self-Attention and Cross-Attention

The following table summarizes the key difference between the two:

Self-AttentionCross-Attention
Input type– Self-attention operates on a single input sequence.

– It is typically used within the encoder layers of a Transformer model, where the input sequence is the source or input text.
– Cross-attention operates on two different input sequences: a source sequence and a target sequence.

– It is typically used within the decoder layers of a Transformer model, where the source sequence is the context, and the target sequence is the sequence being generated
Purpose– Self-attention captures relationships within the same input sequence.

– It helps the model learn context and long-range dependencies by weighing the importance of each element within the input sequence
– Cross-attention allows the model to focus on different parts of the source sequence when generating each element of the target sequence.

– It captures how elements in the source sequence relate to elements in the target sequence. This helps in generating contextually relevant outputs.
UsageIn the encoder of a Transformer, each word or token attends to all other words in the same sentence, learning contextual information about the entire sentence.In machine translation, cross-attention in the decoder allows the model to look at the source sentence while generating each word in the target sentence. This helps ensure that the generated translation is coherent and contextually accurate
FormulationSelf-attention mechanism computes attention scores based on the Query (Q), Key (K), and Value (V) vectors derived from the same input sequenceCross-attention also computes attention scores based on Query (Q), Key (K), and Value (V) vectors. However, in cross-attention, these vectors are derived from different sequences. Q and V from the target sequence (decoder input), and K from the source sequence (encoder output)
Title: Difference between Self-Attention and Cross-Attention
Source: AIML.com Research

Illustrative Example explaining the difference

Self-AttentionCross-Attention
ExampleIn machine translation, self-attention in the encoder allows the model to understand how each word in the source sentence relates to the other words in the same sentence, which is crucial for accurate translation.In image captioning, cross-attention enables the model to attend to different regions of an image (represented as the source sequence) while generating each word of the caption (target sequence), ensuring that the caption describes the image appropriately.
Title: Example explaining the difference between Self-Attention and Cross-Attention
Source: AIML.com Research

Related articles:

Author

Help us improve this post by suggesting in comments below:

– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic

Leave the first comment

Partner Ad
Find out all the ways that you can
Contribute