Multi-Head Attention: Why It Outperforms Single-Head Models

Multi-Head Attention

Multi-head attention extends the idea of single-head attention by running multiple attention heads in parallel on the same input sequence. This lets the model learn various relationships and patterns in the input data simultaneously. It greatly boosts the model’s expressive power compared to using a single attention head.

Related Question: Explain Attention, and Masked Self-Attention as used in Transformers

Self-Attention vs Multi-head Attention
Title: (left) Single Attention head (also known as Scaled Dot-Product Attention)
(right) Multi-Head Attention consists of several attention layers running in parallel
Source: Attention is all you need (2017)

Implementation of Multi-Head Attention

  1. Instead of having a single set of learnable K, Q, and V matrices, multiple sets are initialized (one for each attention head)
  2. Each attention head independently computes attention scores and produces its own attention weighted output.
  3. Outputs from all heads are concatenated and linearly transformed to form the final multi-head attention output.
  4. Each attention head targets different input parts, capturing diverse patterns and relationships in the data.

Benefits and Limitations of Multi-Head Attention over Single-Head Attention

   – Increased Expressiveness: Multi-head attention captures diverse dependencies and patterns simultaneously. This is key to understanding complex relationships in data.

   – Improved Generalization: By learning multiple sets of attention parameters, the model becomes more robust and adaptable to different tasks and datasets.

Increased Computational Complexity: Multi-head attention enhances the model’s capabilities. However, it also increases computational complexity, requiring more compute resources. To help mitigate this, during inference time, a mechanism called Head Pruning is employed to discard heads that are less useful.

Related Questions:
– Explain Self-Attention, and Masked Self-Attention as used in Transformers
– What are transformers? Discuss the major breakthroughs in transformer models
– Explain the Transformer Architecture

Author

Help us improve this post by suggesting in comments below:

– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic

Leave the first comment

Partner Ad
Find out all the ways that you can
Contribute