
About PEFT
Parameter Efficient Fine-Tuning (PEFT) is a set of techniques for adapting large pre-trained language models (PLMs) to new tasks. It only fine-tunes a small portion of the model’s parameters. This greatly reduces both computational and storage costs. In some cases, only 0.1% of the total parameters are updated. PEFT is especially useful in environments with limited resources.
Why PEFT
As pre-trained language models (PLMs) like GPT-3, BERT-large, and T5-11B continue to grow in size and complexity, fine-tuning them for specific tasks becomes increasingly resource-intensive. Parameter-Efficient Fine-Tuning (PEFT) has emerged as a powerful solution to address these challenges. PEFT techniques allow users to leverage the vast pre-trained knowledge of these models while making task-specific adaptations with significantly reduced computational and storage overhead. This approach balances efficiency and effectiveness, enabling fine-tuning of large-scale models on resource-constrained hardware without compromising performance.
PEFT Methods
Figure 1 illustrates the developmental trajectory of PEFT techniques over time. Current PEFT methods can be broadly categorized into five primary types: Additive Fine-tuning, Unified Fine-tuning, Reparameterized Fine-tuning, Hybrid Fine-tuning and Selective (partial) Fine-tuning. Below is a brief introduction to the underlying concepts of each type.
Title: Evolutionary Tree Diagram of PEFT Techniques
Source: “Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment” by Xu et al.
Additive Fine-tuning
Additive fine-tuning introduces additional structures and parameters into the model to adapt it for specific tasks. Figure 3 shows the mechanism of additive finetuning on the left. Two representative approaches in this category are:
Adapter-Based Fine-Tuning: lightweight adapter modules are inserted into the pre-trained model’s architecture. Figure 2 (c) (d) illustrates the structure of adapters, which can be integrated into the model in a parallel or sequential configuration. These modules are trained while the original model’s parameters remain unchanged, significantly reducing computational overhead.
Prefix Tuning: prefix tuning introduces learnable embeddings (prefixes) that serve as task-specific “instructions” for the model. These prefixes are prepended to the input sequence and guide the model’s behavior during inference without modifying the underlying model weights. Figure 2 (a) shows the model structure using prefix tuning.
Title: More detailed structure of some PEFT techniques, including LoRA, Prefix-tuning and Adapters.
Source: “LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models” by Hu et al.
Selective Finetuning
Selective fine-tuning involves updating only a subset of the model’s parameters while keeping the majority of the model frozen, as shown in the mechanism in the middle of Figure 3 . Examples include fine-tuning:
- Bias Terms: Updating only the bias terms in the model layers to adapt them for new tasks. (Paper: BitFit)
- Weight Masking: This method applies a fixed sparse mask to the model’s parameters, selectively updating only the masked subset during training. One approach to constructing the mask is to select the k parameters with the largest Fisher information, which serves as a simple yet effective approximation of the parameters most important for the task at hand. (paper: Training neural networks with fixed sparse masks.)
Reparameterized Fine-tuning
Reparameterized fine-tuning introduces task-specific trainable components that reconfigure the pre-trained model’s architecture. The most prominent method in this category is LoRA (Low-Rank Adaptation).
LoRA (Low-Rank Adaptation)
LoRA (Low-Rank Adaptation) has emerged as one of the most popular techniques in parameter-efficient fine-tuning (PEFT) for adapting large pre-trained models to downstream tasks. Its efficiency and simplicity make it an ideal choice for fine-tuning resource-intensive models like GPT-3, T5, and CLIP. By significantly reducing the number of trainable parameters, LoRA lowers computational overhead and enables organizations with limited resources to harness the power of large models.
Why LoRA is Efficient and Popular
- Efficiency: LoRA updates only a small fraction of the model’s parameters by injecting low-rank matrices into specific layers, reducing memory, storage, and computational requirements.
- Versatility: LoRA scales well to diverse tasks and domains, supporting large-scale models across NLP, vision-language, and domain-specific applications.
- Accessibility: Its lightweight approach makes state-of-the-art models accessible to a broader range of researchers and industries.
Key Components of LoRA
LoRA modifies the weight matrices of specific layers by adding a low-rank adaptation:
- $W$: Original weight matrix (frozen during fine-tuning).
- $\Delta W$: Rank-r update matrix, representing the adjustment applied to $W$.
- $A$ and $B$: Low-rank matrices (trainable) with dimensions $d \times r$ and $r \times d$, respectively, where $r \ll d$.
This low-rank decomposition drastically reduces the number of trainable parameters while maintaining the model’s capacity to adapt to new tasks.
In summary, by freezing the original model parameters, LoRA ensures that the pre-trained knowledge, such as general language understanding or vision features, remains intact. This approach prevents catastrophic forgetting, a common issue in standard fine-tuning where the model may lose previously learned capabilities. In addition, LoRA updates only the newly introduced low-rank matrices (A and B), significantly reducing resource requirements. This allows fine-tuning even on hardware with limited memory, such as standard GPUs, democratizing access to large pre-trained models.
Hybrid Fine-tuning
Hybrid fine-tuning combines multiple fine-tuning techniques to leverage their strengths. Examples include:
MAM Adapters: A combination of parallel adapters and soft prompts, enabling the model to leverage the benefits of both approaches simultaneously.
UniPELT: UniPELT unifies multiple PEFT techniques, such as LoRA, Prefix-Tuning, and Adapters, into a single framework. By combining these methods, UniPELT maximizes flexibility and performance across diverse tasks.
Unified Fine-tuning
Unified fine-tuning involves constructing a comprehensive framework for PEFT, focusing on integrating techniques within a specific category or approach. Unlike hybrid fine-tuning, which combines various techniques, unified fine-tuning seeks to streamline and optimize a single methodology.
For example:
AdaMix: it create a cohesive fine-tuning framework by utilizing a combination of adaptation modules
Comparison of Different PEFT Techniques.
Title: Comparison of Different PEFT techniques
Source: “Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey” by Han et al.
The additive method strategically introduces a minimal number of parameters into the original model’s architecture while keeping the other parameters frozen. For example, in Transformer models, this involves inserting small adapter layers or soft prompts within the Transformer blocks. Mathematically, for a Transformer layer output h_{\text{out}} , with original parameters \theta , this can be expressed as:
$$h_{\text{out}} = f(h_{\text{in}}, \theta) + g(h_{\text{in}}, \phi)$$
Here, $f(h_{\text{in}}, \theta)$ represents the original layer computation (with frozen parameters $\theta$ ), and $g(h_{\text{in}}, \phi)$ corresponds to the new trainable components (e.g., adapters) with parameters $\phi$. Selective PEFT, on the other hand, focuses on updating only a subset of the model’s original parameters, keeping the number of updated parameters small. In contrast, reparameterization PEFT modifies the representation of existing parameters without introducing new components. For instance, in LoRA, the weight matrix $W$ is reconfigured as $W{\prime} = W + A B$, where $W$ is the original frozen weight matrix, A is a low-rank trainable matrix of size $d \times r$, and B is another low-rank trainable matrix of size $r \times d$, with $r \ll d$. This approach enables efficient fine-tuning by leveraging low-rank approximations while maintaining the original model’s architecture.
Comparing PEFT with Supervised Fine Tuning (SFT)
PEFT methods differ from SFT in several key aspects:
- Parameter Updates: PEFT updates only a small fraction of the model’s parameters (e.g., ~0.1%), while SFT modifies all parameters and consumes significantly more resources.
- Resource Usage: PEFT uses less memory and storage, enabling efficient fine-tuning of large models. In contrast, SFT consumes more resources and demands greater computational power.
- Performance: PEFT achieves performance comparable to or slightly below SFT, though SFT often excels in complex tasks.
- Task Adaptation: PEFT lets models handle many tasks more easily by reusing a small set of tuned parameters, while SFT needs a different model for each task.
- Scalability: PEFT works well with big models, so it’s a good fit for today’s systems. SFT, on the other hand, is harder to scale because it uses more resources.
Limitations of PEFT
While PEFT offers several advantages, it also comes with limitations:
Performance Ceiling: PEFT methods may struggle to match the performance of fully fine-tuned models in highly complex or niche tasks.
Compatibility: Some PEFT techniques, such as Adapters and LoRA, require architectural modifications, which might not be supported by all pre-trained models.
Task-Specific Components: Requiring task-specific modules can increase model management complexity in large-scale deployments.
Sensitivity to Hyperparameters: PEFT methods are often sensitive to hyperparameter settings, such as learning rates and rank sizes in LoRA, requiring careful tuning.
Using PEFT in Production
LoRA is a popular PEFT method known for its versatility and efficiency in real-world use. It allows fine-tuning of large models across many domains, including language, vision, and even diffusion-based creative tasks.
For example, Hugging Face offers an easy-to-use library for LoRA. This makes it straightforward to apply LoRA to many large language models. For instance, in training high-performance multimodal models like LLaVA, LoRA has been utilized to achieve remarkable results. On many benchmarks, models fine-tuned with LoRA even surpass the performance of fully fine-tuned models, as highlighted in LLaVA’s Model Zoo
Beyond language models, LoRA is also widely adopted in diffusion models to enable rapid fine-tuning with lower GPU requirements and smaller output sizes. This makes it an ideal choice for tasks like training models to generate specific artistic styles, characters, or objects efficiently. LoRA’s versatility extends to various applications, including quality enhancement (e.g., fine detail adjustments), style or aesthetic customization, and domain-specific tuning. Its advantages of faster training, reduced computational overhead, and adaptability make it a popular choice in production environments for optimizing both text and image generation tasks.
Video Explanation
- The video by Sam Witteveen discusses why PEFT is needed and shows the coding in using PEFT techniques in fine-tuning.
- The video by CodeEmporium explains PEFT, detailing what it is and why the technique is necessary. It uses examples such as adapters and includes comparisons of parameter changes to illustrate its concepts effectively.
Related Questions: