What is Supervised Fine-Tuning?

supervised finetuning
Supervised Finetuning
Source: neo4j

Introduction

Supervised Fine-Tuning (SFT) refers to the process of adapting a pre-trained base model to specific tasks using labeled data. By leveraging supervised fine-tuning, the fine-tuned model achieves enhanced performance on the target task, demonstrating a deeper understanding of task-specific nuances and requirements. To understand supervised fine-tuning, it’s helpful to first learn about pre-trained base models and transfer learning. 

Pre-trained Base Models?

Pre-training typically involves training base models on large datasets to help them learn general representations. Researchers train language models like BERT and GPT on massive corpora to capture general language patterns. BERT uses masked language modeling (MLM) and next sentence prediction (NSP) objectives during pre-training, while GPT learns through autoregressive next-token prediction. Through pre-training on large-scale data, these models develop a sophisticated understanding of general language patterns. However, they often struggle with specialized domains where relevant content is scarce in their pre-training data.

Transfer Learning

One of the primary challenges in deploying large models is the scarcity of labeled data for specific tasks. Transfer learning offers a promising solution to this challenge by adapting pre-trained models. Researchers train these models on vast datasets, such as millions of Wikipedia articles and books, and then fine-tune them for specific downstream tasks using relatively small amounts of task-specific data, often just a few thousand examples. Transfer learning encompasses various adaptation approaches, including both supervised and unsupervised methods, making it a broader concept than supervised fine-tuning.

Supervised Fine-tuning

The term “supervised” reflects the use of labeled data, where the dataset creator pairs each input with a corresponding ground truth output or label. During fine-tuning, the model calculates the loss between its predictions and the ground truth. The optimization algorithm then uses this loss to update the model’s parameters through backpropagation, refining its performance on the task.

The following diagram shows the supervised fine-tuning procedure:

How Supervised Fine-Tuning works: a sentiment analysis example.
Title: How Supervised Fine-Tuning works: a sentiment analysis example. A pre-trained base model is fine-tuned using labeled data, updating its parameters to improve its performance in sentiment classification.
Source: AIML.com Research

Examples of the Labeled Dataset

The format of labeled datasets can vary depending on the specific task requirements. Here are some examples of labeled datasets:

  • Text classification: {“text”: “I love this movie!”, “label”: “positive”}
  • Image classification: {“image”: “dog.jpg”, “label”: “dog”}
  • Sequence-to-sequence tasks: {“input”: “Translate this to French”, “output”: “Traduisez cela en français”}

Fine-Tuning Techniques

Be Cautious of Overfitting and Catastrophic Forgetting

Two significant challenges arise during model fine-tuning: overfitting and catastrophic forgetting. Overfitting occurs when the model becomes too specialized to the fine-tuning dataset, learning specific patterns that don’t generalize well to unseen data. The issue of catastrophic forgetting is that the model loses much of its pre-trained general knowledge while adapting to the new task. Forgetting occurs when the model loses much of its pre-trained general knowledge while adapting to the new task. During fine-tuning, the optimization process drastically modifies the model parameters, causing it to “forget” valuable representations it learned during pre-training.

Finetuning Strategies

Several techniques have been used in supervised fine-tuning:

  • Finetuning the added layers: The most conservative approach is to freeze all base model weights and train only the added task-specific layers. This preserves the pre-trained knowledge while allowing adaptation to new tasks through the added layers.
  • Partial Fine-tuning: freezing the initial layers of the base model while fine-tuning the later layers that has more association with specific tasks.
  • Full Fine-tuning: finetuning all layers of the parameters. While this approach offers maximum flexibility, using too many training epochs can lead to catastrophic forgetting. Thus this strategy typically requires careful learning rate selection and early stopping to maintain a balance between adaptation and preservation of general knowledge.

Applications

Supervised Fine-Tuning is commonly used in different forms of tasks. Here are some scenarios of its usage.

Application scenarios of supervised fine-tuning
Title: Application scenarios of supervised fine-tuning
Source: AIML.com Research

How is Supervised Fine-tuning different from Instruction Fine-tuning

The instruction fine-tuning (IFT) also uses labeled data. However, the difference between supervised fine tuning and instruction fine-tuning is that the latter focuses more on a variety of instruction-output pairs (e.g., “Classify the sentiment of ‘Great movie!'” → “The sentiment is positive”), while the former one typically uses simple input-output pairs (e.g., sentiment classification: “Great movie!” → “Positive”). As a result, models after SFT learn to perform specific tasks but may not understand the broader context, while models after IFT learn to interpret instructions and respond appropriately, more like a conversational agent.

Video Explanation

  • In this video, Oren Sultan discusses the concept of fine-tuning and compares it with prompt engineering and the RAG method.
YouTube video
  • The video by SuperDataScienceWithJonKrohn gives a brief introduction to what supervised fine-tuning is:
YouTube video

  • In Stanford’s CS229N course, Building Large Language Models (LLMs), the professor discusses supervised fine-tuning in large language models starting at 1:02:30 in the video. It covers the impact of large-scale data on the performance of language models:https://www.youtube.com/watch?v=9vM4p9NN0Ts

Notebook Tutorial

A notebook tutorial showcasing a simple example of supervised fine-tuning of a large language model for the task of summary generation.

Related Questions:

Author

  • Brown University CS

    Machine Learning Content Writer

Help us improve this post by suggesting in comments below:

– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic

Leave the first comment

Partner Ad
Find out all the ways that you can
Contribute