Skip to content

Fine-Tuning and Alignment: Making Models Follow Instructions

Intuition: generalist first, specialist second

Section titled “Intuition: generalist first, specialist second”

Pretraining makes the model a “language generalist,” but it does not automatically answer questions in the way humans expect. Fine-tuning continues training on high-quality instruction-response pairs, teaching the model “conversation format.” Alignment goes further, making outputs align with human values: helpful, honest, and harmless.

An analogy: pretraining is reading all textbooks, fine-tuning is practicing mock interviews, and alignment is learning professional ethics and behavioral norms. All three are necessary.

Engineering view: from SFT to preference optimization

Section titled “Engineering view: from SFT to preference optimization”

The typical pipeline has multiple stages:

  1. SFT (Supervised Fine-Tuning): Train on human-written or distilled high-quality instruction data so the model learns to follow format and style.
  2. RLHF (Reinforcement Learning from Human Feedback): First train a reward model (RM) to learn human preference rankings, then optimize the policy model with PPO or similar algorithms to maximize reward scores.
  3. DPO (Direct Preference Optimization): Skip the explicit reward model and optimize the policy directly from preference data, simplifying the pipeline while often achieving comparable results.

Engineering trade-offs include: SFT data quality matters more than quantity; RLHF is sensitive to hyperparameters and can be unstable; DPO is simpler but may underperform on long responses or complex distributions. Constitutional AI and RLAIF attempt to generate preferences with AI rather than humans, reducing cost and improving scalability.

In addition, parameter-efficient fine-tuning methods such as LoRA and QLoRA let small teams fine-tune large models on consumer GPUs, greatly lowering the barrier to application.

Research view: the nature and limits of alignment

Section titled “Research view: the nature and limits of alignment”

A fundamental open question is whether “alignment” truly changes the model’s internal objectives, or merely suppresses surface behavior. Evidence suggests that models can sometimes “jailbreak” around safety training, indicating that alignment may not be deeply internalized.

Key research directions include: detecting and defending against reward hacking; addressing artifacts in preference modeling such as length and position bias; maintaining context consistency in multi-turn conversations; and achieving more robust alignment with less human annotation.

References

  • Training language models to follow instructions with human feedback — Long Ouyang et al. (2022)

    InstructGPT introduces the three-stage RLHF pipeline (SFT → reward model → PPO) that transforms language models from "predict next token" to "follow human intent." This is the direct blueprint for ChatGPT and established the dominant approach to LLM alignment.

  • Deep Reinforcement Learning from Human Preferences — Paul Christiano et al. (2017)

    The foundational RLHF paper. The authors show that training a reward model from human pairwise preferences, then using it to guide reinforcement learning, enables agents to learn complex behaviors that are difficult to specify with explicit reward functions. This framework was directly adopted by InstructGPT/ChatGPT.

  • Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Rafael Rafailov et al. (2023)

    DPO (Direct Preference Optimization) shows that the reward model + RL two-step in RLHF can be collapsed into a single supervised learning step: directly optimizing language model parameters on preference data, mathematically equivalent to the optimal RLHF policy. DPO has become the dominant RLHF alternative in alignment research and open-source community.

  • Constitutional AI: Harmlessness from AI Feedback — Yuntao Bai et al. (2022)

    Anthropic's Constitutional AI (CAI): use a set of explicit "constitution" principles to let the model self-critique and revise (SL-CAI phase), then use AI feedback instead of human feedback for RLHF (RLAIF phase). This reduces reliance on human annotation and is the core alignment technique behind the Claude model family.

  • lee2023-rlaif

    Google systematically proves RLAIF can match RLHF on various tasks, providing engineering evidence for "AI feedback replacing human" as scalable alignment solution.