Fine-Tuning and Alignment: Making Models Follow Instructions

Intuition: generalist first, specialist second

Pretraining makes the model a “language generalist,” but it does not automatically answer questions in the way humans expect. Fine-tuning continues training on high-quality instruction-response pairs, teaching the model “conversation format.” Alignment goes further, making outputs align with human values: helpful, honest, and harmless.

An analogy: pretraining is reading all textbooks, fine-tuning is practicing mock interviews, and alignment is learning professional ethics and behavioral norms. All three are necessary.

Engineering view: from SFT to preference optimization

The typical pipeline has multiple stages:

SFT (Supervised Fine-Tuning): Train on human-written or distilled high-quality instruction data so the model learns to follow format and style.
RLHF (Reinforcement Learning from Human Feedback): First train a reward model (RM) to learn human preference rankings, then optimize the policy model with PPO or similar algorithms to maximize reward scores.
DPO (Direct Preference Optimization): Skip the explicit reward model and optimize the policy directly from preference data, simplifying the pipeline while often achieving comparable results.

Engineering trade-offs include: SFT data quality matters more than quantity; RLHF is sensitive to hyperparameters and can be unstable; DPO is simpler but may underperform on long responses or complex distributions. Constitutional AI and RLAIF attempt to generate preferences with AI rather than humans, reducing cost and improving scalability.

In addition, parameter-efficient fine-tuning methods such as LoRA and QLoRA let small teams fine-tune large models on consumer GPUs, greatly lowering the barrier to application.

Research view: the nature and limits of alignment

A fundamental open question is whether “alignment” truly changes the model’s internal objectives, or merely suppresses surface behavior. Evidence suggests that models can sometimes “jailbreak” around safety training, indicating that alignment may not be deeply internalized.

Key research directions include: detecting and defending against reward hacking; addressing artifacts in preference modeling such as length and position bias; maintaining context consistency in multi-turn conversations; and achieving more robust alignment with less human annotation.

🔬 Open Research Questions

Key questions and research directions in this area:

How can reward hacking in RLHF be fundamentally addressed? Does DPO completely circumvent this issue?
How can the "capability degeneration" phenomenon after alignment be quantified and mitigated? How to find the optimal balance between alignment and capability?
Is there a unified alignment framework that can simultaneously handle helpfulness, harmlessness, and honesty?

Training Data Flow: From Rollout to Gradient Backprop (Engineering View)

The core loop of RL alignment training can be summarized in one sentence: sample a batch of responses, recalculate per-token probabilities for both old and new policies, clip the ratio to control update magnitude, then backpropagate loss through all parameters. The seven figures below walk through every step of this data flow.

Fig 1 · response_mask: which tokens contribute to loss

RL training computes loss only on the model-generated response tokens; prompt token losses are masked to zero, otherwise the reward signal would contaminate the input side.

<bos> 0

User 0

: 0

Summarize 0

this 0

article 0

. 0

<sep> 0

Assistant 0

: 0

The 1

article 1

discusses 1

key 1

topics 1

. 1

<eos> 1

Prompt token · mask = 0 · 不计入 loss

Response token · mask = 1 · 计入 loss

loss = Σ_t mask[t] · CE(logit[t], label[t]) / Σ mask[t]

Fig 2 · rollout: sampling responses (vLLM)

At the start of each training step, the current policy samples multiple response completions via vLLM (rollout). These responses are scored by the reward model and used for subsequent log_prob calculation.

Prompt (固定 / fixed)

<bos>User:ExplainRLHF.<sep>Assistant:

vLLM rollout (temperature sampling)

Sampled Responses (n = 4)

RLHFuseshumanfeedback...<eos>

ReinforcementLearningfromHuman...<eos>

Ittrainsarewardmodel<eos>

RLHFalignsLLMswith...<eos>

每条 response 将用于后续 log_prob 计算 & 奖励打分
Each sampled response feeds into log_prob recalculation & reward scoring

Fig 3 · teacher-forcing forward: recalculating log_prob (Megatron)

With the sampled responses in hand, the full “prompt + response” sequence is fed back into the policy model using teacher-forcing, computing log P for every response token in a single forward pass. This gives log_prob under the current policy π_θ.

Input sequence (teacher-forcing)

t=0 User

t=1 :

t=2 Explain

t=3 RLHF

t=4 .

t=5 <sep>

t=6 RLHF

t=7 uses

t=8 human

t=9 feedback

t=10 <eos>

Megatron policy model π_θ (235B)

Output: log P(t+1 | t_≤t) for response positions

t=1 :

t=2 Explain

t=3 RLHF

t=4 .

t=5 <sep>

t=6 RLHF log p

t=7 uses log p

t=8 human log p

t=9 feedback log p

t=10 <eos> log p

Teacher-forcing：每步输入真实 token（而非上步预测），并行高效重算整条序列的 log_prob。
Teacher-forcing feeds ground-truth tokens at each step, enabling efficient parallel log_prob recalculation.

Fig 4 · log_prob vs old_log_prob: comparing new and reference policy

The log π_θ from the previous step is compared token-by-token against the log π_old recorded during sampling. The difference determines how much the policy has drifted.

log π_θ(a_t|s_t) — new policy

log π_old(a_t|s_t) — reference / old policy

差值 Δ = log π_θ − log π_old 决定重要性采样比 r_t = exp(Δ)， PPO/GRPO 对 r_t 进行裁剪以防策略漂移。
The difference Δ = log π_θ − log π_old determines importance ratio r_t = exp(Δ), which PPO/GRPO clips to prevent policy drift.

Fig 5 · ratio clipping: PPO / GRPO

The importance ratio r_t = exp(log π_θ − log π_old) measures the per-token policy update magnitude. PPO clips r_t to [1−ε, 1+ε] to prevent excessively large updates that could destabilize training.


r_t = π_θ(a_t|s_t) / π_old(a_t|s_t) = exp(log π_θ − log π_old)


L^CLIP = E_t[ min(r_t·Â_t, clip(r_t, 1−ε, 1+ε)·Â_t) ]

Fig 6 · seq-mean-token-mean: loss aggregation

Loss is first averaged over response tokens within each sequence (eliminating length bias), then averaged across the mini-batch to produce a scalar loss.

Step 1 · token-level loss per sequence

seq 1

RLHF

0.32

uses

0.18

human

0.55

feedback

0.41

<eos>

0.08

seq mean 0.308

seq 2

Reinforcement

0.27

Learning

0.44

from

0.19

Human

0.38

Feedback

0.52

0.07

<eos>

0.06

seq mean 0.276

seq 3

0.14

trains

0.36

0.09

reward

0.48

model

0.31

<eos>

0.05

seq mean 0.238

Step 2 · average seq-means across batch

batch loss = 0.2740 = (0.308 + 0.276 + 0.238) / 3

Seq-mean-token-mean：先在每条序列内按 token 数归一化，再跨序列平均，消除序列长度偏差。
Seq-mean-then-batch-mean: normalize per sequence first, then average across the batch — eliminating length bias.

Fig 7 · softmax Jacobian → gradient backprop → Megatron pipeline

Starting from a scalar loss, gradients flow back through the softmax Jacobian to the logits, then back through 96 Transformer layers to all 235B parameters. Megatron-LM’s 3D parallelism (PP/TP/DP) distributes the gradient updates across thousands of GPUs.

scalar loss

∈ ℝ

∂L/∂logit ∈ ℝ^V

▼

∂softmax

Softmax Jacobian

J = diag(p) − p·p^⊤ ∈ ℝ^V×V
V ≈ 128k vocab; full J 太大→实际用向量积简化

∂L/∂h_last ∈ ℝ^d (d = 8192)

▼

Layer 96 Layer … Layer 1

Transformer backprop

96 layers × (Attn + FFN) = 235B params total
每层累积 ∂L/∂W via chain rule

gradient tensors 分发到各设备

▼

Megatron-LM

3D 并行切分

PP stage 1
layers 1–24

PP stage 2
layers 25–48

PP stage 3
layers 49–72

PP stage 4
layers 73–96

← Tensor Parallel (TP) → ↕ Data Parallel (DP)

AllReduce grad sync (DP) + send/recv (PP)

▼

Adam

Optimizer step

θ ← θ − α · m̂ / (√v̂ + ε) · 235B params updated

Softmax Jacobian 的维度爆炸（V×V ≈ 128k²）在实现中用向量乘法 (p − y) 简化为 O(V) 操作。 Megatron 的流水线并行（PP）、张量并行（TP）、数据并行（DP）三维切分让 235B 参数更新分布在数千 GPU 上。
The softmax Jacobian dimension explosion (V×V) is avoided in practice using the simplified gradient (p − y) — O(V) ops. Megatron's 3D parallelism (PP/TP/DP) distributes the 235B parameter update across thousands of GPUs.

References

Training language models to follow instructions with human feedback — Long Ouyang et al. (2022)
InstructGPT introduces the three-stage RLHF pipeline (SFT → reward model → PPO) that transforms language models from "predict next token" to "follow human intent." This is the direct blueprint for ChatGPT and established the dominant approach to LLM alignment.
Deep Reinforcement Learning from Human Preferences — Paul Christiano et al. (2017)
The foundational RLHF paper. The authors show that training a reward model from human pairwise preferences, then using it to guide reinforcement learning, enables agents to learn complex behaviors that are difficult to specify with explicit reward functions. This framework was directly adopted by InstructGPT/ChatGPT.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Rafael Rafailov et al. (2023)
DPO (Direct Preference Optimization) shows that the reward model + RL two-step in RLHF can be collapsed into a single supervised learning step: directly optimizing language model parameters on preference data, mathematically equivalent to the optimal RLHF policy. DPO has become the dominant RLHF alternative in alignment research and open-source community.
Constitutional AI: Harmlessness from AI Feedback — Yuntao Bai et al. (2022)
Anthropic's Constitutional AI (CAI): use a set of explicit "constitution" principles to let the model self-critique and revise (SL-CAI phase), then use AI feedback instead of human feedback for RLHF (RLAIF phase). This reduces reliance on human annotation and is the core alignment technique behind the Claude model family.
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback — Harrison Lee et al. (2023)
Google systematically proves RLAIF can match RLHF on various tasks, providing engineering evidence for "AI feedback replacing human" as scalable alignment solution.
LoRA: Low-Rank Adaptation of Large Language Models — Edward J. Hu et al. (2021)
LoRA freezes pretrained weights and only trains the product of two low-rank matrices (rank r much smaller than original dimensions), reducing trainable parameters by up to 10,000x. This makes fine-tuning large models on consumer GPUs feasible and has become the dominant parameter-efficient fine-tuning (PEFT) method.

Fine-Tuning and Alignment: Making Models Follow Instructions

Intuition: generalist first, specialist second

Engineering view: from SFT to preference optimization

Research view: the nature and limits of alignment

🔬 Open Research Questions

Training Data Flow: From Rollout to Gradient Backprop (Engineering View)

Fig 1 · response_mask: which tokens contribute to loss

Fig 2 · rollout: sampling responses (vLLM)

Fig 3 · teacher-forcing forward: recalculating log_prob (Megatron)

Fig 4 · log_prob vs old_log_prob: comparing new and reference policy

Fig 5 · ratio clipping: PPO / GRPO

Fig 6 · seq-mean-token-mean: loss aggregation

Fig 7 · softmax Jacobian → gradient backprop → Megatron pipeline

Related Reading

Pretraining and Scaling Law: How Models Learn

Safety and Adversarial: Protecting and Attacking Models

Prompt Engineering: The Art of Talking to Models

References