RLHF — 보상 모델과 PPO 학습(Reward Model + PPO)

지도 미세조정(SFT)은 모델에게 지시를 따르는 법을 가르칩니다. 하지만 어떤 응답이 더 나은지는 가르치지 않습니다. 문법적으로 맞고 사실도 맞는 두 답변이라 해도 도움이 되는 정도(helpfulness)는 크게 다를 수 있습니다. 사람의 인간 피드백 강화학습(Reinforcement Learning from Human Feedback; RLHF)은 사람의 판단(human judgment)을 모델의 행동에 새겨 넣는 방법이며, Claude를 도움이 되게 만들고 GPT를 공손하게 만드는 핵심 절차입니다.

유형: Build 언어: Python (numpy 사용) 선수 지식: Phase 10, Lesson 06 (지시 튜닝(Instruction Tuning) / 지도 미세조정(SFT)) 예상 시간: 약 90분

학습 목표

사람의 선호 쌍(human preference pair; chosen vs rejected)으로부터 응답 품질(response quality)을 점수화하는 보상 모델(reward model)을 만듭니다.
KL 발산 패널티(KL penalty)가 적용된 근접 정책 최적화(Proximal Policy Optimization; PPO) 학습 루프를 구현해 언어 모델 정책(policy)을 보상 모델에 맞춰 최적화합니다.
RLHF가 왜 세 모델(SFT, reward, policy)을 필요로 하는지, 그리고 KL 제약(KL constraint)이 보상 해킹(reward hacking)을 어떻게 막는지 설명합니다.
선호 최적화(preference optimization) 전후의 응답 품질을 비교해 RLHF의 효과를 평가합니다.

문제

모델에게 "Explain quantum computing"이라고 물으면 다음과 같은 응답이 나올 수 있습니다.

Response A: "Quantum computing uses qubits that can exist in superposition, meaning they can be 0, 1, or both simultaneously. This allows quantum computers to process certain calculations exponentially faster than classical computers. Key algorithms include Shor's algorithm for factoring large numbers and Grover's algorithm for searching unsorted databases."

Response B: "Quantum computing is a type of computing that uses quantum mechanical phenomena. It was first proposed in the 1980s. Richard Feynman suggested that quantum systems could be simulated by quantum computers. The field has grown significantly since then. Many companies are now working on quantum computers. IBM, Google, and others have made progress. Quantum supremacy was claimed by Google in 2019."

두 응답 모두 사실관계는 맞습니다. 문법도 무난하고, 지시도 모두 따랐습니다. 하지만 Response A가 분명히 더 좋은 응답입니다. 더 간결하고, 더 유익하며, 구조도 더 잘 잡혀 있기 때문입니다. 사람이라면 매번 A를 고를 것입니다.

지도 미세조정(SFT)은 이 차이를 포착하지 못합니다. SFT는 "올바른" 응답으로 모델을 학습시키지만, "이 응답이 저 응답보다 더 좋다"고 말할 수 있는 메커니즘이 없습니다. 모든 학습 예시를 동등하게 좋은 것으로 취급하기 때문에, 만약 A와 B가 모두 SFT 학습 데이터셋에 포함되어 있다면 모델은 두 응답에서 똑같은 방식으로 학습합니다.

RLHF가 바로 이 문제를 해결합니다. 보상 모델을 학습해 사람이 어떤 응답을 더 선호할지 예측하게 하고, 그렇게 얻은 보상 신호(reward signal)로 언어 모델을 더 높은 품질의 출력 쪽으로 밀어줍니다. ChatGPT의 전신인 InstructGPT는 RLHF를 통해 GPT-3의 도움성(helpfulness), 진실성(truthfulness), 무해성(harmlessness)을 크게 개선했습니다. OpenAI의 내부 평가자들은 InstructGPT가 GPT-3보다 135배 작은데도(1.3B vs 175B 파라미터(parameters)) InstructGPT의 출력을 85%의 경우 더 선호했습니다.

사전 테스트

2문제 · 이 강의를 시작하기 전에 얼마나 알고 있는지 확인해보세요

1.RLHF의 보상 모델(reward model)은 무엇으로부터 학습하나요?

2.RLHF의 PPO 학습(PPO training)에서 KL 발산 패널티(KL divergence penalty)를 사용하는 이유는 무엇인가요?

0/2 답변 완료

개념

세 단계(The Three Stages)

RLHF는 단일 학습 실행(training run)이 아니라, 앞 단계 위에 다음 단계를 쌓아 올리는 세 단계의 순차적 파이프라인(pipeline)입니다.

1단계: SFT. 기반 모델(base model)을 지시-응답 쌍(instruction-response pair)으로 학습합니다(Lesson 06). 이 단계가 끝나면 지시는 따를 수 있지만 어느 응답이 더 나은지는 모르는 모델이 만들어집니다.

2단계: 보상 모델(Reward Model). 사람의 선호 데이터(human preference data)를 수집합니다. 주석자(annotator)에게 같은 프롬프트(prompt)에 대한 두 응답을 보여주고 "어느 쪽이 더 좋은가?"를 물어 답을 받습니다. 이렇게 모은 선호를 예측하는 모델을 학습하며, 보상 모델은 (prompt, response) 쌍을 입력으로 받아 스칼라(scalar) 점수를 출력합니다.

3단계: PPO. 보상 모델을 사용해 언어 모델을 위한 학습 신호를 만듭니다. 언어 모델이 응답을 생성하면, 보상 모델이 점수를 매기고, PPO가 더 높은 점수의 응답을 만들도록 언어 모델을 업데이트합니다. KL 발산 패널티(KL divergence penalty)는 언어 모델이 SFT 체크포인트(checkpoint)에서 너무 멀어지지 않게 잡아 줍니다.

graph TD
    subgraph Stage1["Stage 1: SFT"]
        B["Base Model"] --> S["SFT Model"]
        D["Instruction Data\n(27K examples)"] --> S
    end

    subgraph Stage2["Stage 2: Reward Model"]
        S --> |"Generate responses"| P["Preference Pairs\n(prompt, winner, loser)"]
        H["Human Annotators"] --> P
        P --> R["Reward Model\nR(prompt, response) → score"]
    end

    subgraph Stage3["Stage 3: PPO"]
        S --> |"Initialize policy"| PI["Policy Model\n(being optimized)"]
        S --> |"Freeze as reference"| REF["Reference Model\n(frozen SFT)"]
        PI --> |"Generate"| RESP["Response"]
        RESP --> R
        R --> |"Reward signal"| PPO["PPO Update"]
        REF --> |"KL penalty"| PPO
        PPO --> |"Update"| PI
    end

    style S fill:#1a1a2e,stroke:#51cf66,color:#fff
    style R fill:#1a1a2e,stroke:#e94560,color:#fff
    style PI fill:#1a1a2e,stroke:#0f3460,color:#fff
    style REF fill:#1a1a2e,stroke:#0f3460,color:#fff
    style PPO fill:#1a1a2e,stroke:#e94560,color:#fff

보상 모델(The Reward Model)

보상 모델은 본질적으로 언어 모델을 점수 채점기(scorer)로 바꾼 것입니다. SFT 모델을 가져와서, 어휘(vocabulary)에 대한 분포를 출력하던 언어 모델링 헤드(language modeling head)를 숫자 하나를 출력하는 스칼라 헤드(scalar head)로 교체합니다. 마지막 층 이전까지의 아키텍처는 완전히 동일합니다.

입력은 프롬프트와 응답을 이어 붙인 문자열이고, 출력은 하나의 스칼라 보상 점수(reward score)입니다.

학습 데이터는 사람의 선호 쌍입니다. 각 프롬프트마다 주석자가 두 응답을 보고 더 나은 쪽을 고르며, 그 결과로 (prompt, preferred_response, rejected_response) 형태의 학습용 삼중쌍(triple)이 만들어집니다.

손실 함수(loss function)는 쌍 비교(pairwise preference)에 사용하는 브래들리-테리 모델(Bradley-Terry model)을 따릅니다.

loss = -log(sigmoid(reward(preferred) - reward(rejected)))

이것이 핵심 수식입니다. sigmoid(reward(A) - reward(B))는 응답 A가 응답 B보다 선호될 확률을 나타냅니다. 이 손실은 보상 모델이 선호된 응답에 더 높은 점수를 매기도록 유도합니다.

왜 절대 점수(absolute score) 대신 쌍 비교(pairwise comparison)를 쓸까요? 사람은 절대적인 품질 점수("이 응답이 10점 만점에 7.3인가, 7.5인가?")를 매기는 데에는 매우 서툽니다. 반면 상대 비교("A가 B보다 나은가?")는 잘합니다. 브래들리-테리 모델은 이러한 상대 비교를 일관된 절대 점수 체계로 변환해 주는 도구입니다.

InstructGPT 수치: OpenAI는 40명의 외주 작업자(contractor)로부터 33,000개의 비교 쌍(comparison pair)을 수집했습니다. 한 번의 비교에 약 5분이 걸렸으며, 결과적으로 보상 모델 학습 데이터를 위해 2,750시간의 사람 노동이 투입되었습니다.

근접 정책 최적화(PPO: Proximal Policy Optimization)

PPO는 강화학습(reinforcement learning; RL) 알고리즘입니다. RLHF의 맥락에서는 "환경(environment)"이 보상 모델, "에이전트(agent)"가 언어 모델, "행동(action)"이 토큰 한 개를 생성하는 일에 해당합니다.

목적 함수(objective)는 다음과 같습니다.

maximize: E[R(prompt, response)] - beta * KL(policy || reference)

첫 번째 항은 모델이 높은 보상을 받는 응답을 생성하도록 밀어줍니다. 두 번째 항인 KL 발산 패널티는 모델이 SFT 체크포인트에서 너무 멀리 떨어지는 것을 막습니다.

왜 KL 패널티가 필요할까요? 이것이 없으면 모델은 곧 퇴행적인(degenerate) 해결책에 도달합니다. 보상 모델은 유한한 양의 사람 선호 데이터로 학습되었기 때문에 사각지대(blind spot)가 있을 수밖에 없습니다. 언어 모델은 그 사각지대를 파고들어, 보상 모델에서는 높은 점수를 받지만 실제로는 무의미한 출력을 찾아냅니다. 다음은 대표적인 사례입니다.

"I'm so helpful and harmless!" 같은 문장을 반복해서 도움성/무해성 보상 모델에서 높은 점수를 받는 경우
장황하고 격식 있는 톤이지만 내용은 빈 응답을 만들어 "고품질" 패턴처럼 보이게 하는 경우
학습 데이터에서 우연히 높은 보상과 상관이 있었던 특정 문구를 악용하는 경우

KL 패널티는 이렇게 말하는 셈입니다. "개선은 해도 좋지만, 완전히 다른 모델이 되어서는 안 된다. 이미 합리적인 SFT 버전 근처에 머물러라. 너무 멀리 벗어나면 KL 비용이 보상을 압도해 버릴 것이다."

InstructGPT 수치: PPO 학습은 lr=1.5e-5, KL 계수 beta=0.02, 256K 에피소드(episodes; 프롬프트-응답 쌍), 배치(batch)당 4번의 PPO 에포크(epoch)를 사용했습니다. 전체 RLHF 파이프라인은 GPU 클러스터(cluster)에서 며칠이 걸렸습니다.

graph LR
    subgraph PPO["PPO Training Loop"]
        direction TB
        PROMPT["Sample prompt\nfrom dataset"] --> GEN["Policy generates\nresponse"]
        GEN --> SCORE["Reward model\nscores response"]
        GEN --> KL["Compute KL divergence\nvs reference model"]
        SCORE --> OBJ["Objective:\nreward - beta * KL"]
        KL --> OBJ
        OBJ --> UPDATE["PPO gradient update\n(clipped surrogate loss)"]
        UPDATE --> |"repeat"| PROMPT
    end

    style PROMPT fill:#1a1a2e,stroke:#0f3460,color:#fff
    style SCORE fill:#1a1a2e,stroke:#51cf66,color:#fff
    style KL fill:#1a1a2e,stroke:#e94560,color:#fff
    style OBJ fill:#1a1a2e,stroke:#e94560,color:#fff

PPO 목적 함수 자세히 보기(The PPO Objective in Detail)

PPO는 지나치게 큰 업데이트를 막기 위해 "잘린 대리 목적 함수(clipped surrogate objective)"를 사용합니다. 새 정책(new policy)과 이전 정책(old policy)의 확률 비율(probability ratio)을 [1 - epsilon, 1 + epsilon] 범위로 잘라 내며(clip), 일반적으로 epsilon은 0.2를 사용합니다.

ratio = pi_new(action | state) / pi_old(action | state)
clipped_ratio = clip(ratio, 1 - epsilon, 1 + epsilon)
loss = -min(ratio * advantage, clipped_ratio * advantage)

이점 함수(advantage function)는 현재 응답이 기대되는 평균 품질에 비해 얼마나 더 나은지를 추정합니다. RLHF에서는 다음처럼 정의됩니다.

advantage = reward(prompt, response) - baseline

기준선(baseline)은 보통 최근 응답들의 평균 보상으로 잡습니다. 이점이 양수라는 것은 응답이 평균보다 좋았다는 뜻이고, 음수라면 평균보다 나빴다는 뜻입니다. PPO는 평균보다 나은 응답의 확률은 높이고, 평균보다 못한 응답의 확률은 낮춥니다.

이 클리핑(clipping) 덕분에 치명적인 업데이트를 피할 수 있습니다. 어떤 응답이 유난히 큰 보상을 받으면 자르지 않은 비율은 매우 커져 모델이 그 응답 쪽으로 급격히 쏠릴 수 있는데, 클리핑은 업데이트 폭을 제한해 학습 안정성을 유지해 줍니다.

보상 해킹(Reward Hacking)

RLHF의 어두운 이면입니다. 언어 모델은 보상 모델을 최적화하지만, 보상 모델은 사람의 선호를 완전히 반영하지 못하는 불완전한 대리 지표(proxy)일 뿐입니다. 언어 모델이 보상을 더 잘 최대화하게 될수록, 그 모델은 보상 모델의 약점을 이용하기 시작합니다.

대표적인 실패 양상(failure mode)은 다음과 같습니다.

실패(Failure)	현상	원인
장황함(Verbosity)	모델의 응답이 점점 길어집니다.	주석자가 더 길고 자세한 응답을 자주 선호하다 보니, 보상 모델이 길이 자체에 높은 점수를 줍니다.
아첨(Sycophancy)	모델이 사용자의 말에 무조건 동의합니다.	주석자가 질문의 전제(premise)에 동의하는 응답을 더 선호했습니다.
회피(Hedging)	모델이 답을 명확히 내리지 않습니다.	"This is a complex topic with many perspectives..." 같은 회피성 응답은 틀렸다고 표시될 일이 거의 없기 때문입니다.
형식 게이밍(Format gaming)	모델이 글머리표(bullet point)와 헤더(header)를 과도하게 사용합니다.	형식이 잡힌 응답이 주석자에게 더 다듬어진 것처럼 보였습니다.

완화 전략으로는 더 강한 KL 패널티(stronger KL penalty)로 모델이 약점을 이용할 만큼 멀리 가지 못하게 묶는 방법, 적대적 예시(adversarial examples)로 보상 모델을 추가 학습해 알려진 실패 양상을 메우는 방법, 그리고 서로 다른 아키텍처를 가진 여러 보상 모델을 함께 쓰는 방법(모든 모델을 동시에 속이기는 훨씬 어렵습니다)이 있습니다.

실제 RLHF 파이프라인(Real RLHF Pipelines)

모델	비교 쌍(Comparison Pairs)	주석자(Annotators)	RM 크기	PPO Steps	KL 계수(KL Coeff)
InstructGPT	33K	40	6B	256K	0.02
Llama 2 Chat	~1M	undisclosed	70B	undisclosed	0.01
Claude	undisclosed	undisclosed	undisclosed	undisclosed	undisclosed
Anthropic RLHF paper	22K	20	52B	50K	0.001

Anthropic이 2022년에 발표한 논문에서는 22,000개의 비교 데이터로 52B 보상 모델을 학습했습니다. 더 큰 보상 모델은 더 신뢰할 수 있는 신호를 만들어 PPO 학습을 더 안정적으로 만들어 줍니다. 반대로 작은 보상 모델로 큰 언어 모델을 학습하는 것은 위험합니다. 좋은 응답과 나쁜 응답의 미묘한 차이를 담아내기에 보상 모델의 표현 용량(capacity)이 부족하기 때문입니다.

직접 만들기

Step 1: 합성 선호 데이터(Synthetic Preference Data)

실제 운영 환경에서는 사람 주석자가 선호 데이터를 만듭니다. 여기서는 "선호된(preferred)" 응답이 객관적으로 더 좋은(더 간결하고, 더 정확하고, 더 도움이 되는) 합성 쌍을 만들어 사용합니다.

import numpy as np

PREFERENCE_DATA = [
    {
        "prompt": "What is the capital of France?",
        "preferred": "The capital of France is Paris.",
        "rejected": "France is a country in Europe. It has many cities. The capital is Paris. Paris is known for the Eiffel Tower.",
    },
    {
        "prompt": "Explain gravity in one sentence.",
        "preferred": "Gravity is the force that attracts objects with mass toward each other.",
        "rejected": "Gravity is something that makes things fall down when you drop them.",
    },
    {
        "prompt": "What is 15 times 7?",
        "preferred": "15 times 7 is 105.",
        "rejected": "Let me think about this. 15 times 7. Well, 10 times 7 is 70, and 5 times 7 is 35, so the answer might be around 105.",
    },
    {
        "prompt": "Name three programming languages.",
        "preferred": "Python, Rust, and TypeScript.",
        "rejected": "There are many programming languages. Some popular ones include various languages like Python and others.",
    },
    {
        "prompt": "What year did World War II end?",
        "preferred": "World War II ended in 1945.",
        "rejected": "World War II was a major global conflict. It involved many countries. The war ended in the mid-1940s, specifically in 1945.",
    },
    {
        "prompt": "Define machine learning.",
        "preferred": "Machine learning is a field where algorithms learn patterns from data to make predictions without being explicitly programmed.",
        "rejected": "Machine learning is a type of AI. AI stands for artificial intelligence. Machine learning uses data to learn.",
    },
]

선호된 응답은 간결하고 직접적입니다. 반면 거절된(rejected) 응답에는 흔한 실패 양상이 그대로 담겨 있습니다. 불필요한 부연(padding), 회피(hedging), 중복 설명(redundant explanation), 정확도가 떨어지는 표현(imprecision) 같은 것들이지요. 이런 차이는 SFT는 포착하지 못하지만 RLHF는 정확히 잡아낼 수 있는 종류의 구분입니다.

Step 2: 보상 모델 아키텍처(Reward Model Architecture)

보상 모델은 작은 GPT(mini GPT)의 트랜스포머(transformer) 아키텍처를 그대로 재사용하되, 어휘 크기만큼의 출력 헤드(vocabulary-sized output head)를 단일 스칼라 사영(scalar projection)으로 바꿉니다.

import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "04-pre-training-mini-gpt", "code"))
from main import MiniGPT, LayerNorm, Embedding, TransformerBlock


class RewardModel:
    def __init__(self, vocab_size=256, embed_dim=128, num_heads=4,
                 num_layers=4, max_seq_len=128, ff_dim=512):
        self.embedding = Embedding(vocab_size, embed_dim, max_seq_len)
        self.blocks = [
            TransformerBlock(embed_dim, num_heads, ff_dim)
            for _ in range(num_layers)
        ]
        self.ln_f = LayerNorm(embed_dim)
        self.reward_head = np.random.randn(embed_dim) * 0.02

    def forward(self, token_ids):
        seq_len = token_ids.shape[-1]
        mask = np.triu(np.full((seq_len, seq_len), -1e9), k=1)

        x = self.embedding.forward(token_ids)
        for block in self.blocks:
            x = block.forward(x, mask)
        x = self.ln_f.forward(x)

        last_hidden = x[:, -1, :]
        reward = last_hidden @ self.reward_head

        return reward

보상 모델은 마지막 토큰 위치의 은닉 상태(hidden state)를 가져와서 스칼라로 사영합니다. 왜 하필 마지막 토큰일까요? 인과적 어텐션 마스크(causal attention mask) 때문에, 마지막 위치는 이전의 모든 토큰을 이미 어텐드(attend)한 상태이며 (prompt, response) 전체 시퀀스(sequence)에 대해 가장 완전한 표현(representation)을 가지고 있기 때문입니다.

Step 3: 브래들리-테리 손실(Bradley-Terry Loss)

보상 모델을 선호 쌍 위에서 학습할 때, 쌍 비교용 브래들리-테리 손실을 사용합니다.

def tokenize_for_reward(prompt, response, vocab_size=256):
    prompt_tokens = [min(t, vocab_size - 1) for t in list(prompt.encode("utf-8"))]
    response_tokens = [min(t, vocab_size - 1) for t in list(response.encode("utf-8"))]
    return prompt_tokens + [0] + response_tokens


def sigmoid(x):
    return np.where(
        x >= 0,
        1.0 / (1.0 + np.exp(-x)),
        np.exp(x) / (1.0 + np.exp(x))
    )


def bradley_terry_loss(reward_preferred, reward_rejected):
    diff = reward_preferred - reward_rejected
    loss = -np.log(sigmoid(diff) + 1e-8)
    return loss


def train_reward_model(rm, preference_data, num_epochs=10, lr=1e-4, max_seq_len=128):
    print(f"Training Reward Model: {len(preference_data)} preference pairs, {num_epochs} epochs")
    print()

    losses = []
    accuracies = []

    for epoch in range(num_epochs):
        epoch_loss = 0.0
        epoch_correct = 0
        num_pairs = 0

        indices = np.random.permutation(len(preference_data))

        for idx in indices:
            pair = preference_data[idx]

            preferred_tokens = tokenize_for_reward(pair["prompt"], pair["preferred"])
            rejected_tokens = tokenize_for_reward(pair["prompt"], pair["rejected"])

            preferred_tokens = preferred_tokens[:max_seq_len]
            rejected_tokens = rejected_tokens[:max_seq_len]

            preferred_ids = np.array(preferred_tokens).reshape(1, -1)
            rejected_ids = np.array(rejected_tokens).reshape(1, -1)

            r_preferred = rm.forward(preferred_ids)[0]
            r_rejected = rm.forward(rejected_ids)[0]

            loss = bradley_terry_loss(r_preferred, r_rejected)

            if r_preferred > r_rejected:
                epoch_correct += 1

            diff = r_preferred - r_rejected
            grad = sigmoid(diff) - 1.0

            rm.reward_head -= lr * grad * rm.ln_f.forward(
                rm.embedding.forward(preferred_ids)
            )[:, -1, :].flatten()

            epoch_loss += loss
            num_pairs += 1

        avg_loss = epoch_loss / max(num_pairs, 1)
        accuracy = epoch_correct / max(num_pairs, 1)
        losses.append(avg_loss)
        accuracies.append(accuracy)

        if epoch % 2 == 0:
            print(f"  Epoch {epoch + 1:3d} | Loss: {avg_loss:.4f} | Accuracy: {accuracy:.1%}")

    return rm, losses, accuracies

정확도(accuracy) 지표는 단순합니다. 보상 모델이 선호 쌍 중 몇 퍼센트를 올바르게 순위 매겼는지(rank) 보는 것이지요. 무작위 모델은 50% 정도가 나오고, 잘 정제된 데이터에서 잘 학습된 보상 모델이라면 70%를 넘기는 것이 일반적입니다. InstructGPT의 보상 모델은 보류된 비교 쌍(held-out comparison)에서 약 72%의 정확도를 기록했습니다. 수치만 보면 낮아 보일 수 있지만, 실제로는 매우 좋은 성과입니다. 많은 선호 쌍은 사람조차도 판단이 갈리며, 주석자 간 일치도(inter-annotator agreement)도 약 73% 수준이기 때문입니다.

Step 4: 단순화한 PPO 루프(Simplified PPO Loop)

본격적인 PPO 구현은 매우 복잡합니다. 여기서는 핵심 메커니즘만 담아냅니다. 응답을 생성하고, 점수를 매기고, 이점(advantage)을 계산한 뒤 KL 패널티와 함께 정책을 업데이트하는 흐름입니다.

def compute_kl_divergence(policy_logits, reference_logits):
    policy_probs = np.exp(policy_logits - policy_logits.max(axis=-1, keepdims=True))
    policy_probs = policy_probs / policy_probs.sum(axis=-1, keepdims=True)
    policy_probs = np.clip(policy_probs, 1e-10, 1.0)

    ref_probs = np.exp(reference_logits - reference_logits.max(axis=-1, keepdims=True))
    ref_probs = ref_probs / ref_probs.sum(axis=-1, keepdims=True)
    ref_probs = np.clip(ref_probs, 1e-10, 1.0)

    kl = np.sum(policy_probs * np.log(policy_probs / ref_probs), axis=-1)
    return kl.mean()


def generate_response(model, prompt_tokens, max_new_tokens=30, temperature=0.8, max_seq_len=128):
    tokens = list(prompt_tokens)

    for _ in range(max_new_tokens):
        context = np.array(tokens[-max_seq_len:]).reshape(1, -1)
        logits = model.forward(context)
        next_logits = logits[0, -1, :]

        next_logits = next_logits / max(temperature, 1e-8)
        probs = np.exp(next_logits - next_logits.max())
        probs = probs / probs.sum()
        probs = np.clip(probs, 1e-10, 1.0)
        probs = probs / probs.sum()

        next_token = np.random.choice(len(probs), p=probs)
        tokens.append(int(next_token))

    return tokens


def copy_model_weights(source, target):
    target.embedding.token_embed = source.embedding.token_embed.copy()
    target.embedding.pos_embed = source.embedding.pos_embed.copy()
    target.ln_f.gamma = source.ln_f.gamma.copy()
    target.ln_f.beta = source.ln_f.beta.copy()
    for s_block, t_block in zip(source.blocks, target.blocks):
        t_block.attn.W_q = s_block.attn.W_q.copy()
        t_block.attn.W_k = s_block.attn.W_k.copy()
        t_block.attn.W_v = s_block.attn.W_v.copy()
        t_block.attn.W_out = s_block.attn.W_out.copy()
        t_block.ffn.W1 = s_block.ffn.W1.copy()
        t_block.ffn.W2 = s_block.ffn.W2.copy()
        t_block.ffn.b1 = s_block.ffn.b1.copy()
        t_block.ffn.b2 = s_block.ffn.b2.copy()
        t_block.ln1.gamma = s_block.ln1.gamma.copy()
        t_block.ln1.beta = s_block.ln1.beta.copy()
        t_block.ln2.gamma = s_block.ln2.gamma.copy()
        t_block.ln2.beta = s_block.ln2.beta.copy()


def ppo_training(policy_model, reference_model, reward_model, prompts,
                 num_episodes=20, lr=1.5e-5, kl_coeff=0.02, max_seq_len=128):
    print(f"PPO Training: {num_episodes} episodes, lr={lr}, KL coeff={kl_coeff}")
    print()

    rewards_history = []
    kl_history = []

    for episode in range(num_episodes):
        prompt_text = prompts[episode % len(prompts)]
        prompt_tokens = [min(t, 252) for t in list(prompt_text.encode("utf-8"))]

        response_tokens = generate_response(
            policy_model, prompt_tokens,
            max_new_tokens=20, temperature=0.8, max_seq_len=max_seq_len
        )

        response_ids = np.array(response_tokens[:max_seq_len]).reshape(1, -1)
        reward = reward_model.forward(response_ids)[0]

        policy_logits = policy_model.forward(response_ids)
        ref_logits = reference_model.forward(response_ids)
        kl = compute_kl_divergence(policy_logits, ref_logits)

        total_reward = reward - kl_coeff * kl

        rewards_history.append(float(reward))
        kl_history.append(float(kl))

        for block in policy_model.blocks:
            update_scale = lr * total_reward
            block.ffn.W1 += update_scale * np.random.randn(*block.ffn.W1.shape) * 0.01
            block.ffn.W2 += update_scale * np.random.randn(*block.ffn.W2.shape) * 0.01

        if episode % 5 == 0:
            avg_reward = np.mean(rewards_history[-5:]) if rewards_history else 0
            avg_kl = np.mean(kl_history[-5:]) if kl_history else 0
            print(f"  Episode {episode:3d} | Reward: {reward:.4f} | KL: {kl:.4f} | "
                  f"Avg Reward: {avg_reward:.4f}")

    return policy_model, rewards_history, kl_history

핵심 루프(loop)는 다음과 같이 정리할 수 있습니다. (1) 프롬프트를 표본 추출하고, (2) 응답을 생성하고, (3) 보상 모델로 점수를 매기고, (4) 동결(frozen)된 참조 모델(reference model)에 대해 KL 발산을 계산하고, (5) 조정된 보상(reward에서 KL 패널티를 뺀 값)을 구한 뒤, (6) 정책을 업데이트합니다. 정책이 참조 모델에서 멀어질수록 KL 패널티가 커지면서 보상 해킹을 자동으로 억제하게 됩니다.

Step 5: 보상 점수 비교(Reward Score Comparison)

RLHF 학습이 끝나면, 정책 모델이 생성한 응답은 원래 SFT 모델이 생성한 응답보다 보상 모델에서 더 높은 점수를 받아야 합니다.

def compare_models(sft_model, rlhf_model, reward_model, prompts, max_seq_len=128):
    print("Model Comparison (reward scores)")
    print("-" * 60)
    print(f"  {'Prompt':<35} {'SFT':>10} {'RLHF':>10}")
    print("  " + "-" * 55)

    sft_total = 0.0
    rlhf_total = 0.0

    for prompt in prompts:
        prompt_tokens = [min(t, 252) for t in list(prompt.encode("utf-8"))]

        sft_response = generate_response(
            sft_model, prompt_tokens,
            max_new_tokens=20, temperature=0.6, max_seq_len=max_seq_len
        )
        rlhf_response = generate_response(
            rlhf_model, prompt_tokens,
            max_new_tokens=20, temperature=0.6, max_seq_len=max_seq_len
        )

        sft_ids = np.array(sft_response[:max_seq_len]).reshape(1, -1)
        rlhf_ids = np.array(rlhf_response[:max_seq_len]).reshape(1, -1)

        sft_reward = reward_model.forward(sft_ids)[0]
        rlhf_reward = reward_model.forward(rlhf_ids)[0]

        sft_total += sft_reward
        rlhf_total += rlhf_reward

        truncated_prompt = prompt[:33] + ".." if len(prompt) > 35 else prompt
        print(f"  {truncated_prompt:<35} {sft_reward:>10.4f} {rlhf_reward:>10.4f}")

    n = len(prompts)
    print("  " + "-" * 55)
    print(f"  {'Average':<35} {sft_total/n:>10.4f} {rlhf_total/n:>10.4f}")

    return sft_total / n, rlhf_total / n

사용해보기

전체 RLHF 파이프라인 데모(Full RLHF Pipeline Demo)

if __name__ == "__main__":
    np.random.seed(42)

    print("=" * 70)
    print("RLHF PIPELINE: REWARD MODEL + PPO")
    print("=" * 70)
    print()

    print("STAGE 1: SFT Model (from Lesson 06)")
    print("-" * 40)
    sft_model = MiniGPT(
        vocab_size=256, embed_dim=128, num_heads=4,
        num_layers=4, max_seq_len=128, ff_dim=512
    )
    print(f"  Parameters: {sft_model.count_parameters():,}")
    print()

    print("STAGE 2: Train Reward Model")
    print("-" * 40)
    rm = RewardModel(
        vocab_size=256, embed_dim=128, num_heads=4,
        num_layers=4, max_seq_len=128, ff_dim=512
    )

    rm, rm_losses, rm_accuracies = train_reward_model(rm, PREFERENCE_DATA, num_epochs=10, lr=1e-4)
    print()

    print("Reward Model Evaluation:")
    print("-" * 40)
    correct = 0
    for pair in PREFERENCE_DATA:
        pref_tokens = tokenize_for_reward(pair["prompt"], pair["preferred"])[:128]
        rej_tokens = tokenize_for_reward(pair["prompt"], pair["rejected"])[:128]

        r_pref = rm.forward(np.array(pref_tokens).reshape(1, -1))[0]
        r_rej = rm.forward(np.array(rej_tokens).reshape(1, -1))[0]

        if r_pref > r_rej:
            correct += 1
        print(f"  Preferred: {r_pref:+.4f} | Rejected: {r_rej:+.4f} | {'Correct' if r_pref > r_rej else 'Wrong'}")

    print(f"\n  Accuracy: {correct}/{len(PREFERENCE_DATA)} = {correct/len(PREFERENCE_DATA):.1%}")
    print()

    print("STAGE 3: PPO Training")
    print("-" * 40)

    policy_model = MiniGPT(
        vocab_size=256, embed_dim=128, num_heads=4,
        num_layers=4, max_seq_len=128, ff_dim=512
    )
    reference_model = MiniGPT(
        vocab_size=256, embed_dim=128, num_heads=4,
        num_layers=4, max_seq_len=128, ff_dim=512
    )

    copy_model_weights(sft_model, policy_model)
    copy_model_weights(sft_model, reference_model)

    train_prompts = [pair["prompt"] for pair in PREFERENCE_DATA]

    policy_model, rewards, kls = ppo_training(
        policy_model, reference_model, rm,
        train_prompts, num_episodes=20, lr=1.5e-5, kl_coeff=0.02
    )
    print()

    print("=" * 70)
    print("COMPARISON: SFT vs RLHF")
    print("=" * 70)
    print()

    eval_prompts = [
        "What is the capital of France?",
        "Explain gravity.",
        "Name three programming languages.",
    ]

    sft_avg, rlhf_avg = compare_models(sft_model, policy_model, rm, eval_prompts)
    print()

    print("=" * 70)
    print("KL DIVERGENCE ANALYSIS")
    print("=" * 70)
    print()

    if kls:
        print(f"  Initial KL: {kls[0]:.4f}")
        print(f"  Final KL:   {kls[-1]:.4f}")
        print(f"  Max KL:     {max(kls):.4f}")
        kl_threshold = 0.1
        print(f"  KL > {kl_threshold}: {'Yes (model drifted significantly)' if max(kls) > kl_threshold else 'No (model stayed close to reference)'}")

산출물 만들기

이 lesson은 outputs/prompt-reward-model-designer.md를 만들어 냅니다. 보상 모델 학습 파이프라인을 설계해 주는 프롬프트로, 목표 행동(도움성, 코딩 능력, 안전성 등)을 넣으면 데이터 수집 프로토콜(data collection protocol), 주석자 지침(annotator guidelines), 보상 모델 평가 기준(reward model evaluation criteria)을 한 번에 만들어 줍니다.

연습문제

(쉬움) 보상 모델이 마지막 위치만 쓰는 대신 모든 은닉 상태의 평균을 사용하도록 수정합니다. 그런 다음 정확도를 비교하세요. 평균 풀링(mean pooling)은 모든 토큰에 동일한 가중치를 부여하고, 마지막 위치 방식은 인과적 어텐션이 정보를 모았다고 가정하는 점이 다릅니다. 6개 선호 쌍 위에서 둘 중 어느 방식이 더 높은 정확도를 내는지 보고하세요.
(중간) 보상 모델의 보정(calibration)을 구현합니다. 학습이 끝난 뒤 모든 선호 쌍을 보상 모델에 통과시키고 (a) 선호 응답의 평균 보상, (b) 거절 응답의 평균 보상, (c) 마진(margin; 선호 보상에서 거절 보상을 뺀 값)을 계산하세요. 잘 보정된 모델이라면 분명한 마진이 보여야 합니다. 그다음 새로운 선호 쌍 4개를 추가해, 보지 못한 데이터(unseen data)에서도 마진이 유지되는지 확인하세요.
(중간) 보상 해킹을 시뮬레이션해 봅니다. 응답이 길수록 더 높은 점수를 주는 보상 모델을 만드세요(reward = len(response) / 100). 이 결함 있는 보상 모델로 PPO를 돌리고, 정책 모델이 점점 길고 반복적인 출력을 만들어 내는지 관찰합니다. 그다음 KL 패널티 0.1을 추가해 이러한 퇴행적 행동이 억제되는 모습을 보여 주세요.
(어려움) 다목적 보상(multi-objective reward)을 구현합니다. 보상 모델 두 개를 학습하세요. 하나는 도움성용, 다른 하나는 간결성(conciseness)용입니다. R = 0.7 * R_helpful + 0.3 * R_concise로 결합하고, 이렇게 결합한 목적이 단일 도움성 보상에서 빠지기 쉬운 장황함 함정(verbosity trap)을 피하면서도 도움이 되고 간결한 응답을 만드는지 보여 주세요.
(어려움) 서로 다른 KL 계수(KL coefficient)를 비교합니다. beta=0.001(너무 낮아 보상 해킹 발생), beta=0.02(표준), beta=0.5(너무 높아 학습 불가)로 PPO를 각각 돌리고, 각 실행의 보상 곡선(reward curve)과 KL 곡선(KL curve)을 그려 보세요. beta=0.02 실행은 KL이 일정 범위 안에서 유지되면서 보상이 꾸준히 개선되는 모습을 보여야 합니다.

핵심 용어

용어	흔한 설명	실제 의미
RLHF	"사람 피드백으로 학습"	사람 피드백 기반 강화학습(Reinforcement Learning from Human Feedback). SFT, 보상 모델, PPO의 세 단계로 사람 선호 신호를 활용해 언어 모델의 출력을 최적화하는 파이프라인입니다.
보상 모델(Reward model)	"응답을 채점하는 모델"	스칼라 출력 헤드를 가진 트랜스포머로, 브래들리-테리 손실을 사용해 사람의 쌍 비교(pairwise human preference)로부터 학습됩니다.
브래들리-테리(Bradley-Terry)	"비교 모델"	`P(A > B) = sigmoid(score(A) - score(B))`로 정의되는 확률 모델로, 쌍 비교를 일관된 점수 함수(scoring function)로 변환해 줍니다.
PPO	"그 강화학습 알고리즘"	근접 정책 최적화(Proximal Policy Optimization). 보상을 최대화하면서도 업데이트 크기를 잘라 학습 불안정을 막는 알고리즘입니다.
KL 발산(KL divergence)	"두 분포가 얼마나 다른가"	정책 모델의 토큰 분포와 참조 모델의 토큰 분포의 차이를 재는 척도이며, 보상 해킹을 막기 위한 패널티로 사용됩니다.
KL 패널티(KL penalty)	"모델에 거는 목줄"	보상 신호에서 `beta * KL(policy
보상 해킹(Reward hacking)	"보상 게임하기"	정책이 실제로 응답 품질을 높이는 대신, 보상 모델의 약점을 파고들어 퇴행적인 고보상 출력을 찾아내는 현상입니다.
선호 쌍(Preference pair)	"A와 B 중 어느 쪽이 더 좋은가?"	`(prompt, preferred_response, rejected_response)`로 구성된 RLHF 학습 데이터의 가장 기본 단위입니다.
참조 모델(Reference model)	"동결된 SFT 체크포인트"	가중치가 절대 변하지 않는 SFT 모델의 사본으로, KL 발산을 계산할 때의 기준점(anchor)이 됩니다.

더 읽을거리

Ouyang et al., 2022 — "Training language models to follow instructions with human feedback" (InstructGPT) — RLHF를 대규모 언어 모델에 실용적으로 적용한 논문입니다.
Schulman et al., 2017 — "Proximal Policy Optimization Algorithms" — OpenAI에서 발표한 PPO 원논문입니다.
Bai et al., 2022 — "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback" — 보상 해킹과 KL 패널티를 자세히 분석한 Anthropic의 RLHF 논문입니다.
Stiennon et al., 2020 — "Learning to summarize with human feedback" — 요약(summarization)에 RLHF를 적용해 보상 모델이 미묘한 품질 판단을 포착할 수 있음을 보인 논문입니다.
Christiano et al., 2017 — "Deep reinforcement learning from human preferences" — 사람의 비교 데이터로부터 보상 함수를 학습하는 토대를 마련한 연구입니다.

실습 코드

이 강의의 실습 코드 1개

main

Code

산출물

이 강의에서 생성된 프롬프트, 스킬, 코드 산출물 1개

prompt-reward-model-designer

Design reward model training pipelines for RLHF alignment

Prompt

확인 문제

3문제 · 모두 맞추면 완료 표시가 가능합니다

1.전체 RLHF 파이프라인(full RLHF pipeline)에는 몇 개의 별도 모델이 필요한가요?

2.RLHF에서 보상 해킹(reward hacking)이란 무엇인가요?

3.PPO의 클리핑 메커니즘(clipping mechanism)은 무엇을 막나요?

0/3 답변 완료

추가 문제 풀기

AI가 강의 내용을 바탕으로 새로운 문제를 생성합니다

이전 강의

Instruction Tuning — SFT

다음 강의

DPO — Direct Preference Optimization