신경망 디버깅(Debugging Neural Networks)

신경망(Neural Network)은 컴파일(compile)되었고, 실행되었으며, 숫자도 출력했습니다. 그런데 그 숫자는 틀렸고 아무것도 충돌(crash)하지 않았습니다. 오류 메시지(error message)가 없는 디버깅, 가장 어려운 디버깅의 세계에 온 것을 환영합니다.

유형: Practice 언어: Python, PyTorch 선수 학습: Phase 03 Lessons 01-10, 특히 역전파(backpropagation), 손실 함수(loss functions), 옵티마이저(optimizers) 소요 시간: 약 90분

학습 목표

NaN 손실, 평평한 손실 곡선(loss curve), 과적합(overfitting), 진동(oscillation) 같은 흔한 신경망 실패(failure)를 체계적인 디버깅 전략(debugging strategy)으로 진단합니다.
"단일 배치 과적합(overfit one batch)" 기법을 적용해 모델 구조(architecture)와 학습 루프(training loop)가 올바른지 검증합니다.
기울기 크기(gradient magnitude), 활성값 분포(activation distribution), 가중치 노름(weight norm)을 검사해 기울기 소실/폭주(vanishing/exploding gradient) 문제를 식별합니다.
데이터 파이프라인(data pipeline), 모델 구조, 손실 함수, 옵티마이저, 학습률(learning rate) 이슈를 포괄하는 디버깅 체크리스트를 만듭니다.

문제

전통적인 소프트웨어는 고장 나면 충돌합니다. 널 포인터(null pointer)는 예외(exception)를 던지고, 자료형 불일치(type mismatch)는 컴파일 시점(compile time)에 실패하며, 하나 차이 오류(off-by-one error)는 명확히 잘못된 출력을 만듭니다.

신경망은 그런 친절함을 주지 않습니다.

깨진 신경망도 끝까지 실행되고, 손실 값을 출력하며, 예측(prediction)을 만들어 냅니다. 손실이 감소할 수도 있고, 예측이 그럴듯해 보일 수도 있습니다. 하지만 모델은 조용히 틀렸을 수 있습니다. 지름길(shortcut)을 학습하거나, 잡음(noise)을 외우거나, 쓸모없는 지역 최소(local minimum)에 수렴할 수 있습니다. Google 연구자들은 머신러닝(ML) 디버깅 시간의 60~70%가 오류는 없지만 모델 품질(quality)을 떨어뜨리는 "조용한(silent)" 버그에 쓰인다고 추정했습니다.

작동하는 모델과 깨진 모델의 차이는 종종 한 줄입니다. 빠진 zero_grad(), 전치된(transposed) 차원, 10배 틀린 학습률 같은 것들입니다. 권위 있는 글 "Recipe for Training Neural Networks"(2019)는 이렇게 시작합니다. "가장 흔한 신경망 실수는 충돌하지 않는 버그다."

이 강의에서는 그런 버그를 찾는 방법을 배웁니다.

사전 테스트

2문제 · 이 강의를 시작하기 전에 얼마나 알고 있는지 확인해보세요

1.신경망 디버깅(Neural Network Debugging)이 전통적인 소프트웨어 디버깅(Software Debugging)보다 어려운 이유는 무엇입니까?

2.'단일 배치 과적합(overfit one batch)' 디버깅 기법은 무엇입니까?

0/2 답변 완료

개념

디버깅 사고방식(Debugging Mindset)

"출력하고 기도하기(print-and-pray)" 식 디버깅은 잊어야 합니다. 신경망 디버깅은 피드백 루프(feedback loop)가 느리고(학습 한 번에 수분에서 수시간), 증상이 모호하기 때문에 체계적인 접근이 필요합니다. 나쁜 손실 하나가 20가지 다른 원인을 가질 수 있습니다.

황금 규칙은 다음과 같습니다. 단순하게 시작하고, 복잡도(complexity)를 한 조각씩 더하며, 각 조각을 독립적으로 검증합니다.

flowchart TD
    A["Loss not decreasing"] --> B{"Check learning rate"}
    B -->|"Too high"| C["Loss oscillates or explodes"]
    B -->|"Too low"| D["Loss barely moves"]
    B -->|"Reasonable"| E{"Check gradients"}
    E -->|"All zeros"| F["Dead ReLUs or vanishing gradients"]
    E -->|"NaN/Inf"| G["Exploding gradients"]
    E -->|"Normal"| H{"Check data pipeline"}
    H -->|"Labels shuffled"| I["Random-chance accuracy"]
    H -->|"Preprocessing bug"| J["Model learns noise"]
    H -->|"Data is fine"| K{"Check architecture"}
    K -->|"Too small"| L["Underfitting"]
    K -->|"Too deep"| M["Optimization difficulty"]

증상 1: 손실이 감소하지 않음

가장 흔한 불만입니다. 학습 루프는 실행되고 에포크(epoch)는 지나가는데 손실이 그대로이거나 심하게 진동합니다.

잘못된 학습률. 너무 높으면 손실이 진동하거나 NaN으로 튑니다. 너무 낮으면 손실이 너무 느리게 감소해 평평하게(flat) 보입니다. Adam은 1e-3에서 시작합니다. SGD는 1e-1 또는 1e-2에서 시작합니다. 다른 문제가 있다고 결론 내리기 전에 항상 10배 간격의 학습률 세 개를 시도합니다. 예를 들어 1e-2, 1e-3, 1e-4를 확인합니다.

죽은 ReLU(Dead ReLU). ReLU 뉴런(neuron)이 큰 음수 입력을 받으면 0을 출력하고 기울기도 0이 됩니다. 이후 다시 활성화되지 않을 수 있습니다. 충분히 많은 뉴런이 죽으면 신경망은 학습하지 못합니다. 확인 방법은 각 ReLU 층(layer) 뒤에서 정확히 0인 활성값(activation) 비율을 출력하는 것입니다. 죽은 비율이 50%를 넘으면 LeakyReLU로 바꾸거나 학습률을 낮춥니다.

기울기 소실(Vanishing gradients). sigmoid나 tanh 활성 함수를 쓰는 깊은 신경망에서는 기울기가 뒤로 전파되며 지수적으로 작아집니다. 첫 층에 도달할 때쯤 거의 0이 되고, 초기 층은 학습을 멈춥니다. 해결하려면 ReLU/GELU를 사용하거나, 잔차 연결(residual connection)을 추가하거나, 배치 정규화(batch normalization)를 사용합니다.

기울기 폭주(Exploding gradients). 반대 문제입니다. 기울기가 지수적으로 커집니다. RNN과 매우 깊은 신경망에서 흔합니다. 손실이 NaN으로 튑니다. 기울기 자르기(gradient clipping; torch.nn.utils.clip_grad_norm_), 더 낮은 학습률, 정규화(normalization)를 사용합니다.

증상 2: 손실은 감소하지만 모델이 나쁨

손실은 내려갑니다. 학습 정확도(training accuracy)는 99%에 도달합니다. 그런데 시험 정확도(test accuracy)는 55%입니다. 또는 모델이 실제 데이터에서 말이 안 되는 출력을 만듭니다.

과적합(Overfitting). 모델이 패턴을 배우는 대신 학습 데이터를 외웁니다. 학습 손실과 검증 손실(validation loss) 사이의 격차(gap)가 시간이 지날수록 커집니다. 더 많은 데이터, 드롭아웃(dropout), 가중치 감쇠(weight decay), 조기 종료(early stopping), 데이터 증강(data augmentation)으로 완화합니다.

데이터 누수(Data leakage). 시험 데이터가 학습에 새어 들어갑니다. 정확도가 의심스러울 정도로 높습니다. 흔한 원인은 분할(split) 전에 섞기, 전체 데이터셋(dataset)의 통계량으로 전처리(preprocessing), 분할 사이의 중복(duplicate) 샘플입니다. 먼저 분할하고, 그다음 전처리하며, 중복을 확인합니다.

레이블 오류(Label errors). 실제 데이터셋의 레이블(label) 중 5~10%는 잘못된 경우가 흔합니다(Northcutt et al., 2021, "Pervasive Label Errors in Test Sets"). 모델은 잡음을 학습합니다. 확신 학습(confident learning)으로 잘못 표시된(mislabeled) 예제를 찾고 고치거나, 높은 손실 샘플을 무시하는 손실 절단(loss truncation)을 사용할 수 있습니다.

증상 3: 손실의 NaN 또는 Inf

손실 값이 nan 또는 inf가 되면 학습은 멈춘 것입니다.

학습률이 너무 높음. 기울기 갱신이 너무 멀리 넘어서(overshoot) 가중치(weight)가 폭발합니다. 10배 낮춥니다.

log(0) 또는 log(negative). 교차 엔트로피 손실(cross-entropy loss)은 log(p)를 계산합니다. 모델이 정확히 0 또는 음수 확률(probability)을 출력하면 로그가 폭발합니다. 예측값을 [eps, 1-eps]로 잘라냅니다(clamp). 여기서 eps=1e-7입니다.

0으로 나누기(Division by zero). 배치 정규화는 표준편차(standard deviation)로 나눕니다. 상수 값만 있는 배치는 std=0입니다. 분모(denominator)에 엡실론(epsilon)을 더합니다. PyTorch는 기본적으로 이 처리를 하지만 사용자 정의(custom) 구현은 놓칠 수 있습니다.

수치 오버플로(Numerical overflow). 큰 활성값이 exp()에 들어가면 Inf가 됩니다. 소프트맥스(softmax)가 특히 취약합니다. 지수 계산 전에 최댓값을 빼는 로그-합-지수(log-sum-exp) 트릭을 사용합니다.

기법 1: 기울기 확인(Gradient Checking)

해석적 기울기(역전파로 얻은 기울기)를 수치적 기울기(유한 차분(Finite Difference)으로 얻은 기울기)와 비교합니다. 둘이 맞지 않으면 역방향 전달(backward pass)에 버그가 있는 것입니다.

매개변수(parameter) w에 대한 수치 기울기는 다음과 같습니다.

grad_numerical = (loss(w + eps) - loss(w - eps)) / (2 * eps)

일치도를 보는 지표(metric)는 상대 차이(relative difference)입니다.

rel_diff = |grad_analytical - grad_numerical| / max(|grad_analytical|, |grad_numerical|, 1e-8)

rel_diff < 1e-5이면 올바른 것입니다. rel_diff > 1e-3이면 거의 확실히 버그입니다.

flowchart LR
    A["Parameter w"] --> B["w + eps"]
    A --> C["w - eps"]
    B --> D["Forward pass"]
    C --> E["Forward pass"]
    D --> F["loss+"]
    E --> G["loss-"]
    F --> H["(loss+ - loss-) / 2eps"]
    G --> H
    H --> I["Compare to backprop gradient"]

기법 2: 활성화 통계(Activation Statistics)

학습 중 각 층 뒤의 활성값 평균(mean)과 표준편차를 모니터링(monitoring)합니다. 건강한 신경망은 활성값 평균이 0 근처, 표준편차가 1 근처를 유지하거나(정규화 뒤), 적어도 제한된 범위 안에 있어야 합니다.

건강 지표(Health indicator)	평균(Mean)	표준편차(Std)	진단(Diagnosis)
건강함(Healthy)	~0	~1	신경망이 정상적으로 학습 중
포화(Saturated)	>>0 또는 <<0	~0	활성값이 극단값(extreme value)에 고정됨
죽음(Dead)	0	0	뉴런이 죽어 모두 0
폭주(Exploding)	>>10	>>10	활성값이 제한 없이 커짐

기법 3: 기울기 흐름 시각화(Gradient Flow Visualization)

각 층의 평균 기울기 크기를 그립니다. 건강한 신경망에서는 기울기 크기가 층 사이에서 대략 비슷해야 합니다. 초기 층(early layer)의 기울기가 후반 층(later layer)보다 1000배 작다면 기울기 소실이 있는 것입니다.

graph LR
    subgraph "Healthy Gradient Flow"
        L1["Layer 1<br/>grad: 0.05"] --- L2["Layer 2<br/>grad: 0.04"] --- L3["Layer 3<br/>grad: 0.06"] --- L4["Layer 4<br/>grad: 0.05"]
    end

graph LR
    subgraph "Vanishing Gradient Flow"
        V1["Layer 1<br/>grad: 0.0001"] --- V2["Layer 2<br/>grad: 0.003"] --- V3["Layer 3<br/>grad: 0.02"] --- V4["Layer 4<br/>grad: 0.08"]
    end

기법 4: 단일 배치 과적합 테스트(Overfit-One-Batch Test)

딥러닝(deep learning)에서 가장 중요한 디버깅 기법입니다.

작은 배치(batch) 하나(8~32개 샘플)를 고릅니다. 그 배치만 100회 이상 학습합니다. 손실은 거의 0으로 가야 하고 학습 정확도는 100%에 도달해야 합니다. 그렇지 않다면 모델 또는 학습 루프에 근본적인 버그가 있습니다. 전체 학습(full training)으로 넘어가면 안 됩니다.

이 테스트는 다음을 잡아냅니다.

깨진 손실 함수
깨진 역방향 전달
데이터를 표현하기에 너무 작은 구조
모델 매개변수에 연결되지 않은 옵티마이저
데이터와 레이블 어긋남(misalignment)

실행에는 30초 정도 걸리고, 전체 학습 실행을 디버깅하는 몇 시간을 아껴 줍니다.

기법 5: 학습률 탐색기(Learning Rate Finder)

Leslie Smith(2017)는 한 에포크 동안 학습률을 아주 작게(1e-7)부터 아주 크게(10)까지 훑으면서(sweep) 손실을 기록하는 방법을 제안했습니다. 손실 대 학습률을 그래프로 그립니다(plot). 최적 학습률은 손실이 가장 빠르게 감소하기 시작하는 지점보다 대략 10배 작은 값입니다.

graph TD
    subgraph "LR Finder Plot"
        direction LR
        A["1e-7: loss=2.3"] --> B["1e-5: loss=2.3"]
        B --> C["1e-3: loss=1.8"]
        C --> D["1e-2: loss=0.9 -- steepest"]
        D --> E["1e-1: loss=0.5"]
        E --> F["1.0: loss=NaN -- too high"]
    end

이 예시의 최적 학습률은 약 1e-3입니다. 가장 가파른 지점보다 자릿수 하나 앞입니다.

흔한 PyTorch 버그

PyTorch 커뮤니티에서 가장 많은 시간을 낭비하게 만드는 버그들입니다.

버그(Bug)	증상(Symptom)	해결(Fix)
`optimizer.zero_grad()`를 잊음	기울기가 배치 사이에 누적되고 손실이 진동함	`loss.backward()` 전에 `optimizer.zero_grad()` 추가
평가 시점에 `model.eval()`을 잊음	드롭아웃과 배치 정규화가 다르게 동작하고 시험 정확도가 실행마다 달라짐	`model.eval()`과 `torch.no_grad()` 추가
잘못된 텐서 모양(tensor shape)	조용한 브로드캐스팅(broadcasting)이 잘못된 결과를 만들고 오류는 없음	디버깅 중 모든 연산 뒤 모양을 출력
CPU/GPU 불일치	`RuntimeError: expected CUDA tensor`	모델과 데이터 모두에 `.to(device)` 사용
텐서 분리(detach) 누락	연산 그래프(computation graph)가 계속 커져 메모리 부족(OOM) 발생	`.detach()` 또는 `with torch.no_grad()` 사용
자동 미분(autograd)을 깨뜨리는 제자리(in-place) 연산	`RuntimeError: modified by in-place operation`	`x += 1` 대신 `x = x + 1` 사용
정규화되지 않은 데이터	손실이 무작위 추측 수준(random-chance level)에 고정	입력을 평균=0, 표준편차=1로 정규화
잘못된 레이블 자료형	교차 엔트로피는 `Long`을 기대하지만 `Float`를 받음	레이블을 `labels.long()`으로 변환(cast)

종합 디버깅 표(Master Debugging Table)

증상(Symptom)	가능한 원인(Likely cause)	먼저 시도할 것(First thing to try)
손실이 -log(1/num_classes)에 갇힘	모델이 균등 분포(uniform distribution)를 예측함	데이터 파이프라인 확인, 레이블이 입력과 맞는지 검증
몇 단계 뒤 손실 NaN	학습률이 너무 높음	학습률을 10배 낮춤
즉시 손실 NaN	`log(0)` 또는 0으로 나누기	로그/나누기 연산에 엡실론 추가
손실이 격렬하게 진동	학습률이 너무 높거나 배치 크기가 너무 작음	학습률 낮추기, 배치 크기 키우기
손실이 감소하다 정체(plateau)	미세 조정(fine-tuning) 단계에 학습률이 너무 높음	학습률 스케줄(코사인 또는 단계 감쇠) 추가
학습 정확도 높고 시험 정확도 낮음	과적합	드롭아웃, 가중치 감쇠, 더 많은 데이터 추가
학습 정확도 = 시험 정확도 = 우연 수준	모델이 아무것도 학습하지 않음	단일 배치 과적합 테스트 실행
학습 정확도 = 시험 정확도지만 둘 다 낮음	과소적합(underfitting)	더 큰 모델, 더 많은 층, 더 많은 특징(feature)
기울기가 모두 0	죽은 ReLU 또는 분리된 연산 그래프	LeakyReLU로 변경, `.requires_grad` 확인
학습 중 메모리 부족	배치가 너무 크거나 그래프가 해제되지 않음	배치 크기 줄이기, 평가에서 `torch.no_grad()` 사용

만들어 보기

활성값, 기울기, 손실 곡선을 모니터링하는 진단 도구(diagnostic toolkit)를 만듭니다. 일부러 신경망을 망가뜨리고, 이 도구로 각 문제를 진단합니다.

Step 1: `NetworkDebugger` 클래스

PyTorch 모델에 후크(hook)를 걸어 층별 활성값과 기울기 통계량을 기록합니다.

import torch
import torch.nn as nn
import math


class NetworkDebugger:
    def __init__(self, model):
        self.model = model
        self.activation_stats = {}
        self.gradient_stats = {}
        self.loss_history = []
        self.lr_losses = []
        self.hooks = []
        self._register_hooks()

    def _register_hooks(self):
        for name, module in self.model.named_modules():
            if isinstance(module, (nn.Linear, nn.Conv2d, nn.ReLU, nn.LeakyReLU)):
                hook = module.register_forward_hook(self._make_activation_hook(name))
                self.hooks.append(hook)
                hook = module.register_full_backward_hook(self._make_gradient_hook(name))
                self.hooks.append(hook)

    def _make_activation_hook(self, name):
        def hook(module, input, output):
            with torch.no_grad():
                out = output.detach().float()
                self.activation_stats[name] = {
                    "mean": out.mean().item(),
                    "std": out.std().item(),
                    "fraction_zero": (out == 0).float().mean().item(),
                    "min": out.min().item(),
                    "max": out.max().item(),
                }
        return hook

    def _make_gradient_hook(self, name):
        def hook(module, grad_input, grad_output):
            if grad_output[0] is not None:
                with torch.no_grad():
                    grad = grad_output[0].detach().float()
                    self.gradient_stats[name] = {
                        "mean": grad.mean().item(),
                        "std": grad.std().item(),
                        "abs_mean": grad.abs().mean().item(),
                        "max": grad.abs().max().item(),
                    }
        return hook

    def record_loss(self, loss_value):
        self.loss_history.append(loss_value)

    def check_loss_health(self):
        if len(self.loss_history) < 2:
            return "NOT_ENOUGH_DATA"
        recent = self.loss_history[-10:]
        if any(math.isnan(v) or math.isinf(v) for v in recent):
            return "NAN_OR_INF"
        if len(self.loss_history) >= 20:
            first_half = sum(self.loss_history[:10]) / 10
            second_half = sum(self.loss_history[-10:]) / 10
            if second_half >= first_half * 0.99:
                return "NOT_DECREASING"
        if len(recent) >= 5:
            diffs = [recent[i+1] - recent[i] for i in range(len(recent)-1)]
            if max(diffs) - min(diffs) > 2 * abs(sum(diffs) / len(diffs)):
                return "OSCILLATING"
        return "HEALTHY"

    def check_activations(self):
        issues = []
        for name, stats in self.activation_stats.items():
            if stats["fraction_zero"] > 0.5:
                issues.append(f"DEAD_NEURONS: {name} has {stats['fraction_zero']:.0%} zero activations")
            if abs(stats["mean"]) > 10:
                issues.append(f"EXPLODING_ACTIVATIONS: {name} mean={stats['mean']:.2f}")
            if stats["std"] < 1e-6:
                issues.append(f"COLLAPSED_ACTIVATIONS: {name} std={stats['std']:.2e}")
        return issues if issues else ["HEALTHY"]

    def check_gradients(self):
        issues = []
        grad_magnitudes = []
        for name, stats in self.gradient_stats.items():
            grad_magnitudes.append((name, stats["abs_mean"]))
            if stats["abs_mean"] < 1e-7:
                issues.append(f"VANISHING_GRADIENT: {name} abs_mean={stats['abs_mean']:.2e}")
            if stats["abs_mean"] > 100:
                issues.append(f"EXPLODING_GRADIENT: {name} abs_mean={stats['abs_mean']:.2e}")
        if len(grad_magnitudes) >= 2:
            first_mag = grad_magnitudes[0][1]
            last_mag = grad_magnitudes[-1][1]
            if last_mag > 0 and first_mag / last_mag > 100:
                issues.append(f"GRADIENT_RATIO: first/last = {first_mag/last_mag:.0f}x (vanishing)")
        return issues if issues else ["HEALTHY"]

    def print_report(self):
        print("\n=== 신경망 디버거 보고서 ===")
        print(f"\n손실 상태: {self.check_loss_health()}")
        if self.loss_history:
            print(f"  최근 5개 손실: {[f'{v:.4f}' for v in self.loss_history[-5:]]}")
        print("\n활성값 진단:")
        for item in self.check_activations():
            print(f"  {item}")
        print("\n기울기 진단:")
        for item in self.check_gradients():
            print(f"  {item}")
        print("\n층별 활성값 통계:")
        for name, stats in self.activation_stats.items():
            print(f"  {name}: mean={stats['mean']:.4f} std={stats['std']:.4f} zero={stats['fraction_zero']:.1%}")
        print("\n층별 기울기 통계:")
        for name, stats in self.gradient_stats.items():
            print(f"  {name}: abs_mean={stats['abs_mean']:.2e} max={stats['max']:.2e}")

    def remove_hooks(self):
        for hook in self.hooks:
            hook.remove()
        self.hooks.clear()

Step 2: 단일 배치 과적합 테스트

def overfit_one_batch(model, x_batch, y_batch, criterion, lr=0.01, steps=200):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    model.train()
    print("\n=== 단일 배치 과적합 테스트 ===")
    print(f"배치 크기: {x_batch.shape[0]}, 단계 수: {steps}")

    for step in range(steps):
        optimizer.zero_grad()
        output = model(x_batch)
        loss = criterion(output, y_batch)
        loss.backward()
        optimizer.step()

        if step % 50 == 0 or step == steps - 1:
            with torch.no_grad():
                preds = (output > 0).float() if output.shape[-1] == 1 else output.argmax(dim=1)
                targets = y_batch if y_batch.dim() == 1 else y_batch.squeeze()
                acc = (preds.squeeze() == targets).float().mean().item()
            print(f"  단계 {step:3d} | 손실: {loss.item():.6f} | 정확도: {acc:.1%}")

    final_loss = loss.item()
    if final_loss > 0.1:
        print(f"\n  실패: 손실이 수렴하지 않았습니다({final_loss:.4f}). 모델 또는 학습 루프가 깨졌습니다.")
        return False
    print(f"\n  통과: 손실이 {final_loss:.6f}로 수렴했습니다")
    return True

Step 3: 학습률 탐색기

def find_learning_rate(model, x_data, y_data, criterion, start_lr=1e-7, end_lr=10, steps=100):
    import copy
    original_state = copy.deepcopy(model.state_dict())
    optimizer = torch.optim.SGD(model.parameters(), lr=start_lr)
    lr_mult = (end_lr / start_lr) ** (1 / steps)

    model.train()
    results = []
    best_loss = float("inf")
    current_lr = start_lr

    print("\n=== 학습률 탐색기 ===")

    for step in range(steps):
        optimizer.zero_grad()
        output = model(x_data)
        loss = criterion(output, y_data)

        if math.isnan(loss.item()) or loss.item() > best_loss * 10:
            break

        best_loss = min(best_loss, loss.item())
        results.append((current_lr, loss.item()))

        loss.backward()
        optimizer.step()

        current_lr *= lr_mult
        for param_group in optimizer.param_groups:
            param_group["lr"] = current_lr

    model.load_state_dict(original_state)

    if len(results) < 10:
        print("  학습률 탐색을 완료하지 못했습니다. 손실이 너무 빠르게 발산했습니다")
        return results

    min_loss_idx = min(range(len(results)), key=lambda i: results[i][1])
    suggested_lr = results[max(0, min_loss_idx - 10)][0]

    print(f"  {start_lr:.0e}부터 {results[-1][0]:.0e}까지 {len(results)}단계를 훑었습니다")
    print(f"  최소 손실 {results[min_loss_idx][1]:.4f}, 학습률={results[min_loss_idx][0]:.2e}")
    print(f"  제안 학습률: {suggested_lr:.2e}")

    return results

Step 4: 기울기 검증기(Gradient Checker)

def _flat_to_multi_index(flat_idx, shape):
    multi_idx = []
    remaining = flat_idx
    for dim in reversed(shape):
        multi_idx.insert(0, remaining % dim)
        remaining //= dim
    return tuple(multi_idx)


def gradient_check(model, x, y, criterion, eps=1e-4):
    model.train()
    x_double = x.double()
    y_double = y.double()
    model_double = model.double()

    print("\n=== 기울기 확인 ===")
    overall_max_diff = 0
    checked = 0

    for name, param in model_double.named_parameters():
        if not param.requires_grad:
            continue

        layer_max_diff = 0

        model_double.zero_grad()
        output = model_double(x_double)
        loss = criterion(output, y_double)
        loss.backward()
        analytical_grad = param.grad.clone()

        num_checks = min(5, param.numel())
        for i in range(num_checks):
            idx = _flat_to_multi_index(i, param.shape)
            original = param.data[idx].item()

            param.data[idx] = original + eps
            with torch.no_grad():
                loss_plus = criterion(model_double(x_double), y_double).item()

            param.data[idx] = original - eps
            with torch.no_grad():
                loss_minus = criterion(model_double(x_double), y_double).item()

            param.data[idx] = original

            numerical = (loss_plus - loss_minus) / (2 * eps)
            analytical = analytical_grad[idx].item()

            denom = max(abs(numerical), abs(analytical), 1e-8)
            rel_diff = abs(numerical - analytical) / denom

            layer_max_diff = max(layer_max_diff, rel_diff)
            checked += 1

        overall_max_diff = max(overall_max_diff, layer_max_diff)
        status = "일치" if layer_max_diff < 1e-5 else "불일치"
        print(f"  {name}: max_rel_diff={layer_max_diff:.2e} [{status}]")

    model.float()

    print(f"\n  매개변수 {checked}개를 확인했습니다")
    if overall_max_diff < 1e-5:
        print("  통과: 기울기가 일치합니다(rel_diff < 1e-5)")
    elif overall_max_diff < 1e-3:
        print("  경고: 작은 차이가 있습니다(1e-5 < rel_diff < 1e-3)")
    else:
        print("  실패: 기울기 불일치가 감지되었습니다(rel_diff > 1e-3)")
    return overall_max_diff

Step 5: 의도적으로 망가진 신경망

이제 도구를 깨진 신경망에 적용하고 각각을 진단합니다.

def demo_broken_networks():
    torch.manual_seed(42)
    x = torch.randn(64, 10)
    y = (x[:, 0] > 0).long()

    print("\n" + "=" * 60)
    print("버그 1: 학습률이 너무 높음(lr=10)")
    print("=" * 60)
    model1 = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))
    debugger1 = NetworkDebugger(model1)
    optimizer1 = torch.optim.SGD(model1.parameters(), lr=10.0)
    criterion = nn.CrossEntropyLoss()
    for step in range(20):
        optimizer1.zero_grad()
        out = model1(x)
        loss = criterion(out, y)
        debugger1.record_loss(loss.item())
        loss.backward()
        optimizer1.step()
    debugger1.print_report()
    debugger1.remove_hooks()

    print("\n" + "=" * 60)
    print("버그 2: 잘못된 초기화로 생긴 죽은 ReLU")
    print("=" * 60)
    model2 = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 32), nn.ReLU(), nn.Linear(32, 2))
    with torch.no_grad():
        for m in model2.modules():
            if isinstance(m, nn.Linear):
                m.weight.fill_(-1.0)
                m.bias.fill_(-5.0)
    debugger2 = NetworkDebugger(model2)
    optimizer2 = torch.optim.Adam(model2.parameters(), lr=1e-3)
    for step in range(50):
        optimizer2.zero_grad()
        out = model2(x)
        loss = criterion(out, y)
        debugger2.record_loss(loss.item())
        loss.backward()
        optimizer2.step()
    debugger2.print_report()
    debugger2.remove_hooks()

    print("\n" + "=" * 60)
    print("버그 3: zero_grad 누락(기울기 누적)")
    print("=" * 60)
    model3 = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))
    debugger3 = NetworkDebugger(model3)
    optimizer3 = torch.optim.SGD(model3.parameters(), lr=0.01)
    for step in range(50):
        out = model3(x)
        loss = criterion(out, y)
        debugger3.record_loss(loss.item())
        loss.backward()
        optimizer3.step()
    debugger3.print_report()
    debugger3.remove_hooks()

    print("\n" + "=" * 60)
    print("건강한 신경망: 비교를 위한 올바른 설정")
    print("=" * 60)
    model_good = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))
    debugger_good = NetworkDebugger(model_good)
    optimizer_good = torch.optim.Adam(model_good.parameters(), lr=1e-3)
    for step in range(50):
        optimizer_good.zero_grad()
        out = model_good(x)
        loss = criterion(out, y)
        debugger_good.record_loss(loss.item())
        loss.backward()
        optimizer_good.step()
    debugger_good.print_report()
    debugger_good.remove_hooks()

    print("\n" + "=" * 60)
    print("단일 배치 과적합 테스트(건강한 모델)")
    print("=" * 60)
    model_test = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))
    overfit_one_batch(model_test, x[:8], y[:8], criterion)

    print("\n" + "=" * 60)
    print("학습률 탐색기")
    print("=" * 60)
    model_lr = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))
    find_learning_rate(model_lr, x, y, criterion)

    print("\n" + "=" * 60)
    print("기울기 확인")
    print("=" * 60)
    model_grad = nn.Sequential(nn.Linear(10, 8), nn.ReLU(), nn.Linear(8, 2))
    gradient_check(model_grad, x[:4], y[:4], criterion)

사용해보기

PyTorch 내장 도구

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(768, 256),
    nn.ReLU(),
    nn.Linear(256, 10),
)

with torch.autograd.detect_anomaly():
    output = model(input_tensor)
    loss = criterion(output, target)
    loss.backward()

for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: grad_mean={param.grad.abs().mean():.2e}")

Weights & Biases 연동(Integration)

import wandb

wandb.init(project="debug-training")

for epoch in range(100):
    loss = train_one_epoch()
    wandb.log({
        "loss": loss,
        "lr": optimizer.param_groups[0]["lr"],
        "grad_norm": torch.nn.utils.clip_grad_norm_(model.parameters(), float("inf")),
    })

    for name, param in model.named_parameters():
        if param.grad is not None:
            wandb.log({f"grad/{name}": wandb.Histogram(param.grad.cpu().numpy())})

TensorBoard

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter("runs/debug_experiment")

for epoch in range(100):
    loss = train_one_epoch()
    writer.add_scalar("Loss/train", loss, epoch)

    for name, param in model.named_parameters():
        writer.add_histogram(f"weights/{name}", param, epoch)
        if param.grad is not None:
            writer.add_histogram(f"gradients/{name}", param.grad, epoch)

전체 학습 전 디버그 체크리스트

단일 배치 과적합 테스트를 실행합니다. 실패하면 중단합니다.
모델 요약(summary)을 출력하고 매개변수 개수가 합리적인지 확인합니다.
무작위 데이터로 단일 순방향 전달(forward pass)을 실행하고 출력 모양을 확인합니다.
5 에포크만 학습해 손실이 감소하는지 확인합니다.
활성화 통계를 확인합니다. 죽은 층이나 폭주가 없어야 합니다.
기울기 흐름을 확인합니다. 소실도 폭주도 없어야 합니다.
데이터 파이프라인을 검증합니다. 무작위 샘플 5개와 레이블을 출력합니다.

산출물 만들기

이 강의의 최종 산출물은 다음 두 가지입니다.

outputs/prompt-nn-debugger.md: 손실 곡선, 기울기 통계, 활성화 패턴 같은 증상을 바탕으로 신경망 학습 실패를 진단하는 프롬프트(prompt)
outputs/skill-debug-checklist.md: 학습 이슈 디버깅을 위한 의사결정 트리(decision-tree) 체크리스트

디버깅 관련 배포 패턴은 다음과 같습니다.

운영(production) 학습 스크립트에 모니터링 후크를 추가합니다.
활성화와 기울기 통계를 N 단계마다 W&B 또는 TensorBoard에 기록(logging)합니다.
NaN 손실, 죽은 뉴런(0 활성값 비율 >80%), 기울기 폭주에 대한 자동 경보(automatic alert)를 구현합니다.
구조나 데이터 파이프라인을 바꿀 때는 항상 단일 배치 과적합 테스트를 실행합니다.

연습문제

(쉬움) 기울기 폭주 감지기 추가. NetworkDebugger를 수정해 기울기가 임계값(threshold)을 넘을 때 이를 감지하고 기울기 자르기 값을 자동 제안하도록 만듭니다. 정규화가 없는 20층 신경망에서 시험합니다.
(중간) 죽은 뉴런 회생기(resurrector) 만들기. 항상 0을 출력하는 죽은 ReLU 뉴런을 식별하고, 들어오는 가중치를 Kaiming 초기화로 다시 초기화하는 함수를 작성합니다. 뉴런의 70% 이상이 죽은 신경망이 회복되는지 보여 줍니다.
(중간) 학습률 탐색기에 그래프 그리기 추가. find_learning_rate를 확장해 결과를 CSV로 저장하고, 별도 스크립트로 CSV를 읽어 matplotlib으로 학습률 대 손실 곡선을 표시합니다. CIFAR-10의 ResNet-18에 대해 최적 학습률을 식별합니다.
(어려움) 데이터 파이프라인 검증기 만들기. 학습/시험 분할 사이의 중복 샘플, 레이블 분포 불균형(>10:1 비율), 입력 정규화(평균이 0 근처, 표준편차가 1 근처), 데이터의 NaN/Inf 값을 확인하는 함수를 작성합니다. 일부러 망가뜨린 데이터셋에서 실행합니다.
(어려움) 실제 실패 디버그. Lesson 10의 소형 프레임워크(mini-framework)에 미묘한 버그를 넣습니다. 예를 들어 역방향에서 가중치 행렬을 전치합니다. 기울기 확인으로 어떤 매개변수의 기울기가 잘못됐는지 정확히 찾아냅니다. 디버깅 과정을 문서화합니다.

핵심 용어

용어	흔한 설명	실제 의미
조용한 버그(Silent bug)	"실행은 되는데 결과가 나쁨"	오류는 없지만 모델 품질을 떨어뜨리는 버그로, 머신러닝의 지배적인 실패 양식
죽은 ReLU(Dead ReLU)	"뉴런이 죽음"	입력이 항상 음수라 0을 출력하고 기울기도 영구적으로 0이 되는 ReLU 뉴런
기울기 소실(Vanishing gradients)	"초기 층이 학습을 멈춤"	기울기가 층을 지나며 지수적으로 작아져 초기 층 가중치가 사실상 얼어붙는 현상
기울기 폭주(Exploding gradients)	"손실이 NaN이 됨"	기울기가 층을 지나며 지수적으로 커져 가중치 갱신이 오버플로를 일으키는 현상
기울기 확인(Gradient checking)	"역전파가 맞는지 검증"	역전파의 해석적 기울기를 유한 차분의 수치 기울기와 비교하는 방법
단일 배치 과적합(Overfit-one-batch)	"가장 중요한 디버그 테스트"	작은 배치 하나에 학습해 모델이 학습할 수 있는지 확인하는 테스트. 이것도 못 하면 근본적으로 깨진 것
학습률 탐색기(LR finder)	"올바른 학습률 찾기"	한 에포크 동안 학습률을 지수적으로 키우고 손실이 발산하기 직전 값을 선택하는 방법
데이터 누수(Data leakage)	"시험 데이터가 학습에 샘"	시험 집합 정보가 학습을 오염시켜 정확도가 인위적으로 높아지는 문제
활성화 통계(Activation statistics)	"층 건강 모니터링"	각 층 출력의 평균, 표준편차, 0 비율을 추적해 죽음, 포화, 폭주 뉴런을 감지하는 방법
기울기 자르기(Gradient clipping)	"기울기 크기 제한"	기울기 노름이 임계값을 넘으면 기울기를 축소(scaling down)해 폭주성 기울기 갱신을 막는 방법

더 읽을거리

Smith, "Cyclical Learning Rates for Training Neural Networks" (2017): 학습률 범위 시험(learning rate range test; 학습률 탐색기)을 소개한 논문
Northcutt et al., "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks" (2021): ImageNet, CIFAR-10 등 주요 벤치마크(benchmark) 레이블의 3~6%가 잘못됐음을 보인 연구
Zhang et al., "Understanding Deep Learning Requires Rethinking Generalization" (2017): 신경망이 무작위 레이블도 외울 수 있음을 보여 준 논문. 단일 배치 과적합 테스트가 왜 유효한지 이해하는 데 도움이 됩니다.
PyTorch 공식 문서 torch.autograd.detect_anomaly와 torch.autograd.set_detect_anomaly: 내장된 NaN/Inf 감지 참고 문서

실습 코드

이 강의의 실습 코드 1개

debug neural nets

Code

산출물

이 강의에서 생성된 프롬프트, 스킬, 코드 산출물 2개

skill-debug-checklist

Decision-tree checklist for debugging neural network training failures

Skill

prompt-nn-debugger

Diagnose neural network training failures from symptoms -- loss curves, gradient stats, and activation patterns

Prompt

확인 문제

3문제 · 모두 맞추면 완료 표시가 가능합니다

1.몇 학습 단계(training step) 뒤 손실이 NaN이 됩니다. 가장 가능성 높은 원인은 무엇입니까?

2.학습 손실(training loss)은 감소하지만 검증 손실(validation loss)은 처음부터 평평합니다. 무엇을 의미합니까?

3.손실 곡선이 완전히 평평해서 손실이 전혀 감소하지 않을 때 가장 먼저 확인할 것은 무엇입니까?

0/3 답변 완료

이전 강의

JAX 입문

다음 강의

이미지 기초 — 픽셀, 채널, 색공간