디버깅과 프로파일링

최악의 AI 버그(bug)는 crash하지 않습니다. 쓰레기 데이터(garbage)로 조용히 학습(training)하고 아름다운 손실 곡선(loss curve)을 보여줍니다.

유형: Build 언어: Python 선수 지식: Lesson 1 (Dev Environment), 기본적인 PyTorch 사용 경험 예상 시간: 약 60분

학습 목표

조건부 중단점(conditional breakpoint)인 breakpoint()와 debug_print로 학습(training) 중 텐서 형태(tensor shape), 데이터 타입(dtype), NaN 값을 확인합니다.
cProfile, line_profiler, tracemalloc로 학습 루프(training loop)를 프로파일링(profiling)해 병목(bottleneck)을 찾습니다.
형태 불일치(shape mismatch), NaN 손실(NaN loss), 데이터 누수(data leakage), 잘못된 장치 텐서(wrong-device tensor) 같은 AI 버그를 탐지합니다.
TensorBoard로 손실 곡선(loss curve), 가중치 히스토그램(weight histogram), 그래디언트 분포(gradient distribution)를 시각화합니다.

문제

AI 코드(code)는 일반 코드와 다르게 실패합니다. 웹 애플리케이션(web app)은 스택 트레이스(stack trace)와 함께 crash합니다. 잘못 설정된 학습 루프(training loop)는 8시간 동안 실행되고 GPU 비용을 태운 뒤, 모든 입력(input)의 평균을 예측하는 모델(model)을 만들어 낼 수 있습니다. 코드는 오류(error)를 내지 않습니다. 문제는 잘못된 장치 텐서(wrong-device tensor), 잊어버린 .detach(), 특성(feature)에 섞인 레이블(label)일 수 있습니다.

이런 조용한 실패(silent failure)가 시간과 연산 자원(compute)을 낭비하기 전에 잡아내는 디버깅 도구(debugging tool)가 필요합니다.

사전 테스트

2문제 · 이 강의를 시작하기 전에 얼마나 알고 있는지 확인해보세요

1.AI/ML 코드 디버깅(code debugging)이 일반적인 웹 애플리케이션 디버깅(web application debugging)과 근본적으로 다른 점은 무엇인가요?

2.프로파일러(Profiler)는 무엇을 측정하나요?

0/2 답변 완료

개념

AI 디버깅(AI debugging)은 세 수준(level)에서 일어납니다.

graph TD
    L3["3. 학습 동역학(Training Dynamics)<br>손실 곡선(loss curve), 그래디언트 노름(gradient norm), 활성값(activation)"] --> L2
    L2["2. 텐서 연산(Tensor Operations)<br>형태(shape), 데이터 타입(dtype), 장치(device), NaN/Inf 값"] --> L1
    L1["1. 표준 Python(Standard Python)<br>중단점(breakpoint), 로깅(logging), 프로파일링(profiling), 메모리(memory)"]

많은 사람이 바로 수준 3, 즉 TensorBoard만 바라봅니다. 하지만 AI 버그의 상당수는 수준 1과 2에 있습니다.

직접 만들기

Part 1: 출력 기반 디버깅(Print debugging)

출력 기반 디버깅(print debugging)은 가볍게 보이지만 텐서 코드(tensor code)에서는 매우 유용합니다. 형태(shape), 데이터 타입(dtype), 장치(device), 값 범위(value range)를 한 번에 볼 수 있기 때문입니다.

def debug_print(name, tensor):
    print(f"{name}: shape={tensor.shape}, dtype={tensor.dtype}, "
          f"device={tensor.device}, "
          f"min={tensor.min().item():.4f}, max={tensor.max().item():.4f}, "
          f"mean={tensor.mean().item():.4f}, "
          f"has_nan={tensor.isnan().any().item()}")

의심되는 연산(operation) 뒤에 호출합니다. 버그를 찾으면 출력문(print)을 제거합니다.

Part 2: Python 디버거(Python debugger, `pdb`와 `breakpoint`)

내장 디버거(built-in debugger)는 AI 작업에서도 유용합니다. 학습 루프(training loop) 안에 breakpoint()를 넣고 텐서(tensor)를 대화형(interactive)으로 확인합니다.

def training_step(model, batch, criterion, optimizer):
    inputs, labels = batch
    outputs = model(inputs)
    loss = criterion(outputs, labels)

    if loss.item() > 100 or torch.isnan(loss):
        breakpoint()

    loss.backward()
    optimizer.step()

디버거 프롬프트(debugger prompt)에서 자주 쓰는 명령(command)입니다.

p outputs.shape: shape 확인
p loss.item(): loss 값 확인
p torch.isnan(outputs).sum(): NaN 개수 확인
p model.fc1.weight.grad: gradient 확인
c: continue
q: quit

조건부 디버깅(conditional debugging)은 10,000 step 학습 실행(training run)에서 문제가 생기는 순간에만 멈추게 해 줍니다.

Part 3: Python 로깅(Python logging)

잠깐 확인하는 수준을 넘으면 출력문(print) 대신 로깅(logging)을 사용합니다.

import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[
        logging.FileHandler("training.log"),
        logging.StreamHandler(),
    ],
)
logger = logging.getLogger(__name__)

logger.info("학습 시작: lr=%.4f, batch_size=%d", lr, batch_size)
logger.warning("Loss spike detected: %.4f at step %d", loss.item(), step)
logger.error("NaN loss at step %d, stopping", step)

로깅(logging)은 타임스탬프(timestamp), 심각도 수준(severity level), 파일 출력(file output)을 제공합니다. 새벽에 학습(training)이 실패했을 때 터미널 스크롤백(terminal scrollback)보다 로그 파일(log file)이 낫습니다.

Part 4: 코드 구간 시간 측정(Code section timing)

시간이 어디서 쓰이는지 알아야 최적화할 수 있습니다.

import time

class Timer:
    def __init__(self, name=""):
        self.name = name

    def __enter__(self):
        self.start = time.perf_counter()
        return self

    def __exit__(self, *args):
        elapsed = time.perf_counter() - self.start
        print(f"[{self.name}] {elapsed:.4f}s")

with Timer("data loading"):
    batch = next(dataloader_iter)

with Timer("forward pass"):
    outputs = model(batch)

with Timer("backward pass"):
    loss.backward()

흔한 발견은 데이터 로딩(data loading)이 학습 시간(training time)의 60%를 차지한다는 것입니다. 이때 해결책은 더 빠른 GPU가 아니라 DataLoader의 num_workers > 0일 수 있습니다.

Part 5: `cProfile`과 `line_profiler`

수동 타이머(manual timer)보다 자세히 보고 싶을 때 사용합니다.

python -m cProfile -s cumtime train.py

줄 단위 프로파일링(line-by-line profiling)이 필요하면 line_profiler를 사용합니다.

uv pip install line_profiler

@profile
def train_step(model, data, target):
    output = model(data)
    loss = F.cross_entropy(output, target)
    loss.backward()
    return loss

# 실행 방법: kernprof -l -v train.py

Part 6: 메모리 프로파일링(Memory profiling)

`tracemalloc`으로 CPU 메모리 확인(CPU memory with tracemalloc)

import tracemalloc

tracemalloc.start()

model = build_model()
data = load_dataset()

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")
for stat in top_stats[:10]:
    print(stat)

`memory_profiler`로 CPU 메모리 확인(CPU memory with memory_profiler)

uv pip install memory_profiler

from memory_profiler import profile

@profile
def load_data():
    raw = read_csv("data.csv")
    processed = preprocess(raw)
    return processed

아래 명령(command)으로 줄 단위 메모리 사용량(line-by-line memory usage)을 봅니다.

python -m memory_profiler your_script.py

PyTorch로 GPU 메모리 확인(GPU memory with PyTorch)

import torch

if torch.cuda.is_available():
    print(torch.cuda.memory_summary())

    print(f"할당된 memory: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
    print(f"Cache된 memory: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

메모리 부족(OOM, Out of Memory)이 발생하면 아래 순서로 시도합니다.

배치 크기(batch size)를 줄입니다.
torch.cuda.empty_cache()로 캐시된 메모리(cached memory)를 비웁니다.
큰 중간 결과(intermediate)에 del tensor 후 torch.cuda.empty_cache()를 사용합니다.
혼합 정밀도(mixed precision, torch.cuda.amp)로 메모리 사용량(memory usage)을 줄입니다.
매우 깊은 모델에는 그래디언트 체크포인팅(gradient checkpointing)을 사용합니다.

Part 7: 흔한 AI 버그와 잡는 방법

형태 불일치(Shape mismatch)

가장 흔한 버그입니다. 모델은 [batch, channels, height, width]를 기대하는데 텐서가 [batch, features]일 수 있습니다.

def check_shapes(model, sample_input):
    print(f"입력: {sample_input.shape}")
    hooks = []

    def make_hook(name):
        def hook(module, inp, out):
            in_shape = inp[0].shape if isinstance(inp, tuple) else inp.shape
            out_shape = out.shape if hasattr(out, "shape") else type(out)
            print(f"  {name}: {in_shape} -> {out_shape}")
        return hook

    for name, module in model.named_modules():
        hooks.append(module.register_forward_hook(make_hook(name)))

    with torch.no_grad():
        model(sample_input)

    for h in hooks:
        h.remove()

샘플 배치(sample batch)로 한 번 실행하면 모델 내부의 형태 변환(shape transformation)을 볼 수 있습니다.

NaN 손실(NaN loss)

NaN 손실(NaN loss)은 무언가 폭발했다는 뜻입니다. 흔한 원인은 아래와 같습니다.

학습률(learning rate)이 너무 높음
사용자 정의 손실(custom loss)에서 0으로 나누기(division by zero)
0 또는 음수(negative number)에 로그(log) 적용
RNN의 그래디언트 폭주(exploding gradient)

def detect_nan(model, loss, step):
    if torch.isnan(loss):
        print(f"step {step}에서 NaN loss 감지")
        for name, param in model.named_parameters():
            if param.grad is not None:
                if torch.isnan(param.grad).any():
                    print(f"  {name}에서 NaN gradient 감지")
                if torch.isinf(param.grad).any():
                    print(f"  {name}에서 Inf gradient 감지")
        return True
    return False

데이터 누수(Data leakage)

테스트 세트 정확도(test set accuracy)가 99%입니다. 좋아 보이지만 버그일 수 있습니다.

def check_data_leakage(train_set, test_set, id_column="id"):
    train_ids = set(train_set[id_column].tolist())
    test_ids = set(test_set[id_column].tolist())
    overlap = train_ids & test_ids
    if overlap:
        print(f"DATA LEAKAGE: train과 test에 모두 포함된 sample {len(overlap)}개")
        return True
    return False

시간 누수(temporal leakage)도 확인합니다. 미래 데이터(future data)로 과거를 예측하지 않도록 타임스탬프(timestamp) 기준으로 분할(split)해야 할 수 있습니다.

잘못된 장치(Wrong device)

CPU 텐서와 GPU 텐서가 섞이면 런타임 오류(runtime error)가 납니다. 때로는 텐서 일부가 CPU에 남아 학습(training)이 조용히 느려질 수도 있습니다.

def check_devices(model, *tensors):
    model_device = next(model.parameters()).device
    print(f"Model device: {model_device}")
    for i, t in enumerate(tensors):
        if t.device != model_device:
            print(f"  경고: tensor {i}는 {t.device}, model은 {model_device}에 있습니다")

Part 8: TensorBoard 기초(TensorBoard basics)

TensorBoard는 학습(training) 중 무슨 일이 일어나는지 시간에 따라 보여줍니다.

uv pip install tensorboard

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter("runs/experiment_1")

for step in range(num_steps):
    loss = train_step(model, batch)

    writer.add_scalar("loss/train", loss.item(), step)
    writer.add_scalar("lr", optimizer.param_groups[0]["lr"], step)

    if step % 100 == 0:
        for name, param in model.named_parameters():
            writer.add_histogram(f"weights/{name}", param, step)
            if param.grad is not None:
                writer.add_histogram(f"grads/{name}", param.grad, step)

writer.close()

실행합니다.

tensorboard --logdir=runs

확인할 항목입니다.

손실이 감소하지 않음(Loss not decreasing): 학습률(learning rate)이 너무 낮거나 모델 아키텍처(model architecture) 문제일 수 있습니다.
손실이 크게 진동함(Loss oscillating wildly): 학습률이 너무 높을 수 있습니다.
손실이 NaN이 됨(Loss goes to NaN): 수치 불안정성(numerical instability)입니다.
학습 손실은 감소하고 검증 손실은 증가함(Train loss decreasing, val loss increasing): 과적합(overfitting)입니다.
가중치 히스토그램이 0으로 무너짐(Weight histograms collapsing to zero): 그래디언트 소실(vanishing gradient)입니다.
그래디언트 히스토그램이 폭주함(Gradient histograms exploding): 그래디언트 클리핑(gradient clipping)이 필요할 수 있습니다.

Part 9: VS Code 디버거(VS Code debugger)

대화형 디버깅(interactive debugging)에는 VS Code launch.json을 사용할 수 있습니다.

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Training debug",
            "type": "debugpy",
            "request": "launch",
            "program": "${file}",
            "console": "integratedTerminal",
            "justMyCode": false
        }
    ]
}

여백(gutter)을 클릭해 중단점(breakpoint)을 걸고, 변수 패널(Variables pane)에서 텐서 속성(tensor property)을 확인합니다. 디버그 콘솔(Debug Console)에서는 실행 중인 지점에서 Python 표현식(Python expression)을 직접 평가할 수 있습니다.

데이터 전처리 파이프라인(data preprocessing pipeline)처럼 변환(transformation)을 단계별로 봐야 하는 작업에 유용합니다.

사용해보기

대부분의 AI 버그를 잡는 디버깅 워크플로(debugging workflow)입니다.

학습 전(Before training): 샘플 배치(sample batch)로 check_shapes를 실행해 입력/출력 차원(input/output dimension)이 맞는지 확인합니다.
처음 10 step: 손실(loss), 출력(output), 그래디언트(gradient)에 debug_print를 사용합니다. NaN이 없고 값이 합리적인 범위인지 확인합니다.
학습 중(During training): 손실(loss), 학습률(learning rate), 그래디언트 노름(gradient norm)을 로깅(logging)합니다. TensorBoard로 시각화합니다.
문제가 생겼을 때: 실패 지점(failure point)에 breakpoint()를 넣고 텐서를 대화형으로 봅니다.
성능 문제: 데이터 로딩(data loading), 순전파(forward), 역전파(backward) 시간을 나눠 측정합니다. OOM 근처라면 메모리도 프로파일링합니다.

산출물 만들기

디버깅 툴킷 스크립트(debugging toolkit script)를 실행합니다.

python phases/00-setup-and-tooling/12-debugging-and-profiling/code/debug_tools.py

AI 특화 버그(AI-specific bug)를 진단하는 프롬프트(prompt)는 outputs/prompt-debug-ai-code.md에 있습니다.

연습문제

debug_tools.py를 실행하고 각 섹션 출력(section output)을 읽습니다. 더미 모델(dummy model)에 NaN을 일부러 넣고 감지기(detector)가 잡는지 확인합니다.
cProfile로 학습 루프(training loop)를 프로파일링하고 가장 느린 함수(function)를 찾습니다.
tracemalloc으로 데이터 로딩 파이프라인(data loading pipeline)에서 가장 많은 메모리를 할당(allocation)하는 줄(line)을 찾습니다.
간단한 학습 실행(training run)에 TensorBoard를 붙이고 과적합(overfitting) 여부를 확인합니다.
학습 루프 안에 breakpoint()를 넣고 디버거 프롬프트(debugger prompt)에서 텐서 형태(tensor shape), 장치(device), 그래디언트(gradient) 값을 확인하는 연습을 합니다.

핵심 용어

용어	흔한 설명	실제 의미
Silent failure	오류 없이 망함	코드는 실행되지만 모델이 잘못 학습하거나 의미 없는 결과를 만드는 조용한 실패(silent failure)
Profiler	속도 측정기	함수(function)/줄(line)별 시간, 메모리(memory), 자원(resource) 사용량을 측정하는 도구
NaN	숫자가 아님	수치 불안정성(numerical instability)을 나타내는 부동소수점(floating point) 값
Data leakage	테스트가 새어 들어감	학습 데이터(training data)나 특성(feature)에 평가 정답 정보가 섞이는 문제
TensorBoard	손실 그래프	학습 지표(training metric), 히스토그램(histogram), 그래프(graph)를 시간에 따라 보는 시각화 도구
OOM	메모리 부족	CPU/GPU 메모리가 부족해 할당(allocation)이 실패하는 상태

더 읽을거리

Python pdb Documentation — breakpoint()와 대화형 디버거 명령(interactive debugger command)을 확인합니다.
PyTorch Profiler — PyTorch 워크로드 프로파일링 방법을 확인합니다.
TensorBoard with PyTorch — PyTorch 학습 지표를 TensorBoard에 기록하는 방법을 확인합니다.
Python tracemalloc — Python 메모리 할당 추적(memory allocation tracing) 기준을 확인합니다.

실습 코드

이 강의의 실습 코드 1개

debug tools

Code

산출물

이 강의에서 생성된 프롬프트, 스킬, 코드 산출물 1개

prompt-debug-ai-code

Diagnose AI-specific bugs including NaN loss, shape errors, training failures, and OOM

Prompt

확인 문제

3문제 · 모두 맞추면 완료 표시가 가능합니다

1.모델(model)이 테스트 세트(test set)에서 99% 정확도(accuracy)를 달성했습니다. 먼저 의심해야 할 AI 특화 버그(AI-specific bug)는 무엇인가요?

2.학습 루프(training loop)의 시간 분해(time breakdown)를 프로파일링할 때 가장 흔히 발견되는 것은 무엇인가요?

3.step 500에서 NaN 손실(NaN loss)을 봤습니다. 근본 원인(root cause)을 찾는 데 도움이 되는 접근은 무엇인가요?

0/3 답변 완료

이전 강의

AI를 위한 Linux

다음 강의

선형대수 직관