머신러닝을 위한 미적분 — 도함수와 기울기

도함수(Derivative)는 어느 방향이 내리막인지 알려 줍니다. 신경망이 학습하는 데 필요한 것은 결국 이것입니다.

유형: Learn 언어: Python 선수 지식: Phase 1, Lessons 01-03 예상 시간: 약 60분

학습 목표

ML에서 자주 쓰는 함수(x^2, sigmoid, cross-entropy)의 수치 도함수(numerical derivative)와 해석적 도함수(analytical derivative)를 계산합니다.
1D와 2D에서 손실 함수(loss function)를 최소화하기 위해 경사하강법(gradient descent)을 직접 구현합니다.
선형 회귀(linear regression) 모델의 그래디언트를 유도하고, 수동 가중치 갱신(manual weight update)으로 학습합니다.
헤시안 행렬(Hessian matrix), 테일러 급수 근사(Taylor series approximation), 그리고 최적화 기법(optimization method)과의 관계를 설명합니다.

문제

수백만 개의 가중치를 가진 신경망이 있다고 합시다. 각 가중치는 하나의 손잡이(knob)입니다. 모델이 조금이라도 덜 틀리게 하려면 모든 손잡이를 어느 방향으로 돌려야 하는지 알아내야 합니다. 미적분(calculus)은 그 방향을 알려 줍니다.

미적분이 없다면 신경망 학습은 무작위 변경(random change)을 시도하고 운에 맡기는 일이 됩니다. 도함수가 있으면 각 가중치가 오류(error)에 어떤 영향을 주는지 정확히 알 수 있습니다. 그래서 매번 올바른 방향으로 손잡이를 돌릴 수 있습니다.

사전 테스트

2문제 · 이 강의를 시작하기 전에 얼마나 알고 있는지 확인해보세요

1.한 점에서 함수의 도함수(derivative)는 무엇을 알려 주나요?

2.머신러닝(machine learning) 맥락에서 그래디언트(gradient)는 무엇인가요?

0/2 답변 완료

개념

도함수(Derivative)란 무엇인가

도함수는 변화율(rate of change)을 측정합니다. 함수 y = f(x)에서 도함수 f'(x)는 x를 아주 조금 움직였을 때 y가 얼마나 변하는지 알려 줍니다.

기하학적으로 도함수는 한 점에서의 접선 기울기(slope of the tangent line)입니다.

f(x) = x^2:

x	f(x)	f'(x) (slope)
0	0	0 (바닥에서 평평함)
1	1	2
2	4	4 (이 점에서의 접선 기울기)
3	9	6

x=2에서 기울기는 4입니다. x를 오른쪽으로 아주 조금 움직이면 y는 그 움직임의 약 4배만큼 증가합니다. x=0에서는 기울기가 0입니다. 그릇의 바닥에 있는 셈입니다.

형식적 정의는 다음과 같습니다.

f'(x) = lim   f(x + h) - f(x)
        h->0  -----------------
                     h

코드에서는 limit를 직접 다루지 않고 아주 작은 h를 사용합니다. 이것이 수치 도함수(numerical derivative)입니다.

편도함수(Partial derivative): 한 번에 하나의 변수만

실제 함수는 입력이 많습니다. 신경망의 loss는 수천 개 weight에 의존합니다. 편도함수(partial derivative)는 다른 변수는 모두 고정하고, 하나의 변수에 대해서만 도함수를 구합니다.

f(x, y) = x^2 + 3xy + y^2

df/dx = 2x + 3y     (y를 constant로 취급)
df/dy = 3x + 2y     (x를 constant로 취급)

각 편도함수는 "이 weight 하나만 조금 움직이면 loss가 어떻게 변하는가?"에 답합니다.

그래디언트(Gradient): 모든 편도함수의 벡터

그래디언트(gradient)는 모든 편도함수를 하나의 vector로 모은 것입니다. 함수 f(x, y, z)의 gradient는 다음과 같습니다.

grad f = [ df/dx, df/dy, df/dz ]

Gradient는 가장 가파르게 올라가는 방향(steepest ascent)을 가리킵니다. 함수를 최소화하려면 반대 방향으로 가야 합니다.

f(x,y) = x^2 + y^2의 contour plot:

이 함수는 bowl shape를 만들고, contour line은 동심원입니다. Minimum은 (0, 0)입니다.

Point	grad f	-grad f (descent direction)
(1, 1)	[2, 2] (minimum에서 멀어지는 uphill 방향)	[-2, -2] (minimum으로 향하는 downhill 방향)
(0, 0)	[0, 0] (minimum에서 평평함)	[0, 0]

이것이 그림으로 보는 경사하강법(gradient descent)입니다. Gradient를 계산하고, 부호를 바꾼 뒤, 한 걸음 움직입니다.

Optimization과의 연결

신경망 학습은 optimization입니다. Model이 얼마나 틀렸는지 측정하는 loss function L(w1, w2, ..., wn)이 있고, 이를 최소화하고 싶습니다.

Gradient descent update rule:

  w_new = w_old - learning_rate * dL/dw

For every weight:
  1. Loss를 해당 weight로 미분한 편도함수를 계산합니다.
  2. 그 값의 작은 배수를 weight에서 뺍니다.
  3. 반복합니다.

학습률(learning rate)은 step size를 제어합니다. 너무 크면 지나치고, 너무 작으면 기어가듯 느립니다.

Loss landscape (1D slice):

Loss function L(w)는 weight w에 따라 peak와 valley를 가진 curve를 만듭니다.

Feature	설명
Global minimum	전체 curve에서 가장 낮은 지점. 가장 좋은 solution
Local minimum	주변보다 낮지만 전체 최저점은 아닌 valley
Slope	Gradient descent가 어떤 시작점에서든 따라 내려가는 기울기

Gradient descent는 slope를 따라 내려갑니다. Local minimum에 갇힐 수 있지만, 수백만 weight를 가진 high-dimensional space에서는 이것이 실제로 큰 문제가 되는 경우는 생각보다 드뭅니다.

수치 도함수와 해석적 도함수

도함수를 계산하는 방법은 두 가지입니다.

해석적(analytical): 미적분 규칙을 손으로 적용합니다. f(x) = x^2의 derivative는 f'(x) = 2x입니다. 정확하고 빠릅니다.

수치적(numerical): 정의를 사용해 근사합니다. 아주 작은 h에 대해 f(x+h)와 f(x-h)를 계산한 뒤 차이를 봅니다.

Numerical (central difference):

f'(x) ~= f(x + h) - f(x - h)
          -----------------------
                  2h

h = 0.0001은 실제로 꽤 잘 동작합니다.

Numerical derivative는 느리지만 어떤 함수에도 적용할 수 있습니다. Analytical derivative는 빠르지만 공식을 유도해야 합니다. Neural network framework는 세 번째 방법인 자동미분(automatic differentiation)을 사용합니다. 이는 derivative를 기계적으로 정확하게 계산합니다. Phase 3에서 다시 다룹니다.

간단한 함수의 도함수

ML에서 반복해서 보게 되는 도함수입니다.

Function        Derivative       Used in
--------        ----------       -------
f(x) = x^2     f'(x) = 2x      Loss functions (MSE)
f(x) = wx + b  f'(w) = x        Linear layer (gradient w.r.t. weight)
                f'(b) = 1        Linear layer (gradient w.r.t. bias)
                f'(x) = w        Linear layer (gradient w.r.t. input)
f(x) = e^x     f'(x) = e^x     Softmax, attention
f(x) = ln(x)   f'(x) = 1/x     Cross-entropy loss
f(x) = 1/(1+e^-x)  f'(x) = f(x)(1-f(x))   Sigmoid activation

f(x) = x^2의 경우:

f(x) = x^2    f'(x) = 2x

  x    f(x)   f'(x)   meaning
  -2    4      -4      slope tilts left (decreasing)
  -1    1      -2      slope tilts left (decreasing)
   0    0       0      flat (minimum!)
   1    1       2      slope tilts right (increasing)
   2    4       4      slope tilts right (increasing)

f(w) = wx + b에서 x=3, b=1이라면:

f(w) = 3w + 1    f'(w) = 3

w에 대한 derivative는 x입니다.
x가 크면 w의 작은 변화가 output에 큰 변화를 만듭니다.

연쇄 법칙(Chain rule)

함수가 합성되어 있을 때는 연쇄 법칙(chain rule)으로 미분합니다.

If y = f(g(x)), then dy/dx = f'(g(x)) * g'(x)

Example: y = (3x + 1)^2
  outer: f(u) = u^2       f'(u) = 2u
  inner: g(x) = 3x + 1    g'(x) = 3
  dy/dx = 2(3x + 1) * 3 = 6(3x + 1)

신경망은 함수의 chain입니다. input -> linear -> activation -> linear -> activation -> loss로 이어집니다. 역전파(backpropagation)는 chain rule을 output에서 input 방향으로 반복 적용한 것입니다. 이것이 전체 알고리즘입니다.

헤시안 행렬(Hessian Matrix)

Gradient는 slope를 알려 줍니다. Hessian은 curvature를 알려 줍니다.

Hessian은 second-order partial derivative의 matrix입니다. 함수 f(x1, x2, ..., xn)에서 Hessian의 (i, j) entry는 다음과 같습니다.

H[i][j] = d^2f / (dx_i * dx_j)

2-variable function f(x, y)에서는:

H = | d^2f/dx^2    d^2f/dxdy |
    | d^2f/dydx    d^2f/dy^2 |

Critical point(gradient = 0)에서 Hessian이 알려 주는 것:

Hessian property	의미	Example surface
Positive definite (all eigenvalues > 0)	Local minimum	위로 열린 bowl
Negative definite (all eigenvalues < 0)	Local maximum	아래로 열린 bowl
Indefinite (mixed eigenvalues)	Saddle point	말안장 모양

예: f(x, y) = x^2 - y^2 (saddle function)

df/dx = 2x       df/dy = -2y
d^2f/dx^2 = 2    d^2f/dy^2 = -2    d^2f/dxdy = 0

H = | 2   0 |
    | 0  -2 |

Eigenvalues: 2 and -2 (one positive, one negative)
--> Saddle point at (0, 0)

f(x, y) = x^2 + y^2와 비교합니다.

H = | 2  0 |
    | 0  2 |

Eigenvalues: 2 and 2 (both positive)
--> Local minimum at (0, 0)

ML에서 Hessian이 중요한 이유:

뉴턴 방법(Newton's method)은 Hessian을 사용해 gradient descent보다 더 나은 optimization step을 잡습니다. 단순히 slope만 따라가는 대신 curvature를 고려합니다.

Newton's update:    w_new = w_old - H^(-1) * gradient
Gradient descent:   w_new = w_old - lr * gradient

Newton's method는 Hessian이 gradient를 "rescale"하기 때문에 더 빠르게 수렴합니다. 가파른 방향에서는 step을 줄이고, 평평한 방향에서는 step을 키웁니다.

문제는 비용입니다. Parameter가 N개인 neural network에서 Hessian은 N x N입니다. Parameter가 100만 개면 1조 entry matrix가 필요합니다. 그래서 approximation을 씁니다.

Method	사용하는 것	비용	수렴
Gradient descent	First derivative only	Step마다 O(N)	느림(linear)
Newton's method	Full Hessian	Step마다 O(N^3)	빠름(quadratic)
L-BFGS	Gradient history 기반 approximate Hessian	Step마다 O(N)	중간(superlinear)
Adam	Parameter별 adaptive rate(diagonal Hessian approx)	Step마다 O(N)	중간
Natural gradient	Fisher information matrix(statistical Hessian)	Step마다 O(N^2)	빠름

실무 deep learning에서는 Adam이 기본 optimizer입니다. Adam은 parameter별 gradient의 running mean과 variance를 추적해 second-order information을 저렴하게 근사합니다.

테일러 급수 근사(Taylor Series Approximation)

부드러운 함수는 어떤 점 근처에서 polynomial로 근사할 수 있습니다.

f(x + h) = f(x) + f'(x)*h + (1/2)*f''(x)*h^2 + (1/6)*f'''(x)*h^3 + ...

항을 많이 포함할수록 근사는 좋아지지만, 그 점 x 근처에서만 잘 맞습니다.

Taylor series가 ML에서 중요한 이유:

First-order Taylor = gradient descent. f(x + h) ~ f(x) + f'(x)*h를 사용하면 linear approximation을 하는 것입니다. Gradient descent는 이 linear model을 최소화하기 위해 h = -lr * f'(x)를 고릅니다.
Second-order Taylor = Newton's method. f(x + h) ~ f(x) + f'(x)*h + (1/2)*f''(x)*h^2를 쓰면 quadratic model을 얻습니다. 이를 최소화하면 h = -f'(x)/f''(x)입니다.
Loss function design. MSE와 cross-entropy는 smooth합니다. 그래서 Taylor expansion이 잘 behaved합니다. 우연이 아닙니다. Smooth loss는 optimization을 예측 가능하게 만듭니다.

Approximation order    What it captures    Optimization method
-------------------    -----------------   -------------------
0th order (constant)   Just the value      Random search
1st order (linear)     Slope               Gradient descent
2nd order (quadratic)  Curvature           Newton's method
Higher orders          Finer structure     Rarely used in ML

핵심은 이것입니다. Gradient 기반 optimization은 loss function을 국소적으로 근사하고, 그 근사의 minimum으로 한 걸음 이동하는 과정입니다.

ML에서의 적분(Integrals)

Derivative는 변화율을 알려 줍니다. Integral은 누적량, 즉 curve 아래의 area를 계산합니다.

ML에서 integral을 손으로 자주 계산하지는 않지만, 개념은 곳곳에 있습니다.

Probability. Density p(x)를 가진 continuous random variable:

P(a < X < b) = integral from a to b of p(x) dx

a와 b 사이 probability density curve 아래 면적이 그 범위에 들어갈 확률입니다.

Expected value. Probability로 weighted한 average outcome:

E[f(X)] = integral of f(x) * p(x) dx

Data distribution에 대한 expected loss는 integral입니다. Training은 이것의 empirical approximation을 최소화합니다.

KL divergence. 두 distribution이 얼마나 다른지 측정합니다.

KL(p || q) = integral of p(x) * log(p(x) / q(x)) dx

VAE, knowledge distillation, Bayesian inference에서 사용됩니다.

Normalization constants. Bayesian inference에서는:

p(w | data) = p(data | w) * p(w) / integral of p(data | w) * p(w) dw

Denominator는 가능한 모든 parameter value에 대한 integral입니다. 보통 intractable하기 때문에 MCMC나 variational inference 같은 approximation을 사용합니다.

Integral concept	ML에서 등장하는 곳
Area under curve	Density function에서 probability 계산
Expected value	Loss function, risk minimization
KL divergence	VAE, policy optimization, distillation
Normalization	Bayesian posterior, softmax denominator
Marginal likelihood	Model comparison, evidence lower bound(ELBO)

Computation graph의 다변수 연쇄 법칙(Multivariable Chain Rule)

Chain rule은 일렬로 이어진 scalar function에만 적용되는 것이 아닙니다. Neural network에서는 variable이 갈라지고 다시 합쳐집니다. Simple forward pass에서 derivative가 흐르는 방식은 다음과 같습니다.

graph LR
    x["x (input)"] -->|"*w"| z1["z1 = w*x"]
    z1 -->|"+b"| z2["z2 = w*x + b"]
    z2 -->|"sigmoid"| a["a = sigmoid(z2)"]
    a -->|"loss fn"| L["L = -(y*log(a) + (1-y)*log(1-a))"]

Backward pass는 오른쪽에서 왼쪽으로 gradient를 계산합니다.

graph RL
    dL["dL/dL = 1"] -->|"dL/da"| da["dL/da = -y/a + (1-y)/(1-a)"]
    da -->|"da/dz2 = a(1-a)"| dz2["dL/dz2 = dL/da * a(1-a)"]
    dz2 -->|"dz2/dw = x"| dw["dL/dw = dL/dz2 * x"]
    dz2 -->|"dz2/db = 1"| db["dL/db = dL/dz2 * 1"]

각 arrow는 local derivative를 곱합니다. 어떤 parameter의 gradient는 loss에서 그 parameter로 이어지는 path 위의 모든 local derivative의 곱입니다. Path가 갈라졌다가 합쳐지면 contribution을 더합니다. 이것이 multivariate chain rule입니다.

이것이 역전파(backpropagation)의 전부입니다. Computation graph를 따라 output에서 input 방향으로 chain rule을 체계적으로 적용하는 것입니다.

야코비안 행렬(Jacobian matrix)

함수가 vector를 vector로 mapping하면, derivative는 matrix입니다. Jacobian은 모든 output을 모든 input으로 미분한 편도함수를 담습니다.

f: R^n -> R^m인 경우 Jacobian J는 m x n matrix입니다.

	x1	x2	...	xn
f1	df1/dx1	df1/dx2	...	df1/dxn
f2	df2/dx1	df2/dx2	...	df2/dxn
...	...	...	...	...
fm	dfm/dx1	dfm/dx2	...	dfm/dxn

신경망의 Jacobian을 손으로 계산할 일은 거의 없습니다. PyTorch가 처리합니다. 하지만 Jacobian이 존재한다는 사실을 알면 backpropagation의 shape를 이해하기 쉽습니다. 어떤 layer가 R^n을 R^m으로 mapping하면 Jacobian은 m x n이고, gradient는 이 matrix의 transpose를 통과해 뒤로 흐릅니다.

왜 신경망에서 중요한가

신경망의 모든 weight는 gradient를 받습니다. Gradient는 loss를 줄이기 위해 해당 weight를 어떻게 조정해야 하는지 알려 줍니다.

graph LR
    subgraph Forward["Forward Pass"]
        I["input"] --> W1["W1"] --> R["relu"] --> W2["W2"] --> S["softmax"] --> L["loss"]
    end

graph RL
    subgraph Backward["Backward Pass"]
        dL["dL/dloss"] --> dW2["dL/dW2"] --> d2["..."] --> dW1["dL/dW1"]
    end

각 weight update는 다음과 같습니다.

W1 = W1 - lr * dL/dW1
W2 = W2 - lr * dL/dW2

Forward pass는 prediction과 loss를 계산합니다. Backward pass는 모든 weight에 대한 loss gradient를 계산합니다. 그 다음 모든 weight가 내리막 방향으로 작은 step을 이동합니다. 이 과정을 수백만 step 반복합니다. 이것이 deep learning입니다.

직접 만들기

Step 1: 수치 도함수 직접 구현하기

def numerical_derivative(f, x, h=1e-7):
    return (f(x + h) - f(x - h)) / (2 * h)

def f(x):
    return x ** 2

for x in [-2, -1, 0, 1, 2]:
    numerical = numerical_derivative(f, x)
    analytical = 2 * x
    print(f"x={x:2d}  f'(x) numerical={numerical:.6f}  analytical={analytical:.1f}")

Numerical derivative는 analytical derivative와 여러 decimal place까지 일치합니다.

Step 2: 편도함수와 gradient

def numerical_gradient(f, point, h=1e-7):
    gradient = []
    for i in range(len(point)):
        point_plus = list(point)
        point_minus = list(point)
        point_plus[i] += h
        point_minus[i] -= h
        partial = (f(point_plus) - f(point_minus)) / (2 * h)
        gradient.append(partial)
    return gradient

def f_multi(point):
    x, y = point
    return x**2 + 3*x*y + y**2

grad = numerical_gradient(f_multi, [1.0, 2.0])
print(f"(1,2)에서의 numerical gradient: {[f'{g:.4f}' for g in grad]}")
print(f"(1,2)에서의 analytical gradient: [2*1+3*2, 3*1+2*2] = [{2*1+3*2}, {3*1+2*2}]")

Step 3: f(x) = x^2의 minimum 찾기

x = 5.0
lr = 0.1
for step in range(20):
    grad = 2 * x
    x = x - lr * grad
    print(f"step {step:2d}  x={x:8.4f}  f(x)={x**2:10.6f}")

x=5에서 시작해 매 step마다 minimum인 x=0에 가까워집니다.

Step 4: 2D 함수에서 gradient descent

def f_2d(point):
    x, y = point
    return x**2 + y**2

point = [4.0, 3.0]
lr = 0.1
for step in range(30):
    grad = numerical_gradient(f_2d, point)
    point = [p - lr * g for p, g in zip(point, grad)]
    loss = f_2d(point)
    if step % 5 == 0 or step == 29:
        print(f"step {step:2d}  point=({point[0]:7.4f}, {point[1]:7.4f})  f={loss:.6f}")

Step 5: Numerical derivative와 analytical derivative 비교

import math

test_functions = [
    ("x^2",      lambda x: x**2,          lambda x: 2*x),
    ("x^3",      lambda x: x**3,          lambda x: 3*x**2),
    ("sin(x)",   lambda x: math.sin(x),   lambda x: math.cos(x)),
    ("e^x",      lambda x: math.exp(x),   lambda x: math.exp(x)),
    ("1/x",      lambda x: 1/x,           lambda x: -1/x**2),
]

x = 2.0
print(f"{'Function':<12} {'Numerical':>12} {'Analytical':>12} {'Error':>12}")
print("-" * 50)
for name, f, df in test_functions:
    num = numerical_derivative(f, x)
    ana = df(x)
    err = abs(num - ana)
    print(f"{name:<12} {num:12.6f} {ana:12.6f} {err:12.2e}")

Step 6: Hessian을 수치적으로 계산하기

def hessian_2d(f, x, y, h=1e-5):
    fxx = (f(x + h, y) - 2 * f(x, y) + f(x - h, y)) / (h ** 2)
    fyy = (f(x, y + h) - 2 * f(x, y) + f(x, y - h)) / (h ** 2)
    fxy = (f(x + h, y + h) - f(x + h, y - h) - f(x - h, y + h) + f(x - h, y - h)) / (4 * h ** 2)
    return [[fxx, fxy], [fxy, fyy]]

def saddle(x, y):
    return x ** 2 - y ** 2

def bowl(x, y):
    return x ** 2 + y ** 2

H_saddle = hessian_2d(saddle, 0.0, 0.0)
H_bowl = hessian_2d(bowl, 0.0, 0.0)
print(f"Saddle Hessian: {H_saddle}")  # [[2, 0], [0, -2]] -- sign이 섞여 있음
print(f"Bowl Hessian:   {H_bowl}")    # [[2, 0], [0, 2]]  -- 둘 다 positive

Saddle function의 Hessian은 eigenvalue 2와 -2를 가집니다. Sign이 섞여 있으므로 saddle point입니다. Bowl은 eigenvalue가 둘 다 2이므로 minimum입니다.

Step 7: Taylor approximation 실행해보기

import math

def taylor_approx(f, f_prime, f_double_prime, x0, h, order=2):
    result = f(x0)
    if order >= 1:
        result += f_prime(x0) * h
    if order >= 2:
        result += 0.5 * f_double_prime(x0) * h ** 2
    return result

x0 = 0.0
for h in [0.1, 0.5, 1.0, 2.0]:
    true_val = math.sin(h)
    t1 = taylor_approx(math.sin, math.cos, lambda x: -math.sin(x), x0, h, order=1)
    t2 = taylor_approx(math.sin, math.cos, lambda x: -math.sin(x), x0, h, order=2)
    print(f"h={h:.1f}  sin(h)={true_val:.4f}  order1={t1:.4f}  order2={t2:.4f}")

x0=0 근처에서는 sin(x) ~ x입니다. 작은 h에서는 approximation이 훌륭하지만, 큰 h에서는 깨집니다. Gradient descent가 작은 learning rate에서 잘 동작하는 이유도 같습니다. 각 step이 linear approximation이 정확하다고 가정하기 때문입니다.

Step 8: 신경망에서 왜 중요한지 보기

import random

random.seed(42)

w = random.gauss(0, 1)
b = random.gauss(0, 1)
lr = 0.01

xs = [1.0, 2.0, 3.0, 4.0, 5.0]
ys = [3.0, 5.0, 7.0, 9.0, 11.0]

for epoch in range(200):
    total_loss = 0
    dw = 0
    db = 0
    for x, y in zip(xs, ys):
        pred = w * x + b
        error = pred - y
        total_loss += error ** 2
        dw += 2 * error * x
        db += 2 * error
    dw /= len(xs)
    db /= len(xs)
    total_loss /= len(xs)
    w -= lr * dw
    b -= lr * db
    if epoch % 40 == 0 or epoch == 199:
        print(f"epoch {epoch:3d}  w={w:.4f}  b={b:.4f}  loss={total_loss:.6f}")

print(f"\n학습된 식: y = {w:.2f}x + {b:.2f}")
print(f"실제 식:   y = 2x + 1")

모든 gradient 기반 training loop는 이 pattern을 따릅니다. Predict, compute loss, compute gradients, update weights.

사용해보기

NumPy를 사용하면 같은 작업을 더 빠르고 간결하게 할 수 있습니다.

import numpy as np

x = np.array([1, 2, 3, 4, 5], dtype=float)
y = np.array([3, 5, 7, 9, 11], dtype=float)

w, b = np.random.randn(), np.random.randn()
lr = 0.01

for epoch in range(200):
    pred = w * x + b
    error = pred - y
    loss = np.mean(error ** 2)
    dw = np.mean(2 * error * x)
    db = np.mean(2 * error)
    w -= lr * dw
    b -= lr * db

print(f"학습된 식: y = {w:.2f}x + {b:.2f}")

방금 gradient descent를 직접 만들었습니다. PyTorch는 gradient computation을 자동화하지만 update loop는 동일합니다.

산출물 만들기

이 lesson의 산출물은 gradient 계산 기준을 정리한 skill입니다.

outputs/skill-gradient-computation.md

검토할 때는 이 skill이 다음을 명확히 구분하는지 확인합니다.

Analytical derivative
Numerical derivative
Automatic differentiation
Gradient checking
Hessian이나 Taylor series가 optimization과 연결되는 지점

연습문제

numerical_derivative를 두 번 호출해 numerical_second_derivative(f, x)를 구현합니다. x=2에서 x^3의 second derivative가 12인지 확인합니다.
Gradient descent로 f(x, y) = (x - 3)^2 + (y + 1)^2의 minimum을 찾습니다. (0, 0)에서 시작합니다. 정답은 (3, -1)에 수렴해야 합니다.
Gradient descent loop에 momentum을 추가합니다. Past gradient를 누적하는 velocity vector를 유지합니다. f(x) = x^4 - 3x^2에서 momentum이 있을 때와 없을 때의 convergence speed를 비교합니다.

핵심 용어

용어	흔한 설명	실제 의미
도함수(Derivative)	기울기	한 점에서 함수의 변화율입니다. Input이 한 단위 변할 때 output이 얼마나 변하는지 알려 줍니다.
편도함수(Partial derivative)	변수 하나의 derivative	나머지 변수를 고정하고 한 변수에 대해서만 구한 derivative입니다.
그래디언트(Gradient)	가장 가파른 상승 방향	모든 partial derivative를 모은 vector입니다. 함수를 가장 빠르게 증가시키는 방향을 가리킵니다.
경사하강법(Gradient descent)	내리막으로 가기	Loss를 줄이기 위해 parameter에서 gradient에 learning rate를 곱한 값을 뺍니다. Neural network training의 핵심입니다.
학습률(Learning rate)	Step size	Gradient descent의 한 step 크기를 제어하는 scalar입니다. 너무 크면 발산하고, 너무 작으면 느리게 수렴합니다.
연쇄 법칙(Chain rule)	derivative를 곱하기	합성 함수의 미분 규칙입니다. `df/dx = df/dg * dg/dx`. Backpropagation의 수학적 토대입니다.
야코비안(Jacobian)	derivative matrix	Vector-to-vector function에서 output을 input으로 미분한 모든 partial derivative를 담은 matrix입니다.
수치 도함수(Numerical derivative)	Finite differences	가까운 두 점에서 함수를 평가하고 그 사이의 slope로 derivative를 근사합니다.
역전파(Backpropagation)	Reverse-mode autodiff	Chain rule을 사용해 output에서 input 방향으로 layer별 gradient를 계산하는 방법입니다. Neural network가 학습하는 방식입니다.
헤시안(Hessian)	Second derivative matrix	모든 second-order partial derivative를 담은 matrix입니다. 함수의 curvature를 설명합니다. Critical point에서 positive definite Hessian이면 local minimum입니다.
테일러 급수(Taylor series)	Polynomial approximation	어떤 점 근처에서 derivative를 사용해 함수를 근사합니다. `f(x+h) ~ f(x) + f'(x)h + (1/2)f''(x)h^2 + ...`. Gradient descent와 Newton's method가 왜 동작하는지 이해하는 기반입니다.
적분(Integral)	Curve 아래 면적	어떤 범위에서 quantity를 누적한 값입니다. ML에서는 probability, expected value, KL divergence를 정의합니다.

더 읽을거리

3Blue1Brown: Essence of Calculus — derivative, integral, chain rule을 시각적 직관으로 복습할 때 좋습니다.
Stanford CS231n: Backpropagation — neural network layer를 통해 gradient가 흐르는 방식을 확인합니다.

실습 코드

이 강의의 실습 코드 1개

derivatives

Code

산출물

이 강의에서 생성된 프롬프트, 스킬, 코드 산출물 1개

skill-gradient-computation

Compute gradients of common ML loss functions and choose the right derivative approach

Skill

확인 문제

3문제 · 모두 맞추면 완료 표시가 가능합니다

1.경사하강법(gradient descent)에서 update rule `w = w - lr * dL/dw`는 무엇을 하나요?

2.헤시안 행렬(Hessian matrix)을 사용하는 뉴턴 방법(Newton's method)을 parameter가 수백만 개인 neural network에 직접 적용하기 어려운 이유는 무엇인가요?

3.`f'(x)`의 수치적 중앙 차분 근사(central difference approximation)는 무엇인가요?

0/3 답변 완료

이전 강의

행렬 변환과 고윳값

다음 강의

연쇄법칙과 자동미분