pugsley

Backpropagation

Backpropagation

Computation graph

Mathematical expressions can be represented as directed graph. In the context of deep learning it is a sequence of calculations needed to compute the output of a neural network. We can show that logistic regression classifier can be seen as a single-layer neural network with labels 0 or 1. Prediction is computed in forward pass and gradients during backpropagation (reverse-mode automatic differentiation). Using chain rule it is possible to compute loss gradients.

Chain rule

The chain rule lets us compute the partial derivative of the loss with respect to the weight w1 (or any earlier parameter) as a product of all the intermediate partial derivatives along the path from the loss back to w1.

Lw1=uw1×zu×az×La

Step-by-step breakdown of the chain:

  1. Loss depends on activation: L=L(a,y)La
  2. Activation depends on z: a=σ(z)az=σ(z) (the derivative of the sigmoid)
  3. z depends on u: z=u+bzu=1
  4. u depends on w1: u=w1×x1uw1=x1

Multiplying them all together gives exactly the gradient we need to update w1.
Chain Rule Diagram

Numerical Example

Chosen values

Activation function (sigmoid): σ(z)=11+ez

Loss function (mean squared error): L(a,y)=12(ay)2

Derivative of loss w.r.t. activation: La=ay

Forward pass (compute everything from left to right)

  1. Compute u: u=w1×x1=2.0×0.5=1.0

  2. Compute z: z=u+b=1.0+(1.0)=0.0

  3. Compute activation a: a=σ(z)=11+e0=11+1=0.5

  4. Compute loss: L=12(0.51.0)2=12×0.25=0.125

The network is predicting a=0.5 but the true label is y=1.0 → the loss is 0.125 (we want to reduce it).

Backward pass: compute each local partial derivative

Starting at the loss and moving backwards (following the colored arrows in the diagram):

  1. Gradient of loss w.r.t. activation: La=ay=0.51.0=0.5

  2. Gradient of activation w.r.t. z: az=σ(z)=a(1a)=0.5×0.5=0.25

  3. Gradient of z w.r.t. u: zu=1 (because z=u+b, derivative w.r.t. u is just 1)

  4. Gradient of u w.r.t. w1: uw1=x1=0.5 (because u=w1×x1)

Chain rule: multiply them all together

Apply the chain rule exactly as shown in the diagram: Lw1=uw1×zu×az×La

Plug in the numbers: Lw1=0.5×1×0.25×(0.5)=0.0625

Interpretation: The gradient of the loss w.r.t. weight w1 is 0.0625.

The negative sign tells us: if we increase w1 slightly, the loss will go down.

Gradient descent update

Update the weight using gradient descent with learning rate η=0.1:

w1new=w1η×Lw1=2.00.1×(0.0625)=2.0+0.00625=2.00625

After this tiny update, the network would output a slightly higher activation (closer to the target 1.0) and the loss would decrease.

Bonus: gradient for the bias

You can also compute the gradient for the bias b the same way:

Lb=zb×az×La=1×0.25×(0.5)=0.125

PyTorch: automatic differentiation engine (autograd)

The engine provides functions to compute gradients in dynamic computational graphs automatically. (Notice requires_grad=True parameter)

Pytorch example

import torch
import torch.nn.functional as F
from torch.autograd import grad
y = torch.tensor([1.0])
x1 = torch.tensor([0.5])
w1 = torch.tensor([2.0], requires_grad=True)
b = torch.tensor([-1.0], requires_grad=True)
z = x1 * w1 + b
a = torch.sigmoid(z)
loss = F.mse_loss(a, y)
grad_L_w1 = grad(loss, w1, retain_graph=True)
grad_L_b = grad(loss, b, retain_graph=True)
print("Gradient of loss with respect to w1:", grad_L_w1)
print("Gradient of loss with respect to b:", grad_L_b)
print("Loss value:", loss.item())
# Note: gradients scale proportionally with the loss value. The book's loss definition includes 1/2,
# so the loss and all its gradients are half of PyTorch's default. When we divide the loss by 2,
# all gradients are also divided by 2 (chain rule propagates this scaling backward).
loss_book_def = F.mse_loss(a, y) / 2
loss_book_def.backward()
print("Gradient of loss with respect to w1 (book definition):", w1.grad)
print("Gradient of loss with respect to b (book definition):", b.grad)
print("Loss value (book definition):", loss_book_def.item())

Output:

Gradient of loss with respect to w1: (tensor([-0.1250]),)
Gradient of loss with respect to b: (tensor([-0.2500]),)
Loss value: 0.25
Gradient of loss with respect to w1 (book definition): (tensor([-0.0625]),)
Gradient of loss with respect to b (book definition): (tensor([-0.1250]),)
Loss value (book definition): 0.125

#tech