Backpropagation
Backpropagation
Computation graph
Mathematical expressions can be represented as directed graph. In the context of deep learning it is a sequence of calculations needed to compute the output of a neural network. We can show that logistic regression classifier can be seen as a single-layer neural network with labels 0 or 1. Prediction is computed in forward pass and gradients during backpropagation (reverse-mode automatic differentiation). Using chain rule it is possible to compute loss gradients.
Chain rule
The chain rule lets us compute the partial derivative of the loss with respect to the weight (or any earlier parameter) as a product of all the intermediate partial derivatives along the path from the loss back to .
Step-by-step breakdown of the chain:
- Loss depends on activation: →
- Activation depends on : → (the derivative of the sigmoid)
- depends on : →
- depends on : →
Multiplying them all together gives exactly the gradient we need to update .
Numerical Example
Chosen values
- Input:
- Trainable weight:
- Bias:
- Target label:
Activation function (sigmoid):
Loss function (mean squared error):
Derivative of loss w.r.t. activation:
Forward pass (compute everything from left to right)
Compute :
Compute :
Compute activation :
Compute loss:
The network is predicting but the true label is → the loss is (we want to reduce it).
Backward pass: compute each local partial derivative
Starting at the loss and moving backwards (following the colored arrows in the diagram):
Gradient of loss w.r.t. activation:
Gradient of activation w.r.t. :
Gradient of w.r.t. : (because , derivative w.r.t. is just 1)
Gradient of w.r.t. : (because )
Chain rule: multiply them all together
Apply the chain rule exactly as shown in the diagram:
Plug in the numbers:
Interpretation: The gradient of the loss w.r.t. weight is .
The negative sign tells us: if we increase slightly, the loss will go down.
Gradient descent update
Update the weight using gradient descent with learning rate :
After this tiny update, the network would output a slightly higher activation (closer to the target ) and the loss would decrease.
Bonus: gradient for the bias
You can also compute the gradient for the bias the same way:
PyTorch: automatic differentiation engine (autograd)
The engine provides functions to compute gradients in dynamic computational graphs automatically.
(Notice requires_grad=True parameter)
Pytorch example
import torch
import torch.nn.functional as F
from torch.autograd import grad
y = torch.tensor([1.0])
x1 = torch.tensor([0.5])
w1 = torch.tensor([2.0], requires_grad=True)
b = torch.tensor([-1.0], requires_grad=True)
z = x1 * w1 + b
a = torch.sigmoid(z)
loss = F.mse_loss(a, y)
grad_L_w1 = grad(loss, w1, retain_graph=True)
grad_L_b = grad(loss, b, retain_graph=True)
print("Gradient of loss with respect to w1:", grad_L_w1)
print("Gradient of loss with respect to b:", grad_L_b)
print("Loss value:", loss.item())
# Note: gradients scale proportionally with the loss value. The book's loss definition includes 1/2,
# so the loss and all its gradients are half of PyTorch's default. When we divide the loss by 2,
# all gradients are also divided by 2 (chain rule propagates this scaling backward).
loss_book_def = F.mse_loss(a, y) / 2
loss_book_def.backward()
print("Gradient of loss with respect to w1 (book definition):", w1.grad)
print("Gradient of loss with respect to b (book definition):", b.grad)
print("Loss value (book definition):", loss_book_def.item())
Output:
Gradient of loss with respect to w1: (tensor([-0.1250]),)
Gradient of loss with respect to b: (tensor([-0.2500]),)
Loss value: 0.25
Gradient of loss with respect to w1 (book definition): (tensor([-0.0625]),)
Gradient of loss with respect to b (book definition): (tensor([-0.1250]),)
Loss value (book definition): 0.125