Backpropagation

17 May, 2026

Backpropagation

Computation graph

Mathematical expressions can be represented as directed graph. In the context of deep learning it is a sequence of calculations needed to compute the output of a neural network. We can show that logistic regression classifier can be seen as a single-layer neural network with labels 0 or 1. Prediction is computed in forward pass and gradients during backpropagation (reverse-mode automatic differentiation). Using chain rule it is possible to compute loss gradients.

Chain rule

The chain rule lets us compute the partial derivative of the loss with respect to the weight $w_{1}$ (or any earlier parameter) as a product of all the intermediate partial derivatives along the path from the loss back to $w_{1}$ .

$\frac{\partial L}{\partial w_{1}} = \frac{\partial u}{\partial w_{1}} \times \frac{\partial z}{\partial u} \times \frac{\partial a}{\partial z} \times \frac{\partial L}{\partial a}$

Step-by-step breakdown of the chain:

Loss depends on activation: $L = L (a, y)$ → $\frac{\partial L}{\partial a}$
Activation depends on $z$ : $a = σ (z)$ → $\frac{\partial a}{\partial z} = σ^{'} (z)$ (the derivative of the sigmoid)
$z$ depends on $u$ : $z = u + b$ → $\frac{\partial z}{\partial u} = 1$
$u$ depends on $w_{1}$ : $u = w_{1} \times x_{1}$ → $\frac{\partial u}{\partial w_{1}} = x_{1}$

Multiplying them all together gives exactly the gradient we need to update $w_{1}$ .
Chain Rule Diagram

Numerical Example

Chosen values

Input: $x_{1} = 0.5$
Trainable weight: $w_{1} = 2.0$
Bias: $b = - 1.0$
Target label: $y = 1.0$

Activation function (sigmoid): $σ (z) = \frac{1}{1 + e^{- z}}$

Loss function (mean squared error): $L (a, y) = \frac{1}{2} (a - y)^{2}$

Derivative of loss w.r.t. activation: $\frac{\partial L}{\partial a} = a - y$

Forward pass (compute everything from left to right)

Compute $u$ : $u = w_{1} \times x_{1} = 2.0 \times 0.5 = 1.0$
Compute $z$ : $z = u + b = 1.0 + (- 1.0) = 0.0$
Compute activation $a$ : $a = σ (z) = \frac{1}{1 + e^{0}} = \frac{1}{1 + 1} = 0.5$
Compute loss: $L = \frac{1}{2} (0.5 - 1.0)^{2} = \frac{1}{2} \times 0.25 = 0.125$

The network is predicting $a = 0.5$ but the true label is $y = 1.0$ → the loss is $0.125$ (we want to reduce it).

Backward pass: compute each local partial derivative

Starting at the loss and moving backwards (following the colored arrows in the diagram):

Gradient of loss w.r.t. activation: $\frac{\partial L}{\partial a} = a - y = 0.5 - 1.0 = - 0.5$
Gradient of activation w.r.t. $z$ : $\frac{\partial a}{\partial z} = σ^{'} (z) = a (1 - a) = 0.5 \times 0.5 = 0.25$
Gradient of $z$ w.r.t. $u$ : $\frac{\partial z}{\partial u} = 1$ (because $z = u + b$ , derivative w.r.t. $u$ is just 1)
Gradient of $u$ w.r.t. $w_{1}$ : $\frac{\partial u}{\partial w_{1}} = x_{1} = 0.5$ (because $u = w_{1} \times x_{1}$ )

Chain rule: multiply them all together

Apply the chain rule exactly as shown in the diagram: $\frac{\partial L}{\partial w_{1}} = \frac{\partial u}{\partial w_{1}} \times \frac{\partial z}{\partial u} \times \frac{\partial a}{\partial z} \times \frac{\partial L}{\partial a}$

Plug in the numbers: $\frac{\partial L}{\partial w_{1}} = 0.5 \times 1 \times 0.25 \times (- 0.5) = - 0.0625$

Interpretation: The gradient of the loss w.r.t. weight $w_{1}$ is $- 0.0625$ .

The negative sign tells us: if we increase $w_{1}$ slightly, the loss will go down.

Gradient descent update

Update the weight using gradient descent with learning rate $η = 0.1$ :

$w_{1}^{new} = w_{1} - η \times \frac{\partial L}{\partial w_{1}} = 2.0 - 0.1 \times (- 0.0625) = 2.0 + 0.00625 = 2.00625$

After this tiny update, the network would output a slightly higher activation (closer to the target $1.0$ ) and the loss would decrease.

Bonus: gradient for the bias

You can also compute the gradient for the bias $b$ the same way:

$\frac{\partial L}{\partial b} = \frac{\partial z}{\partial b} \times \frac{\partial a}{\partial z} \times \frac{\partial L}{\partial a} = 1 \times 0.25 \times (- 0.5) = - 0.125$

PyTorch: automatic differentiation engine (autograd)

The engine provides functions to compute gradients in dynamic computational graphs automatically. (Notice requires_grad=True parameter)

Pytorch example

import torch
import torch.nn.functional as F
from torch.autograd import grad
y = torch.tensor([1.0])
x1 = torch.tensor([0.5])
w1 = torch.tensor([2.0], requires_grad=True)
b = torch.tensor([-1.0], requires_grad=True)
z = x1 * w1 + b
a = torch.sigmoid(z)
loss = F.mse_loss(a, y)
grad_L_w1 = grad(loss, w1, retain_graph=True)
grad_L_b = grad(loss, b, retain_graph=True)
print("Gradient of loss with respect to w1:", grad_L_w1)
print("Gradient of loss with respect to b:", grad_L_b)
print("Loss value:", loss.item())
# Note: gradients scale proportionally with the loss value. The book's loss definition includes 1/2,
# so the loss and all its gradients are half of PyTorch's default. When we divide the loss by 2,
# all gradients are also divided by 2 (chain rule propagates this scaling backward).
loss_book_def = F.mse_loss(a, y) / 2
loss_book_def.backward()
print("Gradient of loss with respect to w1 (book definition):", w1.grad)
print("Gradient of loss with respect to b (book definition):", b.grad)
print("Loss value (book definition):", loss_book_def.item())

Output:

Gradient of loss with respect to w1: (tensor([-0.1250]),)
Gradient of loss with respect to b: (tensor([-0.2500]),)
Loss value: 0.25
Gradient of loss with respect to w1 (book definition): (tensor([-0.0625]),)
Gradient of loss with respect to b (book definition): (tensor([-0.1250]),)
Loss value (book definition): 0.125

#tech