Neural Networks & Back-propagation

1. A single neuron (from the linear model)

Start with the simplest model: a weighted sum of inputs plus a constant, exactly the score from the Linear-Algebra primer’s dot product.

\[ z \;=\; w_1 x_1 + w_2 x_2 + \dots + w_n x_n + b \;=\; \mathbf{w}\cdot\mathbf{x} + b. \]

\(\mathbf{x}\) — the inputs (features).
\(\mathbf{w}\) — the weights (how much each input matters; learned).
\(b\) — the bias (a constant offset, shifting the threshold; learned).
\(z\) — the pre-activation (the raw score, also called the logit, Probability primer §3).

A neuron then applies a function \(\phi\) (the activation, §2) to that score: \(a = \phi(z)\).

Why “neuron,” “weights,” “bias.” The model is loosely inspired by a biological neuron that fires when its inputs exceed a threshold. The weights weight each incoming signal; the bias is the firing threshold moved to the other side of the equation. The single unit with a step activation is the historic perceptron (Rosenblatt, 1958) — perceive + -tron.

Worked example. Inputs \(\mathbf{x}=[1,2]\), weights \(\mathbf{w}=[0.5,-0.5]\), bias \(b=0\):

\[ z = (0.5)(1)+(-0.5)(2)+0 = 0.5 - 1.0 = -0.5. \]

That single number is the neuron’s score; §2 squashes it into an output.

2. Activation functions — and why a nonlinearity is essential

The activation function \(\phi\) takes the score \(z\) and produces the neuron’s output. The three you meet first:

The three canonical activation functions — **sigmoid** for probability outputs, **tanh** for zero-centred hidden layers, and **ReLU** as the modern default that avoids the vanishing-gradient problem for \(z>0\).
Name	Formula	Range	Note
Sigmoid	\(\sigma(z)=\dfrac{1}{1+e^{-z}}\)	\((0,1)\)	“probability-like”; from the Probability primer §3
Tanh	\(\tanh(z)\)	\((-1,1)\)	zero-centred sigmoid cousin
ReLU	\(\max(0,z)\)	\([0,\infty)\)	the modern default — cheap, no vanishing slope for \(z>0\)

Why “activation” and “ReLU.” Activation = how strongly the neuron “fires.” ReLU = Rectified Linear Unit: “rectified” (as in a rectifier that clips negatives to zero) + “linear” (it is the identity for \(z>0\)).

Worked example. \(\text{ReLU}(-2)=0\) and \(\text{ReLU}(3)=3\). Applying sigmoid to §1’s score: \(\sigma(-0.5)=\dfrac{1}{1+e^{0.5}}=0.3775\).

Figure 5.1: The three activations: **sigmoid** squashes to \((0,1)\), **tanh** to \((-1,1)\) (zero-centred), **ReLU** passes positives and zeros negatives (note its kink at \(0\) and flat slope for \(z<0\)). The **bend** each one adds is the **nonlinearity** — without it, stacked layers collapse to a single line (below).

Why ReLU is the default. Sigmoid and tanh saturate — far from \(0\) their curves go flat, so the slope \(\to 0\) and gradients vanish, stalling the early layers of a deep net. ReLU’s slope is exactly \(1\) for \(z>0\), so the gradient flows back undiminished — which is why it largely replaced the squashing activations in hidden layers. (Its flip side: a unit stuck at \(z<0\) gets zero gradient and can stay “dead” — which Leaky-ReLU and GELU fix by giving negatives a small slope.)

Why the nonlinearity matters — the crucial point. Stack two linear layers and you get \(W_2(W_1\mathbf{x}) = (W_2 W_1)\mathbf{x}\) — still just one linear map (matrix product, Linear-Algebra primer). No amount of stacking adds power. The nonlinearity \(\phi\) between layers is what lets a deep network bend space and represent functions a single line cannot — the classic example being XOR (exclusive-or: output \(1\) iff exactly one input is \(1\)), which no single linear boundary can separate but a 2-layer net can.

Figure 5.2: **XOR is not linearly separable.** Class 0 (blue: \((0,0),(1,1)\)) and class 1 (red: \((0,1),(1,0)\)) sit on opposite diagonals — no single straight line (the dashed one, or any other) can split them. A 2-layer net with a hidden nonlinearity *can* carve the boundary — which is why the nonlinearity is essential.

3. The multi-layer perceptron (MLP)

Put many neurons side by side to form a layer, and stack layers front to back. A layer takes a vector in and puts a vector out — a matrix–vector product (Linear-Algebra primer) followed by an elementwise activation:

\[ \mathbf{h} = \phi\big(W\mathbf{x} + \mathbf{b}\big). \]

\(W\) — the layer’s weight matrix (one row per output neuron, one column per input).
\(\mathbf{b}\) — the layer’s bias vector.
\(\phi\) — applied elementwise.
\(\mathbf{h}\) — the layer’s output, the hidden representation, fed to the next layer.

Stack \(L\) such layers: \(\mathbf{x}\to\mathbf{h}^{(1)}\to\dots\to\mathbf{h}^{(L)}=\hat{\mathbf{y}}\).

Why “MLP,” “hidden,” “feedforward,” “deep.” A multi-layer perceptron is several perceptron-style layers stacked. The middle layers are hidden (not directly observed — neither input nor output). Signal flows one way, input → output, so it is feedforward. Many layers = deep learning. The width is neurons-per-layer; the depth is the number of layers.

Figure 5.3: A **multi-layer perceptron**: every edge is a learned weight, every node a neuron. Signal flows **input \(\to\) hidden \(\to\) output** (feed-forward); **width** = neurons per layer, **depth** = number of layers. The middle layers are **hidden** — neither input nor output.

A tiny forward pass. Take a \(2\!\to\!2\!\to\!1\) network: input \(\mathbf{x}=[1,2]\); a hidden layer with weight rows \([1,-1]\) and \([0,1]\), bias \(\mathbf{b}=[0,-1]\), and ReLU; then an output weight \(\mathbf{w}_2=[1,2]\), \(b_2=0.5\).

\[ \mathbf{h}=\text{ReLU}(W\mathbf{x}+\mathbf{b})=\text{ReLU}\big([\,1{-}2,\ 2{-}1\,]\big)=\text{ReLU}([-1,\,1])=[0,\,1], \qquad \hat y=\mathbf{w}_2\!\cdot\!\mathbf{h}+b_2=2+0.5=\boxed{2.5}. \]

The ReLU clipped the \(-1\) to \(0\) — that bend is the nonlinearity at work. Each layer is just a matrix–vector product plus an elementwise activation; §5 then traces the gradient back through such a network end to end.

Embedding layers — the input side for discrete data

An MLP eats a vector, but a recommender’s raw input is a discrete id (user #4072, movie #88) — not a number you can multiply. The fix is an embedding layer: keep a big learnable lookup table \(E\) (one row per id) and encode id \(k\) as its row \(E_k\). Formally that is a matrix–vector product with a one-hot input: the one-hot \(\mathbf e_k\) (a column of \(0\)s with a single \(1\) in slot \(k\)) picks out the matching row, \(E_k=\mathbf e_k^\top E\) — the one-hot on the left here, since we want a row, not the \(W\mathbf x\) column convention of §3 (the same vector is \(E^\top\mathbf e_k\) if you prefer to keep the matrix on the left). Either way it is a plain linear map, so its rows train by the same back-prop (§5) as any other weights. This is the bridge to Tier 3: the user/item embeddings of matrix factorization (Traditional RecSys) and LightGCN (From Graphs to LightGCN) are exactly these lookup rows — learned vectors, scored by a dot product (§1).

Figure 5.4: An **embedding layer** is a learnable lookup table \(E\). Multiplying a one-hot row vector \(\mathbf{e}_k^\top\) (all zeros except a \(1\) at position \(k\)) by the matrix \(E\) selects exactly **row \(k\)** of \(E\) — no arithmetic needed beyond that selection. The highlighted row is the **embedding vector** for id \(k\); its entries are learned by back-propagation exactly like any other weights. This is what `nn.Embedding` does: it stores the table and returns the row for the given id.

By the numbers. Take a \(4\times3\) table (4 ids, 3-dim): \(E=\begin{bmatrix}0.9&0.1&0.0\\ 0.2&0.8&0.1\\ 0.5&-0.2&0.1\\ -0.3&0.4&0.7\end{bmatrix}\). Looking up id \(k{=}2\) (the third row, 0-indexed) returns \(E_2=[0.5,-0.2,0.1]\) — no arithmetic, just a row read; the one-hot product confirms it, \(\mathbf e_2^{\top}E=0\cdot\text{row}_0+0\cdot\text{row}_1+1\cdot\text{row}_2+0\cdot\text{row}_3=[0.5,-0.2,0.1]\). The quiet consequence is in back-prop: that same one-hot zeroes the gradient for every other row, so an id’s embedding updates only on training examples that use it. A user seen once barely moves; an id never seen gets no gradient at all — precisely the cold-start gap From Graphs to LightGCN §14 and LLM × RecSys exist to fill.

4. The forward pass and the loss

The forward pass runs an input through the network to a prediction \(\hat{y}\), then scores it against the target \(y\) with a loss (Losses & Regularizers): regression → MSE, classification → BCE / cross-entropy. Training = adjust \(W,\mathbf{b}\) to make that loss small over the data.

For multi-class output the last layer is a softmax (Probability §3; Losses & Regularizers §2.2): it turns the \(K\) raw scores into a probability distribution over the classes, paired with cross-entropy. So the output head is a sigmoid (one probability) or a softmax (\(K\) probabilities); everything before it is the linear-maps-plus-nonlinearity stack.

Worked example — a softmax + cross-entropy head. Three classes, raw scores (logits) \(z=[2,1,0]\), true class \(0\). Softmax exponentiates and normalizes: \(e^{2},e^{1},e^{0}=7.39,2.72,1.00\) (sum \(11.11\)), giving probabilities \(\hat{\mathbf p}=[0.665,\,0.245,\,0.090]\). Cross-entropy then reads off the negative log-probability of the true class: \(\mathcal L=-\ln\hat p_0=-\ln 0.665=0.41\). Decoded: the loss is small only when the model concentrates mass on the right class (\(\hat p_0\to 1\Rightarrow\mathcal L\to 0\)) and grows without bound as \(\hat p_0\to 0\) — exactly the pressure that trains a classifier. (Numbers checked in code.)

Why “forward pass.” You pass the data forward through the layers, input to output. (Back-propagation, §5, then passes information backward.)

So a network is a long composition \(\;\mathbf{x}\to\text{layer}_1\to\dots\to\text{layer}_L\to\text{loss}\), i.e. a chain of functions — which is exactly the setting the chain rule was built for.

5. Back-propagation = the chain rule (the centrepiece)

To train with gradient descent we need \(\dfrac{\partial L}{\partial w}\) for every weight: “if I nudge this weight, how does the loss change?” Back-propagation computes all of them efficiently by applying the chain rule (Calculus §3) backward through the chain, reusing each layer’s result.

Why “back-propagation.” The error signal is propagated backward from the loss to each weight — the reverse of the forward pass. It is not a new kind of math: it is the chain rule (\(\frac{dL}{dw}=\frac{dL}{da}\cdot\frac{da}{dz}\cdot\frac{dz}{dw}\)), bookkept layer by layer.

Worked example — one neuron, trained one step, entirely by hand. A fresh, single-input neuron (so the algebra stays one-dimensional — not §1’s two-input one). Network:

\[ x \xrightarrow{\;z=wx+b\;} z \xrightarrow{\;a=\sigma(z)\;} a \xrightarrow{\;L=\tfrac12(a-y)^2\;} L, \qquad x=1,\; w=0.5,\; b=0,\; y=0. \]

Forward pass. \[ z = (0.5)(1)+0 = 0.5,\quad a=\sigma(0.5)=0.6225,\quad L=\tfrac12(0.6225-0)^2 = 0.1937. \]

Backward pass — for each link, identify the local derivative, then accumulate it into the running gradient by multiplying with what came before:

link	local derivative	running gradient so far
loss vs. output	\(\dfrac{\partial L}{\partial a}=a-y=0.6225\)	\(0.6225\)
output vs. score	\(\dfrac{\partial a}{\partial z}=\sigma(z)(1{-}\sigma(z))=0.2350\)	\(\dfrac{\partial L}{\partial a}\cdot\dfrac{\partial a}{\partial z}=0.6225\times0.2350=0.1463\)
score vs. weight	\(\dfrac{\partial z}{\partial w}=x=1\)	\(0.1463\times1=0.1463\)

The local derivative at each node is just that node’s own slope (how \(a\) moves when \(z\) is nudged, for instance). The running gradient accumulates those local slopes multiplicatively as the chain rule propagates the error back toward the weight — so the second column is \(\partial L/\partial a\), the third is \(\partial L/\partial z\), and the final entry is \(\partial L/\partial w\).

Where each local derivative comes from: \(\frac{\partial L}{\partial a}=a-y\) is the derivative of \(L=\tfrac12(a-y)^2\); \(\frac{\partial a}{\partial z}=\sigma(z)(1-\sigma(z))\) is the sigmoid derivative — by the quotient rule on \(\sigma=\tfrac{1}{1+e^{-z}}\), \(\sigma'=\tfrac{e^{-z}}{(1+e^{-z})^2}=\sigma(1-\sigma)\) (the full derivation is in Calculus §3), so here \(0.6225\,(1-0.6225)=0.2350\); \(\frac{\partial z}{\partial w}=x\) because \(z=wx+b\) is linear in \(w\). Chain them: \[ \frac{\partial L}{\partial w}=\frac{\partial L}{\partial a}\cdot\frac{\partial a}{\partial z}\cdot\frac{\partial z}{\partial w} = 0.6225\times0.2350\times1 = \boxed{0.1463}, \] and since \(\frac{\partial z}{\partial b}=1\), also \(\frac{\partial L}{\partial b}=0.6225\times0.2350=0.1463\) (it equals \(\partial L/\partial w\) here only because \(x=1\) makes \(\partial z/\partial w=\partial z/\partial b\)).

One gradient-descent step (Calculus §5), learning rate \(\eta=0.1\): \[ w \leftarrow w - \eta\frac{\partial L}{\partial w} = 0.5 - 0.1(0.1463) = 0.4854. \]

Figure 5.5: The 5 neuron as a **computational graph**. **Forward** (blue, left\(\to\)right): each value, ending at the loss \(0.1937\). **Backward** (red, right\(\to\)left): multiply the local derivatives down the chain to get \(\partial L/\partial w=0.1463\). Back-propagation is exactly the chain rule, bookkept node by node.

The weight moved down (toward \(0\)), which lowers the score \(z\), pushes \(a\) toward the target \(y=0\), and shrinks the loss — exactly what we want. That is the whole training algorithm.

A real network is more than a single chain: one neuron’s output fans out to many next-layer units. The chain rule then sums over the paths. Concretely, if a hidden unit \(h\) feeds two outputs \(a_1,a_2\) (through weights \(w_1,w_2\)), then \(\frac{\partial L}{\partial h}=\frac{\partial L}{\partial a_1}w_1+\frac{\partial L}{\partial a_2}w_2\) — the two paths’ contributions simply add (Calculus §3’s sum-over-paths: “gradients add where paths meet”).

From that sum to “Jacobian, multiply right-to-left.” That two-term sum is exactly one row of a matrix–vector product. Stack the upstream gradients into a vector \(\mathbf g_a=[\,\partial L/\partial a_1,\ \partial L/\partial a_2\,]\) and the layer’s weights into the column \(\mathbf{v}=[w_1,\,w_2]^\top\) — a 2×1 Jacobian column for this single-input example. (The Jacobian of a vector function \(f:\mathbb{R}^n\!\to\!\mathbb{R}^m\) is the \(m{\times}n\) matrix of all partials \([\partial f_i/\partial x_j]\); for a linear layer with weight matrix \(W\), the Jacobian of the output \(\mathbf{a}=W\mathbf{h}\) with respect to \(\mathbf{h}\) is \(W\) itself.) With a single hidden unit \(h\) fanning into two outputs, \(\mathbf{v}\) is that \(2\times1\) special case. Then

\[ \frac{\partial L}{\partial h}\;=\;w_1\frac{\partial L}{\partial a_1}+w_2\frac{\partial L}{\partial a_2}\;=\;\mathbf{v}^\top\mathbf g_a . \]

So one back-prop step through a layer is multiply the upstream gradient vector by the layer’s Jacobian(-transpose) — and a whole network is just doing this layer after layer, starting from the loss and moving right-to-left (loss \(\to\) last layer \(\to\dots\to\) first layer). “Sum over paths” and “multiply Jacobians right-to-left” are the same operation, scalar form vs. matrix form. (Tiny check: with \(w_1{=}1,w_2{=}2\) and upstream grads \(0.6225,\,0.30\), both give \(1(0.6225)+2(0.30)=1.2225\).)

Figure 5.6: **Fan-out.** Because \(h\) feeds both \(a_1\) and \(a_2\), there are **two paths** from \(h\) to the loss \(L\). Back-propagation sends a gradient back along each and **adds them** at \(h\): \(\frac{\partial L}{\partial h}=\frac{\partial L}{\partial a_1}w_1+\frac{\partial L}{\partial a_2}w_2 = \mathbf{v}^\top\mathbf{g}_a\) where \(\mathbf{v}=[w_1,w_2]^\top\) is the \(2{\times}1\) Jacobian column. Stacking every node’s fan-out into one matrix gives the full layer **Jacobian**.

6. The training loop — SGD, epochs, batches

Repeat forward + backward + update over the dataset:

repeat (each epoch):
  for each mini-batch of examples:
     forward pass  -> predictions, loss
     back-prop     -> gradients of loss wrt all weights
     update        -> w <- w - eta * gradient        (Calculus §5)

Why the names. SGD = Stochastic Gradient Descent: “stochastic” because each step uses a random subset (a mini-batch) instead of the whole dataset, so the gradient is a noisy estimate — far cheaper and, in practice, helps escape bad spots. An epoch is one full sweep through the data (Greek epokhē, “a fixed point in time”). The batch size is how many examples per step; the learning rate \(\eta\) is the step size.

Beyond vanilla SGD — momentum and Adam. Plain SGD takes a fixed step \(\eta\nabla\); two upgrades are what you actually reach for:

Momentum keeps a running velocity (a decaying average of past gradients) and steps with that — so it rolls through small bumps and accelerates down a consistent slope, like a ball gaining speed.
Adam (Adaptive Moment estimation; Kingma & Ba, 2015) keeps running averages of both the gradient and its square, giving each parameter its own adaptive step size. It is the default optimizer for most deep nets — when you write optim.Adam(...), this is it.

Adam in one update line. Write \(g_t\) for the current gradient \(\partial L/\partial w\). Adam keeps two running averages and combines them:

\[ m_t = \beta_1 m_{t-1} + (1{-}\beta_1)\,g_t,\quad v_t = \beta_2 v_{t-1} + (1{-}\beta_2)\,g_t^2,\qquad w \leftarrow w \;-\; \eta\,\frac{\hat m_t}{\sqrt{\hat v_t}+\varepsilon}, \quad \hat m_t=\frac{m_t}{1-\beta_1^{\,t}},\ \ \hat v_t=\frac{v_t}{1-\beta_2^{\,t}}. \]

\(m_t\) — the first moment: a decaying average of the gradient (the “momentum” direction).
\(v_t\) — the second moment: a decaying average of the gradient squared (its typical size).
\(\beta_1,\beta_2\) — the decay rates (defaults \(0.9\), \(0.999\)); \(\varepsilon\) — a tiny constant (e.g. \(10^{-8}\)) that keeps the denominator from being \(0\).
\(\hat m_t,\hat v_t\) — bias-corrected versions: because \(m_0=v_0=0\), the raw averages start too small, and dividing by \(1-\beta^{\,t}\) rescales them back up.
The division by \(\sqrt{\hat v_t}\) is the adaptivity: a parameter with consistently large gradients gets a smaller effective step, a rarely-updated one a larger step.

Worked example — one Adam step from rest. Reuse §5’s gradient \(g_1=0.1463\), with \(m_0=v_0=0\), \(\eta=0.1\). Then \(m_1=0.1(0.1463)=0.01463\) and \(v_1=0.001(0.1463)^2=2.14\times10^{-5}\); the bias-correction at \(t=1\) divides by \(1-0.9=0.1\) and \(1-0.999=0.001\), giving \(\hat m_1=g_1=0.1463\) and \(\hat v_1=g_1^2=0.02140\). The update is \(\eta\,\hat m_1/\sqrt{\hat v_1}=0.1\times0.1463/0.1463= \boxed{0.1}\). Notice what happened: at the first step the ratio is exactly \(\eta\cdot\operatorname{sign}(g_1)\), so the step size is the learning rate itself — independent of the gradient’s magnitude. That is Adam’s signature: it normalizes each parameter’s step by its own gradient scale. (Numbers checked in code.)

Figure 5.7: **Adam bias correction** (\(\beta_1=0.9\), \(\beta_2=0.999\), impulse gradient \(g_1=1\), then \(g_t=0\)). **(a)** Without bias correction, \(m_t\) starts near \(0\) at \(t{=}1\) (because the EMA is initialised at \(0\)). The corrected \(\hat{m}_t = m_t/(1-\beta_1^t)\) rescales it back to the true gradient magnitude at \(t{=}1\), then converges to \(m_t\) as the correction factor \(1-\beta_1^t \to 1\). **(b)** The effective step \(\eta\hat{m}_t/\sqrt{\hat{v}_t}\) equals \(\eta\) at \(t{=}1\) (the correction ensures the first step is not under-sized) and decays as the gradient signal fades. Without bias correction the first step would be only \(0.1\times0.1=0.01\) — ten times too small. Numbers verified in code; \(\beta_1=0.9\), \(\beta_2=0.999\), \(\eta=0.1\), \(\varepsilon=10^{-8}\).

These change how far and which way each step goes; the forward \(\to\) backward \(\to\) update loop is unchanged.

In practice. Before you run a single training step, initialize weights carefully. Use He / Kaiming init for ReLU networks and Xavier / Glorot for tanh or sigmoid — both keep activations from exploding or vanishing across layers. Concretely, He initialization draws each weight from \(w\sim\mathcal{N}(0,\,2/\text{fan\_in})\) — i.e. standard deviation \(\sqrt{2/\text{fan\_in}}\), where \(\text{fan\_in}\) is the number of incoming connections to the layer — so for a layer with \(\text{fan\_in}=4\) inputs the weights are drawn with std \(\sqrt{2/4}=\sqrt{0.5}\approx0.707\), scaled to the layer’s width so the forward signal neither blows up nor dies. (Xavier / Glorot uses \(1/\text{fan\_in}\) instead of \(2/\text{fan\_in}\), appropriate for tanh/sigmoid layers whose gradients are roughly half the size.) All-zeros init is silent death: every neuron computes the same gradient, so nothing ever differentiates. When training misbehaves, use this quick map: loss not decreasing → lower the learning rate or fix the init; loss goes NaN → clip gradients first, then lower LR; many dead ReLUs (units stuck at zero, never recovering) → switch to He init or LeakyReLU; train loss low but validation loss high → add regularization (the bias–variance U-curve is in Losses & Regularizers §4).

7. Overfitting and regularization in neural networks

A big network has enough capacity to memorize the training data — fitting noise instead of signal, so it does great on training data and poorly on new data. This is overfitting, and the principled way to think about it is the bias–variance trade-off:

See Losses & Regularizers §4 — The bias–variance trade-off is defined there (it is what regularization manages). In one line: too-simple models underfit (high bias); too-flexible models overfit (high variance); regularization trades a little bias for a big drop in variance.

Neural nets use the regularizers from Losses & Regularizers §5, plus two of their own:

Weight decay (L2 on the weights) — Losses & Regularizers’ \(\lambda\lVert W\rVert^2\).
Dropout — randomly zero a fraction of neurons each training step, so no unit can rely on any other; “drops out” units. (This is the node-dropout idea that reappears as edge dropout in SGL, SSL & Contrastive Learning.)
Early stopping — halt when validation loss stops improving (see the curve below).
Batch / Layer Normalization — re-center and re-scale activations. The primary purpose is training stability (a smoother loss surface), with mild regularization as a bonus.

BatchNorm (BN) normalizes each feature across the mini-batch. Given a batch of activations \(x_1,\dots,x_B\) for one feature, it computes the batch mean \(\mu_B\) and standard deviation \(\sigma_B\), normalizes each value, then applies a learned scale \(\gamma\) and shift \(\delta\): \[ \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}}, \qquad y_i = \gamma\,\hat{x}_i + \delta. \] Hand example. Activations \([2,\,4,\,6]\): \(\mu_B=4\), \(\sigma_B=\sqrt{\tfrac{(2{-}4)^2+(4{-}4)^2+(6{-}4)^2}{3}}=\sqrt{8/3}\approx1.63\). Centred values: \(\hat{x}=[-1.22,\,0,\,1.22]\). (Checked in code.) After BN the activations are zero-mean, unit-variance (before the \(\gamma,\delta\) rescaling) — which keeps the downstream layers in a well-conditioned input range.

LayerNorm (LN) applies the same transform, but normalizes across a single token’s own features (all \(d\) dimensions of one position) rather than across the batch dimension — so it is batch-size-independent and works on sequences of varying length. LayerNorm is the normalization of choice inside the Transformer (next note), where the batch dimension is ill-defined for autoregressive generation.

Figure 5.8: **Overfitting, the picture.** As training proceeds (or capacity grows), **training loss** (blue) keeps falling, but **validation loss** (red) bottoms out and then *rises* as the net starts memorizing noise. The gap between the curves is overfitting; **early stopping** halts at the validation minimum (dashed line), and a regularizer lowers and flattens the red curve.

8. Depth and representation learning (why “deep”)

Stacking layers lets a network build features hierarchically: early layers learn simple patterns, later layers compose them into abstract ones (edges → shapes → objects in vision; characters → words → meaning in language). This learned, layered feature-building is representation learning — the network discovers useful features instead of being handed them. The universal approximation theorem (Cybenko, 1989; Hornik, 1991) says even a single sufficiently-wide hidden layer can approximate any continuous function; depth makes that vastly more parameter-efficient.

Specialized architectures just constrain how layers connect: CNNs (convolutional — share weights across image patches; scoped out, as this book does not need vision), and RNNs / Transformers (for sequences; the Transformer’s attention powers modern LLMs). All are trained by the same §5–§6 engine. The next note — Representation Learning & the Transformer — builds the RNN, LSTM, and Transformer from zero and shows how each becomes a recommender.

9. Exercises

Work these by hand — the numbers are kept tiny on purpose. Full worked solutions are in the Solutions appendix at the back of the book.

(compute) A single neuron (§1) has weights \(\mathbf{w}=[2,-1]\), bias \(b=0.5\), and input \(\mathbf{x}=[1,3]\). Compute the pre-activation \(z=\mathbf{w}\cdot\mathbf{x}+b\), then its output under ReLU, \(a=\max(0,z)\) (§2). Which value did ReLU clip, and to what?
(compute) Evaluate all three canonical activations (§2) at \(z=1\): the sigmoid \(\sigma(z)=\tfrac{1}{1+e^{-z}}\), tanh, and ReLU \(=\max(0,z)\). Then redo just the sigmoid and ReLU at \(z=-2\). Which activation changes sign with \(z\), and which one is flat (returns \(0\)) for the negative input?
(concept) §2 claims a network needs a nonlinearity between layers. Show the algebra: stack two linear layers \(W_2(W_1\mathbf{x})\) and explain in one line why the result is still a single linear map — so no number of purely linear layers can ever separate XOR. What is the smallest fix that lets a net carve the XOR boundary?
(extend) Run the tiny \(2\!\to\!2\!\to\!1\) forward pass (§3) on input \(\mathbf{x}=[2,1]\). Hidden layer: weight rows \([1,1]\) and \([-1,-1]\), bias \(\mathbf{b}=[-1,1]\), ReLU; output layer: \(\mathbf{w}_2=[2,1]\), \(b_2=1\). Compute the hidden vector \(\mathbf{h}\) and the output \(\hat y\). Point to the entry ReLU clipped to \(0\) — the nonlinearity at work.
(compute) Reuse §5’s worked neuron — forward pass \(z=0.5\), \(a=\sigma(0.5)=0.6225\), and the back-prop chain giving \(\partial L/\partial w=0.1463\). (a) Re-multiply the three local derivatives \(\tfrac{\partial L}{\partial a}=0.6225\), \(\tfrac{\partial a}{\partial z}=0.2350\), \(\tfrac{\partial z}{\partial w}=1\) to confirm \(\partial L/\partial w=0.1463\). (b) Take one gradient-descent step with a larger learning rate \(\eta=0.5\) (instead of §5’s \(0.1\)): \(w\leftarrow w-\eta\,\partial L/\partial w\). Did the weight move toward or away from \(0\)?
(compute) Reuse §4’s softmax + cross-entropy head: three classes, logits \(z=[2,1,0]\), exponentials \(e^{2},e^{1},e^{0}=7.39,2.72,1.00\) (sum \(11.11\)), true class \(0\). (a) Compute the softmax probability \(\hat p_0\) of the true class and the cross-entropy loss \(\mathcal L=-\ln\hat p_0\). (b) What would \(\mathcal L\) be if the model were maximally unsure — a uniform \(\hat p_0=\tfrac13\)? Which loss is larger, and why is that the right direction?
(compute) One Adam update line (§6) from rest. Reuse §5’s gradient \(g_1=0.1463\) with \(m_0=v_0=0\), \(\beta_1=0.9\), \(\beta_2=0.999\), \(\eta=0.1\). Compute \(m_1=(1-\beta_1)g_1\), \(v_1=(1-\beta_2)g_1^2\), then the bias-corrected \(\hat m_1=m_1/(1-\beta_1)\) and \(\hat v_1=v_1/(1-\beta_2)\), and finally the step \(\eta\,\hat m_1/\sqrt{\hat v_1}\). Why does the step come out to exactly \(\eta\), regardless of how big \(g_1\) was?
(concept) §6 says to initialize weights carefully: use He / Kaiming init for some networks and Xavier / Glorot for others. Which init pairs with ReLU layers and which with tanh / sigmoid layers? And in one line, why is initializing every weight to zero a silent failure?
(concept) §7 lists two regularizers nets add of their own: dropout and early stopping. In one sentence each: what does dropout do during a training step (and why does that discourage a unit from over-relying on any other), and what signal does early stopping watch to decide when to halt (point to the curve in §7)?
(apply) §1 says a neuron computes \(\mathbf{w}\cdot\mathbf{x}+b\) — the very dot product the recommender chapters score with. A recommender stores a user as \(\mathbf{p}=[2,1,3]\) and an item as \(\mathbf{q}=[1,2,1]\). (a) Write the affinity score \(\mathbf{p}\cdot\mathbf{q}\) as a single linear neuron — say what plays the role of the weights \(\mathbf{w}\), the input \(\mathbf{x}\), and the bias \(b\) — and compute it. (b) A second item \(\mathbf{q}_2=[3,0,1]\) scores higher; verify by computing \(\mathbf{p}\cdot\mathbf{q}_2\). (This dot-product head is matrix factorization / LightGCN; the embeddings \(\mathbf{p},\mathbf{q}\) are the lookup rows of §3’s embedding layer.)
(compute) — embedding lookup and the sparse gradient. A 3-id, 3-dim embedding table is \(E=\begin{bmatrix}1&0&2\\ 0&3&1\\ 2&1&0\end{bmatrix}\) (§3). (a) What vector does id \(1\) (the middle row, 0-indexed) return? (b) In one training step that uses only id \(1\), which rows of \(E\) receive a nonzero gradient — and why?
(compute) — the same neuron, flipped target. Re-run §5’s one-neuron example (\(x=1,\,w=0.5,\,b=0\), sigmoid, so \(a=0.6225\)) but with target \(y=1\) instead of \(0\). (a) Compute \(\partial L/\partial a=a-y\), then \(\partial L/\partial w\) (multiply by \(\partial a/\partial z=0.2350\) and \(\partial z/\partial w=1\)). (b) Take one step (\(\eta=0.1\)): does \(w\) move up or down now, and why does that make sense given the new target?
(concept) — embeddings and cold start. A pure-ID embedding layer learns one row per id by back-prop (§3). (a) Why can such a model give no meaningful score for a user or item that never appeared in training? (b) Tie this to the word From Graphs to LightGCN §14 uses for a model that “only represents ids seen in training,” and name one fix a later chapter provides.

10. Where this fits in the book

Idea (here)	Where it returns
Neuron = \(\mathbf{w}\cdot\mathbf{x}+b\) (§1)	the scoring rule everywhere; a recsys score is a dot product (Traditional RecSys; From Graphs to LightGCN)
Back-prop = chain rule (§5)	how every learned model in the book is trained, incl. LightGCN & RLMRec
SGD / mini-batches (§6)	the optimizer behind all the embedding training
Dropout / weight decay (§7)	Losses & Regularizers §5 regularizers; edge dropout in SGL (SSL & Contrastive Learning)
Representation learning (§8)	embeddings are learned representations; the next note builds word2vec → Transformer → LLMs
Activation/MLP (§2–3)	the “MLP” scoring head and projection layers in many recommenders

Carry this forward: a neural network is stacked linear maps + nonlinearities, trained by gradient descent on a loss, with gradients from the chain rule (back-prop). Everything fancy is a variation on those parts.

11. Glossary

Term	Plain meaning
Neuron / unit	A weighted sum of inputs + bias, passed through an activation.
Weight / bias	Learned multiplier per input / learned constant offset.
Pre-activation \(z\) (logit)	The raw score \(\mathbf{w}\cdot\mathbf{x}+b\) before the activation.
Activation \(\phi\)	Nonlinear function applied to \(z\) (sigmoid, tanh, ReLU).
ReLU	\(\max(0,z)\) — Rectified Linear Unit; the default activation.
Layer / width / depth	A bank of neurons / neurons-per-layer / number of layers.
Hidden layer	A layer that is neither input nor output.
MLP	Multi-layer perceptron — a feedforward stack of fully-connected layers.
Embedding layer	A learnable lookup table; id → its row (= one-hot × matrix); the input side for discrete data.
Softmax output	Multi-class output head: \(K\) scores → a probability distribution, paired with cross-entropy.
Forward pass	Running an input through the net to a prediction + loss.
Back-propagation	Computing all weight-gradients by the chain rule, backward through the net.
SGD	Stochastic Gradient Descent — gradient steps on random mini-batches.
Momentum / Adam	SGD upgrades: a velocity term / per-parameter adaptive step (Adam = the default optimizer).
Epoch / batch	One full pass over the data / the examples used per update.
Overfitting	Fitting noise; great on training data, poor on new data.
Dropout	Regularizer: randomly zero units during training.
Batch / Layer Norm	Re-center/re-scale activations for training stability (LayerNorm powers the Transformer).
Representation learning	The network learning useful features by itself.
Universal approximation	A wide enough net can approximate any continuous function.

12. References

Deisenroth, M. P., Faisal, A. A., & Ong, C. S. (2020). Mathematics for machine learning. Cambridge University Press. https://mml-book.github.io/
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press. https://www.deeplearningbook.org/
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR). arXiv:1412.6980
MIT (2024). 6.390 Introduction to machine learning — neural networks & back-propagation lecture notes. Online course. https://introml.mit.edu/
Nielsen, M. (2015). Neural networks and deep learning. Online book. http://neuralnetworksanddeeplearning.com/
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536. https://doi.org/10.1038/323533a0
Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958. https://jmlr.org/papers/v15/srivastava14a.html

Online sources verified June 2026.