Appendix F — Solutions to Exercises

Linear Algebra for ML

E1. Dot product (§4): multiply entry-by-entry and sum, $\mathbf{u}\cdot\mathbf{s}=3\cdot2+4\cdot1=6+4=\boxed{10}$. It is positive, so the two arrows broadly point the same way (the dot product is positive when the angle between them is under $90^\circ$, zero at a right angle, negative when they oppose).

E2. L2 norm (§1): $\lVert\mathbf{g}\rVert=\sqrt{6^2+8^2}=\sqrt{36+64}=\sqrt{100}=\boxed{10}$. Normalizing divides by that length: $\hat{\mathbf{g}}=\tfrac{1}{10}[6,8]=\boxed{[0.6,\,0.8]}$. Check it is a unit vector: $\lVert\hat{\mathbf{g}}\rVert=\sqrt{0.6^2+0.8^2}=\sqrt{0.36+0.64}=\sqrt{1}=1$. ✓

E3. Cosine similarity (§4). Dot product $\mathbf{u}\cdot\mathbf{t}=3\cdot0+4\cdot5=0+20=20$; lengths $\lVert\mathbf{u}\rVert=\sqrt{9+16}=5$ and $\lVert\mathbf{t}\rVert=\sqrt{0+25}=5$. So $\cos\theta=\dfrac{20}{5\cdot5}=\dfrac{20}{25}=\boxed{0.8}$ (an angle of about $36.9^\circ$). At $0.8$ it sits between the two reference pairs but is clearly on the near-aligned side — closer to the chapter’s $0.96$ than to its orthogonal $0$.

E4. Matrix–vector product (§3): each output entry is a row of $A$ dotted with $\mathbf{y}=[2,1,3]$. \[ A\mathbf{y}=\begin{bmatrix}2\cdot2+0\cdot1+1\cdot3\\ 1\cdot2+3\cdot1+0\cdot3\end{bmatrix} =\begin{bmatrix}4+0+3\\ 2+3+0\end{bmatrix}=\boxed{[7,\,5]}. \] Column blend (picture (b), §3): $2\,\mathrm{col}_1+1\,\mathrm{col}_2+3\,\mathrm{col}_3 =2[2,1]+1[0,3]+3[1,0]=[4,2]+[0,3]+[3,0]=[7,5]$ — the same vector. ✓ (A $2\times3$ matrix took a 3-vector to a 2-vector: “3 in, 2 out.”)

E5. Characteristic equation (§5): \[ \det(B-\lambda I)=\det\begin{bmatrix}3-\lambda&1\\1&3-\lambda\end{bmatrix}=(3-\lambda)^2-1 =\lambda^2-6\lambda+8=(\lambda-2)(\lambda-4)=0\;\Rightarrow\;\boxed{\lambda=2,\ 4}. \] Trace check: $2+4=6=3+3$, the diagonal sum. ✓ For the larger eigenvalue $\lambda=4$, solve $(B-4I)\mathbf{v}=\begin{bmatrix}-1&1\\1&-1\end{bmatrix}\mathbf{v}=\mathbf{0}$, i.e. $-v_1+v_2=0\Rightarrow v_1=v_2$, so the eigenvector is $\boxed{[1,1]}$. (Indeed $B[1,1]=[4,4]=4[1,1]$; the leftover $[1,-1]$ goes with $\lambda=2$.)

E6. A square matrix acts as $A\mathbf{v}=\lambda\mathbf{v}$ along its invariant directions (eigenvalues), whereas singular values measure the stretch of the output ellipse and are defined through $A^{\top}A$ via $\sigma_i=\sqrt{\lambda_i(A^{\top}A)}$ — a different computation, so the two sets generally differ (for $\begin{bmatrix}3&4\\0&0\end{bmatrix}$, $A^{\top}A=\begin{bmatrix}9&12\\12&16\end{bmatrix}$ has eigenvalues $25,0$, giving $\sigma=5,0$, not $3,0$). They coincide only for a symmetric positive-semidefinite matrix, where $A^{\top}A=A^2$ forces $\sigma_i=\sqrt{\lambda_i^2}=|\lambda_i|=\lambda_i$ (every $\lambda_i\ge0$).

E7. The reconstruction $\mathbf{u}=\sum_i(\mathbf{u}\cdot\mathbf{e}_i)\,\mathbf{e}_i$ relies on both properties of an orthonormal basis. Orthogonality guarantees that the coordinate of $\mathbf{u}$ along $\mathbf{e}_i$ feels only that axis (the other terms contribute $0$ to $\mathbf{u}\cdot\mathbf{e}_i$, since $\mathbf{e}_j\cdot\mathbf{e}_i=0$ for $j\neq i$). Unit length is what makes the projection coefficient itself equal the bare dot product: the general projection is $\frac{\mathbf{u}\cdot\mathbf{e}_i}{\lVert\mathbf{e}_i\rVert^2}\mathbf{e}_i$, and only when $\lVert\mathbf{e}_i\rVert^2=1$ does the dividing factor disappear. If the $\mathbf{e}_i$ were merely orthogonal (say length $2$), each true coordinate would be $\mathbf{u}\cdot\mathbf{e}_i$ divided by $4$, so reading the coordinate as the plain dot product would overshoot by that factor.

E8. Apply §6’s recipe to $G=\begin{bmatrix}2&2\\2&2\end{bmatrix}$. Form \[ G^{\top}G=\begin{bmatrix}2&2\\2&2\end{bmatrix}\begin{bmatrix}2&2\\2&2\end{bmatrix}=\begin{bmatrix}8&8\\8&8\end{bmatrix}, \] a symmetric matrix with trace $16$ and determinant $8\cdot8-8\cdot8=0$, so its eigenvalues are $\boxed{16\ \text{and}\ 0}$. The singular values are their square roots: $\sigma_1=\sqrt{16}=\boxed{4}$ and $\sigma_2=\sqrt{0}=0$. One nonzero singular value means $G$ has $\boxed{\text{rank }1}$ — as expected, since its two rows are identical and lie on one line (its single direction is $\tfrac{1}{\sqrt2}[1,1]$, and $4\cdot\tfrac{1}{\sqrt2}[1,1]\cdot\tfrac{1}{\sqrt2}[1\ 1]=2\begin{bmatrix}1&1\\1&1\end{bmatrix}=G$). ✓

E9. The quadratic form (§6) is $\mathbf{x}^{\top}D\mathbf{x}$ with $D=\begin{bmatrix}1&2\\2&1\end{bmatrix}$ and $\mathbf{x}=[1,-1]$. Compute $D\mathbf{x}=\begin{bmatrix}1\cdot1+2(-1)\\2\cdot1+1(-1)\end{bmatrix}=[-1,1]$, then $\mathbf{x}^{\top}D\mathbf{x}=[1,-1]\cdot[-1,1]=-1-1=\boxed{-2}$. Because this is negative, $D$ is not PSD — and that single $\mathbf{x}=[1,-1]$ is the witness that proves it (PSD requires $\mathbf{x}^{\top}D\mathbf{x}\ge0$ for every $\mathbf{x}$; one negative value is enough to disqualify it; here it exposes $D$’s eigenvalue $-1$). Contrast the healthy $S=\begin{bmatrix}2&1\\1&2\end{bmatrix}$ at the same vector: $S[1,-1]=[1,-1]$, so $\mathbf{x}^{\top}S\mathbf{x}=[1,-1]\cdot[1,-1]=1+1=+2\ge0$ — consistent with $S$ being PSD (its eigenvalues $1,3$ are both $\ge0$).

E10. Scoring by dot product (§4), with $\mathbf{p}=[2,1,3]$. (a) $\hat r_1=\mathbf{p}\cdot\mathbf{q}_1=2\cdot1+1\cdot3+3\cdot0=2+3+0=5$ and $\hat r_2=\mathbf{p}\cdot\mathbf{q}_2=2\cdot0+1\cdot1+3\cdot2=0+1+6=7$. Since $7>5$, dot product ranks $\mathbf{q}_2$ above $\mathbf{q}_1$ ($\boxed{\mathbf{q}_2\succ\mathbf{q}_1}$). (b) Cosine needs the norms: $\lVert\mathbf{p}\rVert=\sqrt{4+1+9}=\sqrt{14}\approx3.742$, $\lVert\mathbf{q}_1\rVert=\sqrt{1+9+0}=\sqrt{10}\approx3.162$, $\lVert\mathbf{q}_2\rVert=\sqrt{0+1+4}=\sqrt5\approx2.236$. Then $\cos(\mathbf{p},\mathbf{q}_1)=\dfrac{5}{\sqrt{14}\sqrt{10}}\approx\boxed{0.423}$ and $\cos(\mathbf{p},\mathbf{q}_2)=\dfrac{7}{\sqrt{14}\sqrt5}\approx\boxed{0.837}$. Cosine also ranks $\mathbf{q}_2$ first, so the two rankings agree here. But notice the gap is much wider under cosine ($0.837$ vs $0.423$) than the dot product’s $7$ vs $5$ implies relative to the items’ sizes: $\mathbf{q}_1$ is the larger-norm vector ($\sqrt{10}>\sqrt5$), so raw dot product flatters $\mathbf{q}_1$ partly for its length, not its direction — exactly the popularity/length bias §4’s “serving trick” (L2-normalize first, turning the dot product into cosine) is designed to remove.

E11. Rank-1 collapse (§2). (a) $\det A=2\cdot2-4\cdot1=0$ — singular. (b) $A[1,0]=[2,1]$, while $A[2,-1]=[2\cdot2+4(-1),\,1\cdot2+2(-1)]=[0,0]$: the direction $[2,-1]$ is killed (mapped to the origin — the null space). (c) Both columns, $[2,1]$ and $[4,2]$, are multiples of $[2,1]$, so every output is too — the image is the line $\{c\,[2,1]\}$.

E12. Spot the singular matrix (§2). Determinants: $\big[\begin{smallmatrix}1&0\\0&3\end{smallmatrix}\big]\to3$; $\big[\begin{smallmatrix}2&1\\4&2\end{smallmatrix}\big]\to2\cdot2-1\cdot4=0$; $\big[\begin{smallmatrix}1&1\\1&2\end{smallmatrix}\big]\to2-1=1$. Only the middle matrix has $\det=0$, so it is the singular one (its rows $[2,1],[4,2]$ are dependent — one is twice the other).

E13. Eigenvalues from trace and determinant (§5). We need $\lambda_1+\lambda_2=5$ and $\lambda_1\lambda_2=6$, i.e. the roots of $\lambda^2-5\lambda+6=(\lambda-2)(\lambda-3)=0$. So $\lambda=\boxed{2\text{ and }3}$ — recoverable from the trace and determinant alone, no matrix needed.

Calculus for ML

E1. Derivative from the limit (§1). Expand and cancel the $h$ in the denominator: \[ \frac{(2+h)^2-2^2}{h}=\frac{4+4h+h^2-4}{h}=\frac{4h+h^2}{h}=4+h\xrightarrow[h\to0]{}\boxed{4}. \] So $f'(2)=4$: near $x=2$ the output of $x^2$ grows about $4$ units per unit of input. The power rule agrees — $f'(x)=2x$ gives $2\cdot2=4$. ✓

E2. Product rule (§2) on $f=x^2(x+1)$, with $f_1=x^2$ ($f_1'=2x$) and $f_2=x+1$ ($f_2'=1$): \[ f'(x)=f_1'f_2+f_1f_2'=2x(x+1)+x^2\cdot1=2x^2+2x+x^2=3x^2+2x. \] At $x=2$: $f'(2)=3(4)+2(2)=12+4=\boxed{16}$. Check by expanding first: $f(x)=x^3+x^2$, so $f'(x)=3x^2+2x$ and $f'(2)=12+4=16$ ✓ — the two routes agree.

E3. Chain rule (§3) on $f=(2x+1)^3$: outer $(\cdot)^3$ has derivative $3(\cdot)^2$ evaluated at the inner value, times the inner derivative $g'(x)=2$: \[ f'(x)=\underbrace{3(2x+1)^2}_{\text{outer}'\text{ at }g}\times\underbrace{2}_{g'}=6(2x+1)^2. \] At $x=1$: $f'(1)=6(2\cdot1+1)^2=6\cdot3^2=6\cdot9=\boxed{54}$. Check by expanding: $f(x)=8x^3+12x^2+6x+1$, so $f'(x)=24x^2+24x+6$ and $f'(1)=24+24+6=54$ ✓.

E4. Sigmoid and ReLU derivatives (§3). At $z=0$, $\sigma(0)=\tfrac{1}{1+e^{0}}=\tfrac12$, so \[ \sigma'(0)=\sigma(0)\big(1-\sigma(0)\big)=\tfrac12\cdot\tfrac12=\boxed{0.25}. \] The product $\sigma(1-\sigma)$ is largest exactly when $\sigma=\tfrac12$ (the peak of the downward parabola in $\sigma$), giving the maximum $\sigma'=\tfrac12(1-\tfrac12)=\boxed{0.25}$ — a sigmoid never passes more than a quarter of the gradient through. The ReLU derivative is the slope of each straight piece: $\text{ReLU}'(z)=\boxed{1}$ for $z>0$ and $\boxed{0}$ for $z<0$ (at $z=0$ the kink leaves it undefined; any value in $[0,1]$ is a valid subgradient).

E5. Partial derivatives and the gradient (§4) of $f(x,y)=x^2+xy$. Holding the other variable fixed each time: \[ \frac{\partial f}{\partial x}=2x+y\quad(\text{treat }y\text{ constant}),\qquad \frac{\partial f}{\partial y}=x\quad(\text{treat }x\text{ constant}). \] At $(x,y)=(2,3)$: $\nabla f=\big[\,2(2)+3,\ 2\,\big]=\boxed{[7,\ 2]}$. (At the chapter’s own point $(1,2)$ the same formulas give $[4,1]$, matching §4. ✓)

E6. Hessians and their eigenvalues (§6). \[ f=x^2+3y^2:\ H_f=\begin{bmatrix}2&0\\0&6\end{bmatrix},\quad\text{eigenvalues }\boxed{2,\ 6}; \qquad g=xy:\ H_g=\begin{bmatrix}0&1\\1&0\end{bmatrix},\quad\text{eigenvalues }\boxed{+1,\ -1}. \] (For the diagonal $H_f$ the eigenvalues are the diagonal entries $2,6$; for $H_g$ the characteristic equation is $\det\!\left[\begin{smallmatrix}-\lambda&1\\1&-\lambda\end{smallmatrix}\right]=\lambda^2-1=0\Rightarrow\lambda=\pm1$.) $H_f$ has both eigenvalues positive → PSD (here positive definite) → $f$ is a convex bowl with one minimum. $H_g$ has mixed signs ($+1,-1$) → indefinite → the critical point of $g=xy$ is a saddle, so $g$ is not convex (it curves up along one direction and down along another — the horse-saddle shape).

E7. Vanishing gradients (§3). Multiplying the per-layer ceilings of $0.25$ over $4$ stacked sigmoid layers gives at most \[ 0.25^{4}=\Big(\tfrac14\Big)^{4}=\frac{1}{256}\approx0.0039,\qquad\text{so }\boxed{1/256}. \] Because back-propagation multiplies these sub-$1$ rates link by link, the gradient reaching the earliest layer shrinks geometrically toward $0$ — it has all but vanished, so those early layers barely update; this is the vanishing-gradient problem. A ReLU on its live half has slope $1$, so it multiplies by $1$ rather than $0.25$ and leaves the back-propagated gradient intact, which is why ReLU largely replaced the sigmoid in deep hidden layers.

E8. Gradient descent on $f(x)=x^2$ (so $f'(x)=2x$), $x_0=2$, $\eta=0.1$, via $x\leftarrow x-\eta f'(x)$ (§5): \[ x_1=2-0.1(2\cdot2)=2-0.4=\boxed{1.6},\qquad x_2=1.6-0.1(2\cdot1.6)=1.6-0.32=\boxed{1.28}, \] and $f(x_2)=1.28^2=\boxed{1.6384}$ — the value marched downhill $4\to2.56\to1.6384$ toward the minimum at $0$. Taylor check of the first step (§7): the step is $\delta=-\eta f'(x_0)=-0.4$, and the first-order predicted change is \[ f'(x_0)\cdot\delta=4\times(-0.4)=-1.6<0, \] negative, so the step provably lowers $f$ (predicted $f\approx4-1.6=2.4$ vs the actual $f(1.6)=2.56$ — close, and both below the starting $4$). This is the §7 guarantee $f(\theta-\eta\nabla f)\approx f(\theta)-\eta\lVert\nabla f\rVert^2\le f(\theta)$ in one number.

E9. Jacobian (§8) of $F(x,y)=[\,xy,\ x+y^2\,]$. With $F_1=xy$ and $F_2=x+y^2$, fill $J_{ij}=\partial F_i/\partial x_j$ (row = output, column = input): \[ J=\begin{bmatrix}\partial_x(xy)&\partial_y(xy)\\[2pt]\partial_x(x+y^2)&\partial_y(x+y^2)\end{bmatrix} =\begin{bmatrix}y&x\\1&2y\end{bmatrix} \;\xrightarrow{(2,3)}\;\boxed{\begin{bmatrix}3&2\\1&6\end{bmatrix}}. \] Row $1$ is $[\,y,\ x\,]=[\,3,\ 2\,]$ at $(2,3)$, which is exactly $\nabla(xy)=[\partial_x(xy),\partial_y(xy)]=[y,x]$ — the gradient of the first output, just as §8 says “row $i$ is the gradient of output $i$.” ✓

E10. One gradient-descent step on a squared-error loss (§3 chain rule + §5 step). With $L(w)=(2w-1)^2$, differentiate by the chain rule (outer $(\cdot)^2\to2(\cdot)$, inner $\tfrac{d}{dw}(2w-1)=2$): \[ \frac{dL}{dw}=2(2w-1)\cdot2=4(2w-1)=8w-4. \] At $w_0=0$: $\dfrac{dL}{dw}=8(0)-4=\boxed{-4}$. The step ($\eta=0.1$) is \[ w\leftarrow w_0-\eta\frac{dL}{dw}=0-0.1(-4)=\boxed{0.4}. \] Loss before: $L(0)=(2\cdot0-1)^2=(-1)^2=\boxed{1}$; loss after: $L(0.4)=(2\cdot0.4-1)^2=(-0.2)^2=\boxed{0.04}$. The loss dropped $1\to0.04$ in a single step — one literal turn of the training loop $\theta\leftarrow\theta-\eta\nabla f$ that every model in the series runs (here on one weight; in practice on a whole parameter vector at once).

E11. Linear-approximation error (§7). With $L(x)=4x-4$: at $x=2.5$, $L=6$ and $f=6.25$, error $0.25$; at $x=3$, $L=8$ and $f=9$, error $1.0$. Both equal $(x-2)^2$ ($0.5^2=0.25$, $1^2=1$) — the error is the dropped curvature term, growing with the square of the step away from $x_0=2$.

E12. Convexity (§6). (a) $f''=12x^2\ge0$ everywhere → convex. (b) $f''=6x$ changes sign at $x=0$ → not convex (an inflection there). (c) $f''=e^{x}>0$ everywhere → convex. (d) $f''=-2<0$ everywhere → concave (an upside-down bowl), not convex.

E13. Convergent learning rate (§5). (a) $|1-2\eta|<1$ means $-1<1-2\eta<1$, i.e. $0<\eta<1$ — any step in that range shrinks the iterate toward $0$. (b) $\eta=1$ gives multiplier $1-2=-1$, so the iterate oscillates $x\to-x\to x$ forever (never settles); $\eta=1.5$ gives $-2$, so $|x|$ doubles every step — it diverges. (c) $\eta=0.5$ gives multiplier $0$: the step lands exactly on the minimum in one move (here $\eta=1/f''$ for the parabola), the special case §5 flagged.

Probability & Statistics for ML

E1. Expectation and variance (§1) of the fair four-sided die, $p(x)=\tfrac14$ for $x\in\{1,2,3,4\}$. Mean: $\mathbb{E}[X]=\tfrac14(1+2+3+4)=\tfrac{10}{4}=\boxed{2.5}$ — the balance point of the four bars (a value you never roll). For the variance use the shortcut $\operatorname{Var}[X]=\mathbb{E}[X^2]-\mu^2$: $\mathbb{E}[X^2]=\tfrac14(1+4+9+16)=\tfrac{30}{4}=7.5$, so $\operatorname{Var}[X]=7.5-2.5^2=7.5-6.25=\boxed{1.25}$ (exactly $5/4$). (Smaller than the six-sided die’s $35/12\approx2.92$, as it should be — four faces spread less than six.)

E2. Naming the distributions (§2). (a) One yes/no outcome is a Bernoulli, parameter $p$. (b) One pick out of $5$ unordered labels is a Categorical, parameters $p_1,\dots,p_5$ (summing to $1$). (c) A belief about a probability is a Beta; the pseudo-count reading “$\alpha-1$ heads, $\beta-1$ tails already seen” with $4$ heads and $1$ tail gives $\alpha-1=4,\ \beta-1=1$, i.e. $\boxed{\text{Beta}(5,2)}$, whose mean is $\mathbb{E}[\theta]=\dfrac{\alpha}{\alpha+\beta}=\dfrac{5}{7}\approx\boxed{0.714}$ (leans toward heads, as “4 heads vs 1 tail” should). Finally, the Bernoulli of (a) with $p=0.8$: $\mathbb{E}[X]=p=\boxed{0.8}$ and $\operatorname{Var}[X]=p(1-p)=0.8\cdot0.2=\boxed{0.16}$.

E3. Odds → logit → sigmoid (§3) for the $p=0.8$ coin. (a) Odds $=\dfrac{p}{1-p}=\dfrac{0.8}{0.2}=\boxed{4}$ (4× as likely to land heads as not). (b) Logit $=\ln\!\big(\tfrac{p}{1-p}\big)=\ln 4\approx\boxed{1.386}$. (c) Feed that score $z=\ln 4$ back through the sigmoid: $e^{-z}=e^{-\ln4}=\tfrac14=0.25$, so $\sigma(z)=\dfrac{1}{1+0.25}=\dfrac{1}{1.25}=\boxed{0.8}$ ✓ — exactly the $p$ we started from, because the sigmoid is the logit’s inverse (and the odds are exactly $e^{z}=4$). (d) $\sigma(0)=\dfrac{1}{1+e^{0}}=\dfrac{1}{2}=\boxed{0.5}$.

E4. Softmax (§3) of $z=[1,0,0]$. Exponentiate: $e^{1},e^{0},e^{0}=2.718,\,1,\,1$, summing to $4.718$. Divide each by the total: \[ \operatorname{softmax}(z)=\Big[\tfrac{2.718}{4.718},\,\tfrac{1}{4.718},\,\tfrac{1}{4.718}\Big] =\boxed{[0.576,\,0.212,\,0.212]}, \] and $0.576+0.212+0.212=1.000$ ✓. Class $A$ wins (it had the largest score), and $B$ and $C$ are equal because they had identical scores ($z_B=z_C=0$) — softmax is a deterministic function of the scores, so equal scores give equal probabilities. The top score takes the lion’s share but the others keep nonzero probability (a soft max, not a hard one).

E5. MLE for the coin (§4): 6 heads, 2 tails, $L(p)=p^{6}(1-p)^{2}$, $\log L(p)=6\ln p+2\ln(1-p)$. Set the derivative to zero: \[ \frac{d\log L}{dp}=\frac{6}{p}-\frac{2}{1-p}=0\;\Rightarrow\;6(1-p)=2p\;\Rightarrow\;6=8p \;\Rightarrow\;\hat p_{\text{MLE}}=\frac{6}{8}=\boxed{0.75}. \] So the MLE is just the observed frequency $6/8$. Hand-check at $p=0.75$: $\tfrac{6}{0.75}=8$ and $\tfrac{2}{0.25}=8$ — the two terms cancel ✓.

E6. Bayes’ rule, the medical test (§5), exact chapter numbers. (a) Evidence by the law of total probability: \[ \Pr(+)=\Pr(+\mid S)\Pr(S)+\Pr(+\mid\neg S)\Pr(\neg S)=0.99(0.01)+0.05(0.99)=0.0099+0.0495=\boxed{0.0594}. \] Then the posterior: \[ \Pr(S\mid+)=\frac{\Pr(+\mid S)\Pr(S)}{\Pr(+)}=\frac{0.0099}{0.0594}\approx\boxed{0.167}. \] (b) Counting over $10{,}000$ people: the $1\%$ prior makes 100 sick, of whom the test catches $0.99\times100=$ 99 (true positives); of the 9,900 healthy, $5\%=$ 495 test positive anyway (false positives). Total positives $=99+495=594$, of which only $99$ are truly sick: $99/594\approx\boxed{0.167}$ — the same $\sim17\%$ ✓. It is far below the “$99\%$ accurate” headline because the disease is rare (low prior), so the $495$ false positives from the huge healthy majority swamp the $99$ true ones — the base-rate effect.

E7. MLE vs MAP — the prior as a regularizer (§6). Same coin as E5 (6 heads, 2 tails) with a Beta$(2,2)$ prior. (a) Beta is conjugate to Bernoulli, so the posterior just adds the data counts to the prior exponents: $\theta^{6}(1-\theta)^{2}\cdot\theta^{1}(1-\theta)^{1}=\theta^{7}(1-\theta)^{3}$, i.e. $\boxed{\text{Beta}(8,4)}$ ($\alpha'=2+6=8,\ \beta'=2+2=4$). (b) The MAP estimate is the mode of Beta$(8,4)$: $\dfrac{\alpha'-1}{\alpha'+\beta'-2}=\dfrac{7}{10}=\boxed{0.7}$. (c) The three numbers — MLE $=0.75$, prior mean $=0.5$, MAP $=0.7$ — show the prior pulled the estimate down from $0.75$ toward $0.5$, landing at $0.7$. That is exactly the analogous effect of an L2 regularizer, which pulls a weight from its data-driven value toward $0$ (here the “toward $0.5$” prior plays the role of “toward $0$”). With far more data the likelihood would dominate and MAP would slide back toward the MLE — a prior is a finite nudge, washed out by enough evidence.

E8. Entropy, cross-entropy, KL (§7) for truth $p=[0.5,0.5]$, model $q=[0.8,0.2]$. (a) Entropy in bits: $H(p)=-0.5\log_2 0.5-0.5\log_2 0.5=-\log_2 0.5=\boxed{1}$ bit — the maximal uncertainty of a fair coin. (b) Cross-entropy in nats: \[ H(p,q)=-\big(0.5\ln 0.8+0.5\ln 0.2\big)=-\big(0.5(-0.223)+0.5(-1.609)\big)=0.1115+0.8045=\boxed{0.916}\ \text{nats}. \] (c) KL divergence in nats: \[ D_{\mathrm{KL}}(p\Vert q)=0.5\ln\frac{0.5}{0.8}+0.5\ln\frac{0.5}{0.2} =0.5(-0.470)+0.5(0.916)=-0.235+0.458=\boxed{0.223}\ \text{nats}. \] (d) Identity check: $H(p,q)-H(p)=0.916-0.693=0.223=D_{\mathrm{KL}}(p\Vert q)$ ✓ (using $H(p)=0.693$ nats, which is the same as the $1$ bit from (a)). KL is the extra surprise from coding with the wrong $q$ instead of the truth $p$.

E9. Paired $t$-test (§8) on $d=[3,5,4,6,2]$, $n=5$. (a) $\bar d=\dfrac{3+5+4+6+2}{5}=\dfrac{20}{5}=\boxed{4}$. (b) Deviations from the mean: $[-1,1,0,2,-2]$; sum of squares $=1+1+0+4+4=10$, so $s_d^2=\dfrac{10}{5-1}=2.5$ and $s_d=\sqrt{2.5}\approx\boxed{1.581}$. (c) Standard error $=\dfrac{s_d}{\sqrt n}=\dfrac{1.581}{\sqrt5}\approx0.707$, so $t=\dfrac{\bar d}{s_d/\sqrt n}=\dfrac{4}{0.707}\approx\boxed{5.657}$ with $df=n-1=4$. Since $5.657>2.776$, we reject $H_0$ — the gain is statistically significant (two-sided $p\approx0.0048$). (d) 95% CI: $\bar d\pm 2.776\,(s_d/\sqrt n)=4\pm 2.776(0.707)=4\pm1.963=\boxed{[2.04,\,5.96]}$, which excludes $0$ ✓ — the same significant verdict, now with the effect size and its uncertainty in one object.

E10. The CLT and the standard error (§8). (a) The standard error is $s_d/\sqrt n$, so going from $n=5$ to $n=20$ (a $4\times$ increase in pairs, $s_d$ unchanged) shrinks it by $\sqrt{20/5}=\sqrt4=\boxed{2}$ — it roughly halves (from $\approx0.707$ to $\approx0.354$). Because $t=\bar d/(s_d/\sqrt n)$ has that shrinking quantity in the denominator, $t$ roughly doubles (here $\approx5.66\to\approx11.3$), so a real gain sits more standard errors above $0$ and is easier to detect — i.e. more seeds give the test more power. (b) The theorem is the Central Limit Theorem: the average of many independent, finite-variance noisy pieces is approximately Gaussian (bell-shaped) around the true mean, with variance $\sigma^2/n$. That $\sigma^2/n$ — equivalently a standard deviation of $\sigma/\sqrt n$ — is precisely where the $\sqrt n$ in the standard error comes from: averaging $n$ pieces cancels noise at a rate of $\sqrt n$, which is why a tighter $\bar d$ (and the bell shape of the $t$-curve itself) follows from simply collecting more pairs.

E11. Wilcoxon by hand (§8.7). (a) $|d|=\{3,1,5,2,4,6\}$ sorted is $1,2,3,4,5,6$, so matched back to the data, $d=+3$ gets rank $3$, $+1$ rank $1$, $+5$ rank $5$, $-2$ rank $2$, $+4$ rank $4$, $+6$ rank $6$. (b) The only negative is $-2$ (rank $2$), so $W^{-}=2$ (and $W^{+}=1+3+4+5+6=19$). (c) $W=\min(19,2)=2$. Because one sign disagrees, $W\neq0$, so we are above the $n=6$ floor: the exact two-sided $p$ is $0.094$ — not significant at $\alpha=0.05$. (Only an all-positive set, $W=0$, reaches $p=0.031$.) Five wins were promising, but six seeds with one loss cannot clear the bar — collect more seeds.

E12. Ship-or-not with multiplicity (§8.7). (a) No — the $95\%$ CI $[-0.001,+0.009]$ includes $0$, so the $+0.004$ gain is not distinguishable from no gain (a single test would sit right at the $p<0.05$ edge). (b) With $m=4$ tests, Bonferroni compares each $p$ to $\alpha/m=0.05/4=0.0125$; since $0.04>0.0125$, it does not survive. (c) Do not ship. Both lenses agree: the effect’s interval straddles $0$, and once you account for having tried four models, a raw $p=0.04$ is exactly the false positive multiplicity manufactures. Re-run the single most promising model on fresh seeds before believing it.

Loss Functions & Regularizers

E1. MAE/MSE/Huber on the residual set $r=\{0.5,-2,1\}$ (§2.1), with $N=3$ and $\delta=1$. The MAE is the mean of $|r|$: $\tfrac13(0.5+2+1)=\tfrac{3.5}{3}=\boxed{1.167}$. The MSE is the mean of $r^2$: $\tfrac13(0.25+4+1)=\tfrac{5.25}{3}=\boxed{1.75}$. For Huber ($\delta=1$), the two small residuals $|r|\le1$ use the quadratic $\tfrac12 r^2$ ($0.5\mapsto0.125$, $1\mapsto0.5$), while the outlier $r=-2$ has $|r|>1$ so it uses the linear arm $\delta|r|-\tfrac12\delta^2=1\cdot2-0.5=1.5$; the mean is $\tfrac13(0.125+1.5+0.5)=\tfrac{2.125}{3}=\boxed{0.708}$. The single point $r=-2$ is what blows MSE up — squaring turns its miss into $4$ (more than the other two combined) — and it is exactly the point Huber treats as linear, capping its influence (which is why mean-Huber $0.708<$ MSE $1.75$).

E2. Binary cross-entropy (§2.2), $\hat p=0.7$. For a positive ($y=1$) only the first term lives: $-[1\cdot\ln 0.7+0]=-\ln 0.7=\boxed{0.357}$ (the chapter’s running number). For a negative ($y=0$) only the second term lives: $-[0+1\cdot\ln(1-0.7)]=-\ln 0.3=\boxed{1.204}$. The negative costs more ($1.204>0.357$): predicting $\hat p=0.7$ is a good guess when the truth is class 1 (small surprise $-\ln 0.7$) but a bad guess when the truth is class 0 (large surprise $-\ln 0.3$). BCE is precisely $-\log$ of the probability the model assigned to the right answer — $\hat p$ for a positive, $1-\hat p$ for a negative — so the worse-matched label is punished more.

E3. One gradient step (§3), then with L2. Unregularized: $\hat y=wx=1\cdot2=2$; loss $(\hat y-y)^2=(2-6)^2=16$; gradient $\tfrac{\partial\ell}{\partial w}=2(\hat y-y)x=2(-4)(2)=-16$; update $w\leftarrow 1-0.1(-16)=\boxed{2.6}$ (matching §3). Now add $\lambda w^2$ with $\lambda=0.5$ (§5.1): the penalty contributes gradient $2\lambda w=2(0.5)(1)=1$, so the total gradient is $-16+1=-15$ and $w\leftarrow 1-0.1(-15)=\boxed{2.5}$. The regularized step lands $0.1$ short of the unregularized $2.6$ — the penalty pulled $w$ back toward $0$ by exactly $0.1$ on this single step. That nudge-toward- zero, repeated every step, is weight decay.

E4. BPR loss and gradient (§9), positive $\hat r_{ui}=2.0$, negative $\hat r_{uj}=0.5$. Gap $d=2.0-0.5=1.5$; sigmoid $\sigma(1.5)=\tfrac{1}{1+e^{-1.5}}=\tfrac{1}{1.2231}=\boxed{0.8176}$; per-triple loss $-\ln\sigma(1.5)=-\ln 0.8176=\boxed{0.2014}$; gradient $-(1-\sigma(d))=-(1-0.8176)=\boxed{-0.1824}$. Swapping the scores (a wrong triple, $\hat r_{ui}=0.5,\hat r_{uj}=2.0$) gives $d=-1.5$, $\sigma(-1.5)=0.1824$, loss $-\ln 0.1824=\boxed{1.7014}$, gradient $-(1-0.1824)=\boxed{-0.8176}$. The loss grows by $\tfrac{1.7014}{0.2014}\approx8.4\times$, but the gradient grows by only $\tfrac{0.8176}{0.1824}\approx4.5\times$. The two factors differ because the loss scales with $-\ln\sigma(d)$ while the gradient scales with $1-\sigma(d)$ — two different functions of $d$, so their ratios at $d=\pm1.5$ need not coincide. (Either way, the more-wrong triple gets both the bigger penalty and the bigger corrective push.)

E5. InfoNCE (§2.4), similarities $0.8$ (positive), $0.4,0.0$ (negatives), $\tau=0.2$. Divide by $\tau$: $\tfrac{0.8}{0.2}=4.0$, $\tfrac{0.4}{0.2}=2.0$, $\tfrac{0.0}{0.2}=0.0$. Exponentiate: $e^{4.0}=54.60$, $e^{2.0}=7.389$, $e^{0}=1$. The positive’s softmax share is $\tfrac{54.60}{54.60+7.389+1}=\tfrac{54.60}{62.99}=0.8668$, so $\ell_{\text{InfoNCE}}=-\ln 0.8668=\boxed{0.143}$ (the chapter’s number). If the positive’s similarity collapses to $0.0$, its scaled score is $0$, its exponential is $1$, and its share falls to $\tfrac{1}{1+e^{2.0}+1}=\tfrac{1}{9.389}=0.1065$, so the loss jumps to $-\ln 0.1065=\boxed{2.24}$. The loss jumps because the positive is no longer the most-similar item — it now ties the crowd, owns only a small fraction of the softmax mass, and the gradient pulls the anchor hard back toward its positive.

E6. L1 (lasso) is the one that drives weights to exactly zero; L2 (ridge) only shrinks them smoothly toward zero but never to zero (§5.1). The reason is geometric (the chapter’s diamond-vs-circle figure): the penalized solution is where the growing loss contour (an ellipse around the unregularized optimum) first touches the penalty’s constraint region. L1’s region is a diamond $|w_1|+|w_2|\le t$, whose corners sit on the axes — and a generic ellipse expanding outward almost always meets a corner first, and at a corner one coordinate is exactly $0$ (e.g. $w_1=0$). L2’s region is a smooth circle $w_1^2+w_2^2\le t$ with no corners, so the contour touches it at a generic off-axis point where both coordinates are nonzero. So the sharp corners of the absolute-value penalty are what manufacture exact zeros (sparsity); the roundness of the squared penalty cannot, giving only shrinkage.

E7. From §6’s MAP table: Gaussian noise on $y$ gives the MSE loss (its $-\log$ is $\propto(\hat y-y)^2$), and a Gaussian prior on $\theta$ gives the L2 regularizer (its $-\log$ is $\propto\|\theta\|_2^2$). The MAP objective maximizes the posterior $p(\theta\mid\text{data})\propto p(\text{data}\mid\theta)\,p(\theta)$ — a product of likelihood and prior. Taking $-\log$ does two things at once: because $\log$ of a product is the sum of the logs, the product splits into $-\log p(\text{data}\mid\theta)-\log p(\theta)$; and because $-\log$ is monotone decreasing, the $\theta$ that maximizes the posterior is the same $\theta$ that minimizes this sum. The first piece is the loss (negative log-likelihood) and the second is the regularizer (negative log-prior) — so “loss $+$ regularizer” is just the $-\log$ of “likelihood $\times$ prior.”

E8. Numerically stable forms (§2.4 “in practice”). For $-\ln\sigma(x)$, use the softplus identity $-\ln\sigma(x)=\operatorname{softplus}(-x)=\ln(1+e^{-x})$ (in PyTorch, F.logsigmoid(x) rather than log(sigmoid(x))); for a softmax/InfoNCE denominator, use log-sum-exp with the max subtracted, $\ln\sum_k e^{z_k}=m+\ln\sum_k e^{z_k-m}$ with $m=\max_k z_k$ (in PyTorch, F.cross_entropy / F.log_softmax). Both identities are exact — e.g. at $x=1.5$, $-\ln\sigma(1.5)=\ln(1+e^{-1.5})=0.2014$; and for the §2.4 scores the log-sum-exp is $4.143$ computed either way. What they prevent: coding log(sigmoid(x)) first squashes $\sigma(x)$, which underflows to $0$ for very negative $x$ (then $\log 0=-\infty$); and a raw $\sum e^{z_k}$ overflows to $\infty$ for large $z_k$ (here $e^{z}$ for a big logit). Subtracting the max forces the largest exponent to $e^0=1$ so nothing overflows, and softplus evaluates $-\ln\sigma$ without ever forming the tiny $\sigma$ — same value, no blow-up.

E9. Sampled-softmax log-$Q$ correction (§10), raw score $s=2$ for both negatives. The corrected contribution to the denominator is $e^{s}/Q=e^{\,s-\ln Q}$. Popular item, $Q=0.5$: $e^{2}/0.5=\tfrac{7.389}{0.5}=\boxed{14.78}$ (equivalently $e^{\,2-\ln 0.5}=e^{2.693}=14.78$). Rare item, $Q=0.05$: $e^{2}/0.05=\tfrac{7.389}{0.05}=\boxed{147.8}$ (i.e. $e^{\,2-\ln 0.05}=e^{4.996}=147.8$). The rare item is up-weighted by $\tfrac{147.8}{14.78}=\boxed{10\times}$, exactly the inverse ratio of the two sampling probabilities ($0.5/0.05=10$). That is why it is unbiased: dividing each sampled term by its draw-probability $Q$ is importance weighting — the rare negative, which stands in for the $\approx10$ similar items that were not drawn, is counted $10\times$ as heavily, so the expected value of the sampled denominator equals the full-catalog softmax denominator. Without the correction, frequently-drawn popular items would appear too often and get over-penalized.

E10. Regularized-MSE for matrix factorization (§8), $\mathbf{p}=[2,1]$, $\mathbf{q}=[1,2]$, $r=5$, $\lambda=0.1$. Score $\hat r=\mathbf{p}^{\top}\mathbf{q}=2\cdot1+1\cdot2=\boxed{4}$. Squared-error term $(r-\hat r)^2=(5-4)^2=\boxed{1}$. L2 term: $\|\mathbf{p}\|^2=2^2+1^2=5$ and $\|\mathbf{q}\|^2=1^2+2^2=5$, so $\lambda(\|\mathbf{p}\|^2+\|\mathbf{q}\|^2)=0.1\cdot(5+5)=\boxed{1.0}$. Total objective $\mathcal{L}=1+1.0=\boxed{2.0}$. For implicit feedback where only the top-$K$ order matters, switch the squared-error loss to BPR (§9) — or sampled-softmax (§10) for more negatives per step — since those optimize order rather than calibrated scores; and keep L2 on the embeddings as the regularizer (the universal default, $\lambda(\|\mathbf{p}\|^2+\|\mathbf{q}\|^2)$, which survives unchanged all the way to LightGCN).

E11. BCE rewards correct confidence (§2.2). (a) At $\hat p=0.5$: $y{=}1\to-\log0.5=0.693$; $y{=}0\to-\log0.5=0.693$ — a coin-flip prediction is equally (un)surprised either way. (b) At $\hat p=0.88$: $y{=}1\to-\log0.88=0.127$ (confident and right — tiny loss); $y{=}0\to-\log0.12=2.13$ (confident and wrong — large loss). (c) Least surprised: $\hat p=0.88,\,y{=}1$ ($0.127$); most: $\hat p=0.88,\,y{=}0$ ($2.13$). Confidence is rewarded only when it is correct.

E12. Weight decay step by step (§5.1). (a) Factor $=1-2(0.05)(1)=0.9$. (b) Multiplying by $0.9$ each step: $w: 2\to1.8\to1.62\to1.458$. (c) No — geometric decay by $0.9$ approaches $0$ but never reaches it; only data pressure or an L1 penalty can pin a weight to exactly $0$.

E13. L1 vs L2 (§5.1). (a) L1 drives weights to exactly zero; its constraint region is a diamond whose sharp corners sit on the axes, so the optimum tends to land on an axis (a zero coordinate). (b) L1 — the “selection” in lasso — zeroes the irrelevant features and keeps a sparse handful. (c) §5.1 calls the L2 per-step shrink weight decay.

Neural Networks & Back-propagation

E1. Single neuron (§1–§2). The pre-activation is the dot product plus bias: $z=\mathbf{w}\cdot\mathbf{x}+b=2\cdot1+(-1)\cdot3+0.5=2-3+0.5=\boxed{-0.5}$. ReLU then clips it: $a=\max(0,-0.5)=\boxed{0}$. ReLU clipped the negative score $-0.5$ down to $0$ — that flat “off for $z<0$” behaviour is exactly the bend §2 calls the nonlinearity. (Had we used sigmoid instead, $\sigma(-0.5)=0.3775$, the same value §2 computes for §1’s score.)

E2. Activations at a point (§2). At $z=1$: $\sigma(1)=\dfrac{1}{1+e^{-1}}=\boxed{0.7311}$, $\tanh(1)=\boxed{0.7616}$, $\mathrm{ReLU}(1)=\boxed{1}$. At $z=-2$: $\sigma(-2)=\dfrac{1}{1+e^{2}}=\boxed{0.1192}$ and $\mathrm{ReLU}(-2)=\boxed{0}$. Tanh is the one that changes sign with $z$ (it lives in $(-1,1)$, so $\tanh(-2)=-0.9640$ is negative); ReLU is the one that goes flat at $0$ for the negative input. Sigmoid never reaches $0$ or changes sign — it only squashes toward $0$ (here $0.1192$).

E3. Why a nonlinearity (§2). Stacking two linear layers gives $W_2(W_1\mathbf{x})=(W_2W_1)\mathbf{x}$, and $W_2W_1$ is itself just one matrix (matrix multiplication is associative, Linear-Algebra primer) — so the two layers collapse to a single linear map, no more expressive than one layer: a single straight decision boundary. XOR (output $1$ iff exactly one input is $1$) puts its two classes on opposite diagonals, $\{(0,0),(1,1)\}$ vs. $\{(0,1),(1,0)\}$, which no single straight line can separate — so no stack of purely linear layers ever can either. The smallest fix is to insert a nonlinear activation $\phi$ between the layers (e.g. a 2-layer net with one hidden layer and a ReLU/sigmoid): the bend lets the net fold space and carve the XOR boundary.

E4. Tiny $2\!\to\!2\!\to\!1$ forward pass (§3), input $\mathbf{x}=[2,1]$. Hidden pre-activations (each row of $W$ dotted with $\mathbf{x}$, plus the bias): \[ W\mathbf{x}+\mathbf{b}=\begin{bmatrix}1\cdot2+1\cdot1\\ -1\cdot2+(-1)\cdot1\end{bmatrix}+\begin{bmatrix}-1\\1\end{bmatrix} =\begin{bmatrix}3\\-3\end{bmatrix}+\begin{bmatrix}-1\\1\end{bmatrix}=\begin{bmatrix}2\\-2\end{bmatrix}. \] Apply ReLU elementwise: $\mathbf{h}=\mathrm{ReLU}([2,-2])=\boxed{[2,\,0]}$ — the second entry, the negative $-2$, is the one ReLU clipped to $0$. Output layer: $\hat y=\mathbf{w}_2\cdot\mathbf{h}+b_2=2\cdot2+1\cdot0+1=\boxed{5}$. Each layer is a matrix–vector product plus an elementwise activation, exactly as §3 states; the clipped $-2$ is the nonlinearity at work (without it the net would be one linear map, E3).

E5. Back-propagation step (§5), reusing the worked neuron ($x=1,w=0.5,b=0,y=0$, so $z=0.5$, $a=\sigma(0.5)=0.6225$, $L=\tfrac12(0.6225)^2=0.1937$). (a) Chain the three local derivatives: $\dfrac{\partial L}{\partial w}=\underbrace{(a-y)}_{0.6225}\cdot\underbrace{\sigma(z)(1-\sigma(z))}_{0.6225\times0.3775=0.2350}\cdot\underbrace{x}_{1}=0.6225\times0.2350\times1=\boxed{0.1463}$ — confirming §5’s value. (b) One gradient-descent step with $\eta=0.5$: $w\leftarrow w-\eta\dfrac{\partial L}{\partial w}=0.5-0.5(0.1463)=0.5-0.0732=\boxed{0.4269}$. The weight moved toward $0$ (down from $0.5$), just as in §5’s $\eta=0.1$ step ($0.4854$) — only farther, because the larger learning rate takes a bigger step in the same downhill direction. Lowering $w$ lowers the score $z$, pushes $a$ toward the target $y=0$, and shrinks the loss.

E6. Softmax + cross-entropy (§4), logits $z=[2,1,0]$, exponentials $7.39,2.72,1.00$ (sum $11.11$), true class $0$. (a) The true-class probability is $\hat p_0=\dfrac{7.39}{11.11}=\boxed{0.665}$, and the cross-entropy loss is $\mathcal L=-\ln\hat p_0=-\ln 0.665=\boxed{0.41}$ — matching §4’s worked value. (b) If the model were maximally unsure, $\hat p_0=\tfrac13$, the loss would be $-\ln\tfrac13=\boxed{1.0986}\approx1.10$. The uncertain case has the larger loss ($1.10>0.41$), which is the right direction: cross-entropy rewards putting probability mass on the correct class ($\hat p_0\to1\Rightarrow\mathcal L\to0$) and penalizes spreading it thin — exactly the pressure that trains a classifier (§4).

E7. One Adam step from rest (§6), reusing $g_1=0.1463$, $m_0=v_0=0$, $\beta_1=0.9$, $\beta_2=0.999$, $\eta=0.1$. \[ m_1=(1-0.9)\,g_1=0.1\times0.1463=\boxed{0.01463},\qquad v_1=(1-0.999)\,g_1^2=0.001\times0.1463^2=\boxed{2.14\times10^{-5}}. \] Bias-correct (divide by $1-\beta^{\,1}$, i.e. by $0.1$ and $0.001$): $\hat m_1=\dfrac{0.01463}{0.1}=\boxed{0.1463}=g_1$ and $\hat v_1=\dfrac{2.14\times10^{-5}}{0.001}=\boxed{0.02140}=g_1^2$. The step is \[ \eta\,\frac{\hat m_1}{\sqrt{\hat v_1}}=0.1\times\frac{0.1463}{\sqrt{0.02140}}=0.1\times\frac{0.1463}{0.1463}=\boxed{0.1}. \] It comes out to exactly $\eta$ because at the first step $\hat m_1=g_1$ and $\sqrt{\hat v_1}=\sqrt{g_1^2}=|g_1|$, so the ratio is $g_1/|g_1|=\operatorname{sign}(g_1)=1$ — the gradient’s magnitude cancels. That is Adam’s signature: it normalizes each parameter’s step by that parameter’s own gradient scale, so the size of the move is set by the learning rate, not by how big the gradient happens to be (§6).

E8. Weight initialization (§6). He / Kaiming init pairs with ReLU layers; Xavier / Glorot init pairs with tanh / sigmoid layers. (Both scale the initial weights so activations neither explode nor vanish layer to layer; He uses a larger variance to compensate for ReLU zeroing out half of its inputs, while Xavier is tuned for the symmetric, saturating tanh/sigmoid curves.) Initializing every weight to zero is a silent failure because then every neuron in a layer computes the identical output and therefore the identical gradient, so they all update in lockstep and never differentiate — the layer can never learn distinct features (§6’s “all-zeros init is silent death”).

E9. Net-specific regularizers (§7). Dropout randomly zeros a fraction of the neurons on each training step, so a different sub-network is trained every time; because any given unit may vanish at any step, no unit can rely on the presence of any particular other unit, which forces redundant, robust features instead of brittle co-adaptations. Early stopping watches the validation loss (the red curve in §7’s overfitting figure) and halts training at its minimum — the point where the validation loss stops improving and begins to rise even as the training loss keeps falling (the onset of memorizing noise). Both trade a little fitting power for better generalization, the bias–variance idea of Losses & Regularizers §4.

E10. Neuron as the recommender’s dot-product score (§1), $\mathbf{p}=[2,1,3]$, $\mathbf{q}=[1,2,1]$. (a) The affinity $\mathbf{p}\cdot\mathbf{q}$ is a single linear neuron $\mathbf{w}\cdot\mathbf{x}+b$ with the item vector as the weights ($\mathbf{w}=\mathbf{q}$), the user vector as the input ($\mathbf{x}=\mathbf{p}$), and no bias ($b=0$). Its value: $\hat r=\mathbf{p}\cdot\mathbf{q}+0=2\cdot1+1\cdot2+3\cdot1=2+2+3=\boxed{7}$. (b) The second item $\mathbf{q}_2=[3,0,1]$ scores $\hat r_2=\mathbf{p}\cdot\mathbf{q}_2=2\cdot3+1\cdot0+3\cdot1=6+0+3=\boxed{9}$, and $9>7$, so it indeed ranks above the first — verified. This bare dot-product head is exactly matrix factorization / LightGCN’s scoring rule, and the embeddings $\mathbf{p},\mathbf{q}$ are the learned lookup rows of §3’s embedding layer — trained by the same back-prop (§5) as any other weight.

E11. Embedding lookup and the sparse gradient (§3). (a) Id $1$ returns the middle row, $E_1=[0,3,1]$ — a row read, no arithmetic. (b) Only row $1$ gets a nonzero gradient: the one-hot $\mathbf e_1$ is zero in every other slot, so $\partial L/\partial E_j=\mathbf 0$ for $j\neq1$. An id’s embedding updates only on examples that use it — the sparse-gradient fact behind cold start.

E12. The same neuron, flipped target (§5). (a) With $y=1$: $\partial L/\partial a=a-y=0.6225-1=-0.3775$, so $\partial L/\partial w=-0.3775\times0.2350\times1=-0.0887$. (b) Step: $w\leftarrow0.5-0.1(-0.0887)=0.509$ — $w$ moves up. That makes sense: the target rose from $0$ to $1$, so we now want a larger output $a$, hence a larger score $z=wx$, hence a larger $w$. (The error flipped sign, so the gradient — and the step — flipped with it: with $y=0$ the weight had moved down.)

E13. Embeddings and cold start (§3). (a) A pure-ID model has no row for an unseen id — there is nothing to look up, and back-prop never trained one, so its score is meaningless (random or undefined). (b) §14 calls such a model transductive (it only represents ids present during training); the fix is to give a new id a vector from its content — an LLM-generated profile embedding (LLM × RecSys), or any side-feature — so it arrives with a usable vector instead of a blank.

Representation Learning & the Transformer

E1. One-hot orthogonality (§1.1). With $\mathbf{e}_2=[0,1,0,0,0]$ and $\mathbf{e}_4=[0,0,0,1,0]$, the dot product multiplies entry-by-entry and sums: every product is $0$ because the single $1$’s sit in different slots, so $\mathbf{e}_2\cdot\mathbf{e}_4=\boxed{0}$. Each one-hot has length $\lVert\mathbf{e}_k\rVert=\sqrt{1}=1$, so the cosine is $\cos=\dfrac{0}{1\cdot1}=\boxed{0}$ — the two vectors are orthogonal. This happens for any pair of distinct one-hots: two different symbols never share the on-slot, so the dot product is always $0$. Under one-hot, then, Titanic and The Notebook (both romances) are exactly as dissimilar as Titanic and a chainsaw documentary — one-hot encodes identity but zero relationship, which is the flaw §1.2’s dense embedding fixes.

E2. Cosine of two small word vectors (§1.1 / Linear Algebra note). Dot product $\mathbf{v}_{\text{coffee}}\cdot\mathbf{v}_{\text{tea}}=4\cdot3+3\cdot4=12+12=24$; lengths $\lVert\mathbf{v}_{\text{coffee}}\rVert=\sqrt{16+9}=5$ and $\lVert\mathbf{v}_{\text{tea}}\rVert=\sqrt{9+16}=5$. So $\cos=\dfrac{24}{5\cdot5}=\dfrac{24}{25}=\boxed{0.96}$ (an angle of about $16.3^\circ$) — very near $+1$, i.e. the two arrows point almost the same way. The distributional hypothesis (§2.1) reads this geometrically: coffee and tea keep similar company (drink, cup, morning), so a predict-based model pushes their vectors close together, and a near-$1$ cosine is exactly the “these two words mean similar things” signal.

E3. Skip-gram + negative sampling (§2.2). Skip-gram slides a fixed-width window over the text; for each center word $w$ it tries to predict every context word $c$ in that window. It scores a (center, context) pair by the dot product $\mathbf{u}_c\cdot\mathbf{v}_w$ of the context vector and the center vector — high when the two point the same way — then turns that score into a probability $P(c\mid w)=\dfrac{\exp(\mathbf{u}_c\cdot\mathbf{v}_w)}{\sum_{c'}\exp(\mathbf{u}_{c'}\cdot\mathbf{v}_w)}$ with a softmax over the vocabulary, and trains by cross-entropy. Negative sampling replaces that softmax denominator: instead of summing $\exp(\cdot)$ over all $V$ words (which costs $O(V)$ per step — far too expensive for a real vocabulary), it draws a handful of random “negative” words and only trains the model to say yes to the true pair and no to those few negatives. It is needed precisely to avoid normalizing over the whole vocabulary — the same “contrast the positive against a few negatives” trick as BPR / sampled-softmax (Losses & Regularizers) and InfoNCE (SSL & Contrastive Learning).

E4. Self-attention, first row (§4.2). The value rows are $V=X$ and $\sqrt{d_k}=\sqrt2\approx1.414$. (a) The $t_1$ query is $[1,0]$; its scaled scores against the three keys are \[ \frac{[1,0]\cdot[1,0]}{\sqrt2}=\frac{1}{\sqrt2}=0.707,\quad \frac{[1,0]\cdot[1,1]}{\sqrt2}=\frac{1}{\sqrt2}=0.707,\quad \frac{[1,0]\cdot[0,1]}{\sqrt2}=\frac{0}{\sqrt2}=0.000, \] i.e. $[0.707,\,0.707,\,0.000]$ — exactly the top row of the chapter’s score table. (b) Softmax: $\exp([0.707,0.707,0])=[2.028,2.028,1.000]$, summing to $5.056$, so the weights are $\boxed{[0.401,\,0.401,\,0.198]}$ (matching the chapter’s $0.401,0.401,0.198$). (c) The output is those weights times the value rows: \[ 0.401[1,0]+0.401[1,1]+0.198[0,1]=[\,0.401+0.401,\ \ 0.401+0.198\,]=\boxed{[0.802,\,0.599]}, \] reproducing the chapter’s first output row. Titanic ($[1,0]$, pure romance) attends equally to itself and the “between” Notebook and barely to the pure-musical La La Land, so its context-aware vector tilts toward romance.

E5. Scaled dot-product (§4.1). (a) The raw self-score is $[1,1]\cdot[1,1]=2$, so the scaled self-score is $\dfrac{2}{\sqrt2}=\sqrt2\approx\boxed{1.414}$ — exactly the highlighted middle entry of §4.2’s score table. (b) Dot products grow with the dimension $d_k$, so at large $d_k$ the raw scores become large in magnitude; a softmax of large-magnitude scores saturates into a near one-hot spike (almost all weight on one token) whose gradient is nearly zero, stalling learning. Dividing by $\sqrt{d_k}$ keeps the scores in a sane range so the softmax stays smooth and trainable — which is why it is called scaled dot-product attention.

E6. Multi-head attention (§4.3). A single head computes one set of similarity weights and one blend, so it can express only one notion of “what should attend to what.” With identity projections (as in §4.2) that notion is just raw vector similarity, and one head can track only a single kind of relation at a time. Running several heads in parallel, each with its own learned $Q/K/V$ projections, lets the model attend along several different relations simultaneously — one head can follow romance, another recency, another genre — and the concatenation (mixed by $W_O$) fuses them. The separate learned projections are also what make each head learnable rather than a fixed similarity match: a token can advertise one thing (its key), search for another (its query), and pass on a third (its value).

E7. BPE on an unseen word (§5.3). (a) Replay the two learned rules in order on s l o w _: \[ \texttt{s l o w \_}\ \xrightarrow{\ \texttt{o+w}\ }\ \texttt{s l ow \_}\ \xrightarrow{\ \texttt{l+ow}\ }\ \texttt{s low \_} \quad\Rightarrow\quad \text{tokens } \boxed{[\,\texttt{s},\ \texttt{low},\ \texttt{\_}\,]}. \] The stem low is the same piece slow shares with low and lowest, so the model reuses what it already learned about that stem — and there is no <UNK>, exactly §5.3’s payoff. (b) In the corpus low ×5, lowest ×2, the pair e,s appears only inside lowest, so its count is $\boxed{2}$. BPE merges greedily by frequency, and the pair o,w occurs in both word-types ($5+2=7$), so o+w (count $7$) is merged before e,s (count $2$).

E8. KV-cache cost (§5.4). (a) Without the cache each step re-encodes the whole prefix, so the total for $L=5$ is $1+2+3+4+5=\boxed{15}$ token-passes ($=\tfrac{L(L+1)}{2}=\tfrac{5\cdot6}{2}=15$, the $O(L^2)$ row). (b) With the cache each step pushes only the one new token, so the total is $1+1+1+1+1=\boxed{5}$ passes ($=L$, the $O(L)$ row). (c) The ratio is $15/5=\boxed{3\times}$ at $L=5$ — larger than the $10/4=2.5\times$ the chapter shows at $L=4$, and growing like $L$: at $L=1{,}000$ it is $\tfrac{1{,}000\cdot1{,}001/2}{1{,}000}=\tfrac{1{,}001}{2}\approx500\times$, exactly the chapter’s figure.

E9. Positional encoding (§4.3) with $d=4$, $\text{pos}=2$. Using $10000^{0}=1$ and $10000^{1/2}=100$: \[ \begin{aligned} PE(2,0)&=\sin(2/1)=\sin 2=\boxed{0.909}, & PE(2,1)&=\cos(2/1)=\cos 2=\boxed{-0.416},\\ PE(2,2)&=\sin(2/100)=\sin 0.02=\boxed{0.020}, & PE(2,3)&=\cos(2/100)=\cos 0.02=\boxed{1.000}, \end{aligned} \] so $PE(\text{pos}=2,\,d=4)\approx[\,0.909,\,-0.416,\,0.020,\,1.000\,]$ — a different fingerprint from $\text{pos}=1$’s $[0.841,0.540,0.010,1.000]$, with the low-frequency last two dimensions barely moving between adjacent positions while the high-frequency first two swing a lot. Self-attention is a set operation — shuffle the tokens and the outputs just shuffle too (§4.3) — so without a per-position signal it could not tell “not good” from “good not.” Adding each position’s own $PE$ vector to its token embedding is what lets the order-blind attention layer see order.

E10. SASRec / item2vec scoring (§4.4 / §2.4). The intent vector is $h=[1,2]$ (musical-leaning), and the score of an item $e$ is the dot product $h\cdot e$. (a) Raw dot products: \[ h\cdot[1,0]=1,\qquad h\cdot[1,1]=1+2=3,\qquad h\cdot[0,1]=2, \] so the dot-product ranking is Notebook $(3)$ $\succ$ La La Land $(2)$ $\succ$ Titanic $(1)$. (b) Cosine needs the norms: $\lVert h\rVert=\sqrt{1+4}=\sqrt5\approx2.236$, and the items have norms $\lVert[1,0]\rVert=1$, $\lVert[1,1]\rVert=\sqrt2\approx1.414$, $\lVert[0,1]\rVert=1$. Then \[ \cos(h,[1,0])=\frac{1}{\sqrt5}\approx\boxed{0.447},\quad \cos(h,[1,1])=\frac{3}{\sqrt5\sqrt2}\approx\boxed{0.949},\quad \cos(h,[0,1])=\frac{2}{\sqrt5}\approx\boxed{0.894}. \] The cosine ranking is Notebook $(0.949)$ $\succ$ La La Land $(0.894)$ $\succ$ Titanic $(0.447)$ — the two rankings agree. But notice the close call between Notebook and La La Land: the raw dot product ranks Notebook clearly ahead ($3$ vs $2$) partly because Notebook is the larger-norm vector ($\sqrt2>1$), not purely because its direction matches the intent better — under cosine, which divides the length out, the gap narrows to $0.949$ vs $0.894$. That length sensitivity is exactly the popularity/length bias §4’s serving discussion (L2-normalize first, turning the dot product into cosine; Linear Algebra note) is designed to remove.

E11. LSTM cell, one step (§3.3). With $c_{t-1}=0.50$ and gates $f_t=0.40$, $i_t=0.90$, $\tilde c_t=0.80$, $o_t=0.50$: (a) the new cell state is $c_t=f_t c_{t-1}+i_t\tilde c_t=0.40(0.50)+0.90(0.80)=0.20+0.72=\boxed{0.92}$. (b) the readout is $h_t=o_t\tanh(c_t)=0.50\tanh(0.92)=0.50(0.726)=\boxed{0.363}$. (c) This step is the LSTM overwriting: a low forget gate ($0.40$) with a high input gate ($0.90$) discards most of the old memory ($0.50$) and lets the new candidate dominate ($c_t=0.92$). §3.3’s worked step was the opposite — $f_t=0.95$, $i_t=0.10$ held the memory essentially fixed ($0.80\to0.80$), the LSTM remembering. The forget gate is the single knob: near $1$ remembers, near $0$ overwrites.

E12. Skip-gram probabilities (§2.2). (a) Dot products with $\mathbf v_w=[1,1]$: $\mathbf u_A\cdot\mathbf v_w=2+1=3$, $\mathbf u_B\cdot\mathbf v_w=0+1=1$, $\mathbf u_C\cdot\mathbf v_w=-1+1=0$. (b) Softmax of $[3,1,0]$: $e^3,e^1,e^0=20.09,2.72,1.00$ sum to $23.81$, so $P(A)=20.09/23.81=\boxed{0.844}$, $P(B)=0.114$, $P(C)=0.042$. (c) Most sure of A ($0.844$) — and $\mathbf u_A=[2,1]$ is the vector best aligned with $\mathbf v_w=[1,1]$ (largest dot product). “Predict your neighbours” rewards alignment.

E13. $Q,K,V$ as learned views (§4.2). (a) $q=tW_Q=[2\cdot2+1\cdot0,\ 0]=[4,0]$; $k=tW_K=[0,\ 2\cdot0+1\cdot2]=[0,2]$; $v=tW_V=[2,1]$. (b) Yes — $[4,0]$, $[0,2]$, $[2,1]$ are three different vectors built from the one token $t=[2,1]$. (c) $q\cdot k=4\cdot0+0\cdot2=0$. Raw similarity would give $t\cdot t=5$ (a token equals itself), but the learned $W_Q,W_K$ make this token’s query and key orthogonal — attention matches on a relationship the model learned (a romance-axis query vs a musical-axis key), not on raw sameness.

E14. The KV-cache payoff (§5.4). (a) Without a cache, $1+2+3+4+5=\boxed{15}$ token-passes. (b) With the cache, one new token per step: $\boxed{5}$ passes. (c) The cache stores each past token’s key and value vectors (they never change once computed), so each step pushes only the new token through; at $L=1000$ that turns $\sim500{,}000$ passes into $1{,}000$ — about a $500\times$ speedup.

Traditional Recommender Systems

E1. Content cosine (§2). The numerator is the dot product — it counts shared genres: $\text{Avatar}\cdot\text{Matrix}=(1)(1)+(1)(1)+(1)(0)=2$ (they share action and sci-fi; only romance differs). The lengths are $\lVert\text{Avatar}\rVert=\sqrt{1+1+1}=\sqrt3$ and $\lVert\text{Matrix}\rVert=\sqrt{1+1+0}=\sqrt2$, so \[ \cos(\text{Avatar},\text{Matrix})=\frac{2}{\sqrt3\,\sqrt2}=\frac{2}{\sqrt6}\approx\boxed{0.82}. \] At $0.82$ it is far closer to §2’s near-aligned $0.58$ (Titanic–Avatar) than to its orthogonal $0$ (Titanic–Matrix) — in fact higher, because Avatar and Matrix overlap on two of three genres (action and sci-fi) where Titanic and Avatar shared only one (romance). An angle of about $35.3^\circ$.

E2. Inverse document frequency (§2), base $10$. The rarer word appears in $\mathrm{df}=1$ of $N=4$ documents, the common word in $\mathrm{df}=2$: \[ \operatorname{idf}(\textit{spaceship})=\log_{10}\!\tfrac{4}{1}=\log_{10}4\approx\boxed{0.60}, \qquad \operatorname{idf}(\textit{love})=\log_{10}\!\tfrac{4}{2}=\log_{10}2\approx\boxed{0.30}. \] At equal raw counts the rarer word carries $\dfrac{0.60}{0.30}=\boxed{2\times}$ the weight — TF-IDF up-weights the distinctive term and damps the ubiquitous one, exactly as §2’s alien vs romance ($2.7\times$) showed with a different $N$.

E3. Item-based $k$-NN (§4): a similarity-weighted average of Cara’s ratings of Titanic’s neighbours. Cara rated Avatar $=5$ and Matrix $=4$, with weights $s_{\text{Ti,Av}}=0.3$ and $s_{\text{Ti,Ma}}=0.6$: \[ \hat r_{\text{Cara,Ti}}=\frac{s_{\text{Ti,Av}}\,r_{\text{Cara,Av}}+s_{\text{Ti,Ma}}\,r_{\text{Cara,Ma}}}{s_{\text{Ti,Av}}+s_{\text{Ti,Ma}}} =\frac{0.3(5)+0.6(4)}{0.3+0.6}=\frac{1.5+2.4}{0.9}=\frac{3.9}{0.9}=\boxed{4.33}. \] Each rating is weighted by how similar that item is to Titanic, and dividing by the weight-sum $0.9$ renormalizes the votes onto the $1$–$5$ scale (same mechanics as §4’s $4.27$). Here Matrix is the closer neighbour ($0.6>0.3$), so its $4$ pulls harder, landing the prediction near $4.3$.

E4. Matrix-factorization dot product (§5): \[ \hat r_{\text{Bob,Avatar}}=\mathbf p_{\text{Bob}}\cdot\mathbf q_{\text{Avatar}} =\underbrace{(1)(1)}_{\text{romance}}+\underbrace{(2)(1)}_{\text{action}}=1+2=\boxed{3}. \] That is higher than the $2.5$ §5 found for Ann–Avatar, and the action factor carries the difference: Bob’s action taste is $2$ (vs Ann’s $0.5$), and Avatar expresses action ($q=1$), so the action term contributes $2$ for Bob against $0.5$ for Ann. The romance terms are equal ($1$ each). The score is “taste meets traits”: a factor fires only when both the user wants it and the item has it.

E5. Confidence weighting (§5), $c_{ui}=1+\alpha r_{ui}$ with $\alpha=40$: \[ c(r{=}1)=1+40(1)=\boxed{41},\qquad c(r{=}3)=1+40(3)=\boxed{121}. \] A never-played (blank) cell has $r=0$, so it enters at the floor confidence $c=1+40(0)=1$ — the lowest possible. Because the loss multiplies each squared error by $c$, the played items (weights $41,121$) are fit much harder than the blanks (weight $1$): the blank is treated as a soft “probably not,” not a hard zero, so the model leans toward reproducing the observed positives and only gently toward the unobserved cells — the §5 cure for implicit data’s positives-only nature.

E6. Factorization machine (§8). The three pairwise interactions depend only on the factor vectors, which are unchanged from §8 — so all three values are identical to §8’s $5.9$ computation: \[ \langle\text{Ann},\text{Avatar}\rangle=(2)(1)+(0.5)(1)=\boxed{2.5},\quad \langle\text{Ann},\text{action}\rangle=(2)(0.5)+(0.5)(-1)=\boxed{0.5},\quad \langle\text{Avatar},\text{action}\rangle=(1)(0.5)+(1)(-1)=\boxed{-0.5}. \] The only thing that moved is the action solo weight $w_{\text{action}}$, from $-0.1$ to $+0.1$ — a swing of $+0.2$ in the linear part. Reassembling: \[ \hat y=\underbrace{3.0}_{w_0}+\underbrace{(0.2+0.3+0.1)}_{\text{linear}\,=\,0.6}+\underbrace{(2.5+0.5-0.5)}_{\text{interactions}\,=\,2.5}=3.0+0.6+2.5=\boxed{6.1}. \] So $\hat y$ rose from §8’s $5.9$ to $6.1$, exactly the $+0.2$ the solo-weight change injected — the factorized interaction part ($2.5$) is untouched because it never depends on the linear weights.

E7. Implicit feedback records only what a user did (a click, a play $\to 1$); a non-click is ambiguous — the user may dislike the item, but far more often they simply never saw it (the catalog has millions of items). So an unobserved cell is unknown, not a confirmed $0$, and there are no true negatives to fit. BPR (§5) sidesteps this by never asserting any blank is a $0$: instead of fitting absolute targets, it only requires that an observed (positive) item be ranked above a randomly sampled unobserved one — a relative, pairwise comparison that stays valid even when the “negative” is really just unseen.

E8. Both routes replace the missing interaction history with information the new item already carries. Content-based filtering (§2) scores the item from its features (genre vector / TF-IDF of its description) against the user’s profile, so it can recommend a film the moment it has attributes — no ratings needed (this is the no-item-cold-start strength noted in §2 and §7). A factorization machine (§8) does the analogous thing inside one model: even with no trustworthy id factor for the brand-new item, its side features (genre, release year, …) already have well-trained factor vectors $\mathbf v_j$ learned from every other event those features appeared in, so the FM still produces a sensible score through the feature interactions. In both, features carry the signal where the id has none — the §7 cure for cold start.

E9. User-based $k$-NN with mean-centring (§4). First each user’s own mean over what they rated: $\bar r_{\text{Cara}}=\tfrac{5+4}{2}=4.5$, $\bar r_{\text{Ann}}=\tfrac{5+1}{2}=3$, $\bar r_{\text{Bob}}=\tfrac{4+5}{2}=4.5$. Their Titanic deviations are $r_{\text{Ann,Ti}}-\bar r_{\text{Ann}}=5-3=+2$ and $r_{\text{Bob,Ti}}-\bar r_{\text{Bob}}=4-4.5=-0.5$. With $s_{\text{Cara,Ann}}=0.2$, $s_{\text{Cara,Bob}}=0.8$: \[ \hat r_{\text{Cara,Ti}}=\bar r_{\text{Cara}}+\frac{s_{\text{Cara,Ann}}(+2)+s_{\text{Cara,Bob}}(-0.5)}{s_{\text{Cara,Ann}}+s_{\text{Cara,Bob}}} =4.5+\frac{0.2(2)+0.8(-0.5)}{0.2+0.8}=4.5+\frac{0.4-0.4}{1}=4.5+0=\boxed{4.5}. \] The answer lands exactly on Cara’s own mean because the weighted deviations cancel: Ann sat $+2$ above her baseline but is weighted only $0.2$ ($0.4$), while Bob sat $0.5$ below his and is weighted $0.8$ ($-0.4$) — equal and opposite. The neighbours’ “above-average” and “below-average” votes net to zero, so the model predicts Cara is simply average for her on Titanic. (Centring is what makes Ann’s stingy scale and Bob’s generous one comparable in the first place — §4’s Pearson move.)

E10. Scoring by dot product (§5), with $\mathbf p=[2,1,3]$. (a) $\hat r_1=\mathbf p\cdot\mathbf q_1=(2)(4)+(1)(0)+(3)(0)=8$ and $\hat r_2=\mathbf p\cdot\mathbf q_2=(2)(0)+(1)(0)+(3)(2)=6$. Since $8>6$, dot product ranks $\mathbf q_1$ first ($\boxed{\mathbf q_1\succ\mathbf q_2}$). (b) Cosine needs the norms: $\lVert\mathbf p\rVert=\sqrt{4+1+9}=\sqrt{14}\approx3.742$, $\lVert\mathbf q_1\rVert=\sqrt{16}=4$, $\lVert\mathbf q_2\rVert=\sqrt{4}=2$. Then \[ \cos(\mathbf p,\mathbf q_1)=\frac{8}{\sqrt{14}\cdot4}=\frac{2}{\sqrt{14}}\approx\boxed{0.535}, \qquad \cos(\mathbf p,\mathbf q_2)=\frac{6}{\sqrt{14}\cdot2}=\frac{3}{\sqrt{14}}\approx\boxed{0.802}. \] Now $\mathbf q_2$ wins ($0.802>0.535$), so the two rankings disagree ($\boxed{\text{no}}$). Raw dot product favoured $\mathbf q_1$ largely because of its bigger norm ($\lVert\mathbf q_1\rVert=4$ vs $\lVert\mathbf q_2\rVert=2$): $\mathbf q_1$ is twice as long, so its raw dot product is inflated by length, not by better direction agreement. Cosine divides that length out and reveals $\mathbf q_2$ is the better directional match — exactly the popularity/length bias that §5’s “In practice” box flags and that L2-normalizing-before-scoring (the Linear-Algebra primer’s serving trick) removes.

E11. FM with a genre feature (§8). (a) Pairwise interactions: $\langle\text{Bob},\text{Titanic}\rangle=(1)(2)+(2)(0)=2$; $\langle\text{Bob},\text{romance}\rangle=(1)(1)+(2)(1)=3$; $\langle\text{Titanic},\text{romance}\rangle=(2)(1)+(0)(1)=2$. (b) Bias $w_0=3$; linear $0.1+0.2+0.1=0.4$; interactions $2+3+2=7$; total $\hat y=3+0.4+7=\boxed{10.4}$.

E12. Cold start with metadata (§5/§7). (a) MF has no learned factor vector for an item with zero ratings — there is nothing to take a dot product with, so it cannot score the new movie at all. (b) Content-based filtering (§2) can: it scores from the item’s metadata (a genre/cast vector against the user’s taste profile), needing no interaction history. (c) Once ratings accrue, move to a hybrid (§7) — content for the cold part, collaborative (MF/LightGCN) for the warm part — leaning more on the collaborative signal as it grows.

E13. Jaccard for binary data (§4). (a) $A\cap B=\{3,4\}$ (size $2$), $A\cup B=\{1,2,3,4,5\}$ (size $5$), so Jaccard $=2/5=\boxed{0.4}$. (b) Binary watched/not data has no rating magnitudes for cosine to weigh; Jaccard works directly on set overlap and ignores the huge number of jointly-unwatched items (shared zeros) that would otherwise make almost every pair look similar.

Evaluation Metrics

E1. Rating error (§2). The errors are $[3{-}4,\,5{-}5,\,2{-}4,\,4{-}3]=[-1,0,-2,1]$. \[ \text{MAE}=\frac{|{-}1|+|0|+|{-}2|+|1|}{4}=\frac{4}{4}=\boxed{1.0}, \qquad \text{RMSE}=\sqrt{\frac{1+0+4+1}{4}}=\sqrt{1.5}\approx\boxed{1.225}. \] RMSE ($1.225$) is the larger, because squaring punishes big misses harder, and the single error of $-2$ on the third movie (which becomes $4$ inside the sum, two-thirds of the squared total) is what opens the gap above MAE — exactly the “RMSE vs MAE” lesson of §2.

E2. Top-$K$ set metrics (§4). With $3$ hits among the $K=5$ shown and $|\mathcal{R}|=4$ relevant items in all: \[ \text{Precision@5}=\frac{3}{5}=\boxed{0.6},\qquad \text{Recall@5}=\frac{3}{4}=\boxed{0.75},\qquad \text{F1@5}=\frac{2(0.6)(0.75)}{0.6+0.75}=\frac{0.9}{1.35}\approx\boxed{0.667}. \] Precision = of the 5 slots I spent, $60\%$ were good (“don’t waste the user’s attention”); Recall = I found $75\%$ of everything relevant (“don’t miss the good stuff” — item $G$ is the one missed); F1 = the harmonic mean balancing the two into a single $0.667$.

E3. Reciprocal rank / MRR (§5). For the running list the first relevant item ($B$) is at rank 2, so its reciprocal rank is \[ \text{RR}=\frac{1}{\text{rank of first hit}}=\frac{1}{2}=\boxed{0.5}. \] A second user whose first relevant item sits at rank 1 has $\text{RR}=\tfrac11=1$. Averaging the two reciprocal ranks gives \[ \text{MRR}=\frac{0.5+1}{2}=\boxed{0.75}. \] MRR cares only about how high the first hit is — everything after it is ignored.

E4. Average Precision (§6). The hits are at ranks 2, 3, 5, so the precision at each hit position is \[ P@2=\frac{1}{2}=0.5,\qquad P@3=\frac{2}{3}\approx0.667,\qquad P@5=\frac{3}{5}=0.6. \] Averaging these over the $|\mathcal{R}|=4$ relevant items (the convention that charges you for the missed $G$): \[ \text{AP}=\frac{P@2+P@3+P@5}{4}=\frac{0.5+0.667+0.6}{4}=\frac{1.767}{4}\approx\boxed{0.442}. \] Dividing by $4$ (not by the $3$ hits) is what makes the un-found $G$ cost you — exactly the convention note in §6.

E5. NDCG@5 (§7), rebuilt from scratch. The list’s gains down the ranks are $[0,1,1,0,1]$, so only the hit positions 2, 3, 5 contribute to the discounted sum (with $\log_2 3\approx1.585$, $\log_2 6\approx2.585$): \[ \text{DCG@5}=\frac{1}{\log_2 3}+\frac{1}{\log_2 4}+\frac{1}{\log_2 6} =0.631+0.500+0.387\approx1.518. \] The ideal ordering puts all four relevant items in the top four slots (gains $1,1,1,1$ at ranks 1–4), giving (with $\log_2 5\approx2.322$) \[ \text{IDCG@5}=\frac{1}{\log_2 2}+\frac{1}{\log_2 3}+\frac{1}{\log_2 4}+\frac{1}{\log_2 5} =1+0.631+0.500+0.431\approx2.562. \] \[ \text{NDCG@5}=\frac{\text{DCG@5}}{\text{IDCG@5}}=\frac{1.518}{2.562}\approx\boxed{0.593}, \] matching the chapter exactly. The normalization by IDCG is what pins the score into $[0,1]$ and makes it comparable across users with different numbers of relevant items.

E6. AUC as a pairwise probability (§8). With relevant scores $\{0.8,0.5\}$ and irrelevant scores $\{0.6,0.3\}$ there are $2\times2=4$ (relevant, irrelevant) pairs: \[ 0.8\!>\!0.6\ ✓,\qquad 0.8\!>\!0.3\ ✓,\qquad 0.5\!<\!0.6\ ✗,\qquad 0.5\!>\!0.3\ ✓. \] The model orders 3 of the 4 pairs correctly, so \[ \text{AUC}=\frac{\#\text{correctly-ordered pairs}}{\#\text{pairs}}=\frac{3}{4}=\boxed{0.75}. \] Since $0.75>0.5$ it beats random guessing — the one mistake ($0.5$ scored below the irrelevant $0.6$) is the only pair pulling it below a perfect $1.0$.

E7. IPS debiasing weights (§12.4). (a) Each observed click is weighted by the inverse of its propensity: \[ \text{click on }A\;\to\;\frac{1}{p_A}=\frac{1}{0.8}=\boxed{1.25},\qquad \text{click on }B\;\to\;\frac{1}{p_B}=\frac{1}{0.2}=\boxed{5}. \] (b) With two clicks on $A$ and one on $B$, the IPS-weighted totals are \[ A:\ 2\times1.25=\boxed{2.5},\qquad B:\ 1\times5=\boxed{5}. \] So the less-clicked $B$ outweighs $A$ ($5>2.5$): $B$ was shown only a fifth as often ($p_B=0.2$ vs $p_A=0.8$), so each hard-won $B$-click is up-weighted to count as much as $1/0.2=5$ ordinary observations, undoing the exposure bias the raw click counts carry. (Caveat from §12.4: a tiny propensity like $0.05$ would give a weight of $20$, which is why extreme weights are clipped.)

E8. Beyond-accuracy (§10). (a) Gini for counts $[6,3,1,0,0]$ ($n=5$, total $10$, mean $\mu=2$). The sum of all pairwise absolute gaps is $\sum_i\sum_j|x_i-x_j|=60$, so \[ G=\frac{\sum_i\sum_j|x_i-x_j|}{2n^2\mu}=\frac{60}{2\cdot25\cdot2}=\frac{60}{100}=\boxed{0.6}. \] At $0.6$ this catalog is less concentrated than the chapter’s $[8,2,0,0,0]$ (which gives $0.72$): its recommendations are spread a little more evenly, with the head item taking $6$ of $10$ rather than $8$. (b) Of the 3 relevant hits, only 2 are non-obvious finds, so \[ \text{serendipity@5}=\frac{2}{5}=\boxed{0.4}. \] It sits below the Precision@5 of $0.6$ (E2) because precision credited all $3$ hits, whereas serendipity discards the one obvious blockbuster a popularity baseline would have surfaced anyway — it rewards only the surprising good picks.

E9. Choosing a metric (§13). (a) MRR — when only the first suggestion matters (“one good answer, fast”), the reciprocal rank of the first hit is exactly the right score, and it ignores everything after it. (b) Recall@20 — the goal is “find all the relevant items” somewhere in the top-20, which is precisely the fraction of the relevant set recovered; precision would penalize the empty slots you do not care about here. (c) Gini coefficient (or coverage) — a beyond-accuracy metric is the only kind that detects the recommender collapsing onto a few blockbusters; a high Gini flags the “rich get richer” concentration that Recall/NDCG would happily hide.

E10. Honest evaluation (§11). (a) A $0.004$ NDCG@10 edge is not yet evidence because (i) it may be within run-to-run noise — §11d demands the model be run over several random seeds with mean $\pm$ std and a paired $t$-test ($p<0.05$) showing the gap exceeds the seed wobble; and (ii) the comparison must be on a sound protocol — §11b’s full-catalog (not sampled) ranking and a leak-free split. Without both, the headline number is a single lucky draw, not a result. (b) Ranking each held-out item against only $100$ sampled negatives is a sampled metric, and Krichene & Rendle (KDD 2020) showed that sampling can reverse which model looks better versus full-catalog ranking — @K over $100$ items measures something genuinely different from @K over the whole catalog, which is why §11 insists on full-ranking metrics (or at least a disclosed sampling protocol).

E11. Looks better, but ship? (§11d). (a) $\bar d=(0.012-0.006+0.011-0.004+0.002)/5=0.015/5=+0.003$ — positive, matching the headline. (b) The $95\%$ CI $[-0.007,+0.013]$ includes $0$ (equivalently the paired $t=0.81$, $p=0.47$). (c) Do not ship. A positive mean is not enough: because the per-seed gains swing in sign, the gain is statistically indistinguishable from $0$ — the $+0.003$ is within run-to-run noise, so on this evidence B is not better than A.

Sequential & Session-Based Recommendation

E1. New transition $\text{M}\!\to\!\text{T}$ (§2). With the fifth session added, M now appears once as a “from,” and the only item that follows it is T, so $P(\text{T}\mid\text{M})=\tfrac{1}{1}=\boxed{1.0}$. What was special before: in the original four sessions M was only ever a terminal item (it ended every session it appeared in), so it had no outgoing transitions at all — its “from-M” row was empty and a first-order chain had nothing to predict after Matrix.

E2. After a session ending in Titanic, the chain compares $P(\text{A}\mid\text{T})=0.667$ against $P(\text{M}\mid\text{T})=0.333$ and recommends the larger — Avatar ($0.667$). It cannot use an earlier Avatar view because a first-order Markov chain is memoryless beyond one step: its entire input is the last item, so any item before the last is invisible to it. (That blindness is exactly what GRU4Rec’s hidden state and SASRec’s attention fix.)

E3. Reversed session $(\text{Avatar},\text{Titanic})$, so the query is Titanic $=[1,0]$ (§4). Raw scores against the two keys, scaled by $\sqrt2$: against Avatar $[0,1]$, $\tfrac{[1,0]\cdot[0,1]}{1.414}=0$; against Titanic $[1,0]$, $\tfrac{[1,0]\cdot[1,0]}{1.414}=0.707$. Softmax of $[0,0.707]$ gives weights $\alpha_{\text{Av}}=0.330,\ \alpha_{\text{Ti}}=0.670$. Context $=0.330[0,1]+0.670[1,0]=[0.670,0.330]$. Candidate scores: \[ \text{Titanic}=0.670,\qquad \text{Avatar}=0.330,\qquad \text{Matrix}=[0.670,0.330]\cdot[1,1]=\boxed{1.000}. \] Matrix is still recommended ($1.000$ — it shares a feature with both), but the ranking of the other two has flipped: now Titanic ($0.670$) beats Avatar ($0.330$), the reverse of §4. Why: self-attention weights the most recent position most, and Titanic is now last — recency, the whole point of modelling order.

E4. With $e^{0}=1$ and $e^{0.707}\approx2.028$: \[ \alpha_{\text{Av}}=\frac{e^{0.707}}{e^{0}+e^{0.707}}=\frac{2.028}{1+2.028}=\frac{2.028}{3.028}\approx\boxed{0.670}, \] and the other weight is $1-0.670=0.330$ — matching §4.

E5. Score Inception $[2,0]$ against $\mathbf c=[0.330,0.670]$: $\mathbf c\cdot[2,0]=0.330(2)+0.670(0)=\boxed{0.660}$. Ranking: Matrix $1.000 >$ Avatar $0.670 >$ Inception $0.660 >$ Titanic $0.330$ — Inception slips in just below Avatar. Its score is large mostly because of its norm ($\lVert[2,0]\rVert=2$, twice the others): raw dot product rewards long vectors regardless of direction. That is the same length / popularity bias flagged in Traditional Recommender Systems §5’s “In practice” box (and exercise 10 there) — the reason serving often scores by cosine (normalized) rather than the bare dot product.

E6. The hidden state $\mathbf h_t$ is meant to be a fixed-length summary of the entire session up to step $t$ — everything seen so far, compressed into one vector the model reads to predict the next item. Its structural weakness versus self-attention: it must funnel the whole past through that single vector and update it strictly left-to-right, so long-range signals fade (the early items get diluted) and the step-by-step recurrence cannot be parallelized — which is exactly why SASRec lets each position attend directly to every earlier one.

E7. The two training tasks differ. SASRec is trained to predict the next item from the past only — so at training time it must not see positions $\ge t$, or it would “cheat” by reading the very item it is predicting; the causal mask enforces that. BERT4Rec is trained on a different task — reconstruct a masked item from its surroundings — where the answer has been hidden by the mask, so it is safe (and helpful) to read both sides; nothing is leaked because the target position is blanked out. Causal masking is about not seeing the future you must predict; cloze masking removes the target instead, freeing both-side attention.

E8. SASRec adds positional embeddings — a learned vector per position, added to each item’s embedding — so the model can tell “1st click” from “3rd click.” A sequence model needs this because raw self-attention is order-blind (it computes a weighted average, which is unchanged if you permute the inputs); a bag-of-items model does not, because it never claims to use order in the first place — for it, “Titanic then Avatar” and “Avatar then Titanic” are meant to be the same.

E9. A random split can place a user’s later interactions in the training set and an earlier one in the test set — so the model trains on the future to predict the past. For sequence data, where the next item genuinely depends on what came before, this leaks the future and inflates NDCG. The fix (§6) is the leave-one-last (temporal) split: hold out each user’s chronologically last interaction for test, the second-last for validation, and train only on what came before — grading the model the way it will actually be used.

E10. FPMC score $=\underbrace{\mathbf p_u\cdot\mathbf q_i}_{\text{MF / user taste}}+\underbrace{\mathbf q'_l\cdot\mathbf q''_i}_{\text{Markov / last item}}$. (a) A brand-new, anonymous session has no user history, so $\mathbf p_u$ is uninformative and the Markov term dominates (the model leans on the strong “what follows the last item” signal). (b) For a well-known user whose last click was an accidental mis-tap, the MF term dominates (their learned taste $\mathbf p_u$ is reliable) while the Markov term is misled by the bad last item — so we want MF to outweigh it. The model that keeps only the Markov term (no $\mathbf p_u$ at all) is the plain first-order Markov chain of §2.

From Graphs to LightGCN

E1. From the edge list ($A\!-\!T$; $B\!-\!T,B\!-\!V$; $C\!-\!V,C\!-\!M$) the interaction matrix (rows $A,B,C$; columns $T,V,M$) is \[ R=\begin{bmatrix}1&0&0\\1&1&0\\0&1&1\end{bmatrix}. \] Stacking it into $A=\left[\begin{smallmatrix}0&R\\R^{\top}&0\end{smallmatrix}\right]$ with node order $A,B,C,T,V,M$ gives the symmetric $6\times6$ grid \[ A=\begin{bmatrix} 0&0&0&1&0&0\\ 0&0&0&1&1&0\\ 0&0&0&0&1&1\\ 1&1&0&0&0&0\\ 0&1&1&0&0&0\\ 0&0&1&0&0&0 \end{bmatrix} \quad(\text{top-left and bottom-right blocks are }0\text{ — bipartite, §5}). \] There are $\boxed{10}$ ones. The number is even because the graph is undirected: every one of the $5$ edges is written twice — once in the top-right block $R$ (user→movie) and once in the mirror block $R^{\top}$ (movie→user) — so $\#\text{ones}=2\times\#\text{edges}=2\cdot5=10$. Reading degrees as the row sums: $\deg(T)=$ row-$T$ sum $=1+1+0=\boxed{2}$ (watched by $A,B$) and $\deg(C)=0+0+0+0+1+1=\boxed{2}$ (watched $V,M$). ✓ (Matches §8’s degree table.)

E2. For the line $X\!-\!Y\!-\!Z$ the adjacency and degree matrices are \[ A=\begin{bmatrix}0&1&0\\1&0&1\\0&1&0\end{bmatrix},\qquad D=\begin{bmatrix}1&0&0\\0&2&0\\0&0&1\end{bmatrix},\qquad D^{-1/2}=\begin{bmatrix}1&0&0\\0&\tfrac1{\sqrt2}&0\\0&0&1\end{bmatrix}. \] Symmetric normalization scales entry $(i,j)$ by $\tfrac1{\sqrt{\deg i}}\cdot\tfrac1{\sqrt{\deg j}}$: \[ \tilde A=D^{-1/2}AD^{-1/2}= \begin{bmatrix}0&\tfrac1{\sqrt{1}\sqrt{2}}&0\\[2pt]\tfrac1{\sqrt{2}\sqrt{1}}&0&\tfrac1{\sqrt{2}\sqrt{1}}\\[2pt]0&\tfrac1{\sqrt{1}\sqrt{2}}&0\end{bmatrix} =\begin{bmatrix}0&0.707&0\\0.707&0&0.707\\0&0.707&0\end{bmatrix}. \] The two non-zero entries (each appearing twice, since $\tilde A$ is symmetric) are both $\tfrac1{\sqrt2}\approx\boxed{0.707}$; the diagonal is $0$ (no self-loops). They are equal even though $\deg X=1\neq\deg Y=2$ because each edge’s weight uses the degrees of both its endpoints: edge $X\!-\!Y$ gets $\tfrac{1}{\sqrt{1}\sqrt{2}}$ and edge $Y\!-\!Z$ gets $\tfrac{1}{\sqrt{2}\sqrt{1}}$ — the same product $\tfrac1{\sqrt2}$. The square root split symmetrically across the two endpoints (§7) is exactly what makes the weight depend on the pair, not on either node alone.

E3. One propagation step is “new embedding $=$ weighted sum of neighbours’ old embeddings,” weight $w=\tfrac1{\sqrt{\deg(\text{me})}\sqrt{\deg(\text{nbr})}}$.

Movie $T$ — neighbours $\{A,B\}$, with $w_{T\to A}=\tfrac1{\sqrt2\sqrt1}=0.707$ and $w_{T\to B}=\tfrac1{\sqrt2\sqrt2}=0.5$: \[ e_T^{(1)}=0.707\,e_A^{(0)}+0.5\,e_B^{(0)}=0.707\,[1,0]+0.5\,[0,1]=\boxed{[0.707,\;0.5]}. \] User $C$ — neighbours $\{V,M\}$, with $w_{C\to V}=\tfrac1{\sqrt2\sqrt2}=0.5$ and $w_{C\to M}=\tfrac1{\sqrt2\sqrt1}=0.707$: \[ e_C^{(1)}=0.5\,e_V^{(0)}+0.707\,e_M^{(0)}=0.5\,[0,2]+0.707\,[1,1]=[0+0.707,\;1+0.707]=\boxed{[0.707,\;1.707]}. \] Both match §9’s table. ✓ (Note $C$ uses its neighbours’ vectors $e_V^{(0)},e_M^{(0)}$, never its own $e_C^{(0)}=[1,1]$ — no self-loop.)

E4. Substitute $e_T^{(1)}=0.707\,e_A^{(0)}+0.5\,e_B^{(0)}$ into $e_A^{(2)}=0.707\,e_T^{(1)}$: \[ e_A^{(2)}=0.707\big(0.707\,e_A^{(0)}+0.5\,e_B^{(0)}\big) =\underbrace{0.707^2}_{c_A}\,e_A^{(0)}+\underbrace{0.707\cdot0.5}_{c_B}\,e_B^{(0)}, \] so $c_A=0.707^2\approx\boxed{0.5}$ and $c_B=0.707\cdot0.5\approx\boxed{0.354}$. The term $c_B\,e_B^{(0)}$ travels the 2-hop path $A\to T\to B$ (the only way from $A$ to $B$ in this graph): $A$’s own neighbour $T$ was itself built partly from $B$, so $B$ “leaks into” $A$ even though $A$ and $B$ share no direct edge. Numerically, \[ e_A^{(2)}=0.5\,[1,0]+0.354\,[0,1]=\boxed{[0.5,\;0.354]}, \] matching §10’s table and its “B leaked in!” annotation. ✓

E5. Cosine similarity is $\cos(A,B)=\dfrac{e_A\cdot e_B}{\lVert e_A\rVert\,\lVert e_B\rVert}$ (§6, §12).

Layer 0: $e_A^{(0)}\cdot e_B^{(0)}=[1,0]\cdot[0,1]=0$, so $\cos(A,B)=\dfrac{0}{1\cdot1}=\boxed{0}$ — orthogonal, i.e. a $90^\circ$ angle, totally unrelated.

Layer 2: $e_A^{(2)}\cdot e_B^{(2)}=[0.5,0.354]\cdot[0.604,0.75]=0.5\cdot0.604+0.354\cdot0.75 =0.302+0.2655=0.5675$; lengths $\lVert e_A^{(2)}\rVert=\sqrt{0.5^2+0.354^2}=\sqrt{0.3754}=0.6127$ and $\lVert e_B^{(2)}\rVert=\sqrt{0.604^2+0.75^2}=\sqrt{0.9273}=0.9630$. So \[ \cos(A,B)=\frac{0.5675}{0.6127\cdot0.9630}=\frac{0.5675}{0.5900}\approx\boxed{0.962}. \] The angle closed from $\cos^{-1}0=90^\circ$ to $\cos^{-1}0.962\approx15.8^\circ$ — a swing of about $\boxed{74^\circ}$ in just two hops. Two users who started perpendicular are now nearly aligned; that is over-smoothing beginning (§12). Push the rule further and every pairwise cosine keeps creeping toward $1$ (by layer 6 all three users sit within $\approx6^\circ$), at which point the model can no longer tell the users apart — the reason to stop at $2$–$4$ layers and combine them.

E6. Every edge in a bipartite graph crosses sides (user↔︎movie), so each hop flips a token from the user side to the movie side or back. Starting at user $A$: after $1$ hop you are on the movie side, after $2$ hops back on the user side, after $3$ on the movie side, and so on. To arrive at another user you must end on the user side, which happens only after an even number of hops — hence relating two same-type nodes needs $2,4,6,\dots$ hops. A single step $e\leftarrow\tilde A e$ moves information exactly one hop (§7), landing it on the movie side, so it can give user $A$ a non-zero overlap only with movies ($\tilde A$’s user→user block is the all-zero top-left block of E1), never with user $B$ directly — $A$ first “sees” $B$ at layer $2$ (E4). This parity is exactly why §12 reads even layers only when tracking the over-smoothing cosines: on this bipartite graph odd and even layers alternate sides, and the even-layer (user-to-user) trend is the clean one.

E7. The path is $A\xrightarrow{\text{(1) watched}}T\xrightarrow{\text{(2) also watched by}}B \xrightarrow{\text{(3) watched}}V$: (1) $A\to T$ — $A$ watched Titanic; (2) $T\to B$ — Titanic was also watched by $B$, so this edge exists because $A$ and $B$ share a movie; (3) $B\to V$ — $B$ watched Avatar. The path is 3 edges long, and each LightGCN layer propagates information exactly one hop (§7, §11), so $V$’s embedding cannot reach $A$ until the 3rd layer — which is precisely when $A$’s predicted affinity $e_A\cdot e_V$ first becomes non-zero (it climbs $0\to0.707\to0.905\to1.027$ across layers $0$–$3$ in §11). As for $C$: the shortest path is $A\to T\to B\to V\to C$, distance $\boxed{4}$, so user $A$ first “sees” user $C$ only at layer 4 — consistent with the even-hop rule of E6 ($A,C$ are same-type, so the distance is even) and with $C$ being “further” from $A$ than $B$ is.

E8. Layer combination averages the per-layer embeddings. With $\alpha_k=\tfrac13$ for $k=0,1,2$: \[ e_A=\tfrac13\big(e_A^{(0)}+e_A^{(1)}+e_A^{(2)}\big) =\tfrac13\big([1,0]+[1.414,0]+[0.5,0.354]\big) =\tfrac13\,[2.914,\;0.354]=\boxed{[0.971,\;0.118]}. \] The two roles this weighted sum plays (both named in §12): it acts as a built-in residual / skip connection — the $k{=}0$ term keeps each node’s own original embedding in the mix, which is why LightGCN needs no self-loops — and it mixes the sharp, identity-preserving early layers with the long-range later layers, so the model keeps multi-hop signal without letting any single deep layer wash everything out, defusing over-smoothing. (Here the combined $A$ stays closer to its own $x$-axis identity, $[0.971,0.118]$, than the pure layer-2 vector $[0.5,0.354]$ already drifting toward the crowd.)

E9. With the symmetric weight $w=\tfrac1{\sqrt{\deg(\text{me})}\sqrt{\deg(\text{nbr})}}$ and $\deg U=2$: \[ w_{U\to P}=\frac{1}{\sqrt2\,\sqrt{100}}=\frac{1}{\sqrt2\cdot10}\approx\boxed{0.0707}, \qquad w_{U\to Q}=\frac{1}{\sqrt2\,\sqrt{1}}=\frac{1}{\sqrt2}\approx\boxed{0.707}. \] Their ratio (the user’s own $\tfrac1{\sqrt2}$ cancels): \[ \frac{w_{U\to Q}}{w_{U\to P}}=\frac{1/(\sqrt2\sqrt1)}{1/(\sqrt2\sqrt{100})} =\frac{\sqrt{100}}{\sqrt{1}}=\sqrt{\frac{\deg P}{\deg Q}}=\sqrt{100}=\boxed{10}. \] So the niche film $Q$ pulls $U$’s embedding $10\times$ harder than the blockbuster $P$. This is the right behaviour because a blockbuster watched by everyone is a weak signal of any one person’s taste, so the $1/\sqrt{\deg}$ normalization rightly discounts it (§13, Effect 1) — without it, high-degree items would swamp every embedding and collapse the space toward the popular region.

E10. A vanilla GCN layer applies $E^{(k+1)}=\sigma\!\big(\tilde D^{-1/2}\tilde A\tilde D^{-1/2} E^{(k)}W^{(k)}\big)$ — a learnable feature-transform $W$ and a ReLU $\sigma$ on top of the same neighbour aggregation (§14). The catch is the graph class (§4): the recommendation graph is featureless — a node carries only an ID, so its row of $E^{(0)}$ is a free, fully learnable vector with no fixed meaning. $W$ would be “transforming” those free embeddings themselves. But multiplying a vector you are already free to learn by another learnable matrix just re-parameterizes it: it adds no new capacity (you could have learned the product directly), only extra parameters to optimize, a harder non-convex landscape, and more room to overfit — and $\sigma$ further distorts the clean linear “average your neighbours” signal. On the GCN paper’s citation graph $W$ is useful because there each node’s input is a real, fixed bag-of-words feature, and $W$ genuinely reshapes those given features into a better space. No such given features exist here, so the exact ablation LightGCN ran (He et al., 2020) found $W$ and $\sigma$ hurt recommendation — which is why LightGCN removes both and keeps only normalized aggregation. “More parameters” is not “more power” when the parameters merely rename a vector the model already learns end-to-end. So the colleague’s upgrade is expected to lower accuracy and make training harder, not help.

E11. Layer combination (§12). $e_B=\tfrac13\big([0,1]+[1,1]+[0.604,0.75]\big)=\tfrac13[1.604,\,2.75]=\boxed{[0.535,\,0.917]}$. The $k{=}0$ term $[0,1]$ is $B$’s own un-propagated embedding, so keeping it in the sum preserves $B$’s individual identity; using only the last layer instead would let repeated propagation wash that identity out — the over-smoothing §12 warns about.

E12. Which model to reach for (§15). (a) On sparse, long-tail data reach for SimGCL / LightGCL — the self-supervised variants whose uniformity term spreads the starved tail out instead of letting it collapse (exactly the regime where SSL earns its extra training cost). (b) Run a well-tuned plain MF/BPR (and a training-free filter such as GF-CF) first: §15 notes they often land within about a point of LightGCN, so any graph model must clear that floor before its complexity is justified.

SSL & Contrastive Learning

E1. InfoNCE at a milder temperature (§3.2). With $\tau=0.4$ the three similarities $0.8,\,0.6,\,-0.6$ give $\operatorname{sim}/\tau=2.0,\,1.5,\,-1.5$, so $\exp(\operatorname{sim}/\tau)=e^{2.0}=7.389,\ e^{1.5}=4.482,\ e^{-1.5}=0.223$. The denominator is $7.389+4.482+0.223=\boxed{12.094}$, the positive’s softmax share is $7.389/12.094=0.611$, and \[\mathcal{L}_{\text{cl},i}=-\log(0.611)=\boxed{0.49}.\] This is larger than the chapter’s $0.31$ at $\tau=0.2$. A milder (bigger) $\tau$ flattens the softmax — it divides every similarity by more, so the gap between the positive ($0.8$) and the close negative ($0.6$) shrinks from $4.0-3.0=1.0$ down to $2.0-1.5=0.5$ in pre-exponent terms. The positive therefore keeps a smaller share ($0.611$ vs $0.731$), so the loss rises. (Symmetrically, a smaller $\tau$ sharpens the contrast and lowers the loss when the positive is already the closest — see E3 for the danger when it is not.)

E2. Representation collapse (§3.3). If the encoder sends every node to one vector, then every similarity in the InfoNCE row equals $1$ — the positive and all negatives are indistinguishable — so each of the $N$ terms $\exp(1/\tau)$ is identical and the positive’s softmax share is exactly $\dfrac{\exp(1/\tau)}{N\,\exp(1/\tau)}=\dfrac{1}{N}$. The loss is then \[\mathcal{L}_{\text{cl},i}=-\log\!\frac1N=\boxed{\log N},\] its worst (largest) possible value, independent of $\tau$. For $N=4$ that is $\log 4=\boxed{1.386}$ — well below the chapter’s $\log 256\approx5.55$ for a 256-node batch (more nodes ⇒ a higher collapse ceiling). The term that punishes collapse is uniformity (the denominator): a collapsed space has zero spread, so it pays the maximum penalty and the gradient pushes the nodes apart. (Alignment, the numerator, is actually happy under collapse — the positive pair is identical — which is exactly why you need the uniformity half too.)

E3. A too-small $\tau$ is brittle (§3.3). With a positive at $0.6$ and a hard negative at $0.7$ (a two-term softmax), the node’s loss is $-\log\!\dfrac{e^{0.6/\tau}}{e^{0.6/\tau}+e^{0.7/\tau}}$. At the in-between $\tau=0.1$: $e^{6.0}=403.4$ and $e^{7.0}=1096.6$, so the loss is $-\log\!\dfrac{403.4}{403.4+1096.6}=-\log(0.269)=\boxed{1.31}$. This sits between the chapter’s endpoints $\approx0.97$ at $\tau=0.2$ and $2.13$ at $\tau=0.05$ — as $\tau$ shrinks the loss climbs. That is what “$\tau$ is a sensitivity dial” means: shrinking $\tau$ sharpens the softmax onto the single closest pair, so when a hard negative is closer than the positive, that one wrong pair comes to dominate the entire gradient and training gets jumpy. Smaller is not always better.

E4. Two views (§3.1). A view of a node is one augmented version of its embedding — the encoder’s output after the input has been perturbed in some way. GCL needs two per node because its whole signal is relational: the loss says “these two should agree” and “these two should not,” so it can only compare a node’s view against another view, never against an absolute target. Two views of the same node form a positive pair (pull together); a view of node $i$ paired with a view of a different node $j$ is a negative pair (push apart). Two concrete ways the chapter builds the second view: SimGCL adds small uniform random noise to the embeddings (§4), and LightGCL propagates on a truncated-SVD reconstruction of the adjacency (§5) — a principled, global view rather than a random one. (SGL’s edge/node dropping and XSimGCL’s cross-layer noise are two more.)

E5. Alignment numerically (§3.3). Alignment raises the positive pair’s similarity. Starting from §3.2 (positive $0.8$, negatives $0.6$ and $-0.6$, $\tau=0.2$, loss $0.31$), push the positive to a perfect $1.0$ with the negatives unchanged. The exponentials become $e^{1.0/0.2}=e^{5}=148.41$ for the positive and (unchanged) $e^{3}=20.09$ and $e^{-3}=0.05$ for the negatives, so the denominator is $148.41+20.09+0.05=168.55$. The positive’s share rises to $148.41/168.55=0.881$ (from $0.731$), and \[\mathcal{L}_{\text{cl},i}=-\log(0.881)=\boxed{0.13},\] down from $0.31$ — better alignment shrinks the loss, and pushing the positive’s similarity all the way to $1$ would drive its share toward $1$ and the loss toward $0$. (The complementary move is uniformity: pushing the close negative from $0.6$ down to $0.0$, positive fixed at $0.8$, would instead raise the positive’s share to $0.981$ and drop the loss to $\approx0.02$ — either half of the objective lowers the loss.)

E6. Where $\mathcal{L}_{\text{cl}}$ runs (§3.4). It does not run on the raw lookup table $E^{(0)}$. It runs on the layer-combined final embeddings $e_u=\sum_k\alpha_k e_u^{(k)}$ that LightGCN already produces — exactly the same vectors BPR scores with. The two views are two forward passes through one shared encoder (same weights): one pass on, say, a noised/edge-dropped graph, one on the SVD graph (or a second noise draw). One training step is \[ E_1=\text{propagate(view 1)},\quad E_2=\text{propagate(view 2)},\quad \mathcal{L}=\mathcal{L}_{\text{BPR}}(E_{\text{main}})+\lambda\,\mathcal{L}_{\text{cl}}(E_1,E_2), \] then loss.backward(). Because both views come from the same embedding table, the contrastive gradient flows back into the very table BPR is training — which is how “shaping the geometry” and “ranking clicks” become one update. Contrasting $E^{(0)}$ instead would regularize a different object than the one BPR ranks, defeating the purpose.

E7. LightGCL’s factored cost (§5). The dense layer would cost $\text{users}\cdot\text{items}\cdot d=1000\cdot1000\cdot10=\boxed{1.0\times10^{7}}$ multiply-adds; the factored layer $U_q(\Sigma_q(V_q^\top E))$ costs $(\text{users}+\text{items})\,q\,d=(1000+1000)\cdot5\cdot10=\boxed{1.0\times10^{5}}$. The ratio is $\dfrac{10^{7}}{10^{5}}=\boxed{100\times}$. The symbolic shortcut checks out: with $\text{users}=\text{items}=n$ the ratio is $\dfrac{n^{2}d}{2nqd}=\dfrac{n}{2q}=\dfrac{1000}{2\cdot5}=100$ — independent of $d$. (The chapter’s $n=10^{5}$, $q=5$ setting gives $n/(2q)=10^{5}/10=10{,}000\times$, matching its $3.2\times10^{7}$ vs $3.2\times10^{11}$.) This is why the SVD view is lightweight: $q$ is tiny ($=5$) and the SVD is computed once.

E8. Placing methods on the contrastive/generative split (§6.4). (a) LightGCL — contrastive: two views (SVD vs. main), InfoNCE pulls the positive pair together and pushes negatives apart; needs negatives; loss “organizes the space” (uniformity). (b) GraphMAE — generative: masks node features and reconstructs them; no negatives; loss makes the code “carry enough information to rebuild the signal.” (c) Mult-VAE — generative: reconstructs the user’s whole click vector; no negatives; reconstruction loss, same “rebuild the signal” goal. (d) SimGCL — contrastive: two noised views, InfoNCE; needs negatives; organizes the space. The “needs negatives?” column is the quickest diagnostic — yes ⇒ contrastive, no ⇒ generative (or predictive). Failure modes each family must guard against: contrastive is sensitive to $\tau$, $\lambda$, and the choice of negatives/augmentation (and can collapse, E2); generative can learn a trivial identity map if the masking/corruption is too weak (the mirror-image of collapse).

E9. RLMRec, the LLM as second view (§7). The artificially-augmented view of §3–§5 (noise in SimGCL, the SVD graph in LightGCL) is replaced by the LLM’s semantic embedding of a text profile RLMRec writes for each user/item — a “second view” that carries real-world semantics instead of noise or pure structure. The two variants: RLMRec-Con is contrastive — it treats (user’s GNN embedding, that user’s LLM embedding) as a positive pair, other users as negatives, and applies an InfoNCE alignment loss that pulls the collaborative representation toward the semantic one (needs negatives). RLMRec-Gen is generative — it makes the GNN embedding reconstruct the LLM semantic embedding via a reconstruction objective (no negatives). LightGCN is an especially good backbone because its defining property is pure ID embeddings with no semantic features (backbone note §4/§14) — that is precisely the gap the LLM’s semantics fill; a backbone that already consumed rich features would gain far less from the LLM signal.

E10. Tuning against collapse (§3.4). The symptom — every embedding drifting to one point with $\mathcal{L}_{\text{cl}}$ creeping toward its ceiling $\log N$ (here $\log 4\approx1.39$ for $N=4$) and Recall falling — is representation collapse, the failure that the uniformity half of InfoNCE exists to fight (E2). To restore spread: - Move $\tau$ down (toward the lower end of the $\tau\approx0.1$–$0.2$ range). A smaller temperature sharpens the contrast, making the denominator push harder on the closest negatives — i.e. it strengthens uniformity and pries the collapsed points apart. (Go too far and you hit the E3 brittleness, so stay within the range.) - Move $\lambda$ up (toward the upper end of $\lambda\approx0.05$–$1.0$). $\lambda$ is the weight on $\mathcal{L}_{\text{cl}}$ in $\mathcal{L}=\mathcal{L}_{\text{BPR}}+\lambda\,\mathcal{L}_{\text{cl}}$; raising it gives the spreading (uniformity) term more say in the update, directly countering collapse.

Crucially, take those settings from a noise/edge-drop method’s defaults — not from LightGCL, whose paper deliberately uses a much larger $\tau\in\{0.3,0.5,1,3,10\}$ and a much smaller $\lambda_1\in\{10^{-7},10^{-6},10^{-5}\}$. Porting LightGCL’s $\tau$/$\lambda$ onto a SimGCL/SGL-style model would push $\tau$ the wrong way (larger, flattening the contrast) and collapse the SSL weight to near zero — making the collapse worse, not better. Defaults are method-specific.

E11. DirectAU alignment (§3.3). $z'-z''=(1-0.6,\,0-0.8)=(0.4,-0.8)$, so $\lVert z'-z''\rVert^2=0.4^2+(-0.8)^2=0.16+0.64=\boxed{0.80}$. Not near $0$, so the pair is only moderately aligned — minimizing this term pulls the two views together (the same job InfoNCE’s numerator does, but computed directly, with no negatives).

E12. Should you add SSL? (§4). (a) Contrastive SSL clearly helps on sparse data with a heavy long tail / many cold items (its uniformity term spreads the starved tail out); it barely moves the metric on dense data with a strong collaborative signal or a feature-rich model (little collapsed geometry to fix). (b) Try SimGCL first — embedding noise, no graph surgery, nearly free ($\varepsilon\approx0.1$, small $\lambda$) — most of the gain for least of the cost.

The Spectral / Graph-Filter View

E1. Build the item graph from $R=\begin{bmatrix}1&1&0\\0&1&1\end{bmatrix}$ (§2 recipe). The co-occurrence Gram matrix is \[ G=R^{\top}R=\begin{bmatrix}1&0\\1&1\\0&1\end{bmatrix}\begin{bmatrix}1&1&0\\0&1&1\end{bmatrix} =\begin{bmatrix}1&1&0\\1&2&1\\0&1&1\end{bmatrix}. \] Entry $G_{22}=2$ is item ②’s popularity (both users touched it); $G_{12}=G_{23}=1$ are the two co-watch counts; $G_{13}=0$ since ① and ③ were never seen together. Zero the diagonal to get the pure adjacency $A=\begin{bmatrix}0&1&0\\1&0&1\\0&1&0\end{bmatrix}$, whose row-sums are the degrees $(1,2,1)$. Symmetric-normalizing with $D^{-1/2}=\mathrm{diag}(1,\tfrac{1}{\sqrt2},1)$ scales each edge $A_{ij}$ by $\tfrac{1}{\sqrt{d_i}}\tfrac{1}{\sqrt{d_j}}$: the ①–② and ②–③ edges become $\tfrac{1}{\sqrt1}\tfrac{1}{\sqrt2}=0.707$, giving \[ \tilde A=D^{-1/2}AD^{-1/2}=\begin{bmatrix}0&0.707&0\\0.707&0&0.707\\0&0.707&0\end{bmatrix}, \] exactly the path-graph $\tilde A$ of §3.1. ✓ The interaction data is where that matrix comes from.

E2. Apply $\tilde A$ to $v_{+1}=[0.5,\,0.707,\,0.5]$ row by row (using $0.707\times0.707=0.5$): \[ \tilde A v_{+1}=\begin{bmatrix}0(0.5)+0.707(0.707)+0(0.5)\\ 0.707(0.5)+0(0.707)+0.707(0.5)\\ 0(0.5)+0.707(0.707)+0(0.5)\end{bmatrix} =\begin{bmatrix}0.5\\ 0.707\\ 0.5\end{bmatrix}=1\cdot v_{+1}, \] so $v_{+1}$ is an eigenvector with $\boxed{\lambda=+1}$ — middle row $0.707(0.5)+0.707(0.5)=0.707$, the two ends $0.707\times0.707=0.5$. ✓ This is the lowest frequency because $\tilde A$ replaces each node’s value with a (normalized) average of its neighbours’, and $v_{+1}$ is left unchanged by that averaging: neighbouring nodes already hold matching values (all the same sign, varying as little as the degrees allow), so it is the maximally smooth pattern — exactly what $\lambda$ at the top of $[-1,1]$ labels.

E3. The averaging map $\tilde A$ pulls every node toward its neighbours’ values. A pattern survives that averaging unchanged ($\tilde A v=\lambda v$ with $\lambda$ near $+1$) only if adjacent nodes already agree — i.e. the pattern barely varies edge-to-edge, which is the definition of smooth / low frequency. (At the other end, $\lambda\approx-1$ requires the sign to flip across every edge, the jaggedest possible pattern.) A real user’s clicks are sparse and noisy, so they spread across all frequencies; but genuine taste is smooth over a co-watch graph (“if you like an item you tend to like its neighbours”). Keeping only the $\lambda\approx+1$ component therefore throws away the high-frequency idiosyncratic noise and keeps the broad, shareable (often popular) preference — that is the “denoising” a low-pass does.

E4. GFT round-trip on $r=[1,0,0]$, no filter. (a) The three coordinates are dot products with the eigenvectors: \[ c_{+1}=v_{+1}\cdot r=0.5,\qquad c_{0}=v_{0}\cdot r=0.707,\qquad c_{-1}=v_{-1}\cdot r=0.5. \] (So a single click is not purely low-frequency — its largest piece, $0.707$, is the mid mode.) (b) Rebuild $\sum_k c_k v_k$: \[ 0.5\,[0.5,0.707,0.5]+0.707\,[0.707,0,-0.707]+0.5\,[0.5,-0.707,0.5] =[0.25,0.354,0.25]+[0.5,0,-0.5]+[0.25,-0.354,0.25]=\boxed{[1,0,0]}=r. \ \checkmark \] Transform out, transform back, change nothing — the identity every filter is wrapped inside.

E5. Frequency-domain low-pass with $h(\lambda)=\tfrac{\lambda+1}{2}$, so $h(+1)=1,\ h(0)=0.5,\ h(-1)=0$. Scale the Exercise-4 coordinates and rebuild: \[ h(\lambda_k)\,c_k=\big(1\cdot0.5,\ 0.5\cdot0.707,\ 0\cdot0.5\big)=(0.5,\ 0.354,\ 0), \] \[ \hat r=0.5\,[0.5,0.707,0.5]+0.354\,[0.707,0,-0.707]+0 =[0.25,0.354,0.25]+[0.25,0,-0.25]=\boxed{[0.5,\,0.354,\,0]}. \] Item ③ scores $\boxed{0}$ here. (Cross-check: this filter is the matrix $H=\tfrac{\tilde A+I}{2}$, and $H r=[0.5,0.354,0]$ too. ✓) Compared with §3.1’s $h=\lambda+\lambda^2$ — which zeroed the mid mode ($h(0)=0$) and so put $0.5$ on item ③ from the two-hop chain — this linear filter only halves the mid frequency ($h(0)=0.5$), so it keeps more of the raw click’s local structure (the hub ② still lights up via the surviving mid component) but does not yet reach the 2-hop neighbour ③. Lesson: how you shape $h$ in the mid band decides how far a click propagates.

E6. Over-smoothing as geometric decay of $\lambda^{k}$. (a) The top mode has $\lambda=+1$, and $1^{k}=1$ for every $k$ — the smooth collaborative signal is kept at full weight no matter how many layers. (b) For a generic second mode at $\lambda=0.8$: \[ 0.8^{1}=0.8,\quad 0.8^{4}=0.4096,\quad 0.8^{8}\approx0.1678,\quad 0.8^{16}\approx0.0281. \] (c) Each extra layer raises the filter degree, so every non-top mode is multiplied down geometrically; by $\approx16$ hops only the $\lambda\approx+1$ direction has appreciable weight and all embeddings collapse onto it — that is over-smoothing, and it says “too many layers = too aggressive a low-pass.” The $\lambda=-1$ mode of this path graph is the exception: $|-1|^{k}=1$, so it never shrinks (its sign just flips each hop). That happens only because the bare path is bipartite; add self-loops or any odd cycle — as real co-occurrence graphs have — and the bottom eigenvalue moves inside $(-1,0]$, so it decays like every other non-top mode.

E7. Let $P=v v^{\top}$ with $v=v_{+1}$, a unit vector ($\lVert v\rVert^2=0.5^2+0.707^2+0.5^2=0.25+0.5+0.25=1$). Then for any $w$, $Pw=v(v^{\top}w)=(v\cdot w)\,v$ — $P$ outputs a multiple of $v$, the multiple being $w$’s overlap with $v$. From that one identity: (i) $Pv=(v\cdot v)v=1\cdot v=v$ — it fixes $v$; (ii) if $w\perp v$ then $v\cdot w=0$, so $Pw=0$ — it kills everything orthogonal to $v$ (e.g. $v_{-1}$, since the eigenvectors are orthogonal); (iii) $P^2 w=P(Pw)=P((v\cdot w)v)=(v\cdot w)(Pv)=(v\cdot w)v=Pw$, so $P^2=P$ (idempotent — projecting twice is the same as once); (iv) $\operatorname{trace}(vv^{\top})=\sum_i v_i^2=\lVert v\rVert^2=1$. The single job of $P$: project any signal onto the line of $v$ — extract its $v$-component and discard the rest. (Numerically $P=\begin{bmatrix}0.25&0.354&0.25\\0.354&0.5&0.354\\0.25&0.354&0.25\end{bmatrix}$, and $P r=[0.25,0.354,0.25]=c_{+1}v_{+1}$ — exactly the low-frequency part of the click.) These projectors are the building blocks of $H=\sum_k h(\lambda_k)\,v_k v_k^{\top}$.

E8. A degree-$K$ polynomial filter is $K$ sparse matrix–vector products, cost $\approx K\cdot\mathrm{nnz}(\tilde A)$, and it never forms a dense matrix. An ideal/spectral filter that uses the $v_k v_k^{\top}$ explicitly needs a full eigendecomposition: $O(n^{3})$ time, $O(n^{2})$ memory. At $n=40{,}000$: \[ n^{3}=40{,}000^{3}=6.4\times10^{13}\ \text{flops},\qquad n^{2}\times 8\ \text{bytes}=1.6\times10^{9}\times8=1.28\times10^{10}\ \text{bytes}\approx\boxed{12.8\ \text{GB}} \] for the dense matrix — both infeasible on a single machine. The rule the box draws: “spectral is fast” only if you use a low-order polynomial in $\tilde A$, or a truncated top-$q$ decomposition on a sparse $\tilde A$ — never a full dense eigendecomposition. The cheaper middle option is the truncated top-$q$ SVD / eigendecomposition (Lanczos or randomized), costing $\approx\mathrm{nnz}\cdot q$ and returning exactly the few smooth components a low-pass keeps.

E9. Spectral decomposition $\tilde A=\sum_k\lambda_k v_k v_k^{\top}$, evaluated at entry $(1,2)$. Entry $(1,2)$ of an outer product $v_k v_k^{\top}$ is $(v_k)_1(v_k)_2$, so \[ \tilde A_{12}=\underbrace{(+1)}_{\lambda_{+1}}(0.5)(0.707)+\underbrace{0}_{\lambda_0}(0.707)(0)+\underbrace{(-1)}_{\lambda_{-1}}(0.5)(-0.707). \] The first term is $0.5\times0.707=0.3536$; the mid term is $0$; the third is $-(0.5)(-0.707)=+0.3536$. Summing: $0.3536+0+0.3536=0.7071=\boxed{0.707}=\tilde A_{12}$. ✓ The $0\cdot v_0 v_0^{\top}$ term contributes nothing — consistent with $\lambda=0$ being a frequency $\tilde A$ maps to zero (it “passes through to nothing”): the mid pattern $v_0$ is in $\tilde A$’s null space, $\tilde A v_0=0$, so it carries no weight when the matrix is rebuilt from its frequencies. The matrix is its frequencies: $(+1)v_{+1}v_{+1}^{\top}+(-1)v_{-1}v_{-1}^{\top}$ already reproduces all of $\tilde A$.

E10. Connecting the filter to LightGCN propagation, on $r=[1,0,0]$. (a) The 3-layer mean combination $\hat r=\tfrac13\big(r+\tilde A r+\tilde A^{2}r\big)$ with $\tilde A r=[0,0.707,0]$ and $\tilde A^{2}r=[0.5,0,0.5]$ gives \[ \hat r=\tfrac13\big([1,0,0]+[0,0.707,0]+[0.5,0,0.5]\big)=\tfrac13[1.5,\,0.707,\,0.5]=\boxed{[0.5,\,0.236,\,0.167]}. \] (b) In the frequency domain this is the polynomial response $h(\lambda)=\tfrac13(1+\lambda+\lambda^2)$, so \[ h(+1)=\tfrac13(1+1+1)=1,\quad h(0)=\tfrac13(1+0+0)=\tfrac13,\quad h(-1)=\tfrac13(1-1+1)=\tfrac13. \] Reweighting the Exercise-4 coordinates $c=(0.5,0.707,0.5)$ by these and rebuilding, $\sum_k h(\lambda_k)c_k v_k = 0.5\,v_{+1}+\tfrac13(0.707)\,v_0+\tfrac13(0.5)\,v_{-1}=[0.5,0.236,0.167]$ — identical to the spatial average in (a). So “average your first three propagation hops” is a low-pass filter: it keeps the smooth $\lambda=+1$ mode at full strength ($h=1$) and damps the mid and high modes to a third — the spatial face (LightGCN layers) and the spectral face (a filter response $h(\lambda)$) of the one operation. Picking the number of layers and the layer weights $\alpha_k$ is choosing $h(\lambda)$.

E11. Low-pass filter on eigenvalues (§3). $h(\lambda)=\lambda+\lambda^2$: $h(+1)=1+1=\boxed{2}$, $h(0)=0+0=\boxed{0}$, $h(-1)=-1+1=\boxed{0}$. (a) The smooth mode $\lambda=+1$ is amplified most (by $2$). (b) Both $\lambda=0$ and $\lambda=-1$ are zeroed, so only the low-frequency (smooth, $\lambda$ near $+1$) component survives — keeping low frequencies and discarding high ones is exactly what low-pass means; on a graph the smooth mode is the shared collaborative signal.

E12. Why a training-free baseline (§4). (a) GF-CF has nothing to learn: it applies a fixed, closed-form filter (a normalized-adjacency smoother plus an ideal low-pass from a top-$q$ SVD) straight to the interaction matrix — no embeddings, no epochs, no gradient descent. (b) Because it is nearly free yet often competitive with or better than a trained GCN (here about $70\%$ over LightGCN on Amazon-Book), it is the floor any trained model must clear — report it first so a GCN’s extra training cost is justified by a real margin over the closed-form baseline, not over a weak one.

LLM × RecSys: the Landscape

E1. Classify by what job the LLM does (§2), with the tell in each: (a) Recommender (§2.1) — the LLM is the ranker: it reads the history-as-prompt and directly returns the pick. Tell: the model outputs the recommendation itself, from a candidate list. (b) Enhancer (§2.3 / the RLMRec line of §3) — the LLM runs offline to manufacture a semantic profile that is aligned to the CF embedding; the live request is a plain dot product. Tell: “offline profile + alignment, serve = dot product, LLM never called per query.” (c) Generator (§2.2) — recommendation as generation: the model autoregressively emits the item’s Semantic ID token-by-token, and there is no candidate list. Tell: an item identifier is generated, no retrieval set. (d) Agent (§2.4) — it plans, holds memory across turns, and calls tools (retriever, DB) over a multi-turn conversation. Tell: multi-turn + tool calls + an open-ended, multi-constraint request. (e) Enhancer (§2.3, the knowledge-augmentation / KAR flavour) — the LLM emits factual knowledge offline that is encoded and routed through an MoE adapter into a backbone that does the serving. Tell: LLM features feed a classical model that stays on the serving path. (The split inside (b)/(e): RLMRec aligns a profile embedding, KAR injects knowledge features — both are enhancers because the classical model keeps serving.)

E2. The single axis (§1’s “one-line map”, restated by the §2 table and Fig. 1) is how much LLM sits on the serving path — equivalently, power/cost on one end vs. keeping the cheap collaborative backbone on the other. Ordered cheapest→most LLM at serve time: \[\textbf{enhancer} \;<\; \textbf{generator} \;\lesssim\; \textbf{recommender} \;<\; \textbf{agent}.\] The enhancer puts nothing LLM-shaped on the live path (the LLM ran offline); the generator runs one compact sequence model per query; the recommender runs a large forward pass per query (§4.1’s cost table groups these two as the single top-cost tier — both put a large model on the live path; the $\lesssim$ is only the finer §2.1-vs-§2.2 model-size nuance, which §4.1 itself calls worth far less than the per-query yes/no); the agent runs several LLM steps per request, so its cost compounds. The one fact that dominates a recommender’s serving cost is therefore whether an LLM runs per query at all (§4.1) — that yes/no is worth far more than any model-size tweak.

E3. Order-of-magnitude reasoning on §4.1’s anchors (pure CF $\$0.006/1\text{k}$, general-LLM reranker $\$25/1\text{k}$, specialist cross-encoder $\$2/1\text{k}$): (a) Factor $= 25 / 0.006 \approx \boxed{4{,}200\times}$, i.e. $\log_{10}(4200)\approx 3.6$ — between three and four orders of magnitude (consistent with the chapter’s “spans roughly three to four orders of magnitude,” reaching four once you include the cheaper ANN-only floor and the costlier generator row). (b) At $1{,}000{,}000$ queries/day $= 1000$ thousand-query units: \[\text{pure CF} = 1000\times\$0.006 = \boxed{\$6/\text{day}}, \qquad \text{general LLM} = 1000\times\$25 = \boxed{\$25{,}000/\text{day}}.\] A $6 vs. $25k/day gap is the cliff §4.1 calls “the big step.” (c) Specialist cross-encoder vs. general LLM: $25/2 \approx \boxed{10\times \text{ cheaper}}$ (matches the chapter’s “another $\sim\!10\times$”). Cross-encoder vs. pure CF: $2/0.006 \approx \boxed{300\times \text{ more expensive}}$ ($\approx 2.5$ OOM — the “two-plus orders of magnitude” of adding any per-query model). The lesson: the first per-query model is the expensive jump; swapping a cheap reranker for a general LLM is “only” another order of magnitude on top.

E4. Residual quantization, §2.2’s recipe (snap → subtract → snap the residual): (a) Distances of $(0.7,0.4)$ to the level-1 codebook: \[d_{\textsf{A}}=\lVert(0.7,0.4)-(1,0)\rVert=\sqrt{(-0.3)^2+0.4^2}=\sqrt{0.25}=0.5,\quad d_{\textsf{B}}=\sqrt{0.7^2+0.6^2}\approx0.922,\] $d_{\textsf{C}}\approx1.746,\ d_{\textsf{D}}\approx1.565$. The nearest is $\boxed{\textsf{A}=(1,0)}$, and the residual is $(0.7,0.4)-(1,0)=\boxed{(-0.3,\,0.4)}$. (b) Distances of the residual $(-0.3,0.4)$ to the level-2 codebook: $d_{\textsf{p}}=\lVert(-0.3,0.4)-(-0.3,0.4)\rVert=\boxed{0}$ (exact hit), $d_{\textsf{q}}=0.6$, $d_{\textsf{r}}=0.8$, $d_{\textsf{s}}=1.0$. So the second code is $\textsf{p}$ and the Semantic ID is $\boxed{(\textsf{A},\,\textsf{p})}$. (Reconstruction $\textsf{A}+\textsf{p}=(1,0)+(-0.3,0.4)=(0.7,0.4)$ recovers the embedding exactly — here the two-level codebook represents it with zero residual.) (c) The second item $(0.6,0.5)$: $d_{\textsf{A}}=\sqrt{(-0.4)^2+0.5^2}=\sqrt{0.41}\approx0.640$ vs. $d_{\textsf{B}}=\sqrt{0.6^2+0.5^2}\approx0.781$ (and $\textsf{C},\textsf{D}$ farther), so it also snaps to $\textsf{A}$ first — same level-1 prefix $\textsf{A}$. Because similar items share codeword prefixes, a sequence model that has learned the “$\textsf{A}\dots$” continuations for one item can generalize those statistics to the other, even one it barely saw — exactly the long-tail/cold-start win §2.2 credits to Semantic IDs.

E5. In $\mathcal{L}_{\text{align}}$ the two views of node $i$ are its collaborative embedding $e_i$ (from the LightGCN backbone) and its semantic embedding $s_i$ (a frozen text encoder applied to the LLM-written profile of $i$). The numerator $\exp(\mathrm{sim}(e_i,s_i)/\tau)$ rewards a node’s own two views for agreeing (“this item’s collaborative identity should match what the LLM says it is”); the denominator sums over other nodes $j$, pushing $e_i$ away from other items’ semantics $s_j$. These are precisely SSL & Contrastive Learning’s alignment (matched pairs pulled close) and uniformity (everything else spread out). Minimizing it drags the collaborative embedding space toward the LLM’s semantic space, injecting world knowledge that pure interactions never carried — without re-deriving the InfoNCE softmax, which that chapter owns.

E6. The paradox resolves on when the LLM runs. RLMRec calls the billion-parameter LLM only offline, once, to write each user/item profile; that profile is embedded and used during training as the alignment target $s_i$. At a live request, the only thing that executes is the LightGCN dot product $e_u\cdot e_i$ over an ANN index — no LLM call per query — so the serving cost is the cheap CF cost (§4.1’s sub-cent corner), not the LLM cost. Versus an LLM-as-recommender that ranks at query time, the offline-enhancer design buys you the LLM’s semantics and cold-start robustness baked into the collaborative geometry while keeping serving latency and $/query at CF levels; the LLM’s price is paid once (RLMRec reports only $\approx10$–20% extra training time), never per request. That is exactly why §4.1 files the enhancer in the cheap, production-dominant corner.

E7. Two situations from this note where an LLM on the serving path is the wrong call: (1) Low serving latency / huge catalogue (§4.1) — a per-query LLM forward pass turns single-digit-ms serving into hundreds of ms–seconds and sub-cent cost into dollars; the cheap CF backbone (LightGCN) with a dot product wins, and any LLM should stay offline (enhancer). (2) Ranking against millions of candidates (§2.1’s disadvantage) — an LLM cannot score a whole industrial catalogue per query; a cheap retriever / CF model does the fetch, and an LLM is at most a reranker over a short top-$K$. The single condition under which an LLM’s semantic view adds the most over pure CF (§3, §5): cold-start / long-tail items, where collaborative signal is absent or thin but text is plentiful — the regime where aligning a semantic view buys the biggest, most defensible lift.

E8. Two-tower LLM-embedding retrieval (§4): offline, an LLM-grade text embedder encodes each item — and each user’s history — into a vector (the two “towers”). At serve time only an approximate nearest-neighbour (ANN) search over those precomputed vectors runs; there is no LLM call per query, so it is enhancer-class on cost (an ANN lookup, single-digit–tens of ms, sub-cent/1k) yet still carries the LLM’s semantics directly into retrieval because the vectors were produced by the LLM. The property that lets it surface a brand-new item pure CF cannot: a fresh item gets a meaningful embedding from its text alone, with zero interactions, so it is findable by semantic proximity. The natural full stack (§4): two-tower LLM-embedding retrieval → cheap CF/heuristic candidates → optional LLM-as-reranker on the top-$K$ — semantics enter at the cheap retrieval stage, and the expensive LLM is confined to a short shortlist (or skipped), which is why this hybrid sits ahead of “an LLM reranks everything.”

E9. The leakage mechanism (§5): an LLM may have seen the test items (or their reviews/titles) during pretraining, so a “zero-shot” benchmark score reflects memorized test content, not genuine generalization to unseen interactions. It is a measurement problem, not a modelling one — the model is not better at the task, the number is contaminated because train/test independence (which the metric assumes) has been silently broken by the pretraining corpus. A guarding habit (§5 / Probability primer §8): treat the headline number with suspicion and instead report per-bucket / cold-start-only results with significance testing (and prefer items provably outside pretraining, or time-split data), so you are measuring the collaborative lift rather than recall of leaked text.

E10. A streaming startup: 2M items, sub-50 ms budget, an already-good LightGCN, painful cold start. (a) Best single role: Enhancer (§2.3 / §3 — align an LLM semantic view onto the existing LightGCN), with two-tower LLM-embedding retrieval (§4) as the close hybrid cousin if cold-start retrieval is the binding pain. Against the four constraints: 2M items → only an offline-LLM / ANN path scales (an LLM cannot rank millions per query); sub-50 ms → serving stays a dot product / ANN lookup, single-digit–tens of ms, comfortably under budget; already-good LightGCN → the enhancer keeps everything you have and aligns a semantic view to it rather than replacing it; cold start → the LLM’s offline semantics inject exactly the signal new titles lack. (Recommender/generator/agent all put an LLM on the per-query path and blow the latency budget.) (b) From E3, a general LLM on the per-query path costs $\sim\!4{,}000\times$ pure CF and turns single-digit-ms serving into hundreds of ms–seconds per request — two-plus orders of magnitude over a 50 ms budget, so it is infeasible by inspection. (c) Split: offline — LLM writes profiles / embeds items once, profiles aligned to LightGCN during training (or item vectors precomputed for ANN); serve — LightGCN dot product (or ANN over the offline vectors), no LLM call. The thing to still measure per cold-start bucket rather than trusting top-line NDCG: the recommendation quality restricted to newly added / long-tail items (e.g. Recall/NDCG on the cold-item slice), since a top-line metric dominated by popular items can rise even while the cold-start problem you care about does not — the per-bucket evaluation §5 insists on.

All numerics here were checked in code: E3’s factors ($25/0.006\approx4{,}167$; $25/2\approx12$; $2/0.006\approx333$) and daily costs ($6 vs. $25,000 at 1M queries/day), and E4’s Euclidean distances and exact reconstruction ($\textsf{A}=0.5$ unique-min at level 1; level-2 $\textsf{p}=0$ exact hit; second item $d_{\textsf{A}}\approx0.640<d_{\textsf{B}}\approx0.781$).

Training and Serving a Recommender

E1. Ann $=[2,0.5]$, positive Titanic $\mathbf q^{+}=[2,0]$, negative Rambo $\mathbf q^{-}=[0,2]$. $\hat r^{+}=(2)(2)+(0.5)(0)=4.0$; $\hat r^{-}=(2)(0)+(0.5)(2)=1.0$; gap $d=4.0-1.0=3.0$. Since $d=3.0>0$, the order is already correct — Ann ranks the romance film Titanic well above the action film Rambo, and by a wide margin.

E2. With $d=3.0$: loss $=-\ln\sigma(3.0)=-\ln(0.953)=0.048$, and the gradient factor $1-\sigma(3.0)=0.047$. Both are much smaller than §4’s triple (loss $0.201$, factor $0.182$ at $d=1.5$). That is correct and desirable: this triple is ranked correctly by a wider margin ($3.0$ vs $1.5$), so the model is already happy with it and barely needs to move — the loss and the update it drives are both tiny.

E3. Gradient $\dfrac{\partial\mathcal L}{\partial\mathbf p}=-(1-\sigma(d))(\mathbf q^{+}-\mathbf q^{-})=-0.047\,([2,0]-[0,2])=-0.047\,[2,-2]=[-0.094,\,0.094]$. Update with $\eta=0.1$: $\mathbf p\leftarrow[2,0.5]-0.1\,[-0.094,0.094]=[2.009,\,0.491]$. Ann’s romance factor moved up ($2\to2.009$) and her action factor down — consistent with the positive Titanic being pure romance ($\mathbf q^{+}=[2,0]$). The step is tiny because the triple was already very correct (small factor) — exactly the behaviour Exercise 4 explains.

E4. The factor $-(1-\sigma(d))$ is $\approx 0$ when $d$ is large and positive (then $\sigma(d)\approx 1$, so $1-\sigma(d)\approx 0$) and is large (up to $\approx 1$) when $d<0$ (then $\sigma(d)<0.5$). So a triple the model already ranks correctly by a wide margin produces an almost-zero update, while a triple it gets wrong produces a big one. For free, this gives training a built-in curriculum: it automatically concentrates effort on the pairs it currently gets wrong and stops fiddling with the ones it already has right.

E5. (a) Uniform: each of the three candidates has probability $\tfrac13=0.333$. (b) $\text{pop}^{0.75}$ weights $21.56, 12.82, 5.62$ sum to $40.0$, so the probabilities are $21.56/40=0.539$ (Rambo), $12.82/40=0.320$ (Die Hard), $5.62/40=0.141$ (Notebook). (c) Rambo: $0.539/0.333\approx \mathbf{1.62\times}$ more likely to be the sampled negative under popularity-weighting than under uniform — the loop spends more effort pushing the over-exposed blockbuster down.

E6. (a) Each user gets the other $256-1=\mathbf{255}$ positives in the batch as free negatives. (b) False negative: one of those other users’ positives might actually be an item this user would like — but in-batch sampling treats it as a negative and pushes it away. It is worse for very popular items because a popular item appears as a positive in many users’ rows, so it is far more likely to land in the batch as someone else’s positive and be wrongly pushed apart.

E7. (a) Exact scan $\approx Md=10^{7}\times 64=6.4\times10^{8}$ multiply-adds — per user, per request. (b) $\log_2(10^{7})\approx 23.3$. So an HNSW query touches on the order of a couple dozen “steps” instead of ten million items — a speed-up of roughly six orders of magnitude, which is what makes serving a $10^{7}$-item catalogue in milliseconds feasible.

E8. $\ell_2$-normalize every vector (divide each by its length) before indexing and querying. Why it works: for two unit vectors $\hat{\mathbf a},\hat{\mathbf b}$ the dot product equals the cosine, because $\cos(\mathbf a,\mathbf b)=\dfrac{\mathbf a\cdot\mathbf b}{\lVert\mathbf a\rVert\,\lVert\mathbf b\rVert}$ and the denominator is $1$. So “cosine top-$K$” becomes “inner-product top-$K$” — a plain MIPS query the ANN index already supports.

E9. (a) Final ranker scores: A $=3.0-0.5=2.5$; B $=2.6+0.1=2.7$; C $=2.2+1.4=3.6$. Final ranking: C $>$ B $>$ A. (b) C moved from retrieval-rank 3 to ranking-rank 1. Retrieval scores with the collaborative embedding dot product alone, which has no notion of “it’s Friday night”; only the ranking model carries the $\langle\text{item},\text{Friday}\rangle$ feature, so only it could lift the action film C to the top.

E10. (a) Latency/cost: running the heavy FM ranker over all $10^{7}$ items on every request — for millions of users — is orders of magnitude too slow and expensive; the heavy model can only afford to score a few hundred candidates. Recall: the retrieval stage exists to guarantee that the few hundred candidates it forwards contain the genuinely good items, so the ranker has good material to sort. (b) The error retrieval cannot afford is dropping a genuinely good item: once retrieval discards an item it never reaches ranking, so a false negative at the retrieval stage is unrecoverable — retrieval must keep recall high (cast a wide net), even at the cost of precision, because ranking can only reorder what retrieval kept.

Click-Through Rate Prediction & Feature Interactions

E1. Clicked impression, so the surviving term is $-\ln\hat p$. At $\hat p=0.5$: $-\ln 0.5 = 0.6931$. At $\hat p=0.9$: $-\ln 0.9 = 0.1054$. The $\hat p=0.9$ loss is smaller — the model put high probability on the event that actually happened, which is exactly what “lower LogLoss is better” rewards: confidence in the correct outcome.

E2. Replace Bob/Matrix’s term $\ln 0.4$ with $\ln 0.7$. The four label-selected logs are now $\ln 0.8,\ \ln 0.5,\ \ln 0.7,\ \ln 0.7 = -0.2231,\,-0.6931,\,-0.3567,\,-0.3567$, summing to $-1.6296$. LogLoss $= -\tfrac14(-1.6296) = \mathbf{0.4074}$ — down from $0.5473$, since the model got more confident on a true click.

E3. Now the clicks score $\{0.8,\,0.7\}$ and the non-clicks $\{0.5,\,0.3\}$. All four (click, non-click) pairs are ordered correctly: $0.8>0.5,\ 0.8>0.3,\ 0.7>0.5,\ 0.7>0.3$. So AUC $= 4/4 = \mathbf{1.0}$. Yes — lifting Bob’s clicked Matrix above Ann’s un-clicked Titanic fixed the one misranked pair from §3.

E4. Every prediction is $0.5$, so all four (click, non-click) pairs are ties. Counting a tie as half a win: $4 \times 0.5 = 2$ out of $4$ pairs, so AUC $= 2/4 = \mathbf{0.5}$. A constant predictor has no ranking power — AUC $0.5$ is the “random” floor, the point of the half-credit convention.

E5. Second cross layer with $\mathbf x_1=[0.5,\,1.0]$, $\mathbf x_0=[1,\,2]$, $\mathbf w_1=[1,\,0]$, $\mathbf b_1=[0,0]$. Scalar summary $\mathbf x_1^{\top}\mathbf w_1 = (0.5)(1)+(1.0)(0)=0.5$. Cross onto $\mathbf x_0$: $\mathbf x_0\cdot 0.5 = [0.5,\,1.0]$. Add bias and residual $\mathbf x_1$: $[0.5,1.0]+[0,0]+[0.5,1.0]=\mathbf{[1.0,\,2.0]}$. (The entries now carry degree-3 crosses of the original features.)

E6. Logistic regression forms $z = w_0 + \sum_i w_i x_i$ and adds the active weights. For an impression that is both sci-fi and mobile, the two indicators are $1$, contributing $+0.3 + 0.1 = +0.4$ to the log-odds — and this sum is forced regardless of any synergy between the two. The model has no parameter attached to the conjunction sci-fi ∧ mobile, so it cannot represent “this combination clicks more (or less) than the parts predict.” Capturing that requires a cross feature or a learned interaction (FM / deep arm).

E7. DeepFM removes the hand-crafted cross features $\phi(\mathbf x)$ that Wide & Deep’s wide arm needs a human to design — the FM arm learns all second-order crosses automatically. It adds embedding sharing: the FM arm and the DNN arm read the same field embeddings, so each vector is trained by both the explicit second-order signal and the deep signal.

E8. The residual $+\,\mathbf x_l$ carries every lower-degree term forward, so after $L$ layers the network holds all interaction degrees $1,2,\dots,L+1$ at once. Without it, $\mathbf x_{l+1}=\mathbf x_0\mathbf x_l^{\top}\mathbf w_l+\mathbf b_l$ would keep only the newly created (highest) degree each step, discarding the intermediate degrees $1\dots L$ — you would end up with essentially the single top-order cross and lose the rich low-order signal that usually carries most of the CTR information.

E9. In FM, $\mathbf v_{\text{sci-fi}}$ receives a gradient from every training impression in which sci-fi co-occurs with some other active feature (sci-fi with action, sci-fi with evening, …); likewise $\mathbf v_{\text{mobile}}$ is shaped by all of mobile’s co-occurrences. Both vectors are therefore well-determined from data that never involved the specific pair, and their dot product $\langle\mathbf v_{\text{sci-fi}},\mathbf v_{\text{mobile}}\rangle$ yields a sensible interaction for the unseen (sci-fi, mobile) combination. A free per-pair weight $w_{ij}$ is updated only by impressions where $i$ and $j$ co-occur; with zero such rows it gets zero gradient and stays at its prior — it cannot generalize. Factorizing the interaction is exactly what buys generalization on sparse data.

E10. (1) Build a feature row per candidate. For each of the 200 items, assemble the fields user_id=Ann, item_id=<candidate>, genre=<that item's>, device=mobile, daypart=evening (plus any others), each one-hot. (2) Embed and score. Map every field to its dense embedding (shared across arms); feed the embeddings to the FM arm (all second-order crosses) and the DNN arm (high-order), sum, and apply the sigmoid to get $\hat p$ for each of the 200 rows. (3) Sort and log. Order the candidates by $\hat p$ descending — that ranked list is the page; offline, score the model with AUC (ranking quality) and LogLoss (calibration) against held-out click labels.

Bandits & Online Recommendation

E1. Each pull of the sub-optimal arm adds its gap to the regret, so $8 \times 0.25 = \mathbf{2.0}$ expected clicks given up.

E2. $K=4$, $\varepsilon=0.2$. The best arm is pulled on exploit (prob $0.8$) or when explore lands on it (prob $0.2/4=0.05$): $P(\text{best}) = 1-\varepsilon+\varepsilon/K = 0.8+0.05 = \mathbf{0.85}$. Each specific other arm is pulled only via explore: $\varepsilon/K = \mathbf{0.05}$. (Check: $0.85 + 3\times0.05 = 1.00$.)

E3. With $\ln 9 = 2.197$: $\text{UCB}_A = 0.667 + \sqrt{2(2.197)/3} = 0.667 + 1.210 = \mathbf{1.877}$; $\text{UCB}_B = 0 + \sqrt{2(2.197)/1} = \mathbf{2.096}$. UCB still pulls B — one extra pull was not enough to shrink B’s bonus below A, so the under-explored arm is still favoured.

E4. $\ln 52 = 3.951$. $\text{UCB}_A = 0.60 + \sqrt{2(3.951)/50} = 0.60 + 0.398 = \mathbf{0.998}$; $\text{UCB}_B = 0.50 + \sqrt{2(3.951)/2} = 0.50 + 1.988 = \mathbf{2.488}$. UCB pulls B: A is well-sampled so its bonus has nearly vanished (its score sits just above its true mean), while B’s two pulls leave a huge bonus — optimism still demands B be explored, even though A’s mean is higher. (This is the mechanism that keeps UCB from prematurely locking onto a well-sampled arm.)

E5. Starting from $\text{Beta}(1,1)$ and adding $s=4$ clicks, $f=6$ non-clicks: $\text{Beta}(1+4,\,1+6) = \mathbf{\text{Beta}(5,7)}$, posterior mean $\dfrac{5}{5+7} = \dfrac{5}{12} = \mathbf{0.417}$.

E6. ε-greedy (i) explores forever at a fixed rate $\varepsilon$, even once an arm is clearly best; UCB’s bonus term $\sqrt{2\ln t/n_i}$ shrinks as an arm is pulled, so exploration of a settled arm self-limits. ε-greedy (ii) explores uniformly at random, wasting pulls on arms already known to be bad; UCB’s bonus is largest for the least-pulled (most uncertain) arms, so exploration is targeted. The two UCB terms — the mean $\hat\mu_i$ (exploit) and the bonus (shrinking, targeted explore) — fix exactly those two flaws.

E7. $n_i$ in the denominator: the more times arm $i$ has been pulled, the smaller its bonus — more data means less uncertainty, so the arm’s score settles toward its true mean. $\ln t$ in the numerator: as total rounds $t$ grow, every arm’s bonus creeps up, so an arm ignored for a long stretch eventually looks “worth re-checking” again — but only logarithmically, so the total amount of exploration over a run stays bounded (the source of UCB’s logarithmic regret).

E8. Exploration comes from the width of each arm’s posterior. A barely-tried arm has a wide Beta, so a single sample from it is sometimes high and the arm gets pulled (explored); a well-tried arm has a narrow Beta, so its samples cluster near its true mean and it is pulled only when it really is best. As clicks accumulate, every posterior narrows, high samples from genuinely-bad arms become rare, and exploration fades on its own — no $\varepsilon$ to tune.

E9. A brand-new item has $n_i=0$ and no clicks. ε-greedy: it can be reached only through the $\varepsilon/K$ random-explore slice — a weak, undifferentiated push (and it competes with every other arm for that slice). UCB: its bonus $\sqrt{2\ln t/n_i}$ is infinite at $n_i=0$, so UCB pulls it immediately (UCB plays every arm once before anything else) — the strongest initial push. Thompson: its prior $\text{Beta}(1,1)$ is uniform on $[0,1]$, so it draws a sample above $0.5$ half the time and gets explored with moderate, decaying probability. Strongest first push: UCB.

E10. The contextual bandit builds a context $\mathbf x$ from Ann’s features (sci-fi fan) $+$ each candidate article $+$ the situation (mobile, evening). For each candidate it estimates the expected reward $\boldsymbol\theta^{\top}\mathbf x$ (ridge regression on the rounds seen so far) and adds a confidence bonus that is large for contexts unlike any it has served; it shows the article with the highest score — exploiting its best estimate, but the bonus lets it occasionally explore an article whose context is novel. After the impression it observes only Ann’s click / no-click on the shown article and updates $\boldsymbol\theta$ and the confidence with that single observation. The defining difference from the supervised CTR model: the bandit chooses the action and learns only from the feedback on its own choice (a closed loop, with exploration), rather than training once on a fixed log of all impressions.