42

A tribute to erudite pursuits.

Matrix Calculus

In this article, I may not use any specifying notation for vectors and matrices for sake of generality. Some of the final parts, from Tensors onwards as well as the Hessian section were written by Gemini, with content guidance from myself. I put off finishing the article for 2 years and enough was enough…

Introduction to Differentials and Linearisation

We all know the formula of derivatives,

\[f'(x) = \frac{\mathrm{d}f}{\mathrm{d}x},\]

and we are taught that $f’(x)$ represents the localised gradient of a point on a curve. However, this idea is more difficult to generalise into higher order $x$ such as vectors, matrices and tensors. Instead, we should consider that the derivative is really a linearisation of a function, and we can show this using the Taylor series expansion,

\[f(x + \delta x) = f(x) + f'(x)\delta x + \frac{f''(x)}{2}\delta x^2 + ...\]

Linearising the curve means we discard terms $\mathcal{O}(\mid \delta x \mid^2)$ and for $\lim_{\delta x \rightarrow 0}$ (small perturbations), this is a logical move. Therefore, we are left with the resulting approximation,

\[f(x + \delta x) \approx f(x) + f'(x)\delta x.\]

In essence, the reason we refer to this as linearisation since we have effectively computed the function which is tangent to $f$ at $x$. This idea is key, since we can generalise this to more complicated objects, e.g. a surface $f$ is linearised to a plane. The approximation above becomes an equality as $\lim_{\delta x \rightarrow 0}$, in which case the $\delta x$ becomes a differential $\mathrm{d}x$,

\[f(x + \mathrm{d}x) = f(x) + f'(x)\mathrm{d}x.\]

We can rearrange the equation to give,

\[f(x + \mathrm{d}x) - f(x) = f'(x)\mathrm{d}x,\]

and we can represent $f(x + \mathrm{d}x)$ as $f + \mathrm{d}f$, which gives,

\[\mathrm{d}f = f'(x)\mathrm{d}x.\]

We can recognise that this is effectively “multiplying up” the $\mathrm{d}x$ on both sides of our standard form for the single-variable derivative formula. Now, it is immediately apparent that this form of the derivative is far more generalised, because $\mathrm{d}x$ is not limited to $\mathbb{R}$. The above formula holds some genuine meaning in that $\mathrm{d}f$ represents a small output perturbation and $\mathrm{d}x$ a small input perturbation.

The Jacobian

We call $f’(x)$ the Jacobian. The Jacobian gives the linear relationship between input and output perturbations, which is more accurate the smaller the perturbations, since the $\mathcal{O}(\mid \delta x \mid ^2)$ terms become more and more negligible with respect to the first order.

Take, for example, some function $f$ such that,

\[f : \mathbb{R}^n \rightarrow \mathbb{R}^m,\]

i.e. $f$ is an m-vector and $x$ an n-vector. In this case, for the derivative formula to be satisfied, we require

\[f'(x) \in \mathbb{R}^{m \times n},\]

and so in this example, $f’(x)$ is a matrix. But how can we construct a Jacobian matrix? We shall consult the total differential form relationship listed below, we use the same $f$ as above,

\[\mathrm{d}f_i = \frac{\partial f_i}{\partial x^j} \mathrm{d}x_j.\]

Here, we have used Einstein summation notation. Converting this expression into a matrix-vector multiplication, we get

\[\begin{pmatrix} \mathrm{d}f_1 \\ \vdots \\ \mathrm{d}f_m \end{pmatrix} = \begin{pmatrix} \frac{\partial f_1}{\partial x^1} & \dots & \frac{\partial f_1}{\partial x^n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x^1} & \dots & \frac{\partial f_m}{\partial x^n} \end{pmatrix} \begin{pmatrix} \mathrm{d}x_1 \\ \vdots \\ \mathrm{d}x_n \end{pmatrix}.\]

And so we can define the Jacobian matrix,

\[f'(x) = \begin{pmatrix} \frac{\partial f_1}{\partial x^1} & \dots & \frac{\partial f_1}{\partial x^n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x^1} & \dots & \frac{\partial f_m}{\partial x^n} \end{pmatrix}.\]

To reinforce this idea, we will observe a very trivial example. This should also briefly introduce our approach to taking derivatives in matrix calculus. We define $f : \mathbb{R}^n \rightarrow \mathbb{R}^m$ such that,

\[f(x) = Ax,\]

where $A$ is a constant matrix. Starting from first principles,

\[f'(x) \mathrm{d}x = \mathrm{d}f = f(x + \mathrm{d}x) - f(x).\]

Next we substitute the function $f$ into the equation,

\[f'(x) \mathrm{d}x = A(x + \mathrm{d}x) - Ax.\]

Since matrix multiplication is distributive,

\[f'(x) \mathrm{d}x = A \mathrm{d}x,\] \[f'(x) = A.\]

We have found the Jacobian matrix, but let us check it with our partial derivative construction,

\[\mathrm{d}f_i = \frac{\partial f_i}{\partial x^j} \mathrm{d}x_j.\]

From the definition of the function $f$, we know that,

\[f_i = A_{ij}x_j,\]

and so,

\[\frac{\partial f_i}{\partial x^j} = A_{ij},\] \[\therefore f'(x)_{ij} = A_{ij} \implies f'(x) = A,\]

which agrees with the above.

Loss Functions and Grad

Let us consider another example where $f : \mathbb{R}^n \rightarrow \mathbb{R}$. We call this kind of scalar function a Loss/Cost function in machine learning, since it takes in many parameters and returns a scalar that quantifies some sort of loss or cost. \(\mathrm{d}f = f'(x)\mathrm{d}x\) tells us that $f’(x)$ must be a row-vector/covector. Using vector notation we can say,

\[\mathrm{d}f = f'(x)^T \cdot \mathrm{d}x.\]

From the equation for the total derivative, we can recall that for some scalar field $f : \mathbb{R}^n \rightarrow \mathbb{R}$ (using Einstein summation convention),

\[\mathrm{d}f = \frac{\partial f}{\partial x^i} \mathrm{d}x_i = \nabla f \cdot \mathrm{d}x.\]

Comparing the two equations above,

\[\nabla f = f'(x)^T .\]

Establishing the grad operator here allows us to extend grad to virtually all Hilbert spaces (vector spaces with a defined inner product).

Second Derivatives and the Hessian

We have seen that taking the first derivative of a scalar-valued function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ gives us the gradient $\nabla f$, which is an $n$-vector field. If we take the derivative of the gradient, we arrive at the second derivative of the original scalar function, known as the Hessian matrix, usually denoted as $H$ or $\nabla^2 f$.

\[\mathrm{d}(\nabla f) = H \mathrm{d}x\]

The Hessian $H \in \mathbb{R}^{n \times n}$ contains the second-order partial derivatives of $f$:

\[H_{ij} = \frac{\partial^2 f}{\partial x^i \partial x^j}\]

The Hessian is fundamentally symmetric (due to Schwarz’s theorem, assuming continuous second partial derivatives), meaning $H = H^T$.

Using the Hessian, we can extend our Taylor series linearisation of $f$ to a quadratic approximation. For a small perturbation $\mathrm{d}x$:

\[f(x + \mathrm{d}x) \approx f(x) + (\nabla f)^T \mathrm{d}x + \frac{1}{2}\mathrm{d}x^T H \mathrm{d}x\]

This quadratic approximation is incredibly powerful for optimisation. In Newton’s method for finding the minimum of a function, we want to find the stationary point of the local approximation. We can take the derivative of the approximation with respect to $\mathrm{d}x$ and set it to zero:

\[\nabla f(x + \mathrm{d}x) \approx \nabla f(x) + H \mathrm{d}x = 0\]

Rearranging for the optimal step $\mathrm{d}x$:

\[\mathrm{d}x = -H^{-1} \nabla f(x)\]

This gives us the Newton update rule for optimisation:

\[x_{k+1} = x_k - H^{-1} \nabla f(x_k)\]

Unlike gradient descent which only uses first-order information (slope) to step downwards, Newton’s method uses second-order information (curvature) to jump straight to the minimum of the local quadratic approximation, resulting in exponentially faster convergence for well-behaved functions.

Brief Interlude on Linear Operators

A linear operator is any operator $L: \mathcal{U} \rightarrow \mathcal{V}$ that satisfies the following equation,

\[L(\alpha x + \beta y) = \alpha L(x) + \beta L(y),\]

where $x, y \in \mathcal{U}$ and $\alpha , \beta \in \mathbb{R}$. As a generic example, we can take $L: \mathbb{R}^n \rightarrow \mathbb{R}^m$ such that,

\[L(x) = Ax\]

for a constant $A \in \mathbb{R}^{m \times n}$. From matrix multiplication rules, we can quickly see that,

\[A(\alpha x + \beta y) = \alpha Ax + \beta Ay,\]

because matrix multiplication is distributive and multiplication by scalars is both associative and commutative. In a sense, this demonstration is mildly cyclic since matrix multiplication is defined in such a way to satisfy the criteria of linearity, given it represents a linear transformation.

The Derivative as a Linear Operator

Let us play with an example,

\[f(X) = X^2.\]

Through this example, I hope to further establish the first principles for taking the derivative of a matrix function, as well as illustrating the derivative as a linear operator.

From first principles,

\[f'(X)\mathrm{d}X = \mathrm{d}f = f(X + \mathrm{d}X) - f(X).\]

Substitute our $f$ into the equation,

\[f'(X)\mathrm{d}X = (X + \mathrm{d}X)^2 - X^2,\] \[f'(X)\mathrm{d}X = X^2 + X \mathrm{d}X + \mathrm{d}X X + \mathrm{d}X^2 - X^2.\]

In expanding, care must be taken to preserve the order of matrix multiplication since it is a non-commutative operation. We can now discard the non-linear terms, i.e. $\mathcal{O}(\mathrm{d}X^2)$,

\[f'(X)\mathrm{d}X = X \mathrm{d}X + \mathrm{d}X X.\]

On the surface, it may appear as if we are in trouble! How are we going to represent $f’(x)$ as a matrix? In short, we cannot and instead we will define $f’(x)$ as a linear operator, such that

\[f'(X)[\mathrm{d}X] = X \mathrm{d}X + \mathrm{d}X X.\]

This subtle change in notation allows us to differentiate a far greater set of matrix functions. The formal name for this is the FrĂŠchet derivative.

Vectorising Matrices

One may recognise that an $m \times n$ matrix is isomorphic to an $mn$-vector, and indeed, there is a linear operation for the aforementioned matrix to vector map. We call this the vectorising function, and with its aid, we can form Jacobian matrices in cases where previously we had to settle for linear operators. Let us define,

\[\mathrm{vec} : \mathbb{R}^{m \times n} \rightarrow \mathbb{R}^{mn},\]

such that the column vectors within the matrix are stacked on top of eachother, i.e.

\[\mathrm{vec}\begin{pmatrix} \uparrow & \dots & \uparrow \\ u_1 & \dots & u_n \\ \downarrow & \dots & \downarrow \end{pmatrix} = \begin{pmatrix} \uparrow \\ u_1 \\ \downarrow \\ \vdots \\ \uparrow \\ u_n \\ \downarrow \end{pmatrix}.\]

where $u_i$ represents the $i$-th column vector in the matrix. Using this new vec operator, we can actually construct a Jacobian matrix from the function $f(X) = X^2$. Let us take a $2 \times 2$ matrix $A$, such that

\[A = \begin{pmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{pmatrix},\]

and we match to it some small perturbation,

\[\mathrm{d}A = \begin{pmatrix} \mathrm{d}a_{11} & \mathrm{d}a_{12} \\ \mathrm{d}a_{21} & \mathrm{d}a_{22} \end{pmatrix}.\]

We shall observe the example of using $f(X) = X^2$,

\[A^2 = \begin{pmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{pmatrix}^2 = \begin{pmatrix} a_{11}^2 + a_{12}a_{21} & a_{12}(a_{11}+a_{22}) \\ a_{21}(a_{11} + a_{22}) & a_{22}^2 + a_{12}a_{21} \end{pmatrix},\] \[(A+\mathrm{d}A)^2 = \begin{pmatrix} a_{11} + \mathrm{d}a_{11} & a_{12} + \mathrm{d}a_{12} \\ a_{21} + \mathrm{d}a_{21} & a_{22} + \mathrm{d}a_{22} \end{pmatrix}^2.\]

Here, we skip some steps and give the final result (omitted steps left as an exercise for the reader… the exercise can be approached by matrix algebra or by differentiating the elements of the matrix).

\[\mathrm{d}A^2 = (A+\mathrm{d}A)^2 - A^2\] \[= \begin{pmatrix} 2a_{11}\mathrm{d}a_{11} + a_{12}\mathrm{d}a_{21} + a_{21}\mathrm{d}a_{12} & (a_{11} + a_{22})\mathrm{d}a_{12} + a_{12}(\mathrm{d}a_{11} + \mathrm{d}a_{22}) \\ (a_{11} + a_{22})\mathrm{d}a_{21} + a_{21}(\mathrm{d}a_{11} + \mathrm{d}a_{22})& 2a_{22}\mathrm{d}a_{22} + a_{12}\mathrm{d}a_{21} + a_{21}\mathrm{d}a_12 \end{pmatrix}.\]

We vectorise this matrix into,

\[\mathrm{vec}(\mathrm{d}A^2) = \begin{pmatrix} 2a_{11}\mathrm{d}a_{11} + a_{12}\mathrm{d}a_{21} + a_{21}\mathrm{d}a_{12} \\ (a_{11} + a_{22})\mathrm{d}a_{21} + a_{21}(\mathrm{d}a_{11} + \mathrm{d}a_{22}) \\ (a_{11} + a_{22})\mathrm{d}a_{12} + a_{12}(\mathrm{d}a_{11} + \mathrm{d}a_{22}) \\ 2a_{22}\mathrm{d}a_{22} + a_{12}\mathrm{d}a_{21} + a_{21}\mathrm{d}a_12 \end{pmatrix}.\]

Now, it is handy to realise that this can be expressed as,

\[\mathrm{vec}(\mathrm{d}A^2) = \begin{pmatrix} 2a_{11} & a_{12} & a_{21} & 0 \\ a_{21} & a_{11} + a_{22} & 0 & a_{21} \\ a_{12} & 0 & a_{11} + a_{22} & a_{12} \\ 0 & a_{12} & a_{21} & 2a_{22} \end{pmatrix} \mathrm{vec}(\mathrm{d}A),\]

and hence, we have acquired an expression for the Jacobian!

N.B. The vectorization approach should be used with care, as it encourages an element-centric mindset towards matrices, which often is not the best way to consider matrices.

Deriving General Differentiation Rules

In regular scalar calculus, we are able to achieve such lengths in differentiation due to various rules that can be proved for general holomorphic functions. These rules are then rendered a toolkit for evaluating derivatives of greater complexity. In this section, we look at extending these rules into matrix calculus, alongside some matrix-specific rules. You should expect to notice that all of these rules are valid when we take $1 \times 1$ matrices, i.e. scalars.

Sum Rule

Let is begin with $f(x) = g(x) + h(x)$ where $f,g$ and $h$ take matrix input and outputs. Starting from first principles,

\[\mathrm{d}f = f(x + \mathrm{d}x) - f(x) = g(x + \mathrm{d}x) + h(x + \mathrm{d}x) - g(x) - h(x).\]

Matrix addition is commutative, so the sum on the right hand side can be trivially rearranged,

\[\mathrm{d}f = g(x + \mathrm{d}x) - g(x) + h(x + \mathrm{d}x) - h(x),\] \[\therefore \mathrm{d}(g+h) = \mathrm{d}g + \mathrm{d}h.\]

Product Rule

For the product rule, let $f(x) = g(x)h(x)$, where $f,g$ and $h$ are defined similarly. Adopting the same approach,

\[\mathrm{d}f = (f + \mathrm{d}f) - f = (g+\mathrm{d}g)(h+\mathrm{d}h) - gh.\]

The key is to preserve the order of matrix multiplication. Expanding the above equation and discarding second-order terms, we acquire

\[\mathrm{d}f = g\mathrm{d}h + \mathrm{d}gh.\] \[\therefore \mathrm{d}(gh) = g\mathrm{d}h + \mathrm{d}gh.\]

Chain Rule

To tackle the chain rule, we define $f(x) = h(g(x))$, such that $g$ acts on $x$ first, then followed by $h$. To cement this, we could apply an example (however the proof below is general beyond the example). Let $x \in \mathbb{R}^n$, $g: \mathbb{R}^n \rightarrow \mathbb{R}^m$, $h: \mathbb{R}^m \rightarrow \mathbb{R}^k$ and $f: \mathbb{R}^n \rightarrow \mathbb{R}^k$.

\[\mathrm{d}f = h(g(x+\mathrm{d}x)) - h(g(x)),\] \[\mathrm{d}f = h(g+\mathrm{d}g) - h(g(x)).\]

The observation we must make here is that $\mathrm{d}g = g’(x)\mathrm{d}x$ is still an infinitesimal. The way to consider the following step is to think of $h’(g(x))$ as the derivative of $h$ evaluated at a point $g(x)$ in $g$-space.

\[\mathrm{d}f = h(g(x)) + h'(g(x))\mathrm{d}g - h(g(x)) = h'(g(x))g'(x)\mathrm{d}x.\] \[\therefore f'(x) = h'(g(x))g'(x).\]

The example at the start of this section was not chosen blindly. It illustrates the composition of several vector-to-vector functions. You will notice, $h’(g(x))$ is just the Jacobian of $h$ evaluated at $g(x)$, and $g’(x)$ is just the Jacobian of $g$ evaluated at $x$. In short, the chain rule for vector-to-vector functions is simply the product of Jacobians. Computationally, when chains and vectors get long, it makes a massive difference which way we perform the matrix multiplication (left-to-right or right-to-left). We will discuss this in much further depth later on in this article.

Inverse Rule

This rule is obviously a matrix specific rule, or is it? We will see. Let $f(X) = X^{-1}$, where $X$ is a matrix. We will employ a trick to evaluate this derivative.

\[X X^{-1} = I,\]

where $I$ is the identity matrix. This is a constant matrix, and so will have a derivative of $0$.

\[\implies \mathrm{d}(XX^{-1}) = 0.\]

Applying the product rule,

\[X\mathrm{d}X^{-1} + \mathrm{d}XX^{-1} = 0,\] \[X\mathrm{d}X^{-1} = -\mathrm{d}XX^{-1},\] \[\therefore \mathrm{d}X^{-1} = -X^{-1}\mathrm{d}XX^{-1}.\]

So let’s see if it really is matrix specific. Suppose $X$ were a scalar, the scalar identity is just 1 and so $X^{-1}$ is just the reciprocal of $X$, so

\[\mathrm{d}X^{-1} = -X^{-1}\mathrm{d}XX^{-1} = -\frac{\mathrm{d}X}{X^2},\]

and

\[\frac{\mathrm{d}X^{-1}}{\mathrm{d}X} = -\frac{1}{X^2},\]

which is the expected result.

Defining an Inner Product and Norm

To define an inner product, we will make use of the matrix-vector isomorphism. Specifically, the spaces of $R^{m\times n}$ and $R^{mn}$ via the vectorising function. For the vectors formed by vectorising matrices $A$ and $B$,

\[\mathrm{vec}A \cdot \mathrm{vec}B = (\mathrm{vec}A)^T\mathrm{vec}B.\]

But, without using the vectorising function,

\[(\mathrm{vec}A)^T\mathrm{vec}B = A_{ij}B_{ij} = \mathrm{tr}(A^TB).\]

The most obvious generalisation is to say $\langle A, B \rangle = \mathrm{vec}A \cdot \mathrm{vec}B$. This is called the Frobenius Inner Product, and is defined,

\[\langle A, B \rangle = \mathrm{tr}(A^TB) = A_{ij}B_{ij}.\]

In essence, this is just multiplying corresponding elements from $A$ and $B$, then summing the products, which is again very intuitive. A vector space with a defined inner product is formally known as a Hilbert space.

Following on from the inner product, in vectors, we can define the norm of a vector as its length, which is given by

\[|\vec{a}| = \sqrt{a_i^2} = \sqrt{\vec{a} \cdot \vec{a}}.\]

Again, we make the obvious generalisation and define the Frobenius Norm (also known as the 2-norm) as,

\[|A| = \sqrt{\langle A, A \rangle} = \sqrt{\mathrm{tr}(A^TA)} = \sqrt{A_{ij}^2}.\]

A vector space equipped with a defined norm, which is also complete, is formally known as a Banach space. Similarly, a vector space with a defined inner product that is complete is formally known as a Hilbert space. Note that while every inner product naturally induces a norm (meaning Hilbert spaces are inherently Banach spaces), the reverse is not generally true.

Both definitions fundamentally rely on the concept of Cauchy completeness. A space is complete if every Cauchy sequence within the space converges to a limit that is also within that space. This is an absolutely crucial property for calculus: when we take our perturbations $\mathrm{d}x \rightarrow 0$ to construct derivatives and integrals, completeness guarantees that the limits of our sequences actually exist inside our vector space! Without it, the entire foundation of differential matrix calculus would fail.

Element-wise Operations (Hadamard Product)

Often in machine learning, we apply a scalar function $f: \mathbb{R} \rightarrow \mathbb{R}$ to a matrix $X$ element-wise, creating a matrix output of the same dimensions. A common example is applying a nonlinear activation function like sigmoid or ReLU to a weight matrix.

Because this map is element-wise, the elements of the output do not depend on crossover elements of the input, i.e., $Y_{ij} = f(X_{ij})$.

To represent the derivative of this operation, we introduce the Hadamard product (denoted $\odot$), which is simply element-wise matrix multiplication.

By looking at an individual element perturbed by an infinitesimal:

\[\mathrm{d}Y_{ij} = f(X_{ij} + \mathrm{d}X_{ij}) - f(X_{ij}) = f'(X_{ij})\mathrm{d}X_{ij}.\]

Generalising this back into matrix notation using the Hadamard product, our differential expression is:

\[\mathrm{d}Y = f'(X) \odot \mathrm{d}X.\]

Where $f’(X)$ is the matrix formed by applying the scalar derivative $f’$ element-wise to $X$. This notation elegantly dodges the need to vectorise the matrix or invoke 4D tensors, keeping equations computationally intuitive.

Beyond Matrices: Tensors and Index Notation

If taking the derivative of a vector with respect to a vector yields a 2D matrix (the Jacobian), what happens when we take the derivative of a matrix with respect to a matrix? The result requires a 4-dimensional array of partial derivatives. Instead of relying on unwieldy multidimensional bricks of numbers, we lean on Tensors and Einstein Summation Convention, which we hinted at earlier.

The Kronecker Delta and Levi-Civita Symbol

Two of the most essential tools in tensor calculus are the Kronecker delta ($\delta_{ij}$) and the Levi-Civita symbol ($\epsilon_{ijk} …$).

The Kronecker delta is the tensor equivalent of the identity matrix: \(\delta_{ij} = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{if } i \neq j \end{cases}\) It acts as a replacement operator. When summed with another tensor, it swaps indices: $A_{ik} \delta_{ij} = A_{jk}$.

The Levi-Civita symbol (or permutation symbol) applies to 3-dimensional spaces (and higher equivalents) and is defined completely by the parity of its index permutation: \(\epsilon_{ijk} = \begin{cases} +1 & \text{if } (i,j,k) \text{ is an even permutation of } (1,2,3) \\ -1 & \text{if } (i,j,k) \text{ is an odd permutation of } (1,2,3) \\ 0 & \text{if any index is repeated} \end{cases}\)

The nature of the Levi-Civita symbol is deeply rooted in group theory, specifically the symmetric group $S_3$. The elements of $S_3$ are permutations of a set of 3 items. Every permutation can be constructed from a series of pairwise swaps (transpositions). A permutation is “even” (sign of +1) if it requires an even number of swaps from the base unpermuted state, and “odd” (sign of -1) if it requires an odd number of swaps. The Levi-Civita symbol maps directly to the sign of the permutation group, which is why it is strictly non-zero only when all indices are pairwise different.

Proving the Epsilon-Delta Identity

One of the most famous and useful identities in vector calculus relates the product of two Levi-Civita symbols to Kronecker deltas:

\[\epsilon_{ijk}\epsilon_{ilm} = \delta_{jl}\delta_{km} - \delta_{jm}\delta_{kl}\]

Notice that the index $i$ is repeated, meaning it is being summed over $i \in {1, 2, 3}$. We can prove this identity using the group theory properties of the symbol.

For the left-hand side to be non-zero, $i$ must be different from $j$ and $k$, and also different from $l$ and $m$. Because $i$ takes exactly one value (the one not taken by $j,k$ and $l,m$), the remaining indices $(j,k)$ and $(l,m)$ must be assigned from the exact same set of two remaining integers.

This leaves exactly two possibilities for the indices $l$ and $m$:

  1. $\mathbf{j = l}$ and $\mathbf{k = m}$ (The indices match exactly). In this case, the permutations $(i,j,k)$ and $(i,l,m)$ are identical. Therefore, their group parity is the same. The product of their signs will always be $(\pm 1)^2 = 1$. Looking at the right-hand expression for this case, $\delta_{jl}\delta_{km} = (1)(1) = 1$, and $\delta_{jm}\delta_{kl} = (0)(0) = 0$. The result is $1$, which perfectly matches.

  2. $\mathbf{j = m}$ and $\mathbf{k = l}$ (The indices are swapped). In this case, $(i,l,m)$ is a single transposition away from $(i,j,k)$. Because a single swap flips the parity of the permutation under the symmetric group, their signs will be opposite. Their product will always be $(+1)(-1) = -1$. Looking at the right-hand expression, $\delta_{jl}\delta_{km} = (0)(0) = 0$, and $\delta_{jm}\delta_{kl} = (1)(1) = 1$. The result is $0 - 1 = -1$, which again matches perfectly.

By framing it under permutation parities, the identity emerges naturally, giving us a powerful mechanism for simplifying cross products and curl operations into native tensor algebra!

Conclusion: The Matrix Cookbook

While deriving derivatives from first principles using infinitesimals and linearised differentials provides profound intuition, recalculating large expressions from scratch every time is impractical.

As a concluding takeaway, the practicing engineer and computer scientist’s best friend is The Matrix Cookbook by Petersen and Pedersen. For real-world implementation, you establish your loss function or linear system, and simply look up the resulting gradients in the Cookbook to work them directly into your algorithms.

Important matrix calculus results can be found in The Matrix Cookbook.

31/3/2026