Derivative with respect to Matrix

Last updated on Aug 3, 2025 5 min read

Preface

“What does it actually mean to take derivative of a loss function with respect to a matrix ?” I was wondering some day during my PhD.

“A matrix is nothing but a bunch of numbers structured in a 2D format. So, you just take derivative of loss with respect to each of these numbers and then put them in the same format at the corresponding position. That’s what the derivative of a loss with respect to that matrix is”.

Okay, that is very convenient, but at the same time quite tedious thing to do if we have to do it by hand.

“I’m sure there should be some smart way to go about this. What is that?”

This blog is what I discovered as an early PhD student and today I decided I share that with others who might be wondering too.

Differentiating with respect to Vector

Before going into matrix derivative, let’s think about derivative vector. Let $y = f (x)$ be a scalar valued function of a vector $x$ ; i.e., $f : R^{m} \to R$ . Think of this as a loss function depending on some vector of dimension $m$ .

The derivative $\frac{d y}{d x}$ is a vector $[\frac{d y}{d x_{j}}]$ . By convention, we assume all vectors are column vectors and transpose whenever we need row vectors. Also, $[a_{j}]$ denotes a vector with individual elements $a_{j}$ .

Example:

Let $x = [\begin{matrix} x_{1} \\ x_{2} \end{matrix}]$ and $y = x_{1}^{2} + 3 x_{2}^{4}$ . Then, $\frac{d y}{d x} = [\begin{matrix} 2 x_{1} \\ 12 x_{2}^{3} \end{matrix}]$ .

Note that the column vector $[\frac{d y}{d x_{j}}]$ is constructed by putting the derivative of $y$ with respect to the $j^{t h}$ element of $x$ at the $j^{t h}$ position.

We are interested in developing tools where we can perform such derivation directly. For example, if $A$ is any matrix and $y = x^{T} A x$ , then we want to be able to show $\frac{d y}{d x} = (A + A^{T}) x$ . This can be shown with cumbersome calculation as follows:

Let $A = (\begin{matrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{matrix})$ . Then, $y = [\begin{matrix} x_{1} & x_{2} \end{matrix}] [\begin{matrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{matrix}] [\begin{matrix} x_{1} \\ x_{2} \end{matrix}]$ $= a_{11} x_{1}^{2} + a_{21} x_{1} x_{2} + a_{12} x_{1} x_{2} + a_{22} x_{2}^{2}$ .

$\frac{d y}{d x} = [\begin{matrix} 2 a_{11} x_{1} + (a_{21} + a_{12}) x_{2} \\ (a_{21} + a_{12}) x_{1} + 2 a_{22} x_{2} \end{matrix}] = (A + A^{T}) x$ .

However, we want to do this in an easier way. Moreover, we also want the technique to be useful for deriving derivative with respect to matrices.

Use of Differentials

Using Taylor expansion: $f (x + d x) = f (x) + (\frac{d f}{d x})^{T} d x +$ higher terms

The key idea here is that if we take $d x$ sufficiently small, then $d y = f (x + d x) - f (x) = (\frac{d f}{d x})^{T} d x$ . Why? Because all the rest of the terms are higher powers of $d x$ and will converge to zero in the limiting case. In fact, the very definition of derivative comes from this limiting case of Taylor expansion if we think about it. So, we can use this trick to compute derivative. Hence, given any function $f (x)$ , we calculate $d y = D^{T} d x$ and then obtain $\frac{d y}{d x} = D$ .

Example:

$y = f (x) = x^{T} A x$

$d y = (x + d x)^{T} A (x + d x) - x^{T} A x$ $= x^{T} A x + x^{T} A d x + d x^{T} A x + d x^{T} A d x - x^{T} A x$

In the limiting case, as $d x \to 0$ :

$d y = x^{T} A d x + d x^{T} A x$

Using $d x^{T} A x = x^{T} A^{T} d x$ (because $d x^{T} A x$ is a scalar, so it’s equal to its transpose),

$d y = x^{T} A d x + x^{T} A^{T} d x = x^{T} (A + A^{T}) d x$ . Therefore, $\frac{d y}{d x} = (A + A^{T}) x$ .

Using the same procedure, it is easy to show $\frac{d}{d x} (w^{T} x) = w$ .

This technique of using differentials is powerful because we can use the same technique to find derivative with respect to matrices.

Derivative with respect to matrices

Let $f$ be a function of matrices. We denote $y = f (X)$ , where $X = [x_{i j}]$ . Note that $y$ is scalar. What does it mean by derivative w.r.t. matrix? $f : R^{m \times n} \to R$ . The derivative $\nabla_{X} y = [\begin{matrix} \frac{\partial y}{\partial x_{i j}} \end{matrix}]$ . That is, we differentiate $y$ with respect to each element of matrix $X$ , $\frac{\partial y}{\partial x_{i j}}$ , and then we fill the $(i, j)^{t h}$ position of the matrix with this derivative to obtain the required derivative.

Interestingly, we can use the same procedure of differentials to find derivative with respect to matrices.

Steps to calculate derivative w.r.t. matrix X:

$d y = f (X + d X) - f (X)$
Express $d y = t r (D^{T} d X)$
$\nabla_{X} y = D$

Why is there a trace of $D$ and $d x$ ? Because at this point we need to define the inner product of two matrices. The Taylor series equivalent with matrix as argument requires us to define inner product in the matrix space. For now we can think of that $t r (D^{T} d X)$ is inner product of two matrices $D$ and $d X$ . To see this you can take two matrices of $2 \times 2$ and see that this expression gives us a scalar and satisfies inner product properties like linearity and norm etc.

Example 1:

$\frac{d}{d X} t r (X B X^{T}) = X (B + B^{T})$

Proof:

$d y = t r [(X + d X) B (X + d X)^{T}] - t r (X B X^{T})$

Setting higher terms of $d X$ to zero:

$d y = t r (d X B X^{T} + X B d X^{T})$

Using property $t r (A B) = t r (B A)$ and $t r (A + B) = t r (A) + t r (B)$ :

$d y = t r (d X B X^{T}) + t r (X B d X^{T})$

$d y = t r (B X^{T} d X) + t r (B^{T} X^{T} d X)$

$d y = t r [(B + B^{T}) X^{T} d X]$

Therefore, $\frac{d y}{d X} = X (B + B^{T})$ .

Example 3:

$\frac{d}{d X} [t r (A X^{- 1})] = - (X^{- 1} A X^{- 1})^{T}$

Proof: By now you must have realized from above exercise that the differential operator is linear. So, we can apply some tricks like moving differential operator inside another linear operator like trace. A direct application of something like this is as follows:

$d [t r (A X^{- 1})] = t r (A d X^{- 1})$ .

At this point, we use the following identity: $d X^{- 1} = - X^{- 1} d X X^{- 1}$ —(1)

The proof will be provided later. For now using this identity:

$d y = t r (A (- X^{- 1} d X X^{- 1}))$

$= t r (- A X^{- 1} d X X^{- 1})$

Using the cyclic property of trace, $t r (A B C) = t r (B C A) = t r (C A B)$ :

$= t r (- X^{- 1} A X^{- 1} d X)$ .

Therefore, $\frac{d y}{d X} = - (X^{- 1} A X^{- 1})^{T}$ .

Proof of equation (1):

$d (I) = 0$ .

Or, $d (X X^{- 1}) = 0$ .

Or, $X d X^{- 1} + d X X^{- 1} = 0$ .

$d X^{- 1} = - X^{- 1} d X X^{- 1}$ .

Example 4:

$\frac{d}{d X} \ln | X | = (X^{- 1})^{T}$ .

Proof of example 4 includes the identity $d \ln | X | = t r (X^{- 1} d X)$ — (2).

For proof of this identity, see the Convex Optimization book [2].

Using (2):

$\frac{d}{d X} \ln | X | = (X^{- 1})^{T}$ .

References

Caltech EE150 Lecture 5 Slides](http://www.systems.caltech.edu/dsp/ee150_acospc/lectures/EE_150_Lecture_5_Slides.pdf)
https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

Sandesh Ghimire

Senior Researcher and Engineer

My research interests include machine learning, computer vision and medical imaging.