Preface
“What does it actually mean to take derivative of a loss function with respect to a matrix ?” I was wondering some day during my PhD.
“A matrix is nothing but a bunch of numbers structured in a 2D format. So, you just take derivative of loss with respect to each of these numbers and then put them in the same format at the corresponding position. That’s what the derivative of a loss with respect to that matrix is”.
Okay, that is very convenient, but at the same time quite tedious thing to do if we have to do it by hand.
“I’m sure there should be some smart way to go about this. What is that?”
This blog is what I discovered as an early PhD student and today I decided I share that with others who might be wondering too.
Differentiating with respect to Vector
Before going into matrix derivative, let’s think about derivative vector. Let be a scalar valued function of a vector ; i.e., . Think of this as a loss function depending on some vector of dimension .
The derivative is a vector . By convention, we assume all vectors are column vectors and transpose whenever we need row vectors. Also, denotes a vector with individual elements .
Example:
Let and . Then, .
Note that the column vector is constructed by putting the derivative of with respect to the element of at the position.
We are interested in developing tools where we can perform such derivation directly. For example, if is any matrix and , then we want to be able to show . This can be shown with cumbersome calculation as follows:
Let . Then, .
.
However, we want to do this in an easier way. Moreover, we also want the technique to be useful for deriving derivative with respect to matrices.
Use of Differentials
Using Taylor expansion: higher terms
The key idea here is that if we take sufficiently small, then . Why? Because all the rest of the terms are higher powers of and will converge to zero in the limiting case. In fact, the very definition of derivative comes from this limiting case of Taylor expansion if we think about it. So, we can use this trick to compute derivative. Hence, given any function , we calculate and then obtain .
Example:
In the limiting case, as :
Using (because is a scalar, so it’s equal to its transpose),
. Therefore, .
Using the same procedure, it is easy to show .
This technique of using differentials is powerful because we can use the same technique to find derivative with respect to matrices.
Derivative with respect to matrices
Let be a function of matrices. We denote , where . Note that is scalar. What does it mean by derivative w.r.t. matrix? . The derivative . That is, we differentiate with respect to each element of matrix , , and then we fill the position of the matrix with this derivative to obtain the required derivative.
Interestingly, we can use the same procedure of differentials to find derivative with respect to matrices.
Steps to calculate derivative w.r.t. matrix X:
- Express
Why is there a trace of and ? Because at this point we need to define the inner product of two matrices. The Taylor series equivalent with matrix as argument requires us to define inner product in the matrix space. For now we can think of that is inner product of two matrices and . To see this you can take two matrices of and see that this expression gives us a scalar and satisfies inner product properties like linearity and norm etc.
Example 1:
Proof:
Setting higher terms of to zero:
Using property and :
Therefore, .
Example 3:
Proof: By now you must have realized from above exercise that the differential operator is linear. So, we can apply some tricks like moving differential operator inside another linear operator like trace. A direct application of something like this is as follows:
.
At this point, we use the following identity: —(1)
The proof will be provided later. For now using this identity:
Using the cyclic property of trace, :
.
Therefore, .
Proof of equation (1):
.
Or, .
Or, .
.
Example 4:
.
Proof of example 4 includes the identity — (2).
For proof of this identity, see the Convex Optimization book [2].
Using (2):
.