michael orlitzky The Derivative of a Quadratic Form The Problem You want to take the derivative of f(x)=⟨Ax,x⟩=xTAx over the real numbers. You want it to make sense, so that you don't forget it. Notation Assume that all vectors are column vectors. Derivatives First, we need to talk about derivatives. The usual definition of f′(x) is, f′(x)=limh→0f(x+h)−f(x)h Where f:R→R is simply a function from one real number to another. This expression is called the derivative of f at x. Intuitively, it represents the rate of change of f at the point x. In particular—starting at x— if you move a distance of h and want to figure out how much f changed, you would multiply the derivative f′(x) by the amount of change, h, and add it to f(x) which is where you started: f(x+h)≈f(x)+f′(x)⋅h Note that the last expression, γ(h)=f′(x)⋅h is a linear function which approximates f around the point x. Generalized Derivatives Without getting into too much detail, I'll say that this approach can be generalized to functions f:Rm→Rn. The idea behind a generalized derivative is that we still want a linear approximation to f at a point x, although now we allow x to be a vector and not just a real number. We want these generalized derivatives to approximate f the same way the usual derivative does. If f′ is a generalized derivative for a function f, then we would like, f(x+h)≈f(x)+f′(x)⋅h We use the same notation for the generalized derivative as we do for the usual derivative. It shouldn't matter. The usual derivative is a special case, so we'll just think of f′(x) as that thing which satisfies the equation above. Jacobians You may have seen these generalized derivatives before; they're called Jacobians. The Jacobian of f at x (which we'll call J(f;x) below) is a matrix, and the point x is a vector. I'll claim without proof that this works: f(x+h)≈f(x)+J(f;x)⋅h where of course h is a vector as well. The vector h still represents the amount by which x changed. The formula for the Jacobian looks something like, J(f;x)=⎡⎣⎢⎢⎢⎢⎢⎢∂f1∂x1⋮∂fn∂x1∂f1∂x2⋮∂fn∂x2⋯⋯⋯∂f1∂xm⋮∂fn∂xm⎤⎦⎥⎥⎥⎥⎥⎥ where f1,f2,… are the component functions of f. That is, since the result of f(x) is an n-vector, we can think of f as working component-wise: f(x)=(f1(x),f2(x),…). In any case, when you write everything out, f(x+h)≈f(x)+⎡⎣⎢⎢⎢⎢⎢⎢∂f1∂x1⋮∂fn∂x1∂f1∂x2⋮∂fn∂x2⋯⋯⋯∂f1∂xm⋮∂fn∂xm⎤⎦⎥⎥⎥⎥⎥⎥⎡⎣⎢⎢⎢⎢h1h2⋮hm⎤⎦⎥⎥⎥⎥ None of this is too important if you've never seen a Jacobian before, but it will make the rest easier to remember if you have. Gradients When f:Rm→R, that is, when n=1, the Jacobian is a matrix with one row: J(f;x)=f′(x)=(∂f∂x1,∂f∂x2,…,∂f∂xm) Since there's only one component function of f, we've dropped the subscript on f1 for convenience. Again, without proof, I'll claim that this works: f(x+h)≈f(x)+f′(x)⋅h=f(x)+(∂f∂x1,∂f∂x2,…,∂f∂xm)⎡⎣⎢⎢⎢⎢h1h2⋮hm⎤⎦⎥⎥⎥⎥ The transpose of the row J(f;x)=f′(x) is usually called the gradient of f at x, and is denoted by ∇f(x). This causes great confusion, since the matrix multiplication above no longer works if we transpose the Jacobian. From now on, we'll ignore the gradient, and let J(f;x)=f′(x) be the row vector that acts as the derivative of f:Rm→R. Derivative of an Inner Product The first thing you'll need to remember to differentiate the quadratic form is the derivative of the standard inner product on Rm. Let b∈Rm be some other vector; then the inner product itself is defined as, f(x)=⟨x,b⟩=bTx=∑mk=1bk⋅xk If you have any intuition about these things, it should be clear (and why) that the derivative of this function is simply bT itself. Since b is a column vector, the sum above is bTx. One way to remember it is that “the derivative of a constant times x is the constant.” Another way is to note that bTx is a linear function that approximates f(x)—since it is f(x)—therefore, as we previously mentioned, it serves as the derivative of f. If all else fails, compute the Jacobian, and confirm that it equals bT. Note that it doesn't matter which order x and b appear in. Over the real numbers, the inner product is symmetric, so bTx=xTb both have the same derivative, bT. Derivative of a Matrix Times a Vector Now we'll compute the derivative of f(x)=Ax, where A is an m×m matrix, and x∈Rm. Intuitively, since Ax is linear, we expect its derivative to be A. This turns out to be the case. Recall that in the discussion of generalized derivatives, we said that we wanted a linear approximation of our function that satisfied, f(x+h)≈f(x)+f′(x)⋅h Well, f(x)=Ax, so f(x+h)=A(x+h)=Ax+Ah. If we set this equal to f(x)+f′(x)⋅h, then, Ax+Ah=f(x)+f′(x)⋅h Subtracting f(x)=Ax from both sides, Ah=f′(x)⋅h And dividing by h, A=f′(x) Total Derivatives Let's imagine now that f(x,y) is a function of two variables x and y which returns a real number. Then the derivative of f with respect to x is the partial derivative ∂f∂x. Recall: the derivative of f with respect to x tells us how fast f changes when x changes a little bit: f(x+h,y)≈f(x,y)+∂f∂x⋅h This all holds likewise for y. But what if y isn't independent of x? Write f(x,y(x)) to make explicit the fact that y depends on x. It makes intuitive sense that if we change x by a little bit (and thus y changes some as well), then the total change in f will be equal to (the change in f due to x changing) plus (the change in f due to y changing). The change in f due to x is roughly ∂f∂x⋅h as before. Likewise, the change in f due to y is about ∂f∂y⋅q, where q is how much y changed. But we can compute q; how much does y change when x changes by h? We already know: y(x+h)≈y(x)+∂y∂x⋅h Putting this all together, we have, f(x+h,y(x+h))≈f(x,y(x))+∂f∂x⋅h+∂f∂y∂y∂x⋅h=f(x,y(x))+(∂f∂x+∂f∂y∂y∂x)⋅h Remember that the derivative of f was the thing that, when multiplied by the change h, tells us how much f changed? The above formula therefore shows us that, f′(x,y(x))=∂f∂x+∂f∂y∂y∂x since it satisfies that criteria. Ultimately, f is a function of one variable x, so it makes sense to write f′ and speak of “the derivative of f.” The Chain Rule The only other fact we'll use is the chain rule. This should be familiar from calculus, f′(y(x))=f′(y)⋅y′(x) The chain rule actually applies to all generalized derivatives, in particular the single-row Jacobian. So it's safe to use the formula above, even when f and y are vector-valued functions. We'll use this below. The Quadratic Form With all that out of the way, this should be easy. Let, f(x)=xTAx where x∈Rm, and A is an m×m matrix. We can let y(x)=Ax so that, f(x,y(x))=xT⋅y(x) Using the formula for the total derivative above, f′(x,y(x))=∂f∂x+∂f∂y⋅y′(x) The first term ∂f∂x is the derivative of an inner product; if we hold y(x) fixed, then the derivative of xTy(x) is y(x)T, by the earlier discussion. Remember, it's a row vector. Likewise, in the second term, ∂f∂y is equal to xT. It's a row vector too. Furthermore in the second term, ∂y∂x is the derivative of a matrix times a vector. We determined above that its derivative is the coefficient matrix A. Substituting, f′(x,y(x))=y(x)T+xT⋅A Since y(x)=Ax, we have y(x)T=xTAT. We substitute again, f′(x,y(x))=xTAT+xTA=xT(AT+A) This is a derivative, and therefore, a row vector. If A is symmetric (often the case), then AT=A and we can simplify, f′(x,y(x))=2xTAT copyright sucks Under no circumstances should you send an email to ackbar@viabit.com. |

Software Engineering ➼ Machine learning ➼ Data Science ➼ Product Leadership 🎯 > Math > Linear Algebra >