Proof of weight normalization's mechanism



In Tim Saliman’s paper on Weight Normalization, the derivation of the gradient of loss w.r.t. v is not clearly mentioned. I’d be really glad if someone can show the exact steps to arrive at equation (3) from equation (2).


Not sure how this is done, but I would be interested in finding out too.
The paper is never the less very interesting.


@yaroslavvb helped me out with the derivation. It’s pretty neat and understandable if you’re comfortable with the basics of vector calculus.

The new gradient G* is written in terms of the old gradient G as given in his derivation. For the computing the differential of the loss L, you can follow his notes where he derived KFAC, which is a second-order method to optimize neural nets.

The fundamental idea behind weight normalization is that by reparameterizing the weights as g*v/norm(v) and by decoupling the magnitude from the direction, you’re ensuring that the gradients are orthogonal to the current weight vector v. This helps in stabilizing the weight updates which in turn aids optimization.