@yaroslavvb helped me out with the derivation. It’s pretty neat and understandable if you’re comfortable with the basics of vector calculus.
The new gradient
G* is written in terms of the old gradient
G as given in his derivation. For the computing the differential of the loss
L, you can follow his notes where he derived KFAC, which is a second-order method to optimize neural nets.
The fundamental idea behind weight normalization is that by reparameterizing the weights as
g*v/norm(v) and by decoupling the magnitude from the direction, you’re ensuring that the gradients are orthogonal to the current weight vector
v. This helps in stabilizing the weight updates which in turn aids optimization.