Skip to content

[add] LayerNorm operation

Context

Add the LayerNorm. This operation is used a lot in Transformers.

Forward equation

y = \frac{x - E[x] \cdot \mathbb{1}_d}{\sqrt{Var[x]+\epsilon}} \odot \gamma + \beta

Gradient equations

\hat x = \frac{x - E[x] \cdot \mathbb{1}_{d}}{\sigma [x]}

  • With respect to input x:

\frac{\partial \mathcal{L}}{\partial x} = \frac{1}{\sigma [x]} \cdot \left( I - \frac{1}{d} \mathbb{1} \mathbb{1}^\top - \frac{1}{d} \hat{x} \hat{x}^\top \right) (\frac{\partial \mathcal{L}}{\partial y} \odot \gamma)

  • With respect to \gamma:

\frac{\partial \mathcal{L}}{\partial \gamma} = \sum_{n=1}^{N} \left[ \frac{\partial \mathcal{L}}{y^{(n)}} \odot \hat{x}^{(n)} \right]

  • With respect to \beta:

\frac{\partial \mathcal{L}}{\partial \beta} = \sum_{n=1}^{N} \frac{\partial \mathcal{L}}{y^{(n)}}

With d the number of features and \odot the Hadamard product.

Edited by Cyril Moineau