[add] LayerNorm operation
Context
Add the LayerNorm. This operation is used a lot in Transformers.
Forward equation
y = \frac{x - E[x] \cdot \mathbb{1}_d}{\sqrt{Var[x]+\epsilon}} \odot \gamma + \beta
Gradient equations
\hat x = \frac{x - E[x] \cdot \mathbb{1}_{d}}{\sigma [x]}
- With respect to input x:
\frac{\partial \mathcal{L}}{\partial x} = \frac{1}{\sigma [x]} \cdot \left( I - \frac{1}{d} \mathbb{1} \mathbb{1}^\top - \frac{1}{d} \hat{x} \hat{x}^\top \right) (\frac{\partial \mathcal{L}}{\partial y} \odot \gamma)
- With respect to \gamma:
\frac{\partial \mathcal{L}}{\partial \gamma} = \sum_{n=1}^{N} \left[ \frac{\partial \mathcal{L}}{y^{(n)}} \odot \hat{x}^{(n)} \right]
- With respect to \beta:
\frac{\partial \mathcal{L}}{\partial \beta} = \sum_{n=1}^{N} \frac{\partial \mathcal{L}}{y^{(n)}}
With d the number of features and \odot the Hadamard product.
-
Core: !492 (merged) - Backend - CPU:
-
aidge_backend_cpu!199 (merged) (forward) -
backward
-
-
Interoperability - ONNX: