[add] LayerNorm operation

Context

Add the LayerNorm. This operation is used a lot in Transformers.

Forward equation

y = \frac{x - E[x] \cdot \mathbb{1}_d}{\sqrt{Var[x]+\epsilon}} \odot \gamma + \beta

Gradient equations

\hat x = \frac{x - E[x] \cdot \mathbb{1}_{d}}{\sigma [x]}

With respect to input x:

\frac{\partial \mathcal{L}}{\partial x} = \frac{1}{\sigma [x]} \cdot \left( I - \frac{1}{d} \mathbb{1} \mathbb{1}^\top - \frac{1}{d} \hat{x} \hat{x}^\top \right) (\frac{\partial \mathcal{L}}{\partial y} \odot \gamma)

With respect to \gamma:

\frac{\partial \mathcal{L}}{\partial \gamma} = \sum_{n=1}^{N} \left[ \frac{\partial \mathcal{L}}{y^{(n)}} \odot \hat{x}^{(n)} \right]

With respect to \beta:

\frac{\partial \mathcal{L}}{\partial \beta} = \sum_{n=1}^{N} \frac{\partial \mathcal{L}}{y^{(n)}}

With d the number of features and \odot the Hadamard product.

Core: !492 (merged)
Backend - CPU:
- aidge_backend_cpu!199 (merged) (forward)
- backward
Interoperability - ONNX:

Edited Sep 15, 2025 by Cyril Moineau