Accumulation of the backward gradients (!58) · Merge requests · Eclipse Projects / aidge / aidge_backend_cuda

Benjamin Halimi requested to merge accumulate into dev Dec 05, 2024

Context

The CUDA implementation of the backward propagation is currently heavily used out for the development of the QAT.

While it does work fine for streamlined single-branch architectures, the current policy that consists in writing the gradient in the approriate tensors does not work well for multiple-branch architectures.

Indeed, when a branch forks, the gradient coming from the left branch should be added to the gradient coming from the right one. A possible solution to address all kinds of toplogies is to accumulate the gradient in the tensors instead of writing in them.

This MR does exactly this : it turns all the operator's backward implementation (when it exists) in accumulation mode.

Modifications

All the operators that have a backward have it modified.

We use an alpha/beta blending scheme similar to what cudnn currently proposes, that is:

curr_grad = alpha * computed_grad + beta * curr_grad

And we set both alpha and beta to 1.

Edited Dec 10, 2024 by Benjamin Halimi

Accumulation of the backward gradients

Context

Modifications

Merge request reports