Skip to content

Accumulation of the backward gradients

Benjamin Halimi requested to merge accumulate into dev

Context

The CUDA implementation of the backward propagation is currently heavily used out for the development of the QAT.

While it does work fine for streamlined single-branch architectures, the current policy that consists in writing the gradient in the approriate tensors does not work well for multiple-branch architectures.

Indeed, when a branch forks, the gradient coming from the left branch should be added to the gradient coming from the right one. A possible solution to address all kinds of toplogies is to accumulate the gradient in the tensors instead of writing in them.

This MR does exactly this : it turns all the operator's backward implementation (when it exists) in accumulation mode.

Modifications

All the operators that have a backward have it modified.

We use an alpha/beta blending scheme similar to what cudnn currently proposes, that is:

curr_grad = alpha * computed_grad + beta * curr_grad

And we set both alpha and beta to 1.

Edited by Benjamin Halimi

Merge request reports

Loading