Accumulation of the backward gradients
Context
The CUDA implementation of the backward propagation is currently heavily used out for the development of the QAT.
While it does work fine for streamlined single-branch architectures, the current policy that consists in writing the gradient in the approriate tensors does not work well for multiple-branch architectures.
Indeed, when a branch forks, the gradient coming from the left branch should be added to the gradient coming from the right one. A possible solution to address all kinds of toplogies is to accumulate the gradient in the tensors instead of writing in them.
This MR does exactly this : it turns all the operator's backward implementation (when it exists) in accumulation mode.
Modifications
All the operators that have a backward have it modified.
We use an alpha/beta blending scheme similar to what cudnn
currently proposes, that is:
curr_grad = alpha * computed_grad + beta * curr_grad
And we set both alpha and beta to 1.