Accumulation of the backward gradients
Context
The CUDA implementation of the backward propagation is currently heavily used out for the development of the QAT.
While it does work fine for streamlined single-branch architectures, the current policy that consists in writing the gradient in the approriate tensors does not work well for multiple-branch architectures.
Indeed, when a branch forks, the gradient coming from the left branch should be added to the gradient coming from the right one. A possible solution to address all kinds of toplogies is to accumulate the gradient in the tensors instead of writing in them.
This MR does exactly this : it turns all the operator's backward implementation (when it exists) in accumulation mode.
Modifications
All the operators that have a backward have it modified.
We use an alpha/beta blending scheme similar to what cudnn
currently proposes, that is:
curr_grad = alpha * computed_grad + beta * curr_grad
And we set both alpha and beta to 1.
Merge request reports
Activity
added Enhancement ⭐ StatusReview Ready labels
assigned to @bhalimi
added 45 commits
-
d327fdfe...2b484b2c - 38 commits from branch
dev
- 57813205 - base setup for scaling branch
- 53d94e0e - add the backend of the Scaling node
- 06c47f2d - add sqrt cuda backend
- ed1b204e - accumulate the gradients instead of replacing them (done for several nodes but not all)
- 6ca3699c - add the alpha and beta for the Padding operator
- ad5adffd - move every node with a backward in gradient accumulation mode
- 65cd7698 - Merge branch 'accumulate' of gitlab.eclipse.org:eclipse/aidge/aidge_backend_cuda into accumulate
Toggle commit list-
d327fdfe...2b484b2c - 38 commits from branch
105 105 tensorDesc = std::dynamic_pointer_cast<TensorImpl_cuda_>(input1.getImpl())->getCudnnTensorDesc(input1); 106 106 } 107 107 108 if (op.trainingMode()) changed this line in version 5 of the diff
requested review from @cmoineau
added 25 commits
-
65cd7698...75102058 - 13 commits from branch
dev
- 75102058...a55884fc - 2 earlier commits
- e5dd3f9d - add sqrt cuda backend
- 73659134 - accumulate the gradients instead of replacing them (done for several nodes but not all)
- 1c40c8bd - add the alpha and beta for the Padding operator
- cb88eb71 - move every node with a backward in gradient accumulation mode
- 6d78b49e - minor changes
- 374e351a - add the backend of the Scaling node
- 24ab9fce - add sqrt cuda backend
- 73989490 - add sqrt cuda backend
- 3465120c - move every node with a backward in gradient accumulation mode
- 63b2d000 - Merge remote-tracking branch 'refs/remotes/origin/accumulate' into accumulate
Toggle commit list-
65cd7698...75102058 - 13 commits from branch
mentioned in issue aidge_backend_cpu#34
added 26 commits
-
0517b3bd...d9c094f6 - 2 commits from branch
dev
- d9c094f6...c22df87b - 14 earlier commits
- d1355343 - move every node with a backward in gradient accumulation mode
- bd2dbe48 - base setup for scaling branch
- 283ddbf8 - add the backend of the Scaling node
- c819699a - add sqrt cuda backend
- 69d4560f - add sqrt cuda backend
- fff9b4cd - move every node with a backward in gradient accumulation mode
- de94d8ac - remove the backend of the old Scaling node
- 830add3b - minor fix (remove the scaling op include)
- 08f3083e - remove duplicate include
- 94e29872 - Merge branch 'accumulate' of gitlab.eclipse.org:eclipse/aidge/aidge_backend_cuda into accumulate
Toggle commit list-
0517b3bd...d9c094f6 - 2 commits from branch
enabled an automatic merge when all merge checks for a0191317 pass
mentioned in commit 57faa45a
mentioned in merge request aidge_learning!34 (merged)