Skip to content

[learning] Backward failure when using Data Provider with drop_last set to False

What commit version of aidge do you use

Current development version of core and CUDA modules

Problem description

When using DataProvider with drop_last set to false, the last batch is shorter than the previous ones. This requires that the tensor dimensions are correctly updated during training.

Unfortunately, when scheduler.backward() is called for this last batch, the following CUDA error is generated: "RuntimeError: CUDNN failure: CUDNN_STATUS_NOT_SUPPORTED (9)".

This error occurs in the Aidge::ReLUImpl_cuda::backward_ method, when cudnnActivationBackward is called using the (badly?) updated CUDA descriptors / dimensions for the input and output tensors.

Reproducible example code

Use the following example in the tutorial: /examples/tutorials/Learning/learn.ipynb

Where the following modifications have been done :

  • Set BACKEND to "cuda",
  • In the definition of "aidge_dataprovider", "drop_last" is set to "False",
  • The break point after the 5 first iterations (at the end of the file) is removed.
Edited by Olivier Antoni