refFromCast slow down CUDA backend

What commit version of aidge do you use

aidge_backend_cuda: f08d08ed

Issue description

For a simple MLP, the inference time of the CUDA kernel is really higher than for the CPU backend:

CPU	CUDA
7.91s	10.98s

(This is the time for 10k inference)

Initrgued by this finding I decided to check the CUDA backend and try to check for optimization.

Sub/Add kernel issue

Sub operator is really slow, especially compared to the FC kernel.

For the first inference, the sub operator take a very long time: Scheduler for first inference

The time inference after is more normal: Scheduler for second inference

First optimisation

I refactored loccally the Sub kernel to avoid to compute Stride and initializing tensor at every iteration, but only if the number of inputs changed.

This change allowed to improve from 37.86µs to 27.45µs.

Benchmark of Sub and FC kernel

I have tried to break down the FC and Sub forward in three steps: InputCheck, InputInit and Forward.

Taking the sub kernel example:

void Aidge::SubImpl_cuda::forward() {

    /****** InputCheck ******/
    const Sub_Op& op = static_cast<const Sub_Op&>(mOp);
    // Check inputs
    AIDGE_ASSERT(op.getInput(0), "missing input in Sub operator");
    AIDGE_ASSERT(op.getInput(0)->hasImpl(), "cannot run Sub forward because the 0-th input has no implementation.");
    DataType datatypeFirstInput = op.getInput(0)->dataType();
    for (IOIndex_t i = 1; i < op.nbInputs(); ++i) {
        AIDGE_ASSERT(op.getInput(i), "missing input in Sub operator");
        AIDGE_ASSERT(op.getInput(i)->hasImpl(), "cannot run Sub forward because the {}-th input has no implementation.", i);
        AIDGE_ASSERT(op.getInput(i)->dataType() == datatypeFirstInput, "Cannot add inputs with two differents data type.");
    }
    /****** InputInit ******/
    std::vector<std::shared_ptr<Tensor>> inputFallbacks(op.nbInputs());
    std::vector<Tensor> inputs(op.nbInputs());
    std::vector<std::vector<int>> dims(op.nbInputs()); // For broadcasted dims
    std::vector<std::vector<int>> strides(op.nbInputs()); // For the cooresponding strides
    for (IOIndex_t i = 0; i < op.nbInputs(); ++i) {
        inputs[i] = op.getInput(i)->refCastFrom(inputFallbacks[i], *op.getOutput(0));

        // Get tensor dims and broadcast them
        std::copy(inputs[i].dims().begin(), inputs[i].dims().end(), std::back_inserter(dims[i]));
        dims[i].insert(dims[i].cbegin(), op.getOutput(0)->nbDims() - dims[i].size(), int(1));

        // Compute the corresponding strides
        std::vector<int> tensorStrides(dims[i].size());
        int product = 1;
        for (size_t j = dims[i].size(); j > 0; --j) {
            tensorStrides[j - 1] = product;
            product *= dims[i][j - 1];
        }
        strides[i] = tensorStrides;
    }
    /****** Forward ******/
    switch(std::static_pointer_cast<Tensor>(mOp.getRawOutput(0))->dataType()) {
        case DataType::Float64:
            forward_<double>(inputs, dims, strides);
            break;
        case DataType::Float32:
            forward_<float>(inputs, dims, strides);
            break;
        case DataType::Float16:
            forward_<half>(inputs, dims, strides);
            break;
        default:
            AIDGE_THROW_OR_ABORT(std::runtime_error, "Data type is not supported by Backend Cuda");
    }
}

And I mesured the execution time of each of these blocks I got the following table:

SubForward:

Code Part	Exec time (ns)	% of execution
InputCheck	600	2%
InputInit	11350	41%
Forward	15540	57%

Note: For this measure, I refactored the code so we do not measure all the computation of stride etc ... only the call to refCastFrom

FCForward:

Code Part	Exec time (ns)	% of execution
InputCheck	720	2%
InputInit	1350	3%
Forward	37480	95%

After further investigation it appears that the function refCastFrom is the only one during the InputInit phase. And for a reason I do not understand yet is 10 time slower for the Sub kernel. This functions seems to be the bottleneck for current optimization of the Sub/Add kernel.

Conclusion

Sub operator warm up is very slow
Small refactor of Sub Kernel allow to improve performances by 25%.
Investigation on refCastFrom is recquired
More investigations on the gap between CPU and GPU are recquired has the gap between the two backend is still hudge despite optimizations and does not seems to be solvable only by looking at refCastFrom. (Note: I currently only looked at kernel computation, an investigation of the function called by the Scheduler may be recquired to explain the differences between the two backend.)