Skip to content

[backend_cuda] refFromCast slow down CUDA backend

What commit version of aidge do you use

Issue description

For a simple MLP, the inference time of the CUDA kernel is really higher than for the CPU backend:

CPU CUDA
7.91s 10.98s

(This is the time for 10k inference)

Initrgued by this finding I decided to check the CUDA backend and try to check for optimization.

Sub/Add kernel issue

Sub operator is really slow, especially compared to the FC kernel.

For the first inference, the sub operator take a very long time: Scheduler for first inference

The time inference after is more normal: Scheduler for second inference

First optimisation

I refactored loccally the Sub kernel to avoid to compute Stride and initializing tensor at every iteration, but only if the number of inputs changed.

This change allowed to improve from 37.86µs to 27.45µs.

Benchmark of Sub and FC kernel

I have tried to break down the FC and Sub forward in three steps: InputCheck, InputInit and Forward.

Taking the sub kernel example:

void Aidge::SubImpl_cuda::forward() {

    /****** InputCheck ******/
    const Sub_Op& op = static_cast<const Sub_Op&>(mOp);
    // Check inputs
    AIDGE_ASSERT(op.getInput(0), "missing input in Sub operator");
    AIDGE_ASSERT(op.getInput(0)->hasImpl(), "cannot run Sub forward because the 0-th input has no implementation.");
    DataType datatypeFirstInput = op.getInput(0)->dataType();
    for (IOIndex_t i = 1; i < op.nbInputs(); ++i) {
        AIDGE_ASSERT(op.getInput(i), "missing input in Sub operator");
        AIDGE_ASSERT(op.getInput(i)->hasImpl(), "cannot run Sub forward because the {}-th input has no implementation.", i);
        AIDGE_ASSERT(op.getInput(i)->dataType() == datatypeFirstInput, "Cannot add inputs with two differents data type.");
    }
    /****** InputInit ******/
    std::vector<std::shared_ptr<Tensor>> inputFallbacks(op.nbInputs());
    std::vector<Tensor> inputs(op.nbInputs());
    std::vector<std::vector<int>> dims(op.nbInputs()); // For broadcasted dims
    std::vector<std::vector<int>> strides(op.nbInputs()); // For the cooresponding strides
    for (IOIndex_t i = 0; i < op.nbInputs(); ++i) {
        inputs[i] = op.getInput(i)->refCastFrom(inputFallbacks[i], *op.getOutput(0));

        // Get tensor dims and broadcast them
        std::copy(inputs[i].dims().begin(), inputs[i].dims().end(), std::back_inserter(dims[i]));
        dims[i].insert(dims[i].cbegin(), op.getOutput(0)->nbDims() - dims[i].size(), int(1));

        // Compute the corresponding strides
        std::vector<int> tensorStrides(dims[i].size());
        int product = 1;
        for (size_t j = dims[i].size(); j > 0; --j) {
            tensorStrides[j - 1] = product;
            product *= dims[i][j - 1];
        }
        strides[i] = tensorStrides;
    }
    /****** Forward ******/
    switch(std::static_pointer_cast<Tensor>(mOp.getRawOutput(0))->dataType()) {
        case DataType::Float64:
            forward_<double>(inputs, dims, strides);
            break;
        case DataType::Float32:
            forward_<float>(inputs, dims, strides);
            break;
        case DataType::Float16:
            forward_<half>(inputs, dims, strides);
            break;
        default:
            AIDGE_THROW_OR_ABORT(std::runtime_error, "Data type is not supported by Backend Cuda");
    }
}

And I mesured the execution time of each of these blocks I got the following table:

SubForward:

Code Part Exec time (ns) % of execution
InputCheck 600 2%
InputInit 11350 41%
Forward 15540 57%

Note: For this measure, I refactored the code so we do not measure all the computation of stride etc ... only the call to refCastFrom

FCForward:

Code Part Exec time (ns) % of execution
InputCheck 720 2%
InputInit 1350 3%
Forward 37480 95%

After further investigation it appears that the function refCastFrom is the only one during the InputInit phase. And for a reason I do not understand yet is 10 time slower for the Sub kernel. This functions seems to be the bottleneck for current optimization of the Sub/Add kernel.

Conclusion

  • Sub operator warm up is very slow
  • Small refactor of Sub Kernel allow to improve performances by 25%.
  • Investigation on refCastFrom is recquired
  • More investigations on the gap between CPU and GPU are recquired has the gap between the two backend is still hudge despite optimizations and does not seems to be solvable only by looking at refCastFrom. (Note: I currently only looked at kernel computation, an investigation of the function called by the Scheduler may be recquired to explain the differences between the two backend.)