refFromCast slow down CUDA backend
What commit version of aidge do you use
-
aidge_backend_cuda
: f08d08ed
Issue description
For a simple MLP, the inference time of the CUDA kernel is really higher than for the CPU backend:
CPU | CUDA |
---|---|
7.91s | 10.98s |
(This is the time for 10k inference)
Initrgued by this finding I decided to check the CUDA backend and try to check for optimization.
Sub/Add kernel issue
Sub operator is really slow, especially compared to the FC kernel.
For the first inference, the sub operator take a very long time: Scheduler for first inference
The time inference after is more normal: Scheduler for second inference
First optimisation
I refactored loccally the Sub kernel to avoid to compute Stride and initializing tensor at every iteration, but only if the number of inputs changed.
This change allowed to improve from 37.86µs to 27.45µs.
Benchmark of Sub and FC kernel
I have tried to break down the FC and Sub forward in three steps: InputCheck, InputInit and Forward.
Taking the sub kernel example:
void Aidge::SubImpl_cuda::forward() {
/****** InputCheck ******/
const Sub_Op& op = static_cast<const Sub_Op&>(mOp);
// Check inputs
AIDGE_ASSERT(op.getInput(0), "missing input in Sub operator");
AIDGE_ASSERT(op.getInput(0)->hasImpl(), "cannot run Sub forward because the 0-th input has no implementation.");
DataType datatypeFirstInput = op.getInput(0)->dataType();
for (IOIndex_t i = 1; i < op.nbInputs(); ++i) {
AIDGE_ASSERT(op.getInput(i), "missing input in Sub operator");
AIDGE_ASSERT(op.getInput(i)->hasImpl(), "cannot run Sub forward because the {}-th input has no implementation.", i);
AIDGE_ASSERT(op.getInput(i)->dataType() == datatypeFirstInput, "Cannot add inputs with two differents data type.");
}
/****** InputInit ******/
std::vector<std::shared_ptr<Tensor>> inputFallbacks(op.nbInputs());
std::vector<Tensor> inputs(op.nbInputs());
std::vector<std::vector<int>> dims(op.nbInputs()); // For broadcasted dims
std::vector<std::vector<int>> strides(op.nbInputs()); // For the cooresponding strides
for (IOIndex_t i = 0; i < op.nbInputs(); ++i) {
inputs[i] = op.getInput(i)->refCastFrom(inputFallbacks[i], *op.getOutput(0));
// Get tensor dims and broadcast them
std::copy(inputs[i].dims().begin(), inputs[i].dims().end(), std::back_inserter(dims[i]));
dims[i].insert(dims[i].cbegin(), op.getOutput(0)->nbDims() - dims[i].size(), int(1));
// Compute the corresponding strides
std::vector<int> tensorStrides(dims[i].size());
int product = 1;
for (size_t j = dims[i].size(); j > 0; --j) {
tensorStrides[j - 1] = product;
product *= dims[i][j - 1];
}
strides[i] = tensorStrides;
}
/****** Forward ******/
switch(std::static_pointer_cast<Tensor>(mOp.getRawOutput(0))->dataType()) {
case DataType::Float64:
forward_<double>(inputs, dims, strides);
break;
case DataType::Float32:
forward_<float>(inputs, dims, strides);
break;
case DataType::Float16:
forward_<half>(inputs, dims, strides);
break;
default:
AIDGE_THROW_OR_ABORT(std::runtime_error, "Data type is not supported by Backend Cuda");
}
}
And I mesured the execution time of each of these blocks I got the following table:
SubForward:
Code Part | Exec time (ns) | % of execution |
---|---|---|
InputCheck | 600 | 2% |
InputInit | 11350 | 41% |
Forward | 15540 | 57% |
Note: For this measure, I refactored the code so we do not measure all the computation of stride etc ... only the call to refCastFrom
FCForward:
Code Part | Exec time (ns) | % of execution |
---|---|---|
InputCheck | 720 | 2% |
InputInit | 1350 | 3% |
Forward | 37480 | 95% |
After further investigation it appears that the function refCastFrom
is the only one during the InputInit phase. And for a reason I do not understand yet is 10 time slower for the Sub kernel. This functions seems to be the bottleneck for current optimization of the Sub/Add kernel.
Conclusion
- Sub operator warm up is very slow
- Small refactor of Sub Kernel allow to improve performances by 25%.
- Investigation on
refCastFrom
is recquired - More investigations on the gap between CPU and GPU are recquired has the gap between the two backend is still hudge despite optimizations and does not seems to be solvable only by looking at refCastFrom. (Note: I currently only looked at kernel computation, an investigation of the function called by the Scheduler may be recquired to explain the differences between the two backend.)