Evaluate speed performance of inference with cuda

Measure the inference time of some operators with backend_cuda. Some optimizations that could be done is to reduce the things we do in the forward operation and maybe add them to the constructor of the operator.