Enable to choose weights layout for easy use of intrinsics
Problem description
In the cpp export, the fully connected layer's kernel says:
// Warning, there is a trick here !
// To use this kernel, the inputs have to be in NHWC and the weights are in NCHW
// It is only an issue if the FC was after a flatten layer.
// Otherwise it is not an issue for the other FC because CHANNELS_WIDTH = CHANNELS_HEIGHT = 1
This implies that the loops are complex and the weights are not used in the order of the memory:
for (int ch = 0; ch < NB_CHANNELS; ++ch) {
weightedSum += inputs[CHANNELS_WIDTH*NB_CHANNELS*iy + NB_CHANNELS*ix + ch]
* weights[CHANNELS_HEIGHT*CHANNELS_WIDTH*NB_CHANNELS*och + CHANNELS_HEIGHT*CHANNELS_WIDTH*ch + CHANNELS_HEIGHT*iy + ix];
}
This prevents from using highly optimized intrinsincs from ARM architectures (or others) as they use vectors
float32x4_t
and specific load instructions such as
vld1q_f32
which accelerate the computation of the dot product.
If the weights' layout was NHWC, as it is in the convolution kernel, these optimized instructions could be efficiently used, so it would be good to be able to choose the layout.
Edited by Fabrice Auzanneau