Enable to choose weights layout for easy use of intrinsics

Problem description

In the cpp export, the fully connected layer's kernel says:

    // Warning, there is a trick here !
    // To use this kernel, the inputs have to be in NHWC and the weights are in NCHW
    // It is only an issue if the FC was after a flatten layer.
    // Otherwise it is not an issue for the other FC because CHANNELS_WIDTH = CHANNELS_HEIGHT = 1

This implies that the loops are complex and the weights are not used in the order of the memory:

for (int ch = 0; ch < NB_CHANNELS; ++ch) {
    weightedSum += inputs[CHANNELS_WIDTH*NB_CHANNELS*iy + NB_CHANNELS*ix + ch]
    * weights[CHANNELS_HEIGHT*CHANNELS_WIDTH*NB_CHANNELS*och + CHANNELS_HEIGHT*CHANNELS_WIDTH*ch + CHANNELS_HEIGHT*iy + ix];
}

This prevents from using highly optimized intrinsincs from ARM architectures (or others) as they use vectors

float32x4_t

and specific load instructions such as

vld1q_f32

which accelerate the computation of the dot product.

If the weights' layout was NHWC, as it is in the convolution kernel, these optimized instructions could be efficiently used, so it would be good to be able to choose the layout.

Edited Aug 08, 2025 by Fabrice Auzanneau