The derivative of the ReduceMean operator is not correct when computing the loss gradient for MSE and BCE loss.
Replace target->dims()[0] by target->size() in the computation of the loss gradient.