[learning] [Bug] A Dummy Pass is Required

Description

I made the empirical observation, on several setups, that trainings systematically stall, unless a call of forward() + backward() is performed before the training loop. If this extra pass is done, we get the expected learning behaviour.

For now, the analysis of internal values and gradients is not conclusive, as nothing obvious comes up (the model values diverge progressively, not brutally), which is very strange. Yet, it does show that internally, the values diverge toward infinity and end up being NaN.

I'm actively investigating this issue at the moment. For the record, here is the body of my code :

# DUMMY PASS
x, _ = get_batch_pair(BATCH_SIZE, True)
scheduler.forward(True, [x])
scheduler.backward()    

# TRAINING LOOP
for it in range(NB_IT):
    x, y_h = get_batch_pair(BATCH_SIZE, True)
    x.grad() 
    scheduler.forward(True, [x])
    y = get_output_tensor()
    l = loss(y, y_h)
    scheduler.backward() 
    optimizer.update()
    optimizer.reset_grad(classifier)

Note that calling the reset_grad() before the forward() or after the update() does not change the observed behaviour.