Overflow when converting to true quantize
When converting my model from "true" quantize to "fake" quantize I realized that I had an overflow error.
One of the value of my weight tensor changed from 128 -> -128. This is due to an overflow int8 range being [-128, 127], but the scaling setting my data to the [-128, 128] range.
This error is either due to a bad computation of the scaling factor or to a missing cliping for parameters.
I see two ways to fix this issue:
- Scale values between [-2^{N-1}, 2^{N-1} -1)](with N being the number of bit to quantize to).
- Scale between [-2^{N-1}, 2^{N-1}]but truncate with a Clip.