Slight Output Differences after Integer Cast of Quantized Models

Problem

Current PTQ allows the user to choose the data representations of the output graph. Thus, one can choose to keep the internal values in floats (although all the numbers are actually equal to integers) or to cast them to true integers. In the later case, if the BitShift option is set to True, the scaling operations of the quantizers are transformed into BitShifts with roundings.

While both choices should be mathematically equivalent, we do observe differences in the outputs between the two cases, that eventually affect the accuracies of the models.

Explanation

After investigation, it appears that this is caused by a rounding issue. Indeed, input values of the quantizers are integers, and quantizers scalings are negative powers of two. Thus, the probability that a scaled value is a halfway case (e.g. 3.5, 7.5 or 2.5) is high. But the handling of those values differs from fake to true quantization : in fake quantization the scaled value is rounded by the Round operator, that is to the nearest even integer. On the other hand, in true quantization, the scaled values are computed by the BitShift operator, which does bitshift-rounding, that is rounding to the next integer (in positive order).

Solution

To address this problem, a simple solution is to enrich the Round operator, so that it implements the different halfway rounding techniques, that is :

round halfways to the nearest even integer, which is the Python default behaviour, and ONNX compliant option
round halfways away from zero, which is the C++ STL default behaviour
round halfways to next integer, which is consistent with the integer bitshift-rounding

When doing fake quantization, one should make use of the third choice (for the reasons explained above), which should address the issue.

Merge Requests

We need at least 3 MR to solve this issue : one for core, one for backend_cpu, and one for quantization.

Edited Apr 25, 2025 by Benjamin Halimi