CIF data-based synthesis: variable ordering may be slow
I applied data-based synthesis to the following CIF model anonymous.cif, and noticed that it takes ~8 minutes before synthesis even start with reachability computations. I looked a bit deeper, and found that the bottleneck is in VarOrderHelper#createGraph, in particular in the three nested for-loops at the top of that method to populate Map<Pair<Integer, Integer>, Integer> graphEdges. By default data-based synthesis executes createGraph four times, and the times it took to populate graphEdges is:
- First: 88931.2296ms
- Second: 137134.374ms
- Third: 88294.367101ms
- Fourth: 142364.6363ms
I looked further into why this takes so long. It seems that graphEdges.merge(pair(i, j), 1, (a, b) -> a + b) uses the hash function of pair(i, j) internally, to efficiently store the keys of the mapping. And the type of keys is Pair<Integer, Integer>. Now it seems that Pair#hashCode has a very poor hash distribution when dealing with Integers, since their hash values are the integer values themselves. As a result, graphEdges.merge has to deal with many hash collisions, which means that it effectively becomes a linear-time operation instead of a contant-time one. (Other map operations, like put, would have the same problem.)
A solution could be to improve Pair#hashCode. I changed it to:
@Override
public int hashCode() {
long bits = Objects.hashCode(left);
bits ^= (long)Objects.hashCode(right) * 31;
return (((int)bits) ^ ((int)(bits >> 32)));
}
which I adapted from java.awt.geom.Point2D#hashCode. With that, the times it took to populate graphEdges became:
- First: 8165.9029ms
- Second: 8903.2127ms
- Third: 8521.618201ms
- Fourth: 9969.175401ms
Which is a 10x improvement for this example. Another solution would be to not use Pair<Integer, Integer> for the keys, but some other, locally-defined data type with a good hash function.