CIF data-based synthesis may get 'stuck' due to repeated GC cycles

Context

I got the Lock Linne model from @lmoormann. This is the one used in the real-world test in April. The idea was to compute the controlled system state space size, for a presentation later this year. With forward reachability disabled (as is the default), this number is approximate. It is then an upper bound, not an exact number. This can be computed in about 5 seconds. With forward reachability enabled, synthesis took a long time.

First analysis

I tried running it for a while, but at some point Eclipse ran out of memory due to too much debug output in the console. I then ran it with debug output saved to a file instead, which ran for over 4 days until it had written 1.7 TB of debug output and my disk was full. I analyzed the debug output for the transitions that are applied. Saturation starts at the bottom of the BDD, saturates with all transitions, then moves up a level, saturations that level and all levels below, etc. I found that in the first GB, it reached many new higher levels, but in the remaining 1699 GB or so, it never again reached a higher level. It somehow got stuck in the lower levels.

I tried to run it with a larger cache (5 times the default), and without any debug output, but this didn't help either. It ran for 6+ days until my computer rebooted by itself during the night (I think due to Windows update, which had been asking me to reboot for a few days already).

At the suggestion of @wytseoortwijn, I looked at garbage collection (GC). I enabled GC statistics, and got many lines like this:

BDD pre  garbage collection: #34,072,             0 of       100,003 nodes free
BDD post garbage collection: #34,072,        57,884 of       100,003 nodes free,             5 ms,       178,275 ms total
BDD pre  garbage collection: #34,073,             0 of       100,003 nodes free
BDD post garbage collection: #34,073,        57,242 of       100,003 nodes free,             6 ms,       178,281 ms total
BDD pre  garbage collection: #34,074,             0 of       100,003 nodes free
BDD post garbage collection: #34,074,        57,936 of       100,003 nodes free,             6 ms,       178,287 ms total
BDD pre  garbage collection: #34,075,             0 of       100,003 nodes free
BDD post garbage collection: #34,075,        57,499 of       100,003 nodes free,             8 ms,       178,295 ms total

It seems to get 'stuck' in GC cycles. It does a lot of GCs, to free up space to proceed. Doing a GC in JavaBDD leads to the cache being 'reset', so effectively emptied. It then seems to do some more computations, likely recomputing things it already knew, but due to clearing the cache, needs to recompute. It then runs out of free space, performs GC, clears the cache again, etc. It may well be doing nearly only re-computations of things it already knew at one point. This would mean it effectively was running without a cache. This is know to lead to severe performance issues, as caching is absolutely essential for good performance of BDD operations.

Manual solution

I set --bdd-table=1000000, so 1 million entries as initial node table size, 10 times the default. Synthesis then completed (with forward reachability enabled) in about 5 seconds, so in about the same time as with only backwards reachability enabled (and forward reachability disabled).

Completed improvements

Disable JavaBDD cache flusing on GC (!1356 (merged))

I tried disabling 'flushing' the cache on GC, using this change to JavaBDD's JFactory class:
```
-    public static boolean FLUSH_CACHE_ON_GC = true;
+    public static boolean FLUSH_CACHE_ON_GC = false;
```
It then doesn't clear the cache, but instead tries to keep all valid entries (if I understand the code correctly). This doesn't seem to help. I ran it for over 17 minutes, and it hadn't finished yet. I don't know if it would ultimately have made it faster, as I don't know how far it got in a certain with the flag set to true, nor with it set to false. Further analysis is likely needed to understand why it doesn't help. This is a logical change to make, and probably one of the ingredients required to solve this issue, as keeping cache entries prevents unnecessary re-computations. However, this change doesn't by itself solve the issue. Note that JavaDD (#1000) already preserves the still-valid cache entries on GC.

Original ideas for full solution

The question is then how to prevent this problem in the future. Obviously, users can enable GC statistics, see that it is stuck, and increase the initial node table size, as I did. However, it would be better if it just worked well out-of-the-box. The following would seem to be needed to fully solve the problem:

Detect getting stuck and fix it: We could for instance monitor GC, and detect that it gets 'stuck'. We could then proactively increase the node table size, maybe.
Allow for more memory (beyond Java array limits), to allow increasing the node table to a large enough size for larger models. JavaDD (#1000) aims to support this.

Switching to JavaDD (see #1000) could help, as it does some things better already than JavaDD. However, it likely has the same issues still (it does preserve the cache of still-used nodes on GC, but presumably could also get stuck). Besides these, other performance improvements could help, but likely don't all fundamentally solve this issue.

Further investigation is needed

I tried out original idea number (2), and it turns out that it is not sufficient:

Starting from a large node table from the start differs from (rather quickly) increasing the node table size during synthesis. I conjecture this is partly due to having bad luck. If you start with a large node table, you'll have everything in it, or at least less regularly do a GC. You could get lucky that for some high level of a BDD the cached result is (still) available. If you start from a smaller node table, you could be unlucky, and get stuck at much lower levels that you wouldn't have gotten (back) to if you had a larger node table.
In general, it seems that if 'important' cache entries are lost, you could get stuck in lower levels.
It could be that we need smarter caching, but more analysis is required.

Edited Sep 25, 2025 by Dennis Hendriks