[core] Improve scheduling policy for hybrid parallel/sequential branch resolution

What commit version of aidge do you use

aidge_core: 4d5711e4
aidge_export_cpp: 9f86629bbc179698a89b4a85d2bc79a567a04522

Problem description

Related implem #280 (moved) made it possible to choose a "compute speed"/memory compromise as shown on the two examples below. (It is an encoder made of 1DCNNs, that is tiled in multiple slice for sequential computing, reducing the memory peak)

But by default full sequential scheduling computes the slices branch after branch which means that the original Identity Node must be kept in memory during every branch computation. This could be improved even further by computing the slices first, and then computing the convolutions on those slices so that the full input vector (Identity is not needed anymore). This way, it would reduce the size of the block surrounded in red on the second figure below.

With 2 slices, ram_peam= 683kB, time=25ms:

With 20 slices ram_peak=200kB, Time=250(ms?):

Quick solution proposition

I see 2 simple yet handfull implementations in the scheduling algorithm:

With a dedicated "force" option for the policy like force_n_parallel_nodes=X , we could manually tune the scheduler so that it runs the X first nodes of each branch in parallel before running the rest sequentially. OR if not X nodes, just an option to run the first parallel nodes.
Another solution could be to sum the output_size of every first node of the branches after the fork. If the sum is inferior or equal to the size of the output of the root node, then the nodes should be run in parallel. ie: is size_after <= size_before : run_the_nodes_in_parallel . This check could also be done again for every next levels of parallel nodes. This way It should run the Slice nodes in parallel, and then run the branches sequentialy when it is checked the parallel CNNs use more memory than the parallel nodes before. EDIT: does not work because slices sum is bigger than input.

Edited Jun 17, 2025 by Louis Lerbourg