Improve scheduling policy for hybrid parallel/sequential branch resolution
What commit version of aidge do you use
-
aidge_core
: 4d5711e4 -
aidge_export_cpp
: 9f86629bbc179698a89b4a85d2bc79a567a04522
Problem description
Related implem #280 (closed) made it possible to choose a "compute speed"/memory compromise as shown on the two examples below. (It is an encoder made of 1DCNNs, that is tiled in multiple slice for sequential computing, reducing the memory peak)
But by default full sequential scheduling computes the slices branch after branch which means that the original Identity Node must be kept in memory during every branch computation. This could be improved even further by computing the slices first, and then computing the convolutions on those slices so that the full input vector (Identity is not needed anymore). This way, it would reduce the size of the block surrounded in red on the second figure below.
With 2 slices, ram_peam= 683kB, time=25ms:
With 20 slices ram_peak=200kB, Time=250(ms?):
Quick solution proposition
I see 2 simple yet handfull implementations in the scheduling algorithm:
- With a dedicated "force" option for the policy like
force_n_parallel_nodes=X
, we could manually tune the scheduler so that it runs theX
first nodes of each branch in parallel before running the rest sequentially. OR if not X nodes, just an option to run the first parallel nodes. -
Another solution could be to sum the output_size of every first node of the branches after the fork. If the sum is inferior or equal to the size of the output of the root node, then the nodes should be run in parallel. ie:EDIT: does not work because slices sum is bigger than input.is size_after <= size_before : run_the_nodes_in_parallel
. This check could also be done again for every next levels of parallel nodes. This way It should run theSlice
nodes in parallel, and then run the branches sequentialy when it is checked the parallel CNNs use more memory than the parallel nodes before.