Add looser constraints for memory instructions
Created by: moimfeld
Issue
A core that cannot throw memory exception looses clock cycles because it has to stall it's pipeline due to too tight rules defined by the core-v-xif. This loss of clock cycles is unnecessary and can cause the core-v-xif to run slower compared to interfaces that can take advantage of the fact that certain cores cannot throw memory exceptions.
The following rules prevent cores from continuing to run:
In particular, an instrucion may not be offloaded if there is a memory operation pending in either the core's pipeline or any of the connected accelerator units.
Neither the offloading core nor any of the connected accelerators may commit any new instruction results while synchronous memory operations are being served.
Solution
Change the specification of the core-v-xif to allow a core that cannot throw memory exceptions to continue running. It shall only stall it's pipeline when encountering a core-internal memory instruction while an offloaded memory instruction is still pending. This is done to prevent write-after-write hazards.
Allowing cores to offload multiple memory instructions creates the possibility for an accelerator subsystem to offload multiple memory instructions back to the core. I forbid this (see line 151 in the commit), because the RTL of the core-v-xif cannot handle multiple memory instruction being offloaded back to the core.
This solution will improve the core-v-xif for cores that cannot throw memory exception in terms of latency. But it is still not optimal.
Remaining Bottleneck
Disallowing accelerator subsystems to offload more than one memory instruction to the core can still cause unnecessary stalls in the core.
Picture the following program running on a core that is connected to a fpu-subsystem via the x-interface (the core cannot throw memory exceptions and implements the x-interface according to the changes I proposed):
fsw ft0 0(s0) fsw ft1 4(s0) lw a2 8(s0)
The core will offload fsw in the first clock cycle which will be instantly offloaded back to the core. The core then offloads the second fsw in the second clock cycle which cannot be instantly offloaded back to the core since the fpu-subsystem has to wait until the first fsw is completed (This will cause a stall in the fpu-subsystem, but not in the core). In the third clock cycle the core encounters a lw and therefore has to wait until no more offloaded memory instructions have to be served. The core will therefore be stalled for at least 3 clock cycles (depending on how long the memory access takes).
A core that fully exploits the fact that it cannot throw exceptions would not encounter any stalls for this program.
These stalls could also be resolved by changing the rtl of the core-v-xif and subsequently allowing accelerator subsystems to offload multiple memory instructions back to the core. I will not discuss the solution to this problem in detail here.