State of AMO support

Created by: ramonwirsch

The documentation says the RISC-V A-extension for atomic instructions is supported, but now that I am attempting to use it, it seems broken. How tested / working is the AMO support right now?

A first look at the code seems like it cannot work, but I have not tested this in-depth and could be wrong on some points.

Issues I have found thus far:

AMO-signals are passed combinatorially from the decoder (even bypassing the issue stage) to the DCache (through ls_unit), where they are fused with whatever address and data that is just arriving out of the load-store-queue. I do not see how this ensures that the address matches the AMO-signals
ls_offset is not 0 for AMO ops in decoder. ls_unit simply adds the offset to the base address for all inputs, even though AMO ops have no offset operand and reuse the same instruction bits for opcode-purposes
AMO operations (like amoadd) are treated as loads and go through the load-queue, not the store-queue. So they could be issued before the respective instruction is retired, unlike normal store operations. Seems wrong. I imagine the read-modify-write & swap operations may need to go through both queues
since AMO ops go only through the load-queue which does not have a data-field, the data_in provided to cache / memory does not match (from the store-queue, might be random/invalid). The data_in is used for the new data in case of amoswap or as an operand in case of amoadd and similar.
local memory ignores any AMO-signals silently.
- While that memory should only be for a single core and thus does not need atomic primitives the same way as main memory, this would require code to wrap every atomic primitive with range checks to switch to a non-primitive implementation for local memory. Although adding this should be easy, as there is only one source of atomic operations for local memory. The primitives would still provide a performance benefit over locking interrupts and doing the read-modify-write transaction with non-atomic instructions. ANd still be helpful for atomic operations in a multi-threaded environment.
clear_reservation input into dcache is documented to be used to reset reservation state on exceptions, but hardwired to 0
why is store-forwarding not used for AMO operations? It should provide the same benefit as it does for store-operations when rs2 is used by the operation
why do the sc_success & sc_complete signals not go through the l1_request & response interfaces? Do synthesis tools not optimize this properly when AMO is disabled?

Am I correct on all of those points? Are there more problems I have not even realized? Is there a design-intent behind the current state that I do not yet see and it is just unfinished or is it so out-of-date it should not be considered at all?

Edit:

Also seems that AMO operations are not marked as using rd, so the result is discarded. Though simply changing this, causes the retire logic to dead lock, since it waits for a writeback, but the store will never start as it is waiting for the retire-signal. I am guessing not marking the use of rd and writing it anyway worked before the renamer was added.