Floating-Point Performance Bug for FPU_*_LAT>1

Component:RTL

Hi, we've identified a performance bug in the decoder/APU dispatcher. If FPU_*_LAT>1, the current implementation stalls FPU instructions until the previous completes even if there are no data hazards.

Steps to Reproduce

RTL Configuration

COREV_PULP=0, COREV_CLUSTER=0 FPU=1, ZFINX=0 FPU_ADDMUL_LAT=2 FPU_OTHERS_LAT=2

Software Test

A sequence of independent FMADD (or other FPU) instructions:

    878a:	b0001073          	csrw	mcycle,zero
    878e:	3200f073          	csrc	mcountinhibit,1
    8792:	68e7f043          	fmadd.s	ft0,fa5,fa4,fa3
    8796:	68e7f0c3          	fmadd.s	ft1,fa5,fa4,fa3
    879a:	68e7f143          	fmadd.s	ft2,fa5,fa4,fa3
    879e:	68e7f1c3          	fmadd.s	ft3,fa5,fa4,fa3
    87a2:	68e7f243          	fmadd.s	ft4,fa5,fa4,fa3
    87a6:	68e7f2c3          	fmadd.s	ft5,fa5,fa4,fa3
    87aa:	68e7f343          	fmadd.s	ft6,fa5,fa4,fa3
    87ae:	68e7f3c3          	fmadd.s	ft7,fa5,fa4,fa3
    87b2:	68e7fe43          	fmadd.s	ft8,fa5,fa4,fa3
    87b6:	68e7fec3          	fmadd.s	ft9,fa5,fa4,fa3
    87ba:	01d7a027          	fsw	ft9,0(a5)
    87be:	3200e073          	csrs	mcountinhibit,1

Waves

Unnecessary pipeline bubbles:

For comparison, when FPU_ADDMUL_LAT=1, there are no stalls...

Source

The primary (but likely not only) culprit seems to be that apu_lat (the encoding of the instruction latencies) is only 2 bits wide, and for FPU_*_LAT>1, the encoding for normal FPU instructions is the same as the "max latency/always stall" encoding for DIV/SQRT (2'h3).

Decoder...

Dispatcher...

Comments

The pipeline details page of the User Manual outlines the expected data hazard stalls, but doesn't mention this non-data hazard stall at deeper pipeline depths. This text further down is ambiguous, "Floating-Point instructions are dispatched to the FPU. Following instructions can be executed by the Core as long as they are not FPU ones and there are no Read-After-Write or Write-After-Write data hazard between them and the destination register of the outstanding FPU instruction." It's unclear if this is saying only that the integer pipeline can execute independent instructions while the FPU is active, or also implying that any further FPU instructions will be stalled.

Given that the FPU is pipelined, the core RTL should either be fixed to utilize that pipeline or else the documentation updated to clearly state that only one FPU instruction can be inflight at a time. The latter is critical to know for deeper pipelines since you incur the full latency of each FPU instruction every time unless you can hide enough integer instructions in between which is not usually feasible.

More broadly speaking,

Was there any awareness of or discussions around fully utilizing the FPU pipeline – was this intentionally not implemented? If so, why?
Were there discussion around doing performance verification i.e. verifying that the micro-architecture is not only functionally correct, but also correct to the spec in terms of the timing of different sequences of instructions? If so, what were the factors and conclusions?