Floating-Point Performance Bug for FPU_*_LAT>1
Component:RTL
Hi, we've identified a performance bug in the decoder/APU dispatcher. If FPU_*_LAT>1, the current implementation stalls FPU instructions until the previous completes even if there are no data hazards.
Steps to Reproduce
RTL Configuration
COREV_PULP=0, COREV_CLUSTER=0 FPU=1, ZFINX=0 FPU_ADDMUL_LAT=2 FPU_OTHERS_LAT=2
Software Test
A sequence of independent FMADD (or other FPU) instructions:
878a: b0001073 csrw mcycle,zero
878e: 3200f073 csrc mcountinhibit,1
8792: 68e7f043 fmadd.s ft0,fa5,fa4,fa3
8796: 68e7f0c3 fmadd.s ft1,fa5,fa4,fa3
879a: 68e7f143 fmadd.s ft2,fa5,fa4,fa3
879e: 68e7f1c3 fmadd.s ft3,fa5,fa4,fa3
87a2: 68e7f243 fmadd.s ft4,fa5,fa4,fa3
87a6: 68e7f2c3 fmadd.s ft5,fa5,fa4,fa3
87aa: 68e7f343 fmadd.s ft6,fa5,fa4,fa3
87ae: 68e7f3c3 fmadd.s ft7,fa5,fa4,fa3
87b2: 68e7fe43 fmadd.s ft8,fa5,fa4,fa3
87b6: 68e7fec3 fmadd.s ft9,fa5,fa4,fa3
87ba: 01d7a027 fsw ft9,0(a5)
87be: 3200e073 csrs mcountinhibit,1
Waves
For comparison, when FPU_ADDMUL_LAT=1, there are no stalls...
Source
The primary (but likely not only) culprit seems to be that apu_lat (the encoding of the instruction latencies) is only 2 bits wide, and for FPU_*_LAT>1, the encoding for normal FPU instructions is the same as the "max latency/always stall" encoding for DIV/SQRT (2'h3).
Decoder...
Dispatcher...
Comments
The pipeline details page of the User Manual outlines the expected data hazard stalls, but doesn't mention this non-data hazard stall at deeper pipeline depths. This text further down is ambiguous, "Floating-Point instructions are dispatched to the FPU. Following instructions can be executed by the Core as long as they are not FPU ones and there are no Read-After-Write or Write-After-Write data hazard between them and the destination register of the outstanding FPU instruction." It's unclear if this is saying only that the integer pipeline can execute independent instructions while the FPU is active, or also implying that any further FPU instructions will be stalled.
Given that the FPU is pipelined, the core RTL should either be fixed to utilize that pipeline or else the documentation updated to clearly state that only one FPU instruction can be inflight at a time. The latter is critical to know for deeper pipelines since you incur the full latency of each FPU instruction every time unless you can hide enough integer instructions in between which is not usually feasible.
More broadly speaking,
- Was there any awareness of or discussions around fully utilizing the FPU pipeline – was this intentionally not implemented? If so, why?
- Were there discussion around doing performance verification i.e. verifying that the micro-architecture is not only functionally correct, but also correct to the spec in terms of the timing of different sequences of instructions? If so, what were the factors and conclusions?