Arjan b obi fix base
Created by: Silabs-ArjanB
This pull request makes the prefetch unit and the load/store unit OBI compatible. The external bus interface pins remained unchanged, but internally certain unwanted combinatorial paths have been removed.
Fixes are provided for the following issues: #126 (closed), #127 (closed), #128 (closed), #129 (closed), #318 (closed), #80 (closed), #50 (closed), #176 (closed) . In order to provide backward compatibility to the ETH Zurich PULP team, the fixes for #127 (closed) and #128 (closed) can be undone via the newly introduced localparam PULP_OBI. As part of this pull request also the 128-bit wide fetch interface has been removed (as agreed in https://github.com/openhwgroup/core-v-docs/blob/master/cores/cv32e40p/CV32E40P_and%20CV32E40_Features_Parameters.pdf), therefore also #17 (closed), #125 (closed), and #257 (closed) can be closed when this pull request is accepted.
Local verification was done with the following tests:
- firmware-vsim-run (i.e. firmware test cases, RISC-V compliance tests)
- custom-vsim-run (with and without random wait states)
- interrupt-vsim-run (interrupt test case)
- locally modified (not-pushed) version of firmware-vsim-run with added random wait state support and locally fixed/enhanced versions of I-MISALIGN_LDST-01.S, I-SB-01.S, I-SH-01.S
Unwanted combinatorial paths have been proven removed by running synthesis; OBI compliance has been checked with formal verification (assertions for this will be contributed later), so achievable system frequency will be improved. Cycle count remained identical for firmware-vsim-run and custom-vsim-run when run without wait states. Cycle count improved/reduced by 8% and 14% for firmware-vsim-run and custom-vsim-run respectively when run with random wait states (max 10 wait states on both gnt and rvalid). (This cycle count improvement is okay of course, but not for a realistic scenario (as the number of wait states is unrealistically high). In principle the fixes can theoretically have a negative impact on branch performance as explained in #128 (closed); reason for the overall reduced cycle count is that waits for gnt and waits for rvalid (for different transactions) can now occur in parallel (instead of sequential as in the original design).
This pull request is part of a multi-phase transformation as explained in the OpenHW TWG Cores channel. Temporarily hardware loops have been removed and these will be re-introduced in a following phase. (There is a new top level parameter called PULP_HWLP which for now has to be set to 0 and for which value 1 will re-enabled in a following phase. The actual instruction decoder still needs to take PULP_HWLP into account; there will be a separate pull request for that modification).
There will be a separate pull request related to documentation updates as well.