mrisc32 / mrisc32-a1 Goto Github PK

A pipelined, in-order, scalar VHDL implementation of the MRISC32 ISA

Home Page: https://gitlab.com/mrisc32/mrisc32-a1

Makefile 2.61% VHDL 95.70% Shell 0.49% C++ 0.22% Assembly 0.77% Tcl 0.20%

mrisc32-a1's Introduction

This repo has moved to: https://gitlab.com/mrisc32/mrisc32

This is an open and free 32-bit RISC/Vector instruction set architecture (ISA), primarily inspired by the Cray-1 and MIPS architectures. The focus is to create a clean, modern ISA that is equally attractive to software, hardware and compiler developers.

This repository contains LaTeX documentation and databases of architectural information (e.g. instructions and system registers).

Documentation

The latest MRISC32 Instruction Set Manual (PDF) describes the MRISC32 ISA in detail.

Overview documents:

Features

Unified scalar/vector/integer/floating-point ISA.
There are two register files:
- R0-R31: 32 scalar registers, each 32 bits wide.
  - Three registers have special meaning in hardware: Z, LR, VL.
  - 29 registers are general purpose (of which three are reserved by the ABI: SP, FP, TP).
  - All registers can be used for all types (integers, addresses and floating-point).
- V0-V31: 32 vector registers, each with at least 16 32-bit elements.
  - All registers can be used for all types (integers, addresses and floating-point).
All instructions are 32 bits wide and easy to decode.
Most instructions are non-destructive 3-operand (two sources, one destination).
All conditionals are based on register content.
- There are no condition code flags (carry, overflow, ...).
- Compare instructions generate bit masks.
- Branch instructions can act on bit masks (all bits set, all bits zero, etc) as well as signed quantities (less than zero, etc).
- Bit masks are suitable for masking in conditional operations (for scalars, vectors and packed data types).
Powerful addressing modes:
- Scaled indexed load/store (x1, x2, x4, x8).
- Gather-scatter and stride-based vector load/store.
- PC-releative and absolute load/store:
  - ±4 MiB range with one instruction.
  - Full 32-bit range with two instructions.
- PC-relative and absolute branch:
  - ±4 MiB range with one instruction.
  - Full 32-bit range with two instructions.
Many traditional floating-point operations can be handled in whole or partially by integer operations, reducing the number of necessary instructions:
- Load/store.
- Branch.
- Sign and bit manipulation (e.g. neg, abs).
Vector operations use a Cray-like model:
- Vector operations are variable length (1-N elements).
- Most integer and floating-point instructions come in both scalar and vector variants.
- Vector instructions can use both vector and scalar operands (including immediate values), which removes the overhead for transfering scalar data into vector registers.
In addition to vector operations, there are also packed operations that operate on small data types (byte and half-word).
Fixed point operations are supported:
- Single instruction multiplication of Q31, Q15 and Q7 fixed point numbers.
- Single instruction conversion between floating-point and fixed point.
- Saturating and halving addition and subtraction.

Note: There is no support for 64-bit floating-point operations (that is left for a 64-bit version of the ISA).

mrisc32-a1's People

Contributors

Stargazers

Watchers

Forkers

saitej25 mfkiwl rickyzhang82 pabloua isabella232

mrisc32-a1's Issues

float_compare: Properly handle 0.0 == -0.0

Implement FSQRT

This probably requires different solutions for different float widths. E.g. for f8 we can most likely use a simple LUT solution, while for f32 we may need to use Newton-Raphson or similar.

fmul & fdiv: Implement RTNE rounding

Branch correction: Check PC when !is_branch

For non-branch instructions, the next PC must be PC+4.

Execute PC- and Z-relative unconditional branches in the ID stage rather than in the EX1 stage (?)

B and BL do not need access to any registers, and so should be possible to execute in the ID stage (essentially compute PC + Imm and always branch). The same goes for zero-relative branches (j/jl z, addr).

This would reduce the branch misprediction penalty to 1 cycle for B/BL. C-style for-loops would benefit from this, for instance:

  .loop:
    slt     s9, s20, s21    ; s20 < s21?
    ...
    bns     s9, .loop_done  ; Predicted not taken: +3 cycles on last iteration
    add     s20, s20, 1
    ...
    b       .loop           ; Predicted not taken: +1 cycle on first iteration

.loop_done:
    ...

The potential extra cost is additional muxing in the PC / BTB, as well as more logic for calculating the brancj target in ID.

Caveat: A branch in ID must be considered speculative, since up tp 2 earlier brancjes may be further down the pipeline, waiting to potentially invalidate the branch instruction in the ID stage. One solution is to not let the ID branch update the BTB, but wait until the instruction reaches EX to do the update.

Another problem is that more information may be required in order to determine the correctness of the program flow (i.e. whether or not the EX stage should cancel the following instructions).

Implement a DCache

Do #1 first.

Implement late forwarding for MADD

With late forwarding of the addend, the MADD instruction would work as an MAC with zero latency for consecutive multiply+add operations such as:

    madd  r1, r2, r3
    madd  r1, r4, r5
    madd  r1, r6, r7  ; r1 = r1 + r2 * r3 + r4 * r5 + r6 * r7

We probably only need to worry about forwarding of outputs from the MADD unit.

Don't stall bubbles

It should be relatively easy to "pop bubbles" during a stall (i.e. don't propagate the stall signal to earlier stages if a stage is currently holding a bubble).

A1: Consider switching from asynchronous reset to synchronous reset

Improve the branch predictor

The current branch predictor is a single-state predictor (taken / not taken). Add weakly taken states (i.e. use a two-bit predictor state).

Also, a return-address stack predictor would be useful.

Investigate current BTB implementation:

Is it optimal or does it contain redundant bits?
Can we measure the BTB hit/fail rate?

Redesign the register files to save BRAM

Currently we use five RAM instances for the register files (35840 effective bits in total):

Three 1024-bit RAM instances for the scalar register file (three read ports).
Two 16384-bit RAM instances for the vector register file (two read ports).

In a Cyclone V FPGA this translates to 70 Kbits BRAM usage in total, as follows:

3 x M10K BRAM blocks for the scalar register file.
4 x M10K BRAM blocks for the vector register file.

That means that we are wasting 50% of the memory bits.

Try out different strategies. E.g. try using MLAB:s / distributed RAM for the scalar register file.

A1: Investigate treating short fw branches as predicates

If a branch instruction only advances the PC by 4 bytes when the branch is taken, we essentially have an instruction (the one following the branch) that is predicated. In some situations it would be beneficial to treat that instruction as conditionally executed instead of handling the branch as usual.

If we have a branch misprediction, we could just let the execution flow continue but replace the predicated instruction with a bubble. Or something like that.

fadd & fsub: Implement RTNE rounding

This is slightly trickier than for fmul & fdiv.

Optimize the ALU + operand forwarding

There are too many levels of MUX:ing going on, especially around the compare logic.

Implement late forwarding for memory stores

A memory store does not need the data operand until the 2nd execute pipeline stage. Being able to start the store instruction (to calculate the address) before the data operand is ready can save one clock cycle in certain situations, e.g:

    ldw s1, s3, #0 
    stw s1, s4, #0

Implement an ICache

The ICache is more important than the DCache for most applications, and it is easier to implement.

Having an ICache will leave the shared memory bus free for the data interface most of the time, letting the instruction fetch stage run uninterrupted even during data operations.

An ICache is also very useful for systems with slow memory (e.g. SDRAM). More so than a DCache since the CPU needs one instruction per clock cycle, while it may not perform data accesses on every clock cycle.