Giter Club home page Giter Club logo

mrisc32-a1's Introduction

This repo has moved to: https://gitlab.com/mrisc32/mrisc32

MRISC32

This is an open and free 32-bit RISC/Vector instruction set architecture (ISA), primarily inspired by the Cray-1 and MIPS architectures. The focus is to create a clean, modern ISA that is equally attractive to software, hardware and compiler developers.

This repository contains LaTeX documentation and databases of architectural information (e.g. instructions and system registers).

Documentation

The latest MRISC32 Instruction Set Manual (PDF) describes the MRISC32 ISA in detail.

Overview documents:

Features

  • Unified scalar/vector/integer/floating-point ISA.
  • There are two register files:
    • R0-R31: 32 scalar registers, each 32 bits wide.
      • Three registers have special meaning in hardware: Z, LR, VL.
      • 29 registers are general purpose (of which three are reserved by the ABI: SP, FP, TP).
      • All registers can be used for all types (integers, addresses and floating-point).
    • V0-V31: 32 vector registers, each with at least 16 32-bit elements.
      • All registers can be used for all types (integers, addresses and floating-point).
  • All instructions are 32 bits wide and easy to decode.
  • Most instructions are non-destructive 3-operand (two sources, one destination).
  • All conditionals are based on register content.
    • There are no condition code flags (carry, overflow, ...).
    • Compare instructions generate bit masks.
    • Branch instructions can act on bit masks (all bits set, all bits zero, etc) as well as signed quantities (less than zero, etc).
    • Bit masks are suitable for masking in conditional operations (for scalars, vectors and packed data types).
  • Powerful addressing modes:
    • Scaled indexed load/store (x1, x2, x4, x8).
    • Gather-scatter and stride-based vector load/store.
    • PC-releative and absolute load/store:
      • ±4 MiB range with one instruction.
      • Full 32-bit range with two instructions.
    • PC-relative and absolute branch:
      • ±4 MiB range with one instruction.
      • Full 32-bit range with two instructions.
  • Many traditional floating-point operations can be handled in whole or partially by integer operations, reducing the number of necessary instructions:
    • Load/store.
    • Branch.
    • Sign and bit manipulation (e.g. neg, abs).
  • Vector operations use a Cray-like model:
    • Vector operations are variable length (1-N elements).
    • Most integer and floating-point instructions come in both scalar and vector variants.
    • Vector instructions can use both vector and scalar operands (including immediate values), which removes the overhead for transfering scalar data into vector registers.
  • In addition to vector operations, there are also packed operations that operate on small data types (byte and half-word).
  • Fixed point operations are supported:
    • Single instruction multiplication of Q31, Q15 and Q7 fixed point numbers.
    • Single instruction conversion between floating-point and fixed point.
    • Saturating and halving addition and subtraction.

Note: There is no support for 64-bit floating-point operations (that is left for a 64-bit version of the ISA).

mrisc32-a1's People

Contributors

mbitsnbites avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

mrisc32-a1's Issues

Implement FSQRT

This probably requires different solutions for different float widths. E.g. for f8 we can most likely use a simple LUT solution, while for f32 we may need to use Newton-Raphson or similar.

Execute PC- and Z-relative unconditional branches in the ID stage rather than in the EX1 stage (?)

B and BL do not need access to any registers, and so should be possible to execute in the ID stage (essentially compute PC + Imm and always branch). The same goes for zero-relative branches (j/jl z, addr).

This would reduce the branch misprediction penalty to 1 cycle for B/BL. C-style for-loops would benefit from this, for instance:

  .loop:
    slt     s9, s20, s21    ; s20 < s21?
    ...
    bns     s9, .loop_done  ; Predicted not taken: +3 cycles on last iteration
    add     s20, s20, 1
    ...
    b       .loop           ; Predicted not taken: +1 cycle on first iteration

.loop_done:
    ...

The potential extra cost is additional muxing in the PC / BTB, as well as more logic for calculating the brancj target in ID.

Caveat: A branch in ID must be considered speculative, since up tp 2 earlier brancjes may be further down the pipeline, waiting to potentially invalidate the branch instruction in the ID stage. One solution is to not let the ID branch update the BTB, but wait until the instruction reaches EX to do the update.

Another problem is that more information may be required in order to determine the correctness of the program flow (i.e. whether or not the EX stage should cancel the following instructions).

Implement late forwarding for MADD

With late forwarding of the addend, the MADD instruction would work as an MAC with zero latency for consecutive multiply+add operations such as:

    madd  r1, r2, r3
    madd  r1, r4, r5
    madd  r1, r6, r7  ; r1 = r1 + r2 * r3 + r4 * r5 + r6 * r7

We probably only need to worry about forwarding of outputs from the MADD unit.

Don't stall bubbles

It should be relatively easy to "pop bubbles" during a stall (i.e. don't propagate the stall signal to earlier stages if a stage is currently holding a bubble).

Improve the branch predictor

The current branch predictor is a single-state predictor (taken / not taken). Add weakly taken states (i.e. use a two-bit predictor state).

Also, a return-address stack predictor would be useful.

Investigate current BTB implementation:

  • Is it optimal or does it contain redundant bits?
  • Can we measure the BTB hit/fail rate?

Redesign the register files to save BRAM

Currently we use five RAM instances for the register files (35840 effective bits in total):

  • Three 1024-bit RAM instances for the scalar register file (three read ports).
  • Two 16384-bit RAM instances for the vector register file (two read ports).

In a Cyclone V FPGA this translates to 70 Kbits BRAM usage in total, as follows:

  • 3 x M10K BRAM blocks for the scalar register file.
  • 4 x M10K BRAM blocks for the vector register file.

That means that we are wasting 50% of the memory bits.

Try out different strategies. E.g. try using MLAB:s / distributed RAM for the scalar register file.

A1: Investigate treating short fw branches as predicates

If a branch instruction only advances the PC by 4 bytes when the branch is taken, we essentially have an instruction (the one following the branch) that is predicated. In some situations it would be beneficial to treat that instruction as conditionally executed instead of handling the branch as usual.

If we have a branch misprediction, we could just let the execution flow continue but replace the predicated instruction with a bubble. Or something like that.

Implement late forwarding for memory stores

A memory store does not need the data operand until the 2nd execute pipeline stage. Being able to start the store instruction (to calculate the address) before the data operand is ready can save one clock cycle in certain situations, e.g:

    ldw s1, s3, #0 
    stw s1, s4, #0 

Implement an ICache

The ICache is more important than the DCache for most applications, and it is easier to implement.

Having an ICache will leave the shared memory bus free for the data interface most of the time, letting the instruction fetch stage run uninterrupted even during data operations.

An ICache is also very useful for systems with slow memory (e.g. SDRAM). More so than a DCache since the CPU needs one instruction per clock cycle, while it may not perform data accesses on every clock cycle.

Turn CPU config into a generic on the core entity

Right now the config (CPU capabilities) are given in a separate file in the design.

It would be better for users of the VHDL code if you could pass the configuration as parameters to the entity instantiation. This would also enable a single system to instantiate several cores with different configs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.