laforest / octavo Goto Github PK
View Code? Open in Web Editor NEWVerilog FPGA Parts Library. Old Octavo soft-CPU project.
Home Page: http://fpgacpu.ca/
License: Other
Verilog FPGA Parts Library. Old Octavo soft-CPU project.
Home Page: http://fpgacpu.ca/
License: Other
If an instruction reads from an I/O port, but the ack
line is low by the end of the RD0
stage, then use that bit to trigger the reset of the instruction registers at some point before the ALU and the Controller, converting it to a NOP.
Possible implementation: add a mux of ack
lines in RDO
mirroring the I/O line mux, and a mux in RD1
mirroring the final read mux, which either sends along the selected ack
line (if it was an I/O read after all), or sends a hardwired logic `HIGH (if not an I/O read).
A separate module in Octavo/Octavo/Memory will contain this hardware. Related to #19. Depends on #20.
Add an option to implement conditional add and subtract instructions to implement multiplication and division, faster than software, slower than hardware, but much cheaper than either. Use an i/o port to catch shifted upper bits.
(Also support for some form of newton-raphson, for quick 1/sqrt(n)?)
Use case: too few multipliers on Cyclone devices (see #25), so it might be good to share a multiplier between two cores as an I/O module, where one core's even-numbered threads can write to the multiplier, and other other core's odd-numbered threads can. (Writing during the wrong thread just loses the data)
Generalized: a parameterizable mux and counter to connect N cores to 1 functional unit, where each core can access the FU once every 2nd, 3rd, 4th, etc...thread.
Currently, we support single and double-pipeline multipliers in both soft and hard versions. We need to add an option to place these on I/O ports instead of in the ALU, or have no multiplier at all.
See if we can replace the Octavo module parameter declarations (with default values) with some macros in Misc/params.v.
This change would compress the module declarations and instantiations considerably, and make future changes easier.
Check to see if we can properly chain macros with an added comma in-between, to concatenate parameter lists.
Currently, all FUs are active all the time, with the desired result selected at the end. This wastes power and prevents any future Instruction-Level Parallelism.
We could add small decoders to control the FUs and the write-enables of their pipeline registers to only operate/update when the correct opcode is present.
Add the option of a PC stack in the Controller, and of Data stacks in A and B Memories to support real, re-entrant subroutines.
Open question: use MLABs or M9Ks for stacks?
To better move data to/from cores, and to feed accelerators, create an instruction which does no computation, but moves the A/B data to two D destinations.
Split the D operand into two 6-bit fields, maybe offset by a programmable amount, so we can do double-writes in that window, one to A, one to B.
Add shuffle version, which swaps A/B at write.
Maybe short-circuit data so we don't go through ALU?
Consequences of non-8-cycle-latency operation?
See: http://quartushelp.altera.com/12.1/mergedProjects/hdl/vlog/vlog_file_dir_msg_onoff.htm
Basically, place a message_off directive at each module instance where synthesis reports warnings that don't help. Example: the lack of write ports on the inferred memory in the Address_Translator, and the width warnings for non-multiple-of-4-width memories.
A Wishbone module, hooked-up to one or more I/O ports, would allow easy integration of OpenCores hardware.
In the worst case, an instruction will read from 2 I/O ports (or twice simultaneously from the same port!), and write to an I/O port (which may be at the same address as one of the read port). The instruction will raise req
on each of these ports to signal the other end that the handshake may complete.
However, if any of the ports cannot complete the handshake (low ack
), then no port should raise req
(or let a raised req
get to the other end), because if the handshake were to complete, the data would be lost since the instruction will be annulled anyway.
This will be a tricky one to implement without impacting Fmax, as by itself, it implies a loop going from remote req
to local ack
back out via local req
then back to remote ack
, meaning double the normal propagation delay between ends. Also, it requires interconnecting all I/O port hardware, across both A and B memories.
Currently, only full Modelsim is supported. We should add generation of scripts for Icarus Verilog, Verilator, Altera-Modelsim, etc...
To increase usage of I/O ports, minimize their number, and easily connect duplicate per-thread resources (e.g.: stacks), we need a handshake-capable (de)multiplexer, controlled by the thread number. We can re-combine ports which serve more than one thread. The synthesis tool will automatically optimize away unused mux ports.
We can increase the Instruction Memory storage for each thread by demuxing the PC to 2 or more Instruction Memories, thus dividing the threads across them, to the extreme of having 8 such memories, so each thread can have the maximum addressable amount. We can then trivially remux the fetched instruction in-order.
Pro:
Con:
Thus, we might better increase the I Mem for a thread by either distributing the thread code over two or more hardware threads, or dividing the code across cores. Both would yield better density, and the second creates real concurrency in the thread.
An I/O module to talk to Altera Avalon busses (and whatever the newer Qsys uses) would make it easy to integrate Altera IP blocks.
Currently, Addressing selects a default offset of zero if the instruction operand points to an I/O port or to High Memory.
We should add another Address_Decoder to OR with the default offset selector to also enable direct access to a small region of memory to hold common (usually read-only) data: a literal pool, which would hold common values (0, 1, -1, 2^N, etc...), tables of coefficients, etc... used by all threads.
Really it's a shared, directly-addressed memory area for all threads. It would be a good place for mailboxes/semaphores/mutexes, etc...
To make complex concurrent system easier to build, we need some small modules that allow combining multiple read/write handshakes.
Examples:
Currently, it litters the RTL diagram with per-wire AND gates, making the diagrams harder to read. This new module would have an instance in the Control and Data paths.
Although the simplicity of just passing the opcode to the ALU to use a case statement (and some careful opcode encoding choices) to select the operation has appeal, it means decoding as we compute, and for non-trivial ALUs, there's not enough time.
Instead, decode and pipeline the opcode into control signals during the two stages where we read/write A/B, so the 4 ALU stages can perform more complex work. Relates a bit to issue #31 in that it modularizes the datapath.
While I/O Predication works fine, it should be factored out into its own module living in stages 2 and 3, before the DataPath begins on stage 4, passing to the latter only the IO_ready and misc. "this address is an I/O port" signals.
I/O Predication is a separate functionality, and really just clutters the DataPath code. (This was made clear by attempting to add Addressing modules to the DataPath.)
This will have to wait until after I graduate.
Currently, an annulled instruction gets turned into a no-op, by convention the all-zero instruction.
We should enable the reverse mapping: if the instruction is a no-op, then raise Annul.
The detector would be a set of all-zero detectors, with the sub-signal of D == 0 used to disable writes to A/B, enforcing the "read-only" behaviour of address 0.
Cyclone V M10K BRAMs have 40-bit words and ECC support. Maybe leave Octavo architecturally at 36-bits (or somesuch), and use the ECC instead.
If that includes SECDED operation, then we could use the output of the ECC logic to detect a corrupt A/B/I read and if fixed (SEC), annul the instruction, store the corrected value(s) back into memory (might be stuck to the range of a double-move write), and re-issue later.
If not fixed (DED), then annul the instruction and jump to some error handler.
These provide support for self-healing memory, but what about errors in the logic? See DIVA paper and https://caesr.uwaterloo.ca/2013/11/08/urisc-micro/ ("Reliable Computing with Ultra-Reduced Instruction Set Co-processors").
Add Jason & Matthew's proposal to use the last 2 bits in the 36-bit instruction format to control the write-enable lines for the I/A/B memories to eliminate the need to duplicate all code and data across all three memories.
For larger widths, instead of finding the special cases where 2 spare bits remain, we could limit the address space by stealing bits from the D/A/B operand fields. Enough range will remain in practice, and we can get more than 2 bits this way if needed.
When an instruction is annulled due to non-ready I/O, we subtract one from the issued PC to re-issue the annulled instruction.
Unfortunately, there is no slack between the Controller and the I_mem, dropping Fmax from ~550 to ~430. Adding two stages of PC_PIPELINE restores speed, but now we have 8 threads running around a 10-stage control pipeline. Not sure of the impact of that.
Alternately: rather than using 20 FFs to pipeline the PC more, what about using 1 MLAB (10 ALMs) to store the one-behind PC to re-issue the instruction? A 2:1 mux, controlled by IO_ready, should be fast enough and it moves the subtractor out of that critical path to I_mem?
But will it affect the future port to Cyclone? (See #6)
Currently, doing an I/O write updates all data registers, but only the wren for the updated I/O port changes, indicating where the data must go. This seems messy and maybe wasteful of power.
Now that fixing issue #38 adds a register stage, maybe we should use the wren signal to update only the data register of the same I/O port?
Extending the D operand +2 bits (to 12 bits in vanilla 36-bit configuration), and reconfiguring the write enables of the A/B/I memories, would eliminate data and code duplication across all memories, by enabling at compile time to control where ALU results get written.
Assuming all the memories are identical (1024 words), we can arrange them thus in the D address space:
0-1023: A
1024 - 2047: B
2048 - 3071: I
3072 - 4095: H (call it "high mem", as a DOS homage? ;) )
The A and B operands still read the A/B memories, and the PC reads the I memory, and all remain at 10 bits and index from 0. The D operand of branches now wraps around I memory 4 times, or put another way, ignores the two upper bits.
Thus, as-is, the H memory is write-only and a good place to put hardware extensions which aren't interactive. This reduces the number of A and B I/O ports needed.
Finally, this scheme won't work for word widths without 2 free instruction bits, but supporting all possible word widths (without modification) has lost importance. It was enough to prove that all word widths were usable and fast.
Extension to #40
Based from ideas in
"The Landscape of Parallel Computing Research: A View from Berkeley"
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf
Add two multi-threaded counters, likely external and attached to I/O ports.
Starting at 0 (reset by write), one increments each thread cycle when Annul is low, the other when Annul is high.
Their sum is a thread cycle counter.
Their ratio describes the I/O efficiency (how much of the total time was spent waiting on I/O). A designer can use this value to tune loops, DRAM operation, and other things which affect processing efficiency, if not correctness.
Similarly, counting on other events (particular instructions, operand values, data values, cancellations, shared/indirect memory accesses, taken/untaken branches, etc...) would provide easy and non-intrusive instrumentation.
Generally speaking, you're better off dedicating one thread (or one whole core!) to polling and managing external asynchronous events. But interrupts might be nice anyway in some cases...
The basic configuration would have one interrupt line per thread, multiplexed to the Controller, and patched-in to the Jump logic, and instead of using the D operand, would take the destination address from an interrupt vector, one per thread, set from a multiplexed I/O port.
This way, each thread can independently respond to an external interrupt without affecting the other threads.
If done carefully, we wouldn't need to store the current PC, as the current instruction would have finished before servicing the interrupt (it's a hardware-invoked JMP, after all), and we can resume the thread from the next PC stored in the PC memory in the Controller. Nesting interrupts would complicate this a lot...requiring CALL/RET hardware effectively, but would we really need nested interrupts?
There are several ways to label the signals of a 4-phase handshake.
I think they should just be "req/ack", regardless if they are used synchronously (as "ready" signals).
Naming:
A_wren
becomes A_out_req
, and paired with a A_out_ack
input.A_rden
becomes A_in_req
, and paired with a A_in_ack
input.This change should be doable with a careful search/replace. Though please do it on a separate branch.
At the time of checking the ack
lines for both reads (in RD0
), do the same for the write port: in WR0
, select the correct ack
line and use it to annul an instruction before it reaches the Controller and ALU.
Given a completed #21, this should implement in a similar way, and can just tack on to the existing muxing and annulling circuitry.
Related to #31.
Once the I/O Predication is factored out, there is no point in having stage 1-3 empty pipelines in the DataPath, so factor out the pipelines of things that don't get processed, like the opcode, into a "spine" which runs parallel to the other sub-systems and just carries stuff along.
Since we can propagate Annul down the Datapath to disable A/B reads, ALU register latching, and A/B write, we might save power there. It should decrease in proportion to the number of "idle" threads looping on a no-op or waiting on I/O
36-bit words make SWAR/SIMD processing easy: 4 bytes, each with a guard bit, or two 16-bit values each with 2 guard bits. The guard bits capture any overflow, and can be test/set with bit ops.
Thus, for byte-wide SWAR, we can do in one 36-bit SIMD lane the equivalent work of 4 byte-wide SIMD lanes.
We know that Fmax drops with larger numbers of homogeneous SIMD lanes, even with partitioning, but we don't know the drop for heterogeneous SIMD lanes.
From thesis:
SIMD Configuration | Fmax |
---|---|
Octavo-L1 | 525 |
Octavo-L2 | 514 |
Octavo-L4 | 474 |
Octavo-L8 | 455 |
Octavo-L16 | 420 |
Octavo-L32 | 370 |
If we assume similar Fmax drops for byte-wide SIMD configurations, then using a word-wide Octavo-L1 for byte-wide SWAR instead of Octavo-L4 with byte-wide lanes should improve performance ~11%.
Even without a speed gain, the area will surely be less, as a heterogeneous byte-wide Octavo-L4 actually implements as 1x36 + 3x8 lanes, versus, a single 1x36 lane in SWAR mode. The P&R time will also remain shorter. As we pack more SIMD lanes into SWAR lanes, the area and speed benefits should increase.
SWAR suggests bit packing/unpacking ops (see Hacker's Delight) to convert between packed bytes (as in 32-bit words) and SWAR words, and to manipulate the guard bits.
Create a simple multi-threaded word-wide counter, mapped into High memory.
Assign one of the "fast compare" condition codes to the counter, high when counter == 0
A thread write a count value to the counter, and a BTM entry for its condition code.
Each time the BTM entry selects that condition, the counter decrements by one, saturating at zero (else we might fall into quasi-infinite loops by error, rather than 0-length ones).
The branch is taken when the counter == 0 before decrement.
Why? Since Octavo has already tight, reduced code, this will greatly improve loop efficiency. It's a common trick in DSPs, although less flexible than what the BTM enables.
Currently, the Memory module contains everything. Factoring it out into separate modules would more cleanly expose the useful reusable bits inside, like the universal address decoder and translator.
Handshaking support has the highest importance, as other enhancement and future expansions depend on it.
Currently, I/O reads and writes always complete, regardless of the readiness of the other end. Thus, any readiness check must use a busy-wait loop. We must add a simple 4-phase req/ack handshake on I/O reads and writes to effectively support many peripherals (i.e.: DRAM controllers), a larger memory hierarchy, and for overlapping communication and computation (a simple 1-stage handshaking write FIFO would allow a write to proceed and the writer to go on and compute while the reader eventually acknowledges the write.)
The tricky part comes when handling a not-ready I/O operation:
Finally, we must for now assume that no two threads read/write from the same I/O port, as the handshaking could cause their reads/writes to happen in the opposite order, destroying the assumption of fixed round-robin thread ordering.
Have the bit which annulled an instruction with a failed I/O read/write (see #21 #22) follow the annulled instruction to the Controller, where instead of sending out the PC of the next instruction for the same thread (which was already stored in the PCM), we send out the previous PC to re-issue the annulled instruction.
This seems to imply having a subtractor in the output path of the Controller: we may have to move one of the 3 pipeline stages between I Mem and A/B back behind I Mem to compensate. Octavo already supports this change via a PC_PIPELINE_DEPTH
parameter. Make sure to alter the other connected pipeline depths to match (adds up to 8 total).
Implement a popcount module. There is enough time (8 cycles) to pipeline properly the chains of small adders necessary. This instruction seems much too useful and powerful to not include.
With clever use of AND gates, this could also act as a CLZ instruction.
And they lay the basis for bit "flip", "sheeps and goats", compress, and expand operations, all powerful.
(see Hacker's Delight)
Note: use wrapper to tie second, configuration port, (if multiple functions per accelerator) to the H memory, rather than use up an I/O port.
Currently, we include a test program in the BRAM instance, and so the system comes up out of FPGA configuration ready to run. We should reduce it to a bootloader with "read mem", "write mem", "jump to mem" commands to enable us to select the test program without having to reconfigure the FPGA.
See: http://utoh.org/3ins4th.html (not arguing for Forth, but for this overall idea)
Having the boot loader waiting to load commands from I/O would also have the system start in the most idle (and thus low-power) state possible if handshaking also exists.
Downside: simulation time will increase, as we'll have to load the program each time.
Currently, the project generator only targets a Stratix IV FPGA device, which the free version of Quartus does not support. We should add the option to target Cyclone devices at least, as most entry-level and educational boards based on Altera chips include these, and are supported by the free version of Quartus.
This change might require some minor pipeline tuning to reach top-speed on Cyclone devices, but it's unlikely.
We should move back to simulating the full system with I/O and PLLs for clocks. This setup will make future integration and testing of changes/add-ons easier.
Upside: we can have a simulation setup for each supported FPGA board.
Downside: simulation time will increase significantly.
Now that, with the BTM, the Controller uses the result of the previous thread instruction instead of the value of operand A, the Datapath is no longer coupled to the Controlpath in a feed-forward manner.
Thus, we should be able to slide the Datapath down towards the end of the Controlpath, increasing the number of stages between the I and A/B memories, which means more stages for the AOM, BTM, and any other control hardware.
This has a potential major effect: being able to slide the AOM down past stage 1 means we can use the instruction operand addresses directly to read from the PO/PI AOM memories, eliminating the slow multiplexers in the AOM, which should improve speed and scalability.
The constraints are that the read from the next thread instruction in the pipeline cannot arrive to A/B earlier than the write of the current thread instruction. At most, they must coincide so the write-forwarding logic around the A/B BRAMs feeds the write to the read. Same for the read of the next thread instruction from the I memory: we don't want to introduce a second cycle of delay.
We must also be careful about the feeding of the ALU output back to the BTM, to make sure the thread numbers line up (i.e.: ALU output from thread N must reach the BTM stage containing the next instruction from the same thread).
Update Memory.v to support read-enable, and bring out those lines so they can be controlled by the processor. Being able to disable reads could reduce power, and selectively disable SIMD lanes if needed.
Since I/O writes happen in stage 4, not 5, they are 7 cycles later than I/O reads, not 8.
Thus even a direct connection from I/O write to I/O read cannot pass data inside a single thread! More seriously: an external accelerator would have to do contortions to compensate.
Solution (being addressed in simd_refit branch): add one cycle delay to I/O data and wren outputs.
Based off:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf
Many computing "dwarfs" require scatter/gather memory operation for peak performance.
Having an Addressing module for each SIMD lane would allow per-lane indirection of reads and writes, which should work as scatter/gather with proper setup.
However, this means more High memory entries, which might be complicated to set-up, and we may exhaust H memory, limiting the number of SIMD lanes.
Implement a Look-Up Table instruction(s) to optimize boolean logic.
Load a table with 4 bits representing the 2-LUT Truth Table for 2 bits, one from A and B read data. four 4-bit entries per table, times 8 threads, equals 32 entries, meaning 2 32x2 MLABs used completely. Memory-map per-thread entries in a single word: 4*4 = 16 bits. Writing to word updates all 4 entries.
Place table in ALU, replacing Logic Unit:
ALU0: read LUT Memories
ALU1: register MLAB output, as always.
ALU2: drive 4:1 muxes, one per output bit, selected by A and B bit, maps perfectly to 6-LUT hardware (Cyclone V and Stratix IV)
ALU3: replace final mux with 4:1 mux: Logic, Add/Sub, MHI, MLO
Will need careful experimentation to preserve Fmax, but shouldn't be hard.
Note: first investigate combining Add/Sub with Logic operation, as per Warren's "Hacker's Delight".
Use first 4 opcodes as index into 4 entries, giving programmable, universal 2-term boolean logic opcodes. Easy to default to XOR/AND/OR/NOT. By default, all zero entries encode to a result of 0, regardless of A/B inputs, preserving NOP and address 0 operation.
Benefits: reduces all 2-term boolean operations to a single instruction.
Best case: unoptimized logic (e.g.: ~(~A * ~B)), reduces 4 operations to 1, but that could be trivially optimized to A + B beforehand. DeMorgan's Law also guarantees that any operation with 2 negations can be converted to one with 1 negation.
Common case: operations with a single negation (e.g.: ~A * B), reduces from 2 to 1 operation. Basically all negations vanish, reducing any compound boolean sequence with N terms to the best sequential case of N-1 operations.
See: http://iie.fing.edu.uy/investigacion/grupos/microele//iberchip/pdf/75.pdf ("Classifying n-input boolean functions", Correia and Reis) for characterization of all 16 possible 2-term boolean functions.
See: bit-manipulation papers/thesis by Yedidya Hilewitz, Princeton PALMS group, now at Intel.
Would be much more powerful with 3 terms, as in CUDA 6.0, but that would require adding a C memory to fetch alongside A/B, which is an icky architectural change.
At some point, once the architecture stabilizes (Octavo as top-level, over Mesh, over SIMD, over Scalar), we need to convert (and replace with!) the entire codebase to SystemVerilog.
Support for interfaces in ports, which will eliminate the flat vectors passed in and out of Memory modules, which make each higher level harder and harder to write. Mesh already pushes at the limit of what's practical.
Similarly, support for structs, list parameters, etc... will make parameterizing large Octavo instances much cleaner. (e.g. we could easily implement heterogeneous Meshes)
Start a branch and rewrite from the bottom-up. The lowest modules (BRAMs, Controller, ALU, etc...) should convert almost without change. The Memory module will have port interfaces instead of flat vectors, and that will reflect in all higher modules, simplifying things.
After I graduate. :)
If the new no-op (see #45 and #35) becomes ADD, 0, 0, 0, and reduces power via Annul, then we could create false no-ops (e.g.: SUB 0, 0, 0), which don't raise Annul, and thus hide some power profiles of code execution.
Similarly, based on #35, we could switch the implementation of some bit-manipulation functions with Boolean-equivalent versions using different LUT instructions, also maybe hiding some power profiles.
Taking this further, if Annul shuts down unused FU stages (e.g. Adder when doing a Multiply), then perhaps we can selectively (pseudo-randomly?) inhibit that effect to also hide power profiles.
Also, we could make the I/O Handshaking module behaviour programmable to issue a variety of false no-ops (or even instructions with real and/or decoy side-effects) while waiting, so as to hide some I/O operations. In effect, extending I/O handshaking with a co-routine instead of a no-op.
This co-routine (a stored PC in the I/O Handshaking module?) could also emulate some of the work recovery aimed-at by thread re-scheduling on I/O latency events, which we strictly disallow in Octavo to maintain determinacy.
Adding an I/O read port to the Instruction Memory would allow the direct execution of code from external memory without overhead. This will require handshaking support to support variable latency, and alterations to the Controller so the PC won't increment when reading from the I/O port. We can then use the Instruction Memory as a user-controllable cache for high-speed, deterministic code.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.