laforest / octavo Goto Github PK

View Code? Open in Web Editor NEW

70.0 11.0 14.0 2.21 MB

Verilog FPGA Parts Library. Old Octavo soft-CPU project.

Home Page: http://fpgacpu.ca/

License: Other

Verilog 41.13% Shell 1.22% Python 55.59% SystemVerilog 0.40% C++ 0.55% Assembly 1.11%

fpga processor octavo multi-threading overlay

octavo's People

Stargazers

Watchers

Forkers

siupakm ujamjar cpehle longjohncoder saadmahboob yinxx ossdc flaviusb lcyls ismail00b mfkiwl easymoney89 weberxzq jeffreyyunz

octavo's Issues

Annul instruction if any I/O read handshake not ready

If an instruction reads from an I/O port, but the ack line is low by the end of the RD0 stage, then use that bit to trigger the reset of the instruction registers at some point before the ALU and the Controller, converting it to a NOP.

Possible implementation: add a mux of ack lines in RDO mirroring the I/O line mux, and a mux in RD1 mirroring the final read mux, which either sends along the selected ack line (if it was an I/O read after all), or sends a hardwired logic `HIGH (if not an I/O read).

A separate module in Octavo/Octavo/Memory will contain this hardware. Related to #19. Depends on #20.

Resource Diversity: Conditional Add and Subtract

Add an option to implement conditional add and subtract instructions to implement multiplication and division, faster than software, slower than hardware, but much cheaper than either. Use an i/o port to catch shifted upper bits.

(Also support for some form of newton-raphson, for quick 1/sqrt(n)?)

Create a round-robin mux module to share functional units across cores

Use case: too few multipliers on Cyclone devices (see #25), so it might be good to share a multiplier between two cores as an I/O module, where one core's even-numbered threads can write to the multiplier, and other other core's odd-numbered threads can. (Writing during the wrong thread just loses the data)

Generalized: a parameterizable mux and counter to connect N cores to 1 functional unit, where each core can access the FU once every 2nd, 3rd, 4th, etc...thread.

Create cycle counters

Extension on #40

Have the counter decrement every thread cycle instead of every branch test, to act as a thread cycle counter. Branch on same BTM entry as in #40 .

Decrement each clock tick to implement clock cycle counter.

Use some High mem bits to select behaviour.

Add resource diversity option: multiplier (any kind) on I/O ports (or none)

Currently, we support single and double-pipeline multipliers in both soft and hard versions. We need to add an option to place these on I/O ports instead of in the ALU, or have no multiplier at all.

We can create an Octavo without multiplier, if needed.
It'll make porting to Cyclone (see #6) easier, since the DSP blocks are less capable. We can then share a multiplier between two Octavos on Cyclone.
We can free 3 opcodes by moving mult to I/O ports.

Parameter lists getting too long

See if we can replace the Octavo module parameter declarations (with default values) with some macros in Misc/params.v.

This change would compress the module declarations and instantiations considerably, and make future changes easier.

Check to see if we can properly chain macros with an added comma in-between, to concatenate parameter lists.

Add logic to only enable used functional units within each thread

Currently, all FUs are active all the time, with the desired result selected at the end. This wastes power and prevents any future Instruction-Level Parallelism.

We could add small decoders to control the FUs and the write-enables of their pipeline registers to only operate/update when the correct opcode is present.

Add CALL/RET support

Add the option of a PC stack in the Controller, and of Data stacks in A and B Memories to support real, re-entrant subroutines.

Open question: use MLABs or M9Ks for stacks?

Create Double-Move Instruction

To better move data to/from cores, and to feed accelerators, create an instruction which does no computation, but moves the A/B data to two D destinations.

Split the D operand into two 6-bit fields, maybe offset by a programmable amount, so we can do double-writes in that window, one to A, one to B.

Add shuffle version, which swaps A/B at write.

Maybe short-circuit data so we don't go through ALU?
Consequences of non-8-cycle-latency operation?

Turn off useless Quartus warning messages

See: http://quartushelp.altera.com/12.1/mergedProjects/hdl/vlog/vlog_file_dir_msg_onoff.htm

Basically, place a message_off directive at each module instance where synthesis reports warnings that don't help. Example: the lack of write ports on the inferred memory in the Address_Translator, and the width warnings for non-multiple-of-4-width memories.

Need a Wishbone interface

A Wishbone module, hooked-up to one or more I/O ports, would allow easy integration of OpenCores hardware.

Cancel all read and write handshakes if any are not ready

In the worst case, an instruction will read from 2 I/O ports (or twice simultaneously from the same port!), and write to an I/O port (which may be at the same address as one of the read port). The instruction will raise req on each of these ports to signal the other end that the handshake may complete.

However, if any of the ports cannot complete the handshake (low ack), then no port should raise req (or let a raised req get to the other end), because if the handshake were to complete, the data would be lost since the instruction will be annulled anyway.

This will be a tricky one to implement without impacting Fmax, as by itself, it implies a loop going from remote req to local ack back out via local req then back to remote ack, meaning double the normal propagation delay between ends. Also, it requires interconnecting all I/O port hardware, across both A and B memories.

Do this one after #21 and #22 are done.

Add generator support for other Verilog simulators

Currently, only full Modelsim is supported. We should add generation of scripts for Icarus Verilog, Verilator, Altera-Modelsim, etc...

Create a threaded I/O port (de)multiplexer

To increase usage of I/O ports, minimize their number, and easily connect duplicate per-thread resources (e.g.: stacks), we need a handshake-capable (de)multiplexer, controlled by the thread number. We can re-combine ports which serve more than one thread. The synthesis tool will automatically optimize away unused mux ports.

Distribute PC fetches across multiple Instruction Memories

We can increase the Instruction Memory storage for each thread by demuxing the PC to 2 or more Instruction Memories, thus dividing the threads across them, to the extreme of having 8 such memories, so each thread can have the maximum addressable amount. We can then trivially remux the fetched instruction in-order.

Pro:

easy
parameterizable within an I Mem module

Con:

lowers I Mem usage efficiency, down to 1/8th duty cycle.
- (could we get instruction fetch parallelism somehow?)
Lowers computational density (more BRAMs for no more computations/sec)

Thus, we might better increase the I Mem for a thread by either distributing the thread code over two or more hardware threads, or dividing the code across cores. Both would yield better density, and the second creates real concurrency in the thread.

Need an Avalon/Qsys interface

An I/O module to talk to Altera Avalon busses (and whatever the newer Qsys uses) would make it easy to integrate Altera IP blocks.

Add support for literal pool in Addressing module

Currently, Addressing selects a default offset of zero if the instruction operand points to an I/O port or to High Memory.

We should add another Address_Decoder to OR with the default offset selector to also enable direct access to a small region of memory to hold common (usually read-only) data: a literal pool, which would hold common values (0, 1, -1, 2^N, etc...), tables of coefficients, etc... used by all threads.

Really it's a shared, directly-addressed memory area for all threads. It would be a good place for mailboxes/semaphores/mutexes, etc...

Implement synchronous fork/join/barrier 4-phase handshake modules

To make complex concurrent system easier to build, we need some small modules that allow combining multiple read/write handshakes.

Examples:

Have a writer block until all multiple readers have read the data (fork?)
Have multiple readers/writers block until all have read/written the data (barrier)
Have a writer write to many readers and proceed, and let the readers finish each read independently.

Place instruction annulling logic in a module

Currently, it litters the RTL diagram with per-wire AND gates, making the diagrams harder to read. This new module would have an instance in the Control and Data paths.

Move ALU instruction decoding to ID stages

Although the simplicity of just passing the opcode to the ALU to use a case statement (and some careful opcode encoding choices) to select the operation has appeal, it means decoding as we compute, and for non-trivial ALUs, there's not enough time.

Instead, decode and pipeline the opcode into control signals during the two stages where we read/write A/B, so the 4 ALU stages can perform more complex work. Relates a bit to issue #31 in that it modularizes the datapath.

Factor I/O Predication out of DataPath

While I/O Predication works fine, it should be factored out into its own module living in stages 2 and 3, before the DataPath begins on stage 4, passing to the latter only the IO_ready and misc. "this address is an I/O port" signals.

I/O Predication is a separate functionality, and really just clutters the DataPath code. (This was made clear by attempting to add Addressing modules to the DataPath.)

This will have to wait until after I graduate.

Detect no-op to raise Annul

Currently, an annulled instruction gets turned into a no-op, by convention the all-zero instruction.

We should enable the reverse mapping: if the instruction is a no-op, then raise Annul.

The detector would be a set of all-zero detectors, with the sub-signal of D == 0 used to disable writes to A/B, enforcing the "read-only" behaviour of address 0.

ECC support in Cyclone V

Cyclone V M10K BRAMs have 40-bit words and ECC support. Maybe leave Octavo architecturally at 36-bits (or somesuch), and use the ECC instead.

If that includes SECDED operation, then we could use the output of the ECC logic to detect a corrupt A/B/I read and if fixed (SEC), annul the instruction, store the corrected value(s) back into memory (might be stuck to the range of a double-move write), and re-issue later.

If not fixed (DED), then annul the instruction and jump to some error handler.

These provide support for self-healing memory, but what about errors in the logic? See DIVA paper and https://caesr.uwaterloo.ca/2013/11/08/urisc-micro/ ("Reliable Computing with Ultra-Reduced Instruction Set Co-processors").

Add write-enable bits in the instruction format

Add Jason & Matthew's proposal to use the last 2 bits in the 36-bit instruction format to control the write-enable lines for the I/A/B memories to eliminate the need to duplicate all code and data across all three memories.

For larger widths, instead of finding the special cases where 2 spare bits remain, we could limit the address space by stealing bits from the D/A/B operand fields. Enough range will remain in practice, and we can get more than 2 bits this way if needed.

Adder between Controller and I_mem is critical path

When an instruction is annulled due to non-ready I/O, we subtract one from the issued PC to re-issue the annulled instruction.

Unfortunately, there is no slack between the Controller and the I_mem, dropping Fmax from ~550 to ~430. Adding two stages of PC_PIPELINE restores speed, but now we have 8 threads running around a 10-stage control pipeline. Not sure of the impact of that.

Alternately: rather than using 20 FFs to pipeline the PC more, what about using 1 MLAB (10 ALMs) to store the one-behind PC to re-issue the instruction? A 2:1 mux, controlled by IO_ready, should be fast enough and it moves the subtractor out of that critical path to I_mem?

But will it affect the future port to Cyclone? (See #6)

merge wren with data out to update only relevant data I/O register?

Currently, doing an I/O write updates all data registers, but only the wren for the updated I/O port changes, indicating where the data must go. This seems messy and maybe wasteful of power.

Now that fixing issue #38 adds a register stage, maybe we should use the wren signal to update only the data register of the same I/O port?

Extend D address space +2 bits

Extending the D operand +2 bits (to 12 bits in vanilla 36-bit configuration), and reconfiguring the write enables of the A/B/I memories, would eliminate data and code duplication across all memories, by enabling at compile time to control where ALU results get written.

Assuming all the memories are identical (1024 words), we can arrange them thus in the D address space:

0-1023: A
1024 - 2047: B
2048 - 3071: I
3072 - 4095: H (call it "high mem", as a DOS homage? ;) )

The A and B operands still read the A/B memories, and the PC reads the I memory, and all remain at 10 bits and index from 0. The D operand of branches now wraps around I memory 4 times, or put another way, ignores the two upper bits.

Thus, as-is, the H memory is write-only and a good place to put hardware extensions which aren't interactive. This reduces the number of A and B I/O ports needed.

Finally, this scheme won't work for word widths without 2 free instruction bits, but supporting all possible word widths (without modification) has lost importance. It was enough to prove that all word widths were usable and fast.

Create efficiency counters

Extension to #40

Based from ideas in
"The Landscape of Parallel Computing Research: A View from Berkeley"
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf

Add two multi-threaded counters, likely external and attached to I/O ports.
Starting at 0 (reset by write), one increments each thread cycle when Annul is low, the other when Annul is high.

Their sum is a thread cycle counter.
Their ratio describes the I/O efficiency (how much of the total time was spent waiting on I/O). A designer can use this value to tune loops, DRAM operation, and other things which affect processing efficiency, if not correctness.

Similarly, counting on other events (particular instructions, operand values, data values, cancellations, shared/indirect memory accesses, taken/untaken branches, etc...) would provide easy and non-intrusive instrumentation.

Add interrupt lines

Generally speaking, you're better off dedicating one thread (or one whole core!) to polling and managing external asynchronous events. But interrupts might be nice anyway in some cases...

The basic configuration would have one interrupt line per thread, multiplexed to the Controller, and patched-in to the Jump logic, and instead of using the D operand, would take the destination address from an interrupt vector, one per thread, set from a multiplexed I/O port.
This way, each thread can independently respond to an external interrupt without affecting the other threads.

If done carefully, we wouldn't need to store the current PC, as the current instruction would have finished before servicing the interrupt (it's a hardware-invoked JMP, after all), and we can resume the thread from the next PC stored in the PC memory in the Controller. Nesting interrupts would complicate this a lot...requiring CALL/RET hardware effectively, but would we really need nested interrupts?

Convert wren/rden signals to req/ack form

There are several ways to label the signals of a 4-phase handshake.
I think they should just be "req/ack", regardless if they are used synchronously (as "ready" signals).

Naming:

A_wren becomes A_out_req, and paired with a A_out_ack input.
A_rden becomes A_in_req, and paired with a A_in_ack input.

This change should be doable with a careful search/replace. Though please do it on a separate branch.

Annul instruction if the I/O write handshake not ready

At the time of checking the ack lines for both reads (in RD0), do the same for the write port: in WR0, select the correct ack line and use it to annul an instruction before it reaches the Controller and ALU.

Given a completed #21, this should implement in a similar way, and can just tack on to the existing muxing and annulling circuitry.

Factor out pre-DataPath pure pipelines into Spine module

Related to #31.

Once the I/O Predication is factored out, there is no point in having stage 1-3 empty pipelines in the DataPath, so factor out the pipelines of things that don't get processed, like the opcode, into a "spine" which runs parallel to the other sub-systems and just carries stuff along.

Use Anull to implement power saving.

Since we can propagate Annul down the Datapath to disable A/B reads, ALU register latching, and A/B write, we might save power there. It should decrease in proportion to the number of "idle" threads looping on a no-op or waiting on I/O

Consider hardware/software support for SWAR (SIMD Within A Register)

36-bit words make SWAR/SIMD processing easy: 4 bytes, each with a guard bit, or two 16-bit values each with 2 guard bits. The guard bits capture any overflow, and can be test/set with bit ops.
Thus, for byte-wide SWAR, we can do in one 36-bit SIMD lane the equivalent work of 4 byte-wide SIMD lanes.

We know that Fmax drops with larger numbers of homogeneous SIMD lanes, even with partitioning, but we don't know the drop for heterogeneous SIMD lanes.

From thesis:

SIMD Configuration	Fmax
Octavo-L1	525
Octavo-L2	514
Octavo-L4	474
Octavo-L8	455
Octavo-L16	420
Octavo-L32	370

If we assume similar Fmax drops for byte-wide SIMD configurations, then using a word-wide Octavo-L1 for byte-wide SWAR instead of Octavo-L4 with byte-wide lanes should improve performance ~11%.

Even without a speed gain, the area will surely be less, as a heterogeneous byte-wide Octavo-L4 actually implements as 1x36 + 3x8 lanes, versus, a single 1x36 lane in SWAR mode. The P&R time will also remain shorter. As we pack more SIMD lanes into SWAR lanes, the area and speed benefits should increase.

SWAR suggests bit packing/unpacking ops (see Hacker's Delight) to convert between packed bytes (as in 32-bit words) and SWAR words, and to manipulate the guard bits.

Add loop counters

Create a simple multi-threaded word-wide counter, mapped into High memory.
Assign one of the "fast compare" condition codes to the counter, high when counter == 0
A thread write a count value to the counter, and a BTM entry for its condition code.
Each time the BTM entry selects that condition, the counter decrements by one, saturating at zero (else we might fall into quasi-infinite loops by error, rather than 0-length ones).
The branch is taken when the counter == 0 before decrement.

Why? Since Octavo has already tight, reduced code, this will greatly improve loop efficiency. It's a common trick in DSPs, although less flexible than what the BTM enables.

Factor out the Memory sub-modules

Currently, the Memory module contains everything. Factoring it out into separate modules would more cleanly expose the useful reusable bits inside, like the universal address decoder and translator.

Must implement I/O handshaking

Handshaking support has the highest importance, as other enhancement and future expansions depend on it.

Currently, I/O reads and writes always complete, regardless of the readiness of the other end. Thus, any readiness check must use a busy-wait loop. We must add a simple 4-phase req/ack handshake on I/O reads and writes to effectively support many peripherals (i.e.: DRAM controllers), a larger memory hierarchy, and for overlapping communication and computation (a simple 1-stage handshaking write FIFO would allow a write to proceed and the writer to go on and compute while the reader eventually acknowledges the write.)

The tricky part comes when handling a not-ready I/O operation:

We must annul the current instruction
- Almost free: use the register clear input to turn it to a NOP, see Instruction Set Architecture
We must prevent any other I/O handshake in the same instruction from finishing, since they have side-effects. (e.g. lost data)
- A little harder: we must check all D, A, and B operands, since they could all be I/O.
- Also, the final write of an instruction happens several cycles down the pipeline (in the future) from the reads, so we have to pipeline the status of the previous instruction write to the current instruction read stage to decide whether we must annul it. Otherwise we would have a more complicated scheme where a pending write gets stored at the write stage until it can proceed.
We have to carry the instruction annulment decision to the Controller, so we can re-issue the previous PC value (likely by subtracting 1 from the current one) for that thread to re-issue the annulled instruction.
The same issues happen, but with more complication, when we include the read from Instruction Memory as potentially originating from an I/O port: if we assume some post-incrementing addressing after each read (to process instructions from an external memory), how do we make sure to not post-increment if the handshaking hardware annuls the instruction later due to a non-ready Data Memory (A and/or B) read and/or write?

Finally, we must for now assume that no two threads read/write from the same I/O port, as the handshaking could cause their reads/writes to happen in the opposite order, destroying the assumption of fixed round-robin thread ordering.

Use instruction annulling bit in Controller to re-issue annulled instruction

Have the bit which annulled an instruction with a failed I/O read/write (see #21 #22) follow the annulled instruction to the Controller, where instead of sending out the PC of the next instruction for the same thread (which was already stored in the PCM), we send out the previous PC to re-issue the annulled instruction.

This seems to imply having a subtractor in the output path of the Controller: we may have to move one of the 3 pipeline stages between I Mem and A/B back behind I Mem to compensate. Octavo already supports this change via a PC_PIPELINE_DEPTH parameter. Make sure to alter the other connected pipeline depths to match (adds up to 8 total).

Population Count and Count Leading/Trailing Zeros Accelerators

Implement a popcount module. There is enough time (8 cycles) to pipeline properly the chains of small adders necessary. This instruction seems much too useful and powerful to not include.

With clever use of AND gates, this could also act as a CLZ instruction.

And they lay the basis for bit "flip", "sheeps and goats", compress, and expand operations, all powerful.
(see Hacker's Delight)

Note: use wrapper to tie second, configuration port, (if multiple functions per accelerator) to the H memory, rather than use up an I/O port.

Include a "3-instruction" bootloader in BRAM image

Currently, we include a test program in the BRAM instance, and so the system comes up out of FPGA configuration ready to run. We should reduce it to a bootloader with "read mem", "write mem", "jump to mem" commands to enable us to select the test program without having to reconfigure the FPGA.

See: http://utoh.org/3ins4th.html (not arguing for Forth, but for this overall idea)

Having the boot loader waiting to load commands from I/O would also have the system start in the most idle (and thus low-power) state possible if handshaking also exists.

Downside: simulation time will increase, as we'll have to load the program each time.

Add option to target Cyclone and other Altera devices

Currently, the project generator only targets a Stratix IV FPGA device, which the free version of Quartus does not support. We should add the option to target Cyclone devices at least, as most entry-level and educational boards based on Altera chips include these, and are supported by the free version of Quartus.

This change might require some minor pipeline tuning to reach top-speed on Cyclone devices, but it's unlikely.

Need to revamp the full-system simulation

We should move back to simulating the full system with I/O and PLLs for clocks. This setup will make future integration and testing of changes/add-ons easier.

Upside: we can have a simulation setup for each supported FPGA board.

Downside: simulation time will increase significantly.

Re-Align Datapath further down aside Controlpath

Now that, with the BTM, the Controller uses the result of the previous thread instruction instead of the value of operand A, the Datapath is no longer coupled to the Controlpath in a feed-forward manner.

Thus, we should be able to slide the Datapath down towards the end of the Controlpath, increasing the number of stages between the I and A/B memories, which means more stages for the AOM, BTM, and any other control hardware.

This has a potential major effect: being able to slide the AOM down past stage 1 means we can use the instruction operand addresses directly to read from the PO/PI AOM memories, eliminating the slow multiplexers in the AOM, which should improve speed and scalability.

The constraints are that the read from the next thread instruction in the pipeline cannot arrive to A/B earlier than the write of the current thread instruction. At most, they must coincide so the write-forwarding logic around the A/B BRAMs feeds the write to the read. Same for the read of the next thread instruction from the I memory: we don't want to introduce a second cycle of delay.

We must also be careful about the feeding of the ALU output back to the BTM, to make sure the thread numbers line up (i.e.: ALU output from thread N must reach the BTM stage containing the next instruction from the same thread).

Bring out read-enable lines

Update Memory.v to support read-enable, and bring out those lines so they can be controlled by the processor. Being able to disable reads could reduce power, and selectively disable SIMD lanes if needed.

I/O writes happen one cycle too early!!!

Since I/O writes happen in stage 4, not 5, they are 7 cycles later than I/O reads, not 8.

Thus even a direct connection from I/O write to I/O read cannot pass data inside a single thread! More seriously: an external accelerator would have to do contortions to compensate.

Solution (being addressed in simd_refit branch): add one cycle delay to I/O data and wren outputs.

Replicate Addressing module per-lane for scatter/gather operation

Based off:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf

Many computing "dwarfs" require scatter/gather memory operation for peak performance.
Having an Addressing module for each SIMD lane would allow per-lane indirection of reads and writes, which should work as scatter/gather with proper setup.

However, this means more High memory entries, which might be complicated to set-up, and we may exhaust H memory, limiting the number of SIMD lanes.

Implement LUT instruction

Based off https://devtalk.nvidia.com/default/topic/690631/cuda-programming-and-performance/so-whats-new-about-maxwell-/post/4194866/#4194866

Implement a Look-Up Table instruction(s) to optimize boolean logic.
Load a table with 4 bits representing the 2-LUT Truth Table for 2 bits, one from A and B read data. four 4-bit entries per table, times 8 threads, equals 32 entries, meaning 2 32x2 MLABs used completely. Memory-map per-thread entries in a single word: 4*4 = 16 bits. Writing to word updates all 4 entries.

Place table in ALU, replacing Logic Unit:
ALU0: read LUT Memories
ALU1: register MLAB output, as always.
ALU2: drive 4:1 muxes, one per output bit, selected by A and B bit, maps perfectly to 6-LUT hardware (Cyclone V and Stratix IV)
ALU3: replace final mux with 4:1 mux: Logic, Add/Sub, MHI, MLO
Will need careful experimentation to preserve Fmax, but shouldn't be hard.

Note: first investigate combining Add/Sub with Logic operation, as per Warren's "Hacker's Delight".

Use first 4 opcodes as index into 4 entries, giving programmable, universal 2-term boolean logic opcodes. Easy to default to XOR/AND/OR/NOT. By default, all zero entries encode to a result of 0, regardless of A/B inputs, preserving NOP and address 0 operation.

Benefits: reduces all 2-term boolean operations to a single instruction.

Best case: unoptimized logic (e.g.: ~(~A * ~B)), reduces 4 operations to 1, but that could be trivially optimized to A + B beforehand. DeMorgan's Law also guarantees that any operation with 2 negations can be converted to one with 1 negation.

Common case: operations with a single negation (e.g.: ~A * B), reduces from 2 to 1 operation. Basically all negations vanish, reducing any compound boolean sequence with N terms to the best sequential case of N-1 operations.

See: http://iie.fing.edu.uy/investigacion/grupos/microele//iberchip/pdf/75.pdf ("Classifying n-input boolean functions", Correia and Reis) for characterization of all 16 possible 2-term boolean functions.

See: bit-manipulation papers/thesis by Yedidya Hilewitz, Princeton PALMS group, now at Intel.

Would be much more powerful with 3 terms, as in CUDA 6.0, but that would require adding a C memory to fetch alongside A/B, which is an icky architectural change.

Port to SystemVerilog

At some point, once the architecture stabilizes (Octavo as top-level, over Mesh, over SIMD, over Scalar), we need to convert (and replace with!) the entire codebase to SystemVerilog.

Why

Support for interfaces in ports, which will eliminate the flat vectors passed in and out of Memory modules, which make each higher level harder and harder to write. Mesh already pushes at the limit of what's practical.

Similarly, support for structs, list parameters, etc... will make parameterizing large Octavo instances much cleaner. (e.g. we could easily implement heterogeneous Meshes)

Approach

Start a branch and rewrite from the bottom-up. The lowest modules (BRAMs, Controller, ALU, etc...) should convert almost without change. The Memory module will have port interfaces instead of flat vectors, and that will reflect in all higher modules, simplifying things.

When

After I graduate. :)

False no-ops and equivalent Boolean operations to hinder power side-channel

Related to #47, #48, and #35.

If the new no-op (see #45 and #35) becomes ADD, 0, 0, 0, and reduces power via Annul, then we could create false no-ops (e.g.: SUB 0, 0, 0), which don't raise Annul, and thus hide some power profiles of code execution.

Similarly, based on #35, we could switch the implementation of some bit-manipulation functions with Boolean-equivalent versions using different LUT instructions, also maybe hiding some power profiles.

Taking this further, if Annul shuts down unused FU stages (e.g. Adder when doing a Multiply), then perhaps we can selectively (pseudo-randomly?) inhibit that effect to also hide power profiles.

Also, we could make the I/O Handshaking module behaviour programmable to issue a variety of false no-ops (or even instructions with real and/or decoy side-effects) while waiting, so as to hide some I/O operations. In effect, extending I/O handshaking with a co-routine instead of a no-op.

This co-routine (a stored PC in the I/O Handshaking module?) could also emulate some of the work recovery aimed-at by thread re-scheduling on I/O latency events, which we strictly disallow in Octavo to maintain determinacy.

Add I/O port to Instruction Memory

Adding an I/O read port to the Instruction Memory would allow the direct execution of code from external memory without overhead. This will require handshaking support to support variable latency, and alterations to the Controller so the PC won't increment when reading from the I/O port. We can then use the Instruction Memory as a user-controllable cache for high-speed, deterministic code.