ravenslofty / blog Goto Github PK

View Code? Open in Web Editor NEW

46.0 46.0 2.0 0 B

The Critical Path - a rambly FPGA blog

blog's People

Contributors

Stargazers

Watchers

Forkers

mfkiwl dpretet

blog's Issues

05/06/2023: better LAB packing in nextpnr-mistral

more nextpnr-mistral things!

I've talked about the LABsLogic Array Blocks a little bit before; they contain 10 ALMsAdaptive Logic Modules each, and are essentially the heart of the Intel Cyclone V.

each ALM has 8 inputs, and these inputs can either come from global routing through the TDTile Dispatch multiplexers, or from other ALM outputs through the LDLocal Dispatch multiplexers.

now, since there are 10 ALMs, and each ALM has two logic outputs (...ish), there are 20 LD muxes. to feed these 10 ALMs with their 8 inputs each, there must be 80 TD muxes, right?

unfortunately, there are 46 TD muxes in a LAB instead of 80. in other words, a LAB can only have at most 46 unique external signals as inputs to the ALMs, and nextpnr-mistral reserves four TDs for control signals to the DFFsD Flip Flops, giving 42 TDs for ALM input signals. this makes TD accounting important to maximise utilisation of the LAB.

the old way

in the very beginning, there was no TD accounting, and so the placer filled LABs with as much stuff as it could, entirely unaware of TD limits, and so nothing routed because this inevitably resulted in needing to route multiple signals through the same TD, which is not possible in hardware.

this was not a particularly great situation, so having counted that a LAB has 42 TDs available for use by ALMs, gatecat wrote some code which counted the number of inputs to the ALMs, subtracted cases where individual ALMs shared inputs across multiple LUTs, and rejected placer solutions which had more than 42 inputs according to this count. I'm going to refer to this method as "counting".

counting effectively limits the LAB to having 42 unique inputs, reserving one TD mux for every ALM input. this overestimates TD usage; a signal going through a TD can route to multiple inputs, rendering the TDs reserved for those inputs unused. however, overestimating like this makes life easier for the router, because it results in fewer signals needing to be routed to a LAB, and less time spent figuring out how to map signals to TD muxes ("resolving TD congestion").

a different way

I'm not going to call my approach strictly "better" because increasing TD utilisation also increases time spent resolving TD congestion.

anyway, since TDs can each provide a signal, why not put all the signals into a hash set and reject solutions if the size of the hash set is over 42 elements? let's call this approach "hashing".

this was my initial idea, and after adding some code to count the number of LABs used, hashing needed 10% fewer LABs than counting. however, a placement using hashing would not route; it would be really close to routing but never succeed, and I spent days bashing my head at this problem.

the depths of the LAB

a fairly little-known feature of nextpnr is that it exposes a fairly large part of the internal architecture API to scripts through Python. this means one can, for example, bring up the Python REPL to play with things. combine this with Mistral's full knowledge of the routing graph and bitstream, and we can explore the LAB.

Mistral refers to the inputs of the ALM as GOUTGlobal [Routing] Output nodes, and this is exposed by nextpnr-mistral as wires with the name GOUT.XX.YY.ZZ, where XX and YY are the X and Y coordinates of the LAB, and ZZ is an index referring to the specific ALM input.

the ALM in question is multiplied by eight (because there are eight ALM inputs) and then the ALM input index is added to it. for reasons, the 8 inputs of the ALM are from index zero to seven: E0, F0, A, B, C, D, E1, F1. so, GOUT.1.1.2 is the A input of ALM 0 for the LAB at (1, 1).

knowing this, one can create the name of an ALM input and the nextpnr API will look up its corresponding wire whenever you call into it. given a wire, you can use ctx.getPipsUphill(wire) to iterate over the PIPsProgrammable Interconnect Points that use the wire as a destination, which gives you the TDs and LDs that connect to an ALM input.

if one goes over all the ALM inputs within a LAB, one discovers a few things:

inputs can only read from about half the available TDs. this is a not-uncommon thing vendors do, since having half the muxes takes half the area.
the pattern of mux connections is approximately this: there's a pattern for all A and C inputs, another pattern for all B and D inputs, yet another pattern for E0 and E1, and a fourth pattern for F0 and F1.
there is some overlap between these patterns, such that some TDs could also share signals between other input groups.

this is effectively a constraint satisfaction problem, and I have no experience with these (but I'd like to hear from those who do!).

so instead I decided on a compromise: hashing signals alone underestimates TD usage, but how about if one assigns these TD mux patterns a number and then hashes (signal, TD mux pattern) instead?

while worse than pure hashing, it is still an improvement in density to do this.

things for the future

I mentioned those LD muxes a while back, and then didn't do anything with them. if one can do LD accountancy too, then one would avoid reserving TDs for internal signals, which would further increase LAB utilisation. however, I have my suspicions that the LDs also only connect to some ALM inputs and would need to do some more investigation.

another thing is DFFs; DFFs packed with unrelated LUTs source their inputs from E0/E1 or F0/F1, and we don't know in advance which one. that means that we must reserve both, effectively costing two TDs per DFF. not cheap.

so does this help corescore?

probably not right now. perhaps in the future?

PS: are these getting formulaic? I'm trying to keep these interesting. also I have a patreon.

06/04/2023: 4am optimisation ramble

I'll be posting a few more articles from cohost, on the basis that one copy is none copy if nothing else.

here's a conversation I had with @ptrcnull while insomniac last night.

lofty

Whenever I have trouble sleeping, I think of one of the holy grails of FPGA synthesis. It doesn't help with the sleep but I feel like maybe I understand it a little better each time.

I could ramble at length about this, to be honest.

ptrc

you can ramble anytime; i'll probably understand ~10% of that, but that's still something!

lofty

An FPGA has a lot of weird bits to it, but the general bread and butter of it is:

look-up tables which can implement an arbitrary function of K inputs (where K is usually 4 or 6) in combinational logic.
D flip-flops, which store 1 bit of memory, latched by a clock.

Now, signals propagating through these look-up tables - LUTs - take time, and so the minimum clock period of the circuit is equal to the longest propagation delay between DFFs. For an optimal circuit, one would want the delay between all DFFs to be as balanced as possible, but with the blender that is a HDL synthesiser, humans are not very good at achieving that, and sometimes they just don't want to bother.

So, there is an optimisation called register retiming, which is essentially changing the position of DFFs in the circuit to balance the delay in the circuit.

One major problem is that retiming comes with an implicit proof obligation, and proving that a change to sequential logic is equivalent to the original circuit is a harder problem than proving that a change to combinational logic is equivalent.

Another major problem is that modern DFFs have a lot of special sauce to make things as powerful as possible. DFFs in ASICs power up to an unknown value depending on temperature, power, and the phase of the moon. I'll refer to this indeterminate state as x. x is poisonous, since the result of an operation with an x input is also x, so ASICs have special circuitry in them to force the flops to a known value. Forcing the flop output to 0 is called a "reset" and forcing the flop output to 1 is called a "set".

This brings a challenge to retiming: in a nice and convenient world where flops power up to x then what you do with them doesn't matter, because the state is still indeterminate. However, when you have resets, you need to prove that moving a flop with a reset preserves the circuit reset behaviour.

FPGAs can set/reset all flops at power up, since they do their own reset, which means that they can also optionally reset other (user) flops along with it, but this means you have another proof obligation: the circuit must be equivalent at power up.

ptrc

as in, with every power up you get the exact same state?

lofty

Yes.

ptrc

hm, but wouldn't that always be the same if you just reset all flops significant to the logic?

lofty

Imagine the logic implements Q = ~(A & B) and A and B are both flops that initialise to 0.

If you want to move the flops from A and B to Q, then simply keeping it as a flop that initialises to 0 gives you the wrong result now, because ~(0 & 0) = 1 but Q initialises to 0.

Now, this is an easy thing to detect in this case: you can calculate the initial value of Q by just walking its input logic to see what it evaluates to.

Deleting the flops at a gate's inputs (e.g. A and B) and adding the flop to the gate's output (Q) is known as forward retiming.

If there's a forward retiming, you can also guess there's backward retiming. Take the same Q = ~(A & B) circuit, and imagine that the flop on Q is deleted, and new flops are added to A and B. Question: what do A and B power up to?

ptrc

but.. that can result in multiple different things, right?

lofty

Exactly.

ptrc

if output is 0, then sure, it's straightforward

but all the other input options give you 1 on the output

lofty

Mhm~

ptrc

so.. how does one optimize that? do you, idk, look at all the other logic statements and make something up that makes sense?

lofty

Backwards retiming with initial values is an NP problem for this reason :p

But yeah, pretty much; finding a valid initialisation value is enough though.

This can get even worse when you consider that LUTs have 4 or 6 inputs; imagine the pain and suffering a 6-input XOR gate invites...

Then things get even worse than that ^.^

Now, I've handwaved a detail in the synthesiser which turns out to matter a lot: the synthesis tool takes in HDL, which describes logic, and outputs LUTs. But how does it do that? In simple terms, by keeping track of logic subgraphs that have no more than K inputs, which can each be implemented by a K-LUT.

If one performs forward retiming after LUT mapping, then the retiming logic is kind of imprecise; it's too late to break down a LUT. So, to do retiming for FPGAs "properly" one must simultaneously explore all possible LUT mappings of the input logic and all possible retimings. That is the holy grail of FPGA synthesis.

ptrc

hmm

i mean

that sounds impossible

lofty

Yosys - as well as several commercial tools - relies on a program called ABC by Alan Mishchenko to do LUT mapping. ABC can do a lot of things, it's very powerful and flexible ~~and brittle~~.

The holy grail - "sequential LUT mapping" - is a problem so difficult that even Mishchenko gave up on it. It's even harder when one considers that his prototype for this also includes yet another important thing for optimisation: the concept of choice nodes.

In logic optimisation, one attempts to replace logic by an equivalent (but hopefully better) piece of logic, but you can't know for certain that it is better, because these optimisations are inherently local. They're effectively greedy algorithms :p

Instead of replacing the logic, however, ABC inserts a "choice node", which says "this bit of logic can be represented either by this original logic or this optimised logic, and I know these two forms are equivalent". A choice :p

This provides the obvious but surprisingly nontrivial bonus of never making the circuit worse :p

ptrc

ah, hm

so who's the one choosing then?

lofty

The LUT mapper; for nodes where delay matters, you can choose the mapping which leads to the lowest delay, and for nodes where delay doesn't matter, you can choose the mapping which leads to the lowest area.

(This "choosing" happens implicitly: mapping happens in a few different iterations based on different goals; the first iteration maps aiming greedily for the lowest depth mapping, which leads to the globally lowest depth mapping. Having found the longest path in the network one can then select for mappings which may increase the depth [logic delay from the network inputs to network outputs] for a path (as long as it doesn't increase global depth), but lead to a solution that uses fewer LUTs.)

Actually, LUT mapping is kind of its own kettle of fish, but the algorithm that ABC uses is from 2007 and there's been no improvements in the state of the art since, which is an age in the world of CS.

LUT mapping algorithms can be divided into two families, more or less:

"structural" LUT mappers map the input logic circuit as it is, which is very fast. this leads to some amount of chaos in the algorithm: changing the shape of the input circuit changes the shape of the output circuit.
"functional" LUT mappers create a LUT mapping that is equivalent to the input circuit, but generally using fancy but slower techniques to a) look for and remove redundancies in the input, and b) make the algorithm result less chaotic.

The ABC mapper is a structural mapper, and it is indeed very fast, and also scales very well. Choice nodes were intended to give a little bit of functional flavour to the otherwise structural mapper, since using the optimiser to vigorously shake the logic tree leads to a lot of choice nodes being discovered.

But at the end of the day, it's still a very chaotic mapper, and we had a bug report the other day about how adding an empty line to a file changes the result (Yosys records line number and character as part of its front-end error message infrastructure, and also to help match up wires in the end netlist; inserting a new line changes that information, which makes the information hash differently, which changes network traversal order.)

10/01/2023: vliw software pipelining

I'll be posting a few more articles from cohost, on the basis that one copy is none copy if nothing else.

the fediverse wanted me to post about this, but fedi itself is not so great for long-form content, so, cohost it is.

so, the itanium is generally considered to be quite crazy, but it comes with a surprising number of tricks up its sleeve. one of them is software pipelining, although the documentation also calls it modulo-scheduled loops because they like technical terms.

are you sitting comfortably? this'll be a relatively long explanation.

let's take a relatively simple loop, with four steps that must be executed in sequence. I'm going to omit the "check loop counter and branch" section for simplicity.

imagine that A loads from memory, B and C are logic operations, and then D writes to memory.

if we assume each step takes 1 cycle, then this loop takes 4 cycles to process 1 element. as a VLIW which would like to issue more than one instruction per cycle, this code makes us sad (Itanium can issue up to six instructions per cycle).

a better approach would be to execute N loop iterations simultaneously:

this runs into a problem: it supposes we can issue all these instructions at once. if A and D are memory operations, then we need 6 load/store units in hardware; if B and C are logic operations, then we need 6 ALUs as well. this makes the CPU need a lot of silicon (although with SIMD operations in modern processors they did end up paying the silicon cost). the itanium was already a relatively huge chip, so Intel did not pay the silicon cost there.

what if we overlap loop iterations instead of executing them in parallel?

I have unrolled the loop a little to make the dependencies clearer, because the values being carried across loop iterations is hard to draw. assuming that A and D are load/store operations and B and C are ALU operations, this code needs only two load/store units and two ALUs, while achieving 4 instructions per cycle at peak. however, a prologue and epilogue are needed to populate and depopulate the registers around the loop.

prologues and epilogues do take up space, however, and Itanium doesn't actually need them; it supports conditional execution, where the result of an operation doesn't take effect if a predicate register is zero.

imagine that A, B, C and D were predicated on their own independent register pA/pB/pC/pD; the prologue consists of setting pA on the first cycle, then pB on the second, then pC on the third. then pD is set on the fourth cycle, and we can loop until "almost done", clear pA on the first cycle, pB on the second, pC on the third. this is what the br.ctop instruction does.

so, on Itanium, the loop looks like this:

pretty neat, huh?

09/10/2020: FPGAs are Magic (II)

FPGAs Are Magic II

More boilerplate (but we're getting somewhere!)

The goal of this post is the FPGA-flow related boilerplate, as opposed to the ASIC-flow related boilerplate of part I.

Our goal is to make the FPGA "good", but how do we quantify "good"?

We could measure the area of an FPGA design by multiplying the number of LUTs needed for it by the die area per LUT, giving us the die area needed to implement that design on the chip.
Equally, we could measure the speed of an FPGA by calculating the critical-path delay of a design implemented for it.

I decided to target both: by multiplying the area of the design by its critical-path delay, you get a figure called the delay-area product (DAP), which you can minimise.

From Yosys' point of view, area statistics can be calculated from the stat (statistics) command; critical-path delay is a bit trickier.
I opted to use the sta (static timing analysis) command from Yosys' eddie/sta branch, which re-uses timing information provided to ABC9 to calculate the critical path.
Even though that branch is a little outdated (it was last rebased in May), it's modern enough for our purposes.

Simulation models

Let's start off by writing simulation models of the LUT and DFF cells. I went up to LUT8s in my timing information, so here's a LUT8.
We can simulate smaller LUTs by just tying inputs to constants and limiting the maximum LUT size ABC uses.

I'll call this file rain_sim.v. We'll need to reference it in the synthesis script.

// The `(* abc9_lut=N *)` informs ABC9 that this cell describes the timings for
// a LUT, and the `N` parameter informs ABC9 of its relative area.
// e.g. on a frangible LUT6, you might want to inform ABC9 that a LUT6 uses 
// twice as much relative area as a LUT3 because you can pack two LUT3s together.
(* abc9_lut=1 *)
module rain_lut(input A, B, C, D, E, F, G, H, output Q);

parameter INIT = 0;

// Here's the other important part, a specify block, which describes cell timings
// to ABC9 and sta.
//
// `(A => Q) = N` means "there is a combinational path between A and Q, which has
// a delay of N units". By Yosys convention, these units are picoseconds.
// Note that `=>` is a one-to-one relationship; both sides must be the same width
// (1 bit in this case).
//
// To give a small example, from the last post I calculated a 1.018ns delay between 
// input H and the output. That would look like `(H => Q) = 1018;`.
specify
    (A => Q) = 0; // Fill
    (B => Q) = 0; // these
    (C => Q) = 0; // with
    (D => Q) = 0; // your
    (E => Q) = 0; // timing
    (F => Q) = 0; // information
    (G => Q) = 0; // in
    (H => Q) = 0; // picoseconds
endspecify

// This is pessimistic when it comes to Verilog x-propagation, but it's simple.
assign Q = INIT >> {H, G, F, E, D, C, B, A};

endmodule


module rain_dff(input CLK, D, output reg Q);

// Flops also have timing information, but since we're focusing on LUTs right
// now, it's okay to keep the delays at zero, even though they're unrealistic.
//
// `(posedge CLK => (Q : D)) = N` means "there is a path from D to Q that 
// changes on the positive edge of CLK, which has a delay of N units".
// This is the delay between the clock edge rising and the flop output changing.
//
// `$setup(D, posedge CLK, N)` means "the input at D must be stable for at least
// N units before the positive edge of CLK".
// It is possible to have negative setup times (where the input at D can 
// stabilise after the clock edge), but ABC9 doesn't support these and they will
// be clamped to zero with a warning; it's best to just put the actual value in 
// a comment and leave the field at zero.
specify
    (posedge CLK => (Q : D)) = 0;
    $setup(D, posedge CLK, 0);
endspecify

always @(posedge CLK)
    Q <= D;

endmodule

I named my FPGA "Rain" because Sky...Water...

You may also be wondering why I didn't use the term "fracturable LUT", and the answer is because "frangible" is funnier.

Mapping

ABC9 will produce a netlist which uses Yosys-internal $lut cells.
We need to map those to our LUT model, using the Yosys techmap pass that takes in a Verilog file.

I'll call this file rain_map.v.

// This uses an obscure corner of the Verilog standard: raw identifiers.
// An identifer starting with \ is parsed as an identifer until whitespace.
// This allows you to use special characters like $ and [].
module \$lut (A, Y);

parameter WIDTH = 0;
parameter LUT = 0;

// (* force_downto *) tells Yosys what to do in the event WIDTH is zero and this
// wire is [-1:0]: treat -1 as the MSB index and 0 as the LSB index, producing a
// zero-width wire that is otherwise unachievable in Verilog.
(* force_downto *)
input [WIDTH-1:0] A;
output reg Y;

// `_TECHMAP_REPLACE_` is a cell name specially recognised by `techmap`.
// It give this cell the name and attributes of the cell it replaces, which is
// useful in 1:1 mappings like this.
//
// Likewise, `_TECHMAP_FAIL_` is a wire name specially recognised by `techmap`.
// When set to 1, the cell is skipped and this mapper has no effect, which means
// we don't have to map for all WIDTHs.
generate
    if (WIDTH == 1) begin
        rain_lut #(.INIT({128{LUT}})) _TECHMAP_REPLACE_ (.A(A[0]), .B(1), .C(1), .D(1), .E(1), .F(1), .G(1), .H(1), .Q(Y));
    end else
    if (WIDTH == 2) begin
        rain_lut #(.INIT({64{LUT}})) _TECHMAP_REPLACE_ (.A(A[0]), .B(A[1]), .C(1), .D(1), .E(1), .F(1), .G(1), .H(1), .Q(Y));
    end else
    if (WIDTH == 3) begin
        rain_lut #(.INIT({32{LUT}})) _TECHMAP_REPLACE_ (.A(A[0]), .B(A[1]), .C(A[2]), .D(1), .E(1), .F(1), .G(1), .H(1), .Q(Y));
    end else
    if (WIDTH == 4) begin
        rain_lut #(.INIT({16{LUT}})) _TECHMAP_REPLACE_ (.A(A[0]), .B(A[1]), .C(A[2]), .D(A[3]), .E(1), .F(1), .G(1), .H(1), .Q(Y));
    end else
    if (WIDTH == 5) begin
        rain_lut #(.INIT({8{LUT}})) _TECHMAP_REPLACE_ (.A(A[0]), .B(A[1]), .C(A[2]), .D(A[3]), .E(A[4]), .F(1), .G(1), .H(1), .Q(Y));
    end else
    if (WIDTH == 6) begin
        rain_lut #(.INIT({4{LUT}})) _TECHMAP_REPLACE_ (.A(A[0]), .B(A[1]), .C(A[2]), .D(A[3]), .E(A[4]), .F(A[5]), .G(1), .H(1), .Q(Y));
    end else
    if (WIDTH == 7) begin
        rain_lut #(.INIT({2{LUT}})) _TECHMAP_REPLACE_ (.A(A[0]), .B(A[1]), .C(A[2]), .D(A[3]), .E(A[4]), .F(A[5]), .G(A[6]), .H(1), .Q(Y));
    end else
    if (WIDTH == 8) begin
        rain_lut #(.INIT(LUT)) _TECHMAP_REPLACE_ (.A(A[0]), .B(A[1]), .C(A[2]), .D(A[3]), .E(A[4]), .F(A[5]), .G(A[6]), .H(A[7]), .Q(Y));
    end else
    begin
        wire _TECHMAP_FAIL_ = 1;
    end
endgenerate

endmodule


// Yosys' internal flop names get *complicated*, but this one means "a positive-
// edge clocked D flip flop".
module \$_DFF_P_ (input D, C, output reg Q);

rain_dff _TECHMAP_REPLACE_ (.CLK(C), .D(D), .Q(Q));

endmodule

Synthesis

Now we need another Yosys synthesis script, but this time for FPGAs instead of ASICs.

# Read the cells from the cell library, including their specify blocks, but treat
# them as black boxes.
read_verilog -lib -specify /path/to/lut_sim.v

# Perform generic synthesis for LUTs. Here, I'm targeting LUT8s, but you can change
# the -lut parameter to change that. More importantly, I'm stopping the `synth` pass
# partway through so we can manually set mapping options.
synth -auto-top -flatten -run :fine -lut 8

# Clean up the design.
opt -full

# Map the flops to the one flop type we currently have: an uninitialised D flip-flop.
# This will break on any design that requires initialised D flip-flops, or D latches, 
# but we'll get to that.
dfflegalize -cell $_DFF_P_ x

# Map the design to LUTs using ABC9.
# -maxlut tells ABC9 what the largest LUT is of this architecture (8)
# -W is a fudge factor representing interconnect delay, encouraging ABC9 to pack LUTs
# more efficiently.
abc9 -maxlut 8 -W 2000

# Clean up the design again.
opt -full

# It's useful to know the relative sizes of LUTs produced in the design, but since 
# there is one LUT cell in the library (for the sake of brevity), we'll check the 
# width of the Yosys-internal $lut cells before mapping them to rain_lut to obtain
# this information.
stat -width

# Map the $lut cells from ABC9 into our rain_lut cells.
techmap -map lut_map.v

# Run static timing analysis based on the information in the specify blocks to 
# calculate the critical path. Since the timing information is in the rain_lut 
# cells, we need to techmap before running this.
sta

You'll probably want to write a script to vary the size of the LUT and change the timings as necessary.
Then you can extract the relative LUT sizes from the output of stat and the critical path from sta.
I picked a few of the benchmarks from the EPFL combinational benchmark suite and here are my results.

Size	Relative Speed	Relative Area	Relative Delay-Area
LUT2	63.63%	103.76%	117.36%
LUT3	78.78%	113.34%	102.43%
LUT4	86.75%	202.30%	167.45%
LUT5	95.78%	315.72%	242.48%
LUT6	86.32%	566.59%	476.50%
LUT7	92.62%	1008.38%	828.96%
LUT8	68.87%	1842.33%	1954.12%

"Relative Speed" measures how fast the end result can go.
We don't want to artificially hobble the chip by using a slow architecture.

"Relative Area" measures the total die area needed to implement that design.
We don't want to spend an excessive amount of area for the architecture.

"Relative Delay-Area" measures how fast the end result is for its area.
We want a design which provides the best performance for its area.

From the data, implementing designs using LUT2s is slower and less efficient than LUT3s, because you need a lot of them for the same design.
Conversely, using LUT8s is slower and less efficient because it's difficult to use all of a LUT8, and all the logic results in slow switching speeds.
Let's discount them both.

The LUT5 is the fastest architecture here, with the LUT7 a close second.
This is surprising to me, because commercial FPGAs use LUT6s and LUT4s, and I was expecting these to be more competitive.

The LUT2 is the smallest design, with the LUT3 close behind.
This makes sense; they're the smallest LUTs, and going from LUT2 to LUT3 is a significant efficiency boost.

The lack of performance for the LUT2 is punished by the delay-area product, and so the LUT3 is the best there.

If I was going to implement an FPGA from that data, I would pick the LUT5.
It performs the best, and I think the performance of smaller LUTs will diminish when routing costs come into play.

But there are some other factors to be explored; you can combine smaller LUTs with fast muxes to increase performance, and you can use multiple outputs on large LUTs to reduce area.

In the next post, I will explore these.

26/10/2020: A LUT Mapper.

LUT Mapping with Priority Cuts

There's a disproportionately low number of people who know how to write a LUT mapper to people who depend on them for a living.
So, in an effort to help with that, here's a description of what is - to my knowledge - the state of the art algorithm for it: Combinational and Sequential Mapping with Priority Cuts by Mishchenko et al.
I'm going to be writing it in Rust, using the itertools crate, because I enjoy Rust, but it should be implementable in any language.

The basics

The standard way to store logic networks is with an AND-Inverter Graph (AIG): AND gates with configurable inverted inputs.
This graph has "primary inputs" (the inputs to the logic network) and "primary outputs" (the outputs of the logic network).

use itertools::Itertools;

/// An AND-Inverter Graph node.
#[derive(Copy, Clone)]
struct AigNode {
    /// The index of the left-hand-side child.
    /// The least significant bit indicates if the value is inverted.
    lhs: u32,
    /// The index of the right-hand-side child.
    /// The least significant bit indicates if the value is inverted.
    rhs: u32,
}

impl AigNode {
    /// Create a new `AigNode`.
    pub fn new(lhs_index: u32, lhs_is_inverted: bool, rhs_index: u32, rhs_is_inverted: bool) -> Option<Self> {
        // Since the indexes use one bit to store inversion (since having over 2^31 nodes is already impractically big),
        // fail if an index is above that 2^31-index threshold.
        if lhs_index >= 1 << 30 || rhs_index >= 1 << 30 {
            return None;
        }

        let lhs = lhs_index << 1 | lhs_is_inverted as u32;
        let rhs = rhs_index << 1 | rhs_is_inverted as u32;
        Some(Self { lhs, rhs })
    }

    /// Return the index of the left-hand-side node.
    pub fn lhs_index(self) -> u32 {
        self.lhs >> 1
    }

    /// Return whether the left-hand-side node is inverted.
    pub fn lhs_is_inverted(self) -> bool {
        self.lhs & 1 == 1
    }

    /// Return the index of the right-hand-side node.
    pub fn rhs_index(self) -> u32 {
        self.rhs >> 1
    }

    /// Return whether the right-hand-side node is inverted.
    pub fn rhs_is_inverted(self) -> bool {
        self.rhs & 1 == 1
    }
}

/// An AND-Inverter Graph.
/// Populating this graph is out of scope for this article, but see the AIGER format if you want more.
struct Aig {
    /// The primary inputs of the graph.
    inputs: Vec<u32>,
    /// The primary outputs of the graph.
    outputs: Vec<u32>,
    /// The graph nodes.
    nodes: Vec<AigNode>,
}

Cut enumeration

If you take an AIG node and recursively follow its inputs all the way down to the primary inputs of the graph, you have that node's "cone".

You can partition the nodes of this cone to produce a smaller contiguous cone called a "cut". A cut has at least one input and exactly one output.

/// A partition of the graph, rooted at a node.
#[derive(Clone)]
struct Cut {
    inputs: Vec<u32>,
    output: u32,
}

impl Cut {
    /// Create a new cut. For efficiency, inputs must be in sorted order.
    pub fn new(inputs: Vec<u32>, output: u32) -> Self {
        // Vec::is_sorted is currently unstable :(
        assert!((0..inputs.len() - 1).all(|i| inputs[i] <= inputs[i + 1]))
        Self { inputs, output }
    }
}

A cut with no more than K inputs is a "K-feasible cut". To map to K-input LUTs, all cuts must be K-feasible. Since an AIG node has two inputs, K must be at least two.

impl Cut {
    /// Returns the number of inputs to this cut.
    pub fn input_count(&self) -> usize {
        self.inputs.len()
    }
}

You can make a "trivial cut" from a node by using the node's inputs as the cut inputs and the node's output as the cut output.

A cut on a node's left-hand-side can be merged with another cut on the node's right-hand-side to create a new cut rooted at that node.

impl Cut {
    /// Merge two cuts together at `root`, returning the new cut.
    pub fn merge(&self, rhs: &Self, root: u32) -> Self {
        let inputs = self.inputs.iter()
            // Having the inputs sorted lets us merge the two cuts together in linear time...
            .merge(&rhs.inputs)
            // ...and deduplicate it with no additional space or allocations.
            .dedup()
            .copied()
            .collect::<Vec<u32>>();
        let output = root;
        Self { inputs, output }
    }
}

We can then explore all cuts in a graph by starting with the trivial cuts of the primary inputs, and then walking the graph in a topological order (a node's inputs before the node), building cuts by merging together the cuts of the node's children, adding the trivial cut of the node, and filtering out all cuts which are not K-feasible.

impl Aig {
    /// Return cuts in the graph that have at most `max_inputs` inputs.
    pub fn enumerate_cuts(&self, max_inputs: usize) -> Vec<Vec<Cut>> {
        // Mapping to 1-input LUTs doesn't make any sense.
        assert!(max_inputs >= 2);

        let mut cuts = vec![vec![]; self.nodes.len()];

        // Every primary input has exactly one cut: the trivial cut of that input.
        for input in &self.inputs {
            cuts[*input as usize] = vec![Cut::new(vec![*input], *input)];
        }

        // Writing a topological ordering is also out of scope, but see the Wikipedia article on it.
        let topological_ordering = self.topological_ordering();

        for node in topological_ordering {
            // Skip primary inputs.
            if self.inputs.contains(&node) {
                continue;
            }

            // Convenience variables for the children nodes.
            let lhs = self.nodes[node as usize].lhs_index();
            let rhs = self.nodes[node as usize].rhs_index();

            // Compute the trivial cut of this node.
            // Its inputs are the (sorted) inputs of this node.
            let trivial_cut_inputs = vec![lhs, rhs];
            trivial_cut_inputs.sort();
            // Its output is the output of this node.
            let trivial_cut_output = node;
            let trivial_cut = Cut::new(trivial_cut_inputs, trivial_cut_output);

            // Now for some iterator spam.
            let node_cuts = cuts[lhs as usize].iter()
            // Combine all child cuts together.
            .cartesian_product(&cuts[rhs as usize])
            // Merge the child cuts, using the current node as a root.
            .map(|(lhs, rhs)| lhs.merge(rhs, node))
            // Include the trivial cut of this node too.
            .chain(std::iter::once(trivial_cut))
            // Keep all `max_inputs`-feasible cuts.
            .filter(|cut| cut.input_count() <= max_inputs)
            // And then collect them into a vector.
            .collect::<Vec<Cut>>();

            // Set these cuts as the cuts at this node.
            cuts[node as usize] = node_cuts;
        }

        // Return the set of cuts for the graph.
        cuts
    }
}

This performs "exhaustive cut enumeration"; that is, it keeps all cuts for a node.
To produce a final mapping, we need to select cuts to form our final LUT network.

Cut selection

Before we worked from the primary inputs and built cuts from the bottom up. To produce the final mapping, we start at the primary outputs and work down to the inputs.

impl Aig {
    /// Select the final mapping of LUTs.
    pub fn select_cuts(&self, max_inputs: usize) -> Vec<Cut> {
        // Calculate the cuts.
        let cuts = self.enumerate_cuts(max_inputs);

        // The mapping frontier is all the nodes which need a cut selected that do not presently have one.
        // We start with the primary outputs of the graph and let the algorithm do its job from there.
        let mut frontier = self.outputs.clone();

        // The selected mappings to return.
        let mut mapping = Vec::new();

        // The nodes which have a selected mapping, to avoid duplicate work for a node.
        let mut mapped_nodes = Vec::new();

        // Pick the next node from the frontier.
        while let Some(node) = frontier.pop() {
            // Pick the first cut as the one to use.
            let cut = cuts[node as usize][0];

            // Add the cut's inputs to the mapping frontier if they are not primary inputs and not already mapped.
            for input in &cut.inputs {
                if !self.inputs.contains(input) && !mapped_nodes.contains(input) {
                    frontier.push(*input);
                }
            }

            // Mark this node as mapped.
            mapped_nodes.push(node);

            // Add the cut to the mapping.
            mapping.push(cut);
        }

        // Return the final mapping.
        mapping
    }
}

These cuts can then be turned into LUTs (with some difficulty).

I'm going to divide this series into two parts; this mapper produces a valid mapping, but the result is pretty terrible, and the algorithm doesn't scale well beyond 4 inputs per LUT.

We can make some reasonably small tweaks to drastically improve the quality of the mapping, and that will be the next page.

06/02/2022: Keep Calm and Carry On

Keep Calm and Carry On

Wow, it's been a while since I last wrote a blog article. Sorry about that.

Credit where credit is due: I learned this trick from the Altera Synthesis Cookbook.

Most modern FPGAs come with a carry chain: hard logic to quickly propagate the carry bit of a sum into the next bit.

If you think about a 1-bit full adder, there are some useful properties related to carries:

1 + 1 always results in a carry-out. It "generates" the carry-out.
1 + 0 + cin results in the carry-out equalling the carry-in. It "propagates" the carry-in to the carry-out.

Reduce-AND With A Carry Chain

In Verilog, using unary-& it is possible to AND every bit of a value together (which is 1 if every bit of the input is 1).
This is called "reduce-AND", and by itself it's not very common, but it's a fantastic building block.

If you think about how reduce-AND works, it is equivalent to asking if every pair of bits in the signal propagates a carry.
In other words, &x => most_significant_bit(x + 1). Using that identity, we can map a reduce-AND to a carry chain, getting the + 1 term by setting the carry-in to 1.

We can do better though. Since FPGAs are made up of K-input lookup tables (K-LUTs), we can first do a K-bit-wide reduce-AND using a LUT before going into the carry chain.
On an architecture like the Lattice ECP5, this is free since the LUT is before the carry chain, so the bits would have to go through the LUT anyway.
This transforms the problem into &x => &{&x[K-1:0], &x[2*K-1:K], ...} => most_significant_bit({&x[K-1:0], &x[2*K-1:K], ...} + 1).

Notably this is not free on the Lattice iCE40; because the carry chain is before the LUT you need to use another LUT to perform this.
This leads to the interesting property that subtraction is more expensive on iCE40 than addition because you need a LUT for input inversion.

When I bring this idea up, I often get somebody mentioning big-O notation, since this approach would be O(n/K) compared to using a tree of K-LUTs being O(log_K(n)).
However, propagating carries is really fast, so this approach is significantly faster than a tree of LUTs in practice.

Reduce-OR With A Carry Chain

Since reduce-AND is a thing, Verilog also includes reduce-OR as unary-| (which is 1 if any bit of the input is 1).

The core of the idea here is that through De Morgan's laws, one can transform |x into ~(&(~x)).

Since the K-LUTs are doing small reduce-ANDs already, we can pack the ~x terms into them by performing reduce-NOR (which doesn't exist in Verilog, but I hope makes sense) instead.
This then means the carry bit is 1 if all bits of the input are zero, which we can then invert somewhere to form the final reduce-OR.

The Gowin LittleBee FPGAs strongly resemble the Lattice ECP5, but because of a feature the ECP5 has but the LittleBee does not, reduce-OR like this is less efficient than reduce-AND.
It's still faster than a tree of LUTs however.

Equality With A Carry Chain

There are two ways of implementing this in practice: one way for equality between two signals, and another, more efficient way for equality between a signal and a constant.

The equality-between-two-signals approach is relatively straightforward.
Chunk the input into K/2-bit equalities, and then reduce-AND them together: A == B => &{A[K/2 - 1:0] == B[K/2 - 1:0], A[K-1:K/2 - 1] == B[K-1:K/2 - 1]}.

The equality-between-signal-and-constant approach is a little more complex.
Since the constant itself can be stored in the LUT, instead of needing to share LUT inputs between A and B, the inputs can be dedicated to the constant.
This results in A == B => &{A[K-1:0] == B[K-1:0], A[2*K-1:K] == B[2*K-1:K]}.

The equivalent transform can be made for inequality, as well.

The General Idea

Fundamentally, what this logic is doing is expressing a function as an AND of some other logic (this is "product of sums" form if the "other logic" is just OR, but there's nothing requiring it to be OR).
If the resulting function is relatively simple, it will probably map fairly efficiently to a carry chain.

15/02/2021: A Constructive Look At Pilkington's TS1 FPGA

A Constructive Look At Pilkington's TS1 FPGA

In case somebody hasn't read A Constructive Look At TempleOS I really think you should.

Did you know that Pilkington - a glass manufacturer - once designed an FPGA? As you might have imagined, it wasn't very successful. One source suggests the fabs couldn't get the FPGA to work (for some unspecified definition of "work"), or perhaps that the Pilkington toolchain was not as mature as it could have been for a commercial offering.

But instead of saying "stick to your knitting", I tracked down a white paper and the architecture patent. These two conflict on some pieces of information, so I'll reference the patent whenever that happens. Let's take a look, shall we?

A Brief Description Of The Pilkington Architectures

Pilkington has two (known) architectures, both based on a sea-of-gates design, where a logic function is built up by linking gates together. This comes from the mid-80's where different manufacturers had very different logic cell structures - only Xilinx had the lookup tables we're now familiar with.

The first architecture (I don't know of any particular name for it) was designed around a logic cell with a NAND gate and a latch, with each logic cell feeding into its adjacent neighbours through local interconnect. The white paper suggests that this was inefficient due to needing a lot of NAND gates for common functions like OR and XOR, as well as placement restrictions caused by needing to build a D flip-flop out of latches. I'm not going to go into too much more detail about it though, as it's not the focus of this article; if you want to read more, I think its architecture patent or the Toshiba paper.

The second architecture (the white paper calls it TS1, so TS1 it is) made me laugh when I looked at the diagrams of the patent. Logic optimisation programs like ABC represent logic as a structure of AND gates, XOR gates (which are common, but difficult to represent as ANDs), multiplexers (which are special cases that want to be preserved), and D flip-flops (as synchronous elements); all with per-input programmable inversion. Pilkington has made an implementation of that in hardware in the mid-90's when logic optimisers of the time were still using unwieldy sum-of-products form. It's quite ahead of its time, honestly.

TS1 Logic Cell

The logic cell really is simplicity itself: the combinational logic element just sources the inputs from input selector muxes, optionally inverts them, feeds them to the inputs of a NAND, XOR and MUX, selects an output from them, and then inverts it to amplify the signal.

The sequential logic element is along the same lines, except that the XOR is gone, and the MUX feeds into a DFF instead.

The output of a logic cell is directly connected to "local interconnect": fast links to the A and B input selectors of adjacent signals (patent Figure 2.4.B); it may also connect to the "medium interconnect": slower horizontal and vertical interconnect links - 6 per row/column - spread through a "zone" of logic cells.

TS1 Routing

Logic cells are grouped into a square of 3 combinational logic elements and a sequential logic element (Figure 2.5.B). These tiles come in two variants, A (Figure 2.4.C) and B (Figure 2.4.D), differing only in how they connect to the medium interconnect, and A and B are tiled together to form a zone of 5x5 tiles (Figure 2.4.E).

Each zone is surrounded by port cells that interface with "global interconnect": 4 lines along each row and column of zones to connect them together. This is - by modern standards - not very much global interconnect, so the placement tool has to make the most of the significantly more generous local routing per zone. The designers have another trick up their sleeve: if you make the logic gates fast enough, you can route logic through the gates without too much performance loss; this is why there are no direct links between the rows and columns of medium interconnect - those links are the logic gates.

Thoughts

This design feels really elegant to me, compared to some of the stranger logic families of the time (such as Actel's multiplexer architecture). The gates chosen are small and simple and easy for software to handle: synthesis isn't really a problem. Splitting the logic into zones was explicitly intended to make place-and-route easier by turning it into a divide and conquer problem, but perhaps modern algorithms would be fine treating it as a global problem (I can't see placement here being significantly tricker than on, say, an ECP5 85k).

So, perhaps, the implementation of the architecture had bugs that couldn't be fixed, or the 90's tooling was unable to exploit the hierarchy of the chip well enough. I think we should perhaps reconsider this architecture in light of the modern open tooling we have, both ASIC and FPGA. I think it's a lot more feasible to make the most of this nowadays than it was in the mid-90's when you had to roll your own toolchain.

Perhaps the actual moral should be "good ideas are timeless". This one certainly feels that way.

If you liked this article, why not support me on Patreon so I can eat?

25/09/2020: FPGAs are Magic (I).

FPGAs Are Magic I

Any sufficiently advanced technology is indistinguishable from magic.

Arthur C. Clark

FPGAs are magic.

On my Twitter account, I have been posting diagrams of the Logic Parcel in a Hercules Micro HME-M7 and some routing that gets signals into the Logic Parcel, and the reactions were not the best.

So let's build one!

My only real prior knowledge of FPGA architecture is cursory knowledge of the Lattice iCE40 and ECP5 and Xilinx 7 Series, and some more detailed knowledge of the Intel Cyclone V and Hercules HME-M7.
I am relatively familiar with the Yosys toolchain, however, and I'll be using that as a synthesis tool.

Implementing a LUT and flop in nMigen

I'm going to skip a lot of the set-up, but I'm using nMigen, the 130nm SkyWater PDK, and OpenSTA.

To achieve anything, we need a LUT and a D flip-flop, and since I am a staunch advocate of ABC9, we need timings, too.

Let's imagine the most boring possible LUT with no carry logic, and the most boring possible D flip-flop with no init or resets.

Since we're experimenting, the LUT needs to be relatively flexibly designed. ASICs generally use latches instead of flops for storage, as they're smaller, so we'll use that for our LUT storage.

from nmigen import *
from nmigen.back import verilog


class Lut(Elaboratable):
    def __init__(self, k):
        """Instantiates a `k`-input, single-output LUT"""
        assert k > 1

        self.k      = k

        self.inputs = Signal(k) # LUT read address
        self.output = Signal(1, reset_less=True) # LUT read data

        # We need a write port to (theoretically) load our LUT data from.
        self.lut_wd = Signal(2**k) # LUT write data
        self.lut_we = Signal(1)    # LUT write enable (negative-true)
    
    def elaborate(self, platform):
        m = Module()

        inputs = Signal(self.k, reset_less=True)
        latchout = Signal(2**self.k)

        for i in range(2**self.k):
            # This is the latch primitive for the SkyWater PDK; Yosys can't map this to the cell by itself.
            m.submodules += Instance(
                "sky130_fd_sc_hs__dlxtn_1",
                i_D=self.lut_wd[i],
                i_GATE_N=self.lut_we,
                o_Q=latchout[i]
            )

        # HACK: OpenSTA measures delay between synchronous endpoints, so we need to encase the LUT in flops to time it.
        m.d.sync += [
            inputs.eq(self.inputs),
            self.output.eq(latchout.bit_select(inputs, width=1)),
        ]

        return m


class Dff(Elaboratable):
    def __init__(self):
        """Instantiates a simple D flip-flop"""
        self.d = Signal(1)
        self.q = Signal(1)

    def elaborate(self, platform):
        m = Module()

        m.d.sync += self.q.eq(self.d)

        return m


# Dump LUT as Verilog
with open("lut.v", "w") as f:
    lut = Lut(4) # Adjust as appropriate.
    ports = [
        lut.inputs,
        lut.lut_wd,
        lut.lut_we,

        lut.output
    ]
    f.write(verilog.convert(lut, ports=ports))

# Dump DFF as Verilog
with open("dff.v", "w") as f:
    dff = Dff()
    ports = [
        dff.d,
        dff.q,
    ]
    f.write(verilog.convert(dff, ports=ports))

Using Yosys for ASIC synthesis.

Now we need Yosys to perform ASIC synthesis, and this process is...not very well documented, so here's my stab at it.
I'm going to use the high-speed SkyWater cell library (sky130_fd_sc_hs), because this is a thought experiment and concerns like area and power usage don't matter to me right now.

I'm using a pretty conservative ABC synthesis script because I want ABC to use the multiplexer cells in the library. More aggressive synthesis tends to break up the structure of the mux.

# Read the cells from the cell library, but treat them as black boxes.
read_liberty -lib /path/to/skywater-pdk/libraries/sky130_fd_sc_hs/latest/timing/sky130_fd_sc_hs__tt_025C_1v80.lib

# Perform generic synthesis.
synth -auto-top -flatten

# Attempt to find cells that could be merged.
share -aggressive

# Clean up the design, removing dead cells and wires.
opt -purge

# Map flops in the design to the cell library.
dfflibmap -liberty /path/to/skywater-pdk/libraries/sky130_fd_sc_hs/latest/timing/sky130_fd_sc_hs__tt_025C_1v80.lib

# Map combinational cells in the design to the cell library, targeting smallest-possible delay.
abc -D 1 -liberty /path/to/skywater-pdk/libraries/sky130_fd_sc_hs/latest/timing/sky130_fd_sc_hs__tt_025C_1v80.lib

# Any undefined bits should be zeroes.
setundef -zero

# Break apart multi-bit wires into multiple single-bit wires.
splitnets

# Remove any wires that are dead because of that.
opt -purge

# Give cells better names 
autoname

# Pretty-print statistics about the design.
stat -liberty /path/to/skywater-pdk/libraries/sky130_fd_sc_hs/latest/timing/sky130_fd_sc_hs__tt_025C_1v80.lib

# Write the result, so that OpenSTA can use it.
write_verilog -noattr -noexpr -nohex -nodec netlist.v

Which prints something like this:

12. Printing statistics.

=== top ===

   Number of wires:                 32
   Number of wire bits:             50
   Number of public wires:          32
   Number of public wire bits:      50
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                 26
     sky130_fd_sc_hs__dfxtp_1        5
     sky130_fd_sc_hs__dlxtp_1       16
     sky130_fd_sc_hs__mux4_1         5

   Chip area for module '\top': 704.894400

Using OpenSTA for timing measurement

And then we need an OpenSTA script to print timing information about it.

If you missed the comment in the nMigen source, OpenSTA measures timing between synchronous endpoints, but LUTs are combinational, so we encase the LUT in flops to measure delays.
This makes checking timing a bit messy.

# Basic unit setup.
set_cmd_units -time ps -power W -current mA -voltage V

# If we use a cell that doesn't exist in the library, fail.
set link_make_black_boxes 0

# We are using one timing corner only.
define_corners tt_025C_1v80
read_liberty -corner tt_025C_1v80 /mnt/d/skywater-pdk/libraries/sky130_fd_sc_hs/latest/timing/sky130_fd_sc_hs__tt_025C_1v80.lib

read_verilog netlist.v

# Instantiate the design.
link_design top

# Create a clock, but we don't actually care about its time period, or its delays to inputs and outputs.
create_clock -name clk -period 0 {clk}
set_input_delay -clock clk 0 {rst inputs lut_wa lut_wd lut_we}
set_output_delay -clock clk 0 {output}

# Then, calculate the timing delays from inputs to output.
report_checks -from "inputs\$1[0]_sky130_fd_sc_hs__dfxtp_1_Q" -to output_sky130_fd_sc_hs__dfxtp_1_Q -digits 3
report_checks -from "inputs\$1[1]_sky130_fd_sc_hs__dfxtp_1_Q" -to output_sky130_fd_sc_hs__dfxtp_1_Q -digits 3
report_checks -from "inputs\$1[2]_sky130_fd_sc_hs__dfxtp_1_Q" -to output_sky130_fd_sc_hs__dfxtp_1_Q -digits 3
report_checks -from "inputs\$1[3]_sky130_fd_sc_hs__dfxtp_1_Q" -to output_sky130_fd_sc_hs__dfxtp_1_Q -digits 3
report_checks -from "inputs\$1[4]_sky130_fd_sc_hs__dfxtp_1_Q" -to output_sky130_fd_sc_hs__dfxtp_1_Q -digits 3
report_checks -from "inputs\$1[5]_sky130_fd_sc_hs__dfxtp_1_Q" -to output_sky130_fd_sc_hs__dfxtp_1_Q -digits 3
report_checks -from "inputs\$1[6]_sky130_fd_sc_hs__dfxtp_1_Q" -to output_sky130_fd_sc_hs__dfxtp_1_Q -digits 3
report_checks -from "inputs\$1[7]_sky130_fd_sc_hs__dfxtp_1_Q" -to output_sky130_fd_sc_hs__dfxtp_1_Q -digits 3

exit

This will give you entries that look like this.

Startpoint: inputs$1[0]_sky130_fd_sc_hs__dfxtp_1_Q
            (rising edge-triggered flip-flop clocked by clk)
Endpoint: output_sky130_fd_sc_hs__dfxtp_1_Q
          (rising edge-triggered flip-flop clocked by clk)
Path Group: clk
Path Type: max

   Delay     Time   Description
-----------------------------------------------------------
   0.000    0.000   clock clk (rise edge)
   0.000    0.000   clock network delay (ideal)
   0.000    0.000 ^ inputs$1[0]_sky130_fd_sc_hs__dfxtp_1_Q/CLK (sky130_fd_sc_hs__dfxtp_1)
   2.430    2.430 ^ inputs$1[0]_sky130_fd_sc_hs__dfxtp_1_Q/Q (sky130_fd_sc_hs__dfxtp_1)
   0.371    2.800 v latchout[160]_sky130_fd_sc_hs__mux4_1_A0/X (sky130_fd_sc_hs__mux4_1)
   0.107    2.908 ^ latchout[160]_sky130_fd_sc_hs__mux4_1_A0_X_sky130_fd_sc_hs__o21ai_1_A2/Y (sky130_fd_sc_hs__o21ai_1)
   0.068    2.976 v $2_sky130_fd_sc_hs__mux4_1_X_A2_sky130_fd_sc_hs__mux4_1_X_A3_sky130_fd_sc_hs__o32ai_1_Y/Y (sky130_fd_sc_hs__o32ai_1)
   0.248    3.223 v $2_sky130_fd_sc_hs__mux4_1_X_A2_sky130_fd_sc_hs__mux4_1_X/X (sky130_fd_sc_hs__mux4_1)
   0.225    3.448 v $2_sky130_fd_sc_hs__mux4_1_X/X (sky130_fd_sc_hs__mux4_1)
   0.000    3.448 v output_sky130_fd_sc_hs__dfxtp_1_Q/D (sky130_fd_sc_hs__dfxtp_1)
            3.448   data arrival time

   0.000    0.000   clock clk (rise edge)
   0.000    0.000   clock network delay (ideal)
   0.000    0.000   clock reconvergence pessimism
            0.000 ^ output_sky130_fd_sc_hs__dfxtp_1_Q/CLK (sky130_fd_sc_hs__dfxtp_1)
  -0.129   -0.129   library setup time
           -0.129   data required time
-----------------------------------------------------------
           -0.129   data required time
           -3.448   data arrival time
-----------------------------------------------------------
           -3.577   slack (VIOLATED)

Here, OpenSTA is printing:

"arrival time", which is the time it takes for data to propagate through the network and stabilise.
"required time", which is the actual length a clock pulse is, minus the time an input needs to be stable for the flop to register it (the "setup time").
"slack", which is the required time minus the arrival time. If slack is positive, the design should run at this clock frequency; if slack is negative, it likely won't.

We don't care about required time at all (which is why the clock length is zero) as this is an asynchronous logic element.
Arrival time is important, however, as it contains the actual timings for the LUT.

The OpenSTA script reports timings from each input to the LUT output, and this is the data we'll need for ABC9, but also an annoying synchronous delay: flops naturally have a delay from when the clock edge rises to when the output changes.
This delay (also called "arrival time") needs to be excluded from the timings.

The flop arrival time information is this line in the output:

   2.430    2.430 ^ inputs$1[0]_sky130_fd_sc_hs__dfxtp_1_Q/Q (sky130_fd_sc_hs__dfxtp_1)

Which tells us that there is a 2.43 nanosecond delay between the clock edge (the /CLK entry above it) and the flop output (Q) changing.

The total arrival time is in this line in the output:

            3.448   data arrival time

Which tells us there is a 3.448 nanosecond delay between the clock edge and the LUT output changing.

To find the actual delay, we just subtract the flop arrival time from the total arrival time, to get 3.448ns - 2.43ns = 1.018ns input to output delay.

To put that into perspective, here are the slowest input to output delays of some commercial FPGAs in Yosys:

Xilinx 7 Series (LUT6, 28nm): 0.642ns

Intel Cyclone V (LUT6, 28nm): 0.602ns

Lattice ECP5 (LUT4, 40nm): 0.379ns

Lattice iCE40HX (LUT4, 40nm): 0.449ns

Lattice iCE40UP (LUT4, 40nm): 1.285ns

Gowin GW1N (LUT4, 55nm): 1.638ns

Rinse and repeat for the size of LUT you're interested in. Don't assume that the timings of the same LUT with an extra input look similar; the resulting network could have different delay characteristics.

Alternatively, here are some timings I made earlier.

In part II, I'll be going into how these timings can be used in a Yosys FPGA flow to test and measure improvements.

02/06/2023: the corescore challenge (part 1)

this is a rough post on what I've tried doing to improve the performance of nextpnr-mistral on a specific benchmark: CoreScore. it might end up becoming a series, depending on how inspired I get to try to work on this.

what's corescore?

in 2018, there was a competition held by the risc-v foundation for open-source soft-cpu designs. the winner for the "creativity prize" of the competition was SERVbit-SErial Risc-V by olof kindgren. SERV is a very tiny risc-v core because instead of doing operations with, say, 32 bits in one clock cycle, it processes a single bit per clock cycle ("bit-serial" logic), which makes the core logic minimal. if you want to learn more about it, olof's done a talk about it.

anyway, after the competition, olof kept tinkering with SERV, and eventually came up with a toolchain benchmark: how many SERV cores can one fit on an FPGA? that seemingly-simple question tests quite a lot of the toolchain: better synthesis makes SERV smaller, which means you can fit more of them on there; better placement and routing means you can more efficiently make use of the FPGA to fit more of them in there.

this benchmark is CoreScore, and there's even a small website about the largest number of SERV cores that can be fit on various boards.

for my purposes, the target board is the Terasic DE10-Nano which has an Intel Cyclone V FPGA on it, on which Quartus can fit 271 SERV cores.

Mistral...can only fit 84 of them. time to get to work.

the symptoms

Info: Device utilisation:
Info:           MISTRAL_COMB: 21873/83820    26%
Info:             MISTRAL_FF: 21469/167640    12%
Info:             MISTRAL_IO:     3/  472     0%
Info:         MISTRAL_CLKENA:     1/    2    50%
Info:    cyclonev_oscillator:     0/    1     0%
Info:   cyclonev_hps_interface_mpu_general_purpose:     0/    1     0%
Info:           MISTRAL_M10K:    84/  553    15%

these are the nextpnr device utilisation statistics for the 84-core SERV SoCSystem-on-Chip for Mistral.

one might first notice that utilising only 26% of the available LUTsLook-Up Tables suggests that there is a lot of headroom available, so why is this the limit for the tooling?

well, the device utilisation section only tells us a little bit of the story: there are more limits to FPGA utilisation than simply "how many potential places for LUTs"Basic Elements of Logic" (bels) from last article are actually filled".

control set hell

FPGAsField-Programmable Gate Arrays have LUTs to provide combinational logic, and DFFsD Flip-Flops to provide individual bits of memory to store the results of that logic.

a not-uncommon operation on DFFs is storing zero to a result instead of the intended value on a particular condition; this is called synchronous-clear (synchronous because the clear only happens on a clock edge). this is relatively cheap to implement in hardware (it's an AND gate before the DFF input), so some vendors offer it to save LUTs.

on some FPGAs, synchronous-clear is cheap(ish). for example, a Lattice ECP5 PFUProgrammable Function Unit containing eight LUT4s has two set/clear wires, each of which may be synchronous or asynchronous.

the Intel Cyclone V goes a different route: each LABLogic Array Block has two dedicated asynchronous-clear wires, and one dedicated synchronous-clear wire which a flop may choose to ignore.

features such as synchronous-clear are called DFF controls, and the particular set of signals that drives the DFF controls is the "DFF control set". it's important that the control sets of flops packed in a LAB do not conflict: while a DFF in a LAB which has no synchronous-clear can be packed with a DFF which depends on a synchronous-clear, one cannot put two DFFs with different synchronous-clear signals in the same LAB. effectively, this means the maximum number of synchronous-clear signals available in the Cyclone V is the same as the number of LABs on the chip.

all this results in significantly fewer synchronous-clear resources on the Cyclone V compared to the ECP5. the chip on the DE10-Nano, the 5CSEBAU23I7 has 83,820 potential LUT slots; a comparable ECP5, the LFE5UM-85F has 83,640 potential LUT slots, but the Cyclone V has only 4,191 synchronous-clears (one per LAB of 10 ALMs each containing two LUT slots) compared to (up to) 20,910 synchronous-clears on the ECP5 (one per pair of LUTs).

why did I go on a multi-paragraph ramble about a relatively-obscure optimisation in FPGA hardware? well, Yosys really likes to infer synchronous-clear signals when available to reduce area, and SERV is optimised to make maximum use of these synchronous-clear signals to reduce area. all this combines to make SERV's optimisation for other architectures (where these things are cheap) rather expensive for Cyclone V.

now, the corescore rules require the toolchain to be as-upstream, so obviously I can't just hack my local copy of Yosys and claim victory.

a reasonable approach

I mentioned before that Yosys likes inferring synchronous-clears. this is because going from complex flops to simple flops is (relatively) easy, but going from simple flops to complex flops is harder. the Yosys pass for turning complex flops to simple flops is called dfflegalize, and it takes a description of the supported flops and the available flop-initialisation values.

in synth_intel_alm, one can legalise to:

positive-edge-clocked D flip-flops with negative-polarity asynchronous reset and positive-polarity clock enables that initialise to zero, or
positive-edge-clocked D flip-flops with positive-polarity synchronous reset and positive-polarity clock enables (that have priority over the reset).

that's a lot, and you don't have to worry about it too much, except that the description lets dfflegalize know that, for example, it can't use a flop that has both synchronous and asynchronous reset (which is a frankly terrifying proposition).

now, to handle cases like the Intel Cyclone V (or more specifically the Lattice iCE40, which was designed by SiliconBlue which was a startup of ex-Intel employees), dfflegalize has an option called -minsrst N , which if specified will only use synchronous-clear when that signal is used by more than N flops, and otherwise emulate it with logic.

since 100 is a nice round number, let's see what it takes to build a SoC with 100 SERV cores.

since a synchronous-reset is five times more expensive than an ECP5, let's start with an N of 5.

Info: Device utilisation:
Info:           MISTRAL_COMB: 27837/83820    33%
Info:             MISTRAL_FF: 27819/167640    16%
Info:             MISTRAL_IO:     4/  472     0%
Info:         MISTRAL_CLKENA:     1/    2    50%
Info:    cyclonev_oscillator:     0/    1     0%
Info:   cyclonev_hps_interface_mpu_general_purpose:     0/    1     0%
Info:           MISTRAL_M10K:   101/  553    18%

Info: Placed 0 cells based on constraints.
Info: Creating initial analytic placement for 52917 cells, random placement wirelen = 5743416.
Info:     at initial placer iter 0, wirelen = 76
Info:     at initial placer iter 1, wirelen = 78
Info:     at initial placer iter 2, wirelen = 78
Info:     at initial placer iter 3, wirelen = 78
Info: Running main analytical placer, max placement attempts per cell = 10000.
ERROR: Unable to find legal placement for cell 'corescorecore.core_43.serving.rf_ram_if.rcnt_MISTRAL_FF_Q' after 10001 attempts, check constraints and utilisation. Use `--placer-heap-cell-placement-timeout` to change the number of attempts.
0 warnings, 1 error

one cup of tea later, it fails.

okay, let's double it and try again.

Info: Device utilisation:
Info:           MISTRAL_COMB: 28669/83820    34%
Info:             MISTRAL_FF: 27819/167640    16%
Info:             MISTRAL_IO:     4/  472     0%
Info:         MISTRAL_CLKENA:     1/    2    50%
Info:    cyclonev_oscillator:     0/    1     0%
Info:   cyclonev_hps_interface_mpu_general_purpose:     0/    1     0%
Info:           MISTRAL_M10K:   101/  553    18%

Info: Placed 0 cells based on constraints.
Info: Creating initial analytic placement for 53749 cells, random placement wirelen = 5794611.
Info:     at initial placer iter 0, wirelen = 76
Info:     at initial placer iter 1, wirelen = 78
Info:     at initial placer iter 2, wirelen = 78
Info:     at initial placer iter 3, wirelen = 78
Info: Running main analytical placer, max placement attempts per cell = 10000.
Info:     at iteration #1, type MISTRAL_COMB: wirelen solved = 21667, spread = 3861671, legal = 3862040; time = 0.83s
ERROR: Unable to find legal placement for cell 'corescorecore.core_48.serving.cpu.ctrl.i_jump_MISTRAL_FF_Q' after 10001 attempts, check constraints and utilisation. Use `--placer-heap-cell-placement-timeout` to change the number of attempts.
0 warnings, 1 error

nope. notice how the number of MISTRAL_COMB cells increases as more soft logic is used to emulate synchronous-clear.

no dice

so, I kept doubling the -minsrst value, but any reasonable value of -minsrst (or 1280, which verges on unreasonable) results in the same results as above. I even tried -minsrst 12800, which still fails.

I think the problem comes down to there being a few giant synchronous-clear networks, and nextpnr is handling this situation terribly. it's certainly something to consider handling "properly" (perhaps we should do global promotion? to investigate).

but I've spent quite a while staring at Yosys building a netlist only for nextpnr to fail placing it. time to call it a day.

30/12/2020: Where have I been?

My general goal for this blog is for fairly well-edited technical writing. This particular post isn't it - it's actually quite depressing - and you're welcome to ignore it if you need to. But it's my blog, and I feel I need to write this for myself as much as anything.

So what was my sudden disappearance for? Mental health issues, mostly.

An incredibly abbreviated guide is that at the end of August my best friend of 8 years died from a rare form of prostate cancer. He wasn't even the first person I've lost to cancer, and that kind of thing messes with your head while you process it.

As much as anything, the articles I wrote before were a distraction from his death; with the university year starting up I tried putting it out of my mind to focus on my studies. These things often creep up on you though, and at the end of November I finally admitted I needed to take some time to process it, as I was struggling with university work.

So now I'm working for Antmicro on open FPGA tooling (presently Yosys, but it looks like I'll have the opportunity to work on the openfpga project too). Maybe I'll do a write-up of that if people are interested.

I didn't entirely intend to take a hiatus - I think more sources of technical writeups can only be a good thing - but I hope my readers will understand.

29/11/2022: so what have you been working on, lofty?

this is a crosspost from my cohost post a few days ago.
people might have missed me posting the link, and I also figured having an extra copy would be useful.

(yes, I meant to have this out on sunday, but oh well)

I spent most of my spare time during the week working on awooter. it's presently at the stage where after doing partitioning of the input netlist into arcs, it can then find a route for every arc. there isn't any congestion avoidance or nextpnr state update however, so this isn't usable in anger yet.

big thank you to @SpaceCat-Chan and @Jubilee for their help on various bits of debugging.

observation four: partitioning is highly parallel

I had assumed that the partitioner would run serially, casually picking pips, and this would be fine (enough) because the amount of parallelism in the router would make up for it. As it turns out, after writing the code, it's pretty easy to turn things into something MapReduce-like: for each arc, figure out if it needs partitioning, split it across a pip on the border, and then return the new arcs; these arcs can then be serially put into the four segments of the final partition.

observation five: you can't partition everything

this was annoying to realise after writing the partitioner and finding that the router couldn't find a path through wires. it comes down to FPGAs having special routing for clocks and for inter-slice carry signals; these get skipped (because nextpnr handles them separately) and put into a fifth "segment" respectively. the special-case segment is a little annoying to have, but it's easy enough to deal with, and the segment only has 1-2% of the total arcs in practice.

observation six: you can't partition everywhere

here's where nextpnr's low-level API bites a little; how do you know if a pip is routing that you can safely partition across? for a decent while my code used a duck test of sorts to guess if it was general routing, but this is impossible to get right. so for now, there's an explicit allow-list based on the name of the wires in the pip. this is... suboptimal, but effective.

29/05/2023: debugging a nextpnr-mistral issue: a write-up

so, I spent all night on this bug, and fedi thinks that a write-up of that might be useful for other people.

the symptoms

going into this, all I really knew was that the reporter said it was correlated to usage of the M10Ks10-kilobit memory blocks in nextpnr-mistral. let's run it and see.

$ ~/nextpnr/build-mistral/nextpnr-mistral --device ms --json blink_rgb.json --rbf blink_rgb.rbf --freq 50 --qsf io.qsf
Info: Initialising bels...
Info: Initialising routing graph...
Info:     imported 2739969 wires and 25465239 pips
Info: Constraining IO 'o_rgb_MISTRAL_OB_PAD_2' to pin PIN_W15 (bel MISTRAL_IO.89.8.1)
Info: Constraining IO 'o_rgb_MISTRAL_OB_PAD_1' to pin PIN_AA24 (bel MISTRAL_IO.89.9.2)
Info: Constraining IO 'o_rgb_MISTRAL_OB_PAD' to pin PIN_V16 (bel MISTRAL_IO.89.9.0)
Info: Constraining IO 'blink_rgb_0.i_reset_n_MISTRAL_IB_O' to pin PIN_AH17 (bel MISTRAL_IO.64.0.2)
Info: Constraining IO 'i_clock_MISTRAL_IB_PAD' to pin PIN_V11 (bel MISTRAL_IO.32.0.0)

Info: Annotating ports with timing budgets for target frequency 50.00 MHz
Info: Checksum: 0x12b4a4b4

Info: Device utilisation:
Info:           MISTRAL_COMB:   253/83820     0%
Info:             MISTRAL_FF:   103/167640     0%
Info:             MISTRAL_IO:     5/  472     1%
Info:         MISTRAL_CLKENA:     1/    2    50%
Info:    cyclonev_oscillator:     0/    1     0%
Info:   cyclonev_hps_interface_mpu_general_purpose:     0/    1     0%
Info:           MISTRAL_M10K:    16/  553     2%

Info: Placed 0 cells based on constraints.
Info: Creating initial analytic placement for 226 cells, random placement wirelen = 33762.
Info:     at initial placer iter 0, wirelen = 167
Info:     at initial placer iter 1, wirelen = 159
Info:     at initial placer iter 2, wirelen = 139
Info:     at initial placer iter 3, wirelen = 139
Info: Running main analytical placer, max placement attempts per cell = 17860.
Info:     at iteration #1, type MISTRAL_FF: wirelen solved = 756, spread = 1355, legal = 1526; time = 0.01s
Info:     at iteration #1, type MISTRAL_COMB: wirelen solved = 963, spread = 2087, legal = 2091; time = 0.01s
Info:     at iteration #1, type MISTRAL_M10K: wirelen solved = 1966, spread = 3097, legal = 3098; time = 0.01s
Info:     at iteration #1, type MISTRAL_CLKENA: wirelen solved = 3098, spread = 3098, legal = 3260; time = 0.00s
Info:     at iteration #1, type ALL: wirelen solved = 151, spread = 2739, legal = 3071; time = 0.01s
terminate called after throwing an instance of 'std::out_of_range'
  what():  dict::at()
Aborted (core dumped)

hmm, a crash in the analytical placer. odd.

dict is nextpnr's implementation of std::unordered_map if people were wondering; we use our own implementation to have identical behaviour across platforms.

the stack trace

#0  0x00007ffff73aa26c in ?? () from /usr/lib/libc.so.6
#1  0x00007ffff735aa08 in raise () from /usr/lib/libc.so.6
#2  0x00007ffff7343538 in abort () from /usr/lib/libc.so.6
#3  0x00007ffff76bba6f in __gnu_cxx::__verbose_terminate_handler ()
    at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/vterminate.cc:95
#4  0x00007ffff76cf11c in __cxxabiv1::__terminate (handler=<optimized out>)
    at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48
#5  0x00007ffff76cf189 in std::terminate () at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:58
#6  0x00007ffff76cf3ed in __cxxabiv1::__cxa_throw (obj=<optimized out>,
    tinfo=0x7ffff788b0e8 <typeinfo for std::out_of_range>, dest=0x7ffff76e75c0 <std::out_of_range::~out_of_range()>)
    at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_throw.cc:98
#7  0x000055555564396e in nextpnr_mistral::dict<nextpnr_mistral::IdString, nextpnr_mistral::ArchPinInfo, nextpnr_mistral::hash_ops<nextpnr_mistral::IdString> >::at (this=<optimized out>, key=...)
    at /home/lofty/nextpnr/common/kernel/hashlib.h:597
#8  0x0000555555695e69 in nextpnr_mistral::Arch::getBelPinsForCellPin (this=0x555566b67a20, pin=...,
    cell_info=<optimized out>) at /home/lofty/nextpnr/mistral/arch.h:446
#9  nextpnr_mistral::Context::predictArcDelay (this=0x555566b67a20, net_info=0x55558bfcf000, sink=...)
    at /home/lofty/nextpnr/common/kernel/context.cc:104
#10 0x00005555556ce13d in nextpnr_mistral::TimingAnalyser::get_route_delays (this=0x7fffffffb780)
    at /home/lofty/nextpnr/common/kernel/timing.cc:145
#11 nextpnr_mistral::TimingAnalyser::run (this=0x7fffffffb780, update_route_delays=<optimized out>)
    at /home/lofty/nextpnr/common/kernel/timing.cc:50
#12 0x000055555573b1bc in nextpnr_mistral::HeAPPlacer::place (this=0x7fffffffb600)
    at /home/lofty/nextpnr/common/place/placer_heap.cc:281
#13 nextpnr_mistral::placer_heap (cfg=..., ctx=<optimized out>) at /home/lofty/nextpnr/common/place/placer_heap.cc:1812
#14 nextpnr_mistral::Arch::place (this=0x555566b67a20) at /home/lofty/nextpnr/mistral/arch.cc:471
#15 0x000055555579f1e4 in nextpnr_mistral::CommandHandler::executeMain(std::unique_ptr<nextpnr_mistral::Context, std::default_delete<nextpnr_mistral::Context> >) [clone .constprop.0] (this=this@entry=0x7fffffffc640,
    ctx=std::unique_ptr<nextpnr_mistral::Context> = {...}) at /home/lofty/nextpnr/common/kernel/command.cc:469

the relevant frame here is #8: there's a crash in getBelPinsForCellPin. that function looks like this:

const std::vector<IdString> &getBelPinsForCellPin(const CellInfo *cell_info, IdString pin) const override
{
    return cell_info->pin_data.at(pin).bel_pins;
}

there's not much to go wrong here; the only call to at is the pin_data.at(pin), which suggests that pin doesn't exist, even though it should.

what is pin, anyway?

(gdb) p pin
$1 = {index = 82}

oh, right. nextpnr uses string interning to reduce memory usage, especially for repetitive sequences, and an interned string is called an IdString. so, we need to turn this IdString into an actual string.

one does this by calling the .str method of the IdString, passing it the Arch (...more specifically, the Context, of which Arch is a derived class) which contains the string pool. this gets one a std::string, so now one calls .c_str() on that to get a C string.

(can gdb print std::strings? probably, but .str(this).c_str() is so ingrained in my muscle memory that I pretty much always use it.)

(gdb) p pin.str(this).c_str()
$3 = 0x555566b6a828 "B1DATA"

B1DATA, hm.

some background

let's step back a little: what's a "bel"? what's a "bel pin"? what's a "cell" (in this context)?

nextpnr differentiates between abstract models of how the hardware can be configured - cells - from the concrete model of how the hardware actually is - belsBasic Elements of Logic.

for example, a cell might be a LUTLook-Up Table generated by Yosys, but the corresponding bel would be an ALMAdaptive Logic Module, which is a bit more complicated.

as one might see from the diagram, there needs to be a mapping from the inputs and outputs of an abstract cell to the concrete bel that will implement it: those bel inputs and outputs are "bel pins".

now, for some bels, you can use a static template, but the M10Ks need to have different cell-pin-to-bel-pin mappings depending on the configuration.

the M10K 10-kilobit memory block

before I go into the code, I think it's worth describing why the code is what it is.

internally, the M10K is 40 sub-blocks of 8-bit address by 1-bit data; this gives a range of 8-bit address by 40-bit data, 9-bit address by 20-bit data, 10-bit address by 10-bit data, 11-bit address by 5-bit data, 12-bit address by 2-bit data, and 13-bit address by 1-bit data.

it has two ports, each port containing a 12-bit address, a 20-bit data input and a 20-bit data output, plus read/write enables.

there is an internal data read multiplexer, but no corresponding data write demultiplexer, so one must manually wire up the write port.

40-bit data mode...might be the subject of another post.
for 20-bit data mode, this is easy enough: data bit N goes to DATAAIN[N].
for 10-bit data mode, data bit N goes to the above and DATAAIN[N+10].
for 5-bit data mode, data bit N goes to the above, DATAAIN[N+5], and DATAAIN[N+15].
for 2-bit data mode, data bit N goes to the above, DATAAIN[N+2], DATAAIN[N+7], DATAAIN[N+12], DATAAIN[N+17].
for 1-bit data mode, data bit N goes in every bit except DATAAIN[4], DATAAIN[9], DATAAIN[14], and DATAAIN[19].

now, one may ask where the 13th address bit for 13x1-bit mode comes from; "obviously", it comes from DATAAIN[4].

it's an annoying exception to the rule, and printf debugging showed that this is where the problem lies.

the code

(this code has been lightly edited for clarity, but the bug(s) remains)

void setup_m10ks()
{
    for (auto &cell : ctx->cells) {
        CellInfo *ci = cell.second.get();
        if (ci->type != id_MISTRAL_M10K)
            continue;

        auto abits = ci->params.at(id_CFG_ABITS).as_int64();
        auto dbits = ci->params.at(id_CFG_DBITS).as_int64();
        NPNR_ASSERT(abits >= 7 && abits <= 13);
        NPNR_ASSERT(dbits == 1 || dbits == 2 || dbits == 5 || dbits == 10 || dbits == 20);

        // Quartus doesn't seem to generate ADDRSTALL[AB], BYTEENABLE[AB][01].

        // It *does* generate ACLR[01] but leaves them unconnected if unused.

        // Enables.
        // RDEN[1] and WREN[0] are left unconnected.
        ci->pin_data[ctx->id("A1EN")].bel_pins = {ctx->id("WREN[1]")};
        ci->pin_data[ctx->id("B1EN")].bel_pins = {ctx->id("RDEN[0]")};

        // Clocks.
        ci->pin_data[ctx->id("CLK1")].bel_pins = {ctx->id("CLKIN[0]")};

        // Clock enables left unconnected.

        // Address lines.
        int addr_offset = std::max(12 - std::max(abits, 9L), 0L);
        int bit_offset = (abits == 13);
        if (abits == 13) {
            ci->pin_data[ctx->id("A1ADDR[0]")].bel_pins = {ctx->id("DATAAIN[4]")};
            ci->pin_data[ctx->id("B1ADDR[0]")].bel_pins = {ctx->id("DATABIN[19]")};
        }
        for (int bit = bit_offset; bit < abits; bit++) {
            ci->pin_data[ctx->idf("A1ADDR[%d]", bit)].bel_pins = {ctx->idf("ADDRA[%d]", bit + addr_offset)};
            ci->pin_data[ctx->idf("B1ADDR[%d]", bit)].bel_pins = {ctx->idf("ADDRB[%d]", bit + addr_offset)};
        }

        // Data lines
        std::vector<int> offsets;
        offsets.push_back(0);
        if (abits >= 10 && dbits <= 10) {
            offsets.push_back(10);
        }
        if (abits >= 11 && dbits <= 5) {
            offsets.push_back(5);
            offsets.push_back(15);
        }
        if (abits >= 12 && dbits <= 2) {
            offsets.push_back(2);
            offsets.push_back(7);
            offsets.push_back(12);
            offsets.push_back(17);
        }
        if (abits == 13 && dbits == 1) {
            offsets.push_back(1);
            offsets.push_back(3);
            offsets.push_back(6);
            offsets.push_back(8);
            offsets.push_back(11);
            offsets.push_back(13);
            offsets.push_back(16);
            offsets.push_back(18);
        }
        for (int bit = 0; bit < dbits; bit++)
            for (int offset : offsets)
                ci->pin_data[ctx->idf("A1DATA[%d]", bit)].bel_pins.push_back(ctx->idf("DATAAIN[%d]", bit + offset));

        for (int bit = 0; bit < dbits; bit++)
            ci->pin_data[ctx->idf("B1DATA[%d]", bit)].bel_pins = {ctx->idf("DATABOUT[%d]", bit)};
    }
}

bug the first

if you remember from earlier, the missing entry is B1DATA. at the very bottom of the function is this:

        for (int bit = 0; bit < dbits; bit++)
            ci->pin_data[ctx->idf("B1DATA[%d]", bit)].bel_pins = {ctx->idf("DATABOUT[%d]", bit)};

where dbits is the data width in bits. in 13x1 mode, this is 1, so, this loop assigns the IdString for DATABOUT[0] to ci->pin_data["B1DATA[0]"].bel_pins.

but it's not looking for the cell pin B1DATA[0], it's looking for the cell pin B1DATA. so, we can special-case dbits == 1, and assign to the correct cell pin names.

        // snipped code as above.

        // In this corner case the pin name does not have indexing
        // because it's a single bit wide...
        if (abits == 13 && dbits == 1) {
            for (int offset : offsets)
                ci->pin_data[ctx->idf("A1DATA")].bel_pins.push_back(ctx->idf("DATAAIN[%d]", offset));
            ci->pin_data[ctx->idf("B1DATA")].bel_pins = {ctx->idf("DATABOUT[0]")};
            continue;
        }

        for (int bit = 0; bit < dbits; bit++)
            for (int offset : offsets)
                ci->pin_data[ctx->idf("A1DATA[%d]", bit)].bel_pins.push_back(ctx->idf("DATAAIN[%d]", bit + offset));

        for (int bit = 0; bit < dbits; bit++)
            ci->pin_data[ctx->idf("B1DATA[%d]", bit)].bel_pins = {ctx->idf("DATABOUT[%d]", bit)};
    }
}

let's compile it and run it again!

bug the second

$ ~/nextpnr/build-mistral/nextpnr-mistral --device ms --json blink_rgb.json --rbf blink_rgb.rbf --freq 50 --qsf io.qsf
Info: Initialising bels...
Info: Initialising routing graph...
Info:     imported 2739969 wires and 25465239 pips
Info: Constraining IO 'o_rgb_MISTRAL_OB_PAD_2' to pin PIN_W15 (bel MISTRAL_IO.89.8.1)
Info: Constraining IO 'o_rgb_MISTRAL_OB_PAD_1' to pin PIN_AA24 (bel MISTRAL_IO.89.9.2)
Info: Constraining IO 'o_rgb_MISTRAL_OB_PAD' to pin PIN_V16 (bel MISTRAL_IO.89.9.0)
Info: Constraining IO 'blink_rgb_0.i_reset_n_MISTRAL_IB_O' to pin PIN_AH17 (bel MISTRAL_IO.64.0.2)
Info: Constraining IO 'i_clock_MISTRAL_IB_PAD' to pin PIN_V11 (bel MISTRAL_IO.32.0.0)

Info: Annotating ports with timing budgets for target frequency 50.00 MHz
Info: Checksum: 0x12b4a4b4

Info: Device utilisation:
Info:           MISTRAL_COMB:   253/83820     0%
Info:             MISTRAL_FF:   103/167640     0%
Info:             MISTRAL_IO:     5/  472     1%
Info:         MISTRAL_CLKENA:     1/    2    50%
Info:    cyclonev_oscillator:     0/    1     0%
Info:   cyclonev_hps_interface_mpu_general_purpose:     0/    1     0%
Info:           MISTRAL_M10K:    16/  553     2%

Info: Placed 0 cells based on constraints.
Info: Creating initial analytic placement for 226 cells, random placement wirelen = 33762.
Info:     at initial placer iter 0, wirelen = 167
Info:     at initial placer iter 1, wirelen = 159
Info:     at initial placer iter 2, wirelen = 139
Info:     at initial placer iter 3, wirelen = 139
Info: Running main analytical placer, max placement attempts per cell = 17860.
Info:     at iteration #1, type MISTRAL_FF: wirelen solved = 756, spread = 1355, legal = 1526; time = 0.07s
Info:     at iteration #1, type MISTRAL_COMB: wirelen solved = 963, spread = 2087, legal = 2091; time = 0.03s
Info:     at iteration #1, type MISTRAL_M10K: wirelen solved = 1966, spread = 3097, legal = 3098; time = 0.03s
Info:     at iteration #1, type MISTRAL_CLKENA: wirelen solved = 3098, spread = 3098, legal = 3260; time = 0.02s
Info:     at iteration #1, type ALL: wirelen solved = 151, spread = 2739, legal = 3071; time = 0.11s
Info:     at iteration #2, type MISTRAL_FF: wirelen solved = 2954, spread = 2954, legal = 2954; time = 0.02s
Info:     at iteration #2, type MISTRAL_COMB: wirelen solved = 2533, spread = 2663, legal = 2816; time = 0.03s
Info:     at iteration #2, type MISTRAL_M10K: wirelen solved = 1444, spread = 2545, legal = 2540; time = 0.03s
Info:     at iteration #2, type MISTRAL_CLKENA: wirelen solved = 2380, spread = 2380, legal = 2540; time = 0.02s
Info:     at iteration #2, type ALL: wirelen solved = 241, spread = 2288, legal = 2525; time = 0.09s
Info:     at iteration #3, type MISTRAL_FF: wirelen solved = 2535, spread = 2535, legal = 2558; time = 0.05s
Info:     at iteration #3, type MISTRAL_COMB: wirelen solved = 2209, spread = 2589, legal = 2627; time = 0.03s
Info:     at iteration #3, type MISTRAL_M10K: wirelen solved = 1379, spread = 2346, legal = 2393; time = 0.03s
Info:     at iteration #3, type MISTRAL_CLKENA: wirelen solved = 2272, spread = 2272, legal = 2436; time = 0.02s
Info:     at iteration #3, type ALL: wirelen solved = 277, spread = 2239, legal = 2676; time = 0.09s
Info:     at iteration #4, type MISTRAL_FF: wirelen solved = 2453, spread = 2485, legal = 2578; time = 0.04s
Info:     at iteration #4, type MISTRAL_COMB: wirelen solved = 2335, spread = 2563, legal = 2434; time = 0.03s
Info:     at iteration #4, type MISTRAL_M10K: wirelen solved = 1301, spread = 2169, legal = 2168; time = 0.03s
Info:     at iteration #4, type MISTRAL_CLKENA: wirelen solved = 1998, spread = 1998, legal = 2168; time = 0.02s
Info:     at iteration #4, type ALL: wirelen solved = 284, spread = 1991, legal = 2312; time = 0.09s
Info:     at iteration #5, type MISTRAL_FF: wirelen solved = 2125, spread = 2097, legal = 2135; time = 0.03s
Info:     at iteration #5, type MISTRAL_COMB: wirelen solved = 1741, spread = 2248, legal = 2357; time = 0.03s
Info:     at iteration #5, type MISTRAL_M10K: wirelen solved = 1501, spread = 2288, legal = 2319; time = 0.03s
Info:     at iteration #5, type MISTRAL_CLKENA: wirelen solved = 2151, spread = 2151, legal = 2319; time = 0.02s
Info:     at iteration #5, type ALL: wirelen solved = 271, spread = 1924, legal = 2230; time = 0.09s
Info:     at iteration #6, type MISTRAL_FF: wirelen solved = 1975, spread = 1925, legal = 1930; time = 0.03s
Info:     at iteration #6, type MISTRAL_COMB: wirelen solved = 1783, spread = 2011, legal = 2047; time = 0.03s
Info:     at iteration #6, type MISTRAL_M10K: wirelen solved = 1268, spread = 2026, legal = 2029; time = 0.03s
Info:     at iteration #6, type MISTRAL_CLKENA: wirelen solved = 1897, spread = 1897, legal = 2067; time = 0.02s
Info:     at iteration #6, type ALL: wirelen solved = 375, spread = 1844, legal = 2087; time = 0.09s
Info:     at iteration #7, type MISTRAL_FF: wirelen solved = 2004, spread = 1974, legal = 2001; time = 0.03s
Info:     at iteration #7, type MISTRAL_COMB: wirelen solved = 1738, spread = 2205, legal = 2299; time = 0.03s
Info:     at iteration #7, type MISTRAL_M10K: wirelen solved = 1562, spread = 2258, legal = 2253; time = 0.03s
Info:     at iteration #7, type MISTRAL_CLKENA: wirelen solved = 2083, spread = 2083, legal = 2216; time = 0.02s
Info:     at iteration #7, type ALL: wirelen solved = 387, spread = 1759, legal = 2042; time = 0.08s
Info:     at iteration #8, type MISTRAL_FF: wirelen solved = 1975, spread = 1905, legal = 1945; time = 0.03s
Info:     at iteration #8, type MISTRAL_COMB: wirelen solved = 1720, spread = 2004, legal = 2044; time = 0.03s
Info:     at iteration #8, type MISTRAL_M10K: wirelen solved = 1350, spread = 2044, legal = 2042; time = 0.03s
Info:     at iteration #8, type MISTRAL_CLKENA: wirelen solved = 1911, spread = 1911, legal = 2079; time = 0.03s
Info:     at iteration #8, type ALL: wirelen solved = 489, spread = 1724, legal = 1902; time = 0.08s
Info:     at iteration #9, type MISTRAL_FF: wirelen solved = 1851, spread = 1832, legal = 2005; time = 0.05s
Info:     at iteration #9, type MISTRAL_COMB: wirelen solved = 1750, spread = 2052, legal = 2111; time = 0.03s
Info:     at iteration #9, type MISTRAL_M10K: wirelen solved = 1395, spread = 2109, legal = 2117; time = 0.03s
Info:     at iteration #9, type MISTRAL_CLKENA: wirelen solved = 1951, spread = 1951, legal = 2117; time = 0.02s
Info:     at iteration #9, type ALL: wirelen solved = 370, spread = 1731, legal = 1995; time = 0.08s
Info:     at iteration #10, type MISTRAL_FF: wirelen solved = 1960, spread = 2107, legal = 2141; time = 0.03s
Info:     at iteration #10, type MISTRAL_COMB: wirelen solved = 1812, spread = 2145, legal = 2226; time = 0.03s
Info:     at iteration #10, type MISTRAL_M10K: wirelen solved = 1509, spread = 2192, legal = 2202; time = 0.03s
Info:     at iteration #10, type MISTRAL_CLKENA: wirelen solved = 2042, spread = 2042, legal = 2168; time = 0.02s
Info:     at iteration #10, type ALL: wirelen solved = 423, spread = 1718, legal = 1892; time = 0.08s
Info:     at iteration #11, type MISTRAL_FF: wirelen solved = 1798, spread = 1757, legal = 1788; time = 0.03s
Info:     at iteration #11, type MISTRAL_COMB: wirelen solved = 1536, spread = 1738, legal = 1842; time = 0.03s
Info:     at iteration #11, type MISTRAL_M10K: wirelen solved = 1209, spread = 2137, legal = 2026; time = 0.03s
Info:     at iteration #11, type MISTRAL_CLKENA: wirelen solved = 1888, spread = 1888, legal = 2048; time = 0.03s
Info:     at iteration #11, type ALL: wirelen solved = 477, spread = 1786, legal = 2097; time = 0.09s
Info:     at iteration #12, type MISTRAL_FF: wirelen solved = 1943, spread = 1904, legal = 1919; time = 0.03s
Info:     at iteration #12, type MISTRAL_COMB: wirelen solved = 1751, spread = 2170, legal = 2291; time = 0.03s
Info:     at iteration #12, type MISTRAL_M10K: wirelen solved = 1580, spread = 2517, legal = 2390; time = 0.03s
Info:     at iteration #12, type MISTRAL_CLKENA: wirelen solved = 2242, spread = 2242, legal = 2390; time = 0.02s
Info:     at iteration #12, type ALL: wirelen solved = 544, spread = 2184, legal = 2221; time = 0.08s
Info:     at iteration #13, type MISTRAL_FF: wirelen solved = 2149, spread = 2152, legal = 2248; time = 0.04s
Info:     at iteration #13, type MISTRAL_COMB: wirelen solved = 1967, spread = 2696, legal = 2708; time = 0.03s
Info:     at iteration #13, type MISTRAL_M10K: wirelen solved = 1965, spread = 2734, legal = 2669; time = 0.03s
Info:     at iteration #13, type MISTRAL_CLKENA: wirelen solved = 2531, spread = 2531, legal = 2669; time = 0.02s
Info:     at iteration #13, type ALL: wirelen solved = 599, spread = 2158, legal = 2218; time = 0.08s
Info:     at iteration #14, type MISTRAL_FF: wirelen solved = 2099, spread = 2144, legal = 2143; time = 0.03s
Info:     at iteration #14, type MISTRAL_COMB: wirelen solved = 1890, spread = 2381, legal = 2397; time = 0.03s
Info:     at iteration #14, type MISTRAL_M10K: wirelen solved = 1613, spread = 2314, legal = 2283; time = 0.03s
Info:     at iteration #14, type MISTRAL_CLKENA: wirelen solved = 2151, spread = 2151, legal = 2283; time = 0.03s
Info:     at iteration #14, type ALL: wirelen solved = 673, spread = 1916, legal = 2049; time = 0.08s
Info:     at iteration #15, type MISTRAL_FF: wirelen solved = 2085, spread = 2077, legal = 2095; time = 0.04s
Info:     at iteration #15, type MISTRAL_COMB: wirelen solved = 1821, spread = 2250, legal = 2257; time = 0.03s
Info:     at iteration #15, type MISTRAL_M10K: wirelen solved = 1471, spread = 2170, legal = 2170; time = 0.03s
Info:     at iteration #15, type MISTRAL_CLKENA: wirelen solved = 2042, spread = 2042, legal = 2150; time = 0.03s
Info:     at iteration #15, type ALL: wirelen solved = 593, spread = 1840, legal = 1942; time = 0.08s
Info: HeAP Placer Time: 4.88s
Info:   of which solving equations: 1.94s
Info:   of which spreading cells: 0.63s
Info:   of which strict legalisation: 0.45s

Info: Running simulated annealing placer for refinement.
Info:   at iteration #1: temp = 0.000000, timing cost = 14, wirelen = 1892
Info:   at iteration #5: temp = 0.000000, timing cost = 4, wirelen = 1435
Info:   at iteration #10: temp = 0.000000, timing cost = 5, wirelen = 1359
Info:   at iteration #14: temp = 0.000000, timing cost = 4, wirelen = 1341
Info: SA placement time 0.81s

Info: Max frequency for clock 'blink_rgb_0.i_clock': 324.68 MHz (PASS at 50.00 MHz)

Info: Slack histogram:
Info:  legend: * represents 1 endpoint(s)
Info:          + represents [1,1) endpoint(s)
Info: [ 16920,  17043) |****
Info: [ 17043,  17166) |****
Info: [ 17166,  17289) |*******************
Info: [ 17289,  17412) |*************************
Info: [ 17412,  17535) |*********************
Info: [ 17535,  17658) |*************
Info: [ 17658,  17781) |**********************
Info: [ 17781,  17904) |*******************
Info: [ 17904,  18027) |*****************************
Info: [ 18027,  18150) |*****************
Info: [ 18150,  18273) |***************
Info: [ 18273,  18396) |***********
Info: [ 18396,  18519) |
Info: [ 18519,  18642) |
Info: [ 18642,  18765) |
Info: [ 18765,  18888) |
Info: [ 18888,  19011) |
Info: [ 19011,  19134) |*
Info: [ 19134,  19257) |***
Info: [ 19257,  19380) |***
Info: Checksum: 0xd45ef47f
Info: Preparing LABs for routing...
Info: Routing globals...
Info:     routed net 'blink_rgb_0.i_clock' using global resources
Info: Running router2...
Info: Setting up routing resources...
ERROR: No wire found for port B1ADDR[12] on destination cell blink_rgb_0.rom_0.123.0.0.0.
0 warnings, 1 error

we made it through the analytical placer at least?

I mentioned before that the M10K has 12-bit address inputs, but B1ADDR[12] (0-indexed) would refer to the 13th bit of this address input, which...doesn't exist.

// Address lines.
int addr_offset = std::max(12 - std::max(abits, 9L), 0L);
int bit_offset = (abits == 13);
if (abits == 13) {
    ci->pin_data[ctx->id("A1ADDR[0]")].bel_pins = {ctx->id("DATAAIN[4]")};
    ci->pin_data[ctx->id("B1ADDR[0]")].bel_pins = {ctx->id("DATABIN[19]")};
}
for (int bit = bit_offset; bit < abits; bit++) {
    ci->pin_data[ctx->idf("A1ADDR[%d]", bit)].bel_pins = {ctx->idf("ADDRA[%d]", bit + addr_offset)};
    ci->pin_data[ctx->idf("B1ADDR[%d]", bit)].bel_pins = {ctx->idf("ADDRB[%d]", bit + addr_offset)};
}

the cause of that is bit_offset. when we're in the 13x1-bit mode, the least significant bit gets stuffed in DATAAIN[4], so the least-significant bit is skipped, and the remaining bits go through the loop of being assigned. however, this leads to ADDR{A,B} starting at one instead of zero.

so, to fix this we need to subtract bit_offset to make ADDR{A,B} zero-indexed again.

there is another trick available here, but I chose not to use it due to it being a bit more surprising: if addr_offset did not have the std::max to keep it non-negative, then in 13x1 mode, addr_offset would be -1, which would correct for the one-indexing.

// Address lines.
int addr_offset = std::max(12 - std::max(abits, 9L), 0L);
int bit_offset = (abits == 13);
if (abits == 13) {
    ci->pin_data[ctx->id("A1ADDR[0]")].bel_pins = {ctx->id("DATAAIN[4]")};
    ci->pin_data[ctx->id("B1ADDR[0]")].bel_pins = {ctx->id("DATABIN[19]")};
}
for (int bit = bit_offset; bit < abits; bit++) {
    ci->pin_data[ctx->idf("A1ADDR[%d]", bit)].bel_pins = {ctx->idf("ADDRA[%d]", bit + addr_offset - bit_offset)};
    ci->pin_data[ctx->idf("B1ADDR[%d]", bit)].bel_pins = {ctx->idf("ADDRB[%d]", bit + addr_offset - bit_offset)};
}

let's compile it and run it again!

[final fantasy victory sound]

$ ~/nextpnr/build-mistral/nextpnr-mistral --device ms --json blink_rgb.json --rbf blink_rgb.rbf --freq 50 --qsf io.qsf
Info: Initialising bels...
Info: Initialising routing graph...
Info:     imported 2739969 wires and 25465239 pips
Info: Constraining IO 'o_rgb_MISTRAL_OB_PAD_2' to pin PIN_W15 (bel MISTRAL_IO.89.8.1)
Info: Constraining IO 'o_rgb_MISTRAL_OB_PAD_1' to pin PIN_AA24 (bel MISTRAL_IO.89.9.2)
Info: Constraining IO 'o_rgb_MISTRAL_OB_PAD' to pin PIN_V16 (bel MISTRAL_IO.89.9.0)
Info: Constraining IO 'blink_rgb_0.i_reset_n_MISTRAL_IB_O' to pin PIN_AH17 (bel MISTRAL_IO.64.0.2)
Info: Constraining IO 'i_clock_MISTRAL_IB_PAD' to pin PIN_V11 (bel MISTRAL_IO.32.0.0)

Info: Annotating ports with timing budgets for target frequency 50.00 MHz
Info: Checksum: 0x12b4a4b4

Info: Device utilisation:
Info:           MISTRAL_COMB:   253/83820     0%
Info:             MISTRAL_FF:   103/167640     0%
Info:             MISTRAL_IO:     5/  472     1%
Info:         MISTRAL_CLKENA:     1/    2    50%
Info:    cyclonev_oscillator:     0/    1     0%
Info:   cyclonev_hps_interface_mpu_general_purpose:     0/    1     0%
Info:           MISTRAL_M10K:    16/  553     2%

Info: Placed 0 cells based on constraints.
Info: Creating initial analytic placement for 226 cells, random placement wirelen = 33762.
Info:     at initial placer iter 0, wirelen = 167
Info:     at initial placer iter 1, wirelen = 159
Info:     at initial placer iter 2, wirelen = 139
Info:     at initial placer iter 3, wirelen = 139
Info: Running main analytical placer, max placement attempts per cell = 17860.
Info:     at iteration #1, type MISTRAL_FF: wirelen solved = 756, spread = 1355, legal = 1526; time = 0.06s
Info:     at iteration #1, type MISTRAL_COMB: wirelen solved = 963, spread = 2087, legal = 2091; time = 0.03s
Info:     at iteration #1, type MISTRAL_M10K: wirelen solved = 1966, spread = 3097, legal = 3098; time = 0.03s
Info:     at iteration #1, type MISTRAL_CLKENA: wirelen solved = 3098, spread = 3098, legal = 3260; time = 0.02s
Info:     at iteration #1, type ALL: wirelen solved = 151, spread = 2739, legal = 3071; time = 0.11s
Info:     at iteration #2, type MISTRAL_FF: wirelen solved = 2954, spread = 2954, legal = 2954; time = 0.02s
Info:     at iteration #2, type MISTRAL_COMB: wirelen solved = 2533, spread = 2663, legal = 2816; time = 0.03s
Info:     at iteration #2, type MISTRAL_M10K: wirelen solved = 1444, spread = 2545, legal = 2540; time = 0.03s
Info:     at iteration #2, type MISTRAL_CLKENA: wirelen solved = 2380, spread = 2380, legal = 2540; time = 0.02s
Info:     at iteration #2, type ALL: wirelen solved = 241, spread = 2288, legal = 2525; time = 0.09s
Info:     at iteration #3, type MISTRAL_FF: wirelen solved = 2535, spread = 2535, legal = 2558; time = 0.05s
Info:     at iteration #3, type MISTRAL_COMB: wirelen solved = 2209, spread = 2589, legal = 2627; time = 0.03s
Info:     at iteration #3, type MISTRAL_M10K: wirelen solved = 1379, spread = 2346, legal = 2393; time = 0.03s
Info:     at iteration #3, type MISTRAL_CLKENA: wirelen solved = 2272, spread = 2272, legal = 2436; time = 0.02s
Info:     at iteration #3, type ALL: wirelen solved = 277, spread = 2239, legal = 2676; time = 0.09s
Info:     at iteration #4, type MISTRAL_FF: wirelen solved = 2453, spread = 2485, legal = 2578; time = 0.04s
Info:     at iteration #4, type MISTRAL_COMB: wirelen solved = 2335, spread = 2563, legal = 2434; time = 0.03s
Info:     at iteration #4, type MISTRAL_M10K: wirelen solved = 1301, spread = 2169, legal = 2168; time = 0.03s
Info:     at iteration #4, type MISTRAL_CLKENA: wirelen solved = 1998, spread = 1998, legal = 2168; time = 0.02s
Info:     at iteration #4, type ALL: wirelen solved = 284, spread = 1991, legal = 2312; time = 0.10s
Info:     at iteration #5, type MISTRAL_FF: wirelen solved = 2125, spread = 2097, legal = 2135; time = 0.03s
Info:     at iteration #5, type MISTRAL_COMB: wirelen solved = 1741, spread = 2248, legal = 2357; time = 0.03s
Info:     at iteration #5, type MISTRAL_M10K: wirelen solved = 1501, spread = 2288, legal = 2319; time = 0.03s
Info:     at iteration #5, type MISTRAL_CLKENA: wirelen solved = 2151, spread = 2151, legal = 2319; time = 0.02s
Info:     at iteration #5, type ALL: wirelen solved = 271, spread = 1924, legal = 2230; time = 0.09s
Info:     at iteration #6, type MISTRAL_FF: wirelen solved = 1975, spread = 1925, legal = 1930; time = 0.03s
Info:     at iteration #6, type MISTRAL_COMB: wirelen solved = 1783, spread = 2011, legal = 2047; time = 0.03s
Info:     at iteration #6, type MISTRAL_M10K: wirelen solved = 1268, spread = 2026, legal = 2029; time = 0.03s
Info:     at iteration #6, type MISTRAL_CLKENA: wirelen solved = 1897, spread = 1897, legal = 2067; time = 0.02s
Info:     at iteration #6, type ALL: wirelen solved = 375, spread = 1844, legal = 2087; time = 0.09s
Info:     at iteration #7, type MISTRAL_FF: wirelen solved = 2004, spread = 1974, legal = 2001; time = 0.03s
Info:     at iteration #7, type MISTRAL_COMB: wirelen solved = 1738, spread = 2205, legal = 2299; time = 0.03s
Info:     at iteration #7, type MISTRAL_M10K: wirelen solved = 1562, spread = 2258, legal = 2253; time = 0.03s
Info:     at iteration #7, type MISTRAL_CLKENA: wirelen solved = 2083, spread = 2083, legal = 2216; time = 0.02s
Info:     at iteration #7, type ALL: wirelen solved = 387, spread = 1759, legal = 2042; time = 0.08s
Info:     at iteration #8, type MISTRAL_FF: wirelen solved = 1975, spread = 1905, legal = 1945; time = 0.03s
Info:     at iteration #8, type MISTRAL_COMB: wirelen solved = 1720, spread = 2004, legal = 2044; time = 0.03s
Info:     at iteration #8, type MISTRAL_M10K: wirelen solved = 1350, spread = 2044, legal = 2042; time = 0.03s
Info:     at iteration #8, type MISTRAL_CLKENA: wirelen solved = 1911, spread = 1911, legal = 2079; time = 0.03s
Info:     at iteration #8, type ALL: wirelen solved = 489, spread = 1724, legal = 1902; time = 0.08s
Info:     at iteration #9, type MISTRAL_FF: wirelen solved = 1851, spread = 1832, legal = 2005; time = 0.05s
Info:     at iteration #9, type MISTRAL_COMB: wirelen solved = 1750, spread = 2052, legal = 2111; time = 0.03s
Info:     at iteration #9, type MISTRAL_M10K: wirelen solved = 1395, spread = 2109, legal = 2117; time = 0.03s
Info:     at iteration #9, type MISTRAL_CLKENA: wirelen solved = 1951, spread = 1951, legal = 2117; time = 0.02s
Info:     at iteration #9, type ALL: wirelen solved = 370, spread = 1731, legal = 1995; time = 0.08s
Info:     at iteration #10, type MISTRAL_FF: wirelen solved = 1960, spread = 2107, legal = 2141; time = 0.04s
Info:     at iteration #10, type MISTRAL_COMB: wirelen solved = 1812, spread = 2145, legal = 2226; time = 0.03s
Info:     at iteration #10, type MISTRAL_M10K: wirelen solved = 1509, spread = 2192, legal = 2202; time = 0.03s
Info:     at iteration #10, type MISTRAL_CLKENA: wirelen solved = 2042, spread = 2042, legal = 2168; time = 0.03s
Info:     at iteration #10, type ALL: wirelen solved = 423, spread = 1718, legal = 1892; time = 0.08s
Info:     at iteration #11, type MISTRAL_FF: wirelen solved = 1798, spread = 1757, legal = 1788; time = 0.03s
Info:     at iteration #11, type MISTRAL_COMB: wirelen solved = 1536, spread = 1738, legal = 1842; time = 0.03s
Info:     at iteration #11, type MISTRAL_M10K: wirelen solved = 1209, spread = 2137, legal = 2026; time = 0.03s
Info:     at iteration #11, type MISTRAL_CLKENA: wirelen solved = 1888, spread = 1888, legal = 2048; time = 0.03s
Info:     at iteration #11, type ALL: wirelen solved = 477, spread = 1786, legal = 2097; time = 0.09s
Info:     at iteration #12, type MISTRAL_FF: wirelen solved = 1943, spread = 1904, legal = 1919; time = 0.03s
Info:     at iteration #12, type MISTRAL_COMB: wirelen solved = 1751, spread = 2170, legal = 2291; time = 0.03s
Info:     at iteration #12, type MISTRAL_M10K: wirelen solved = 1580, spread = 2517, legal = 2390; time = 0.03s
Info:     at iteration #12, type MISTRAL_CLKENA: wirelen solved = 2242, spread = 2242, legal = 2390; time = 0.02s
Info:     at iteration #12, type ALL: wirelen solved = 544, spread = 2184, legal = 2221; time = 0.08s
Info:     at iteration #13, type MISTRAL_FF: wirelen solved = 2149, spread = 2152, legal = 2248; time = 0.04s
Info:     at iteration #13, type MISTRAL_COMB: wirelen solved = 1967, spread = 2696, legal = 2708; time = 0.03s
Info:     at iteration #13, type MISTRAL_M10K: wirelen solved = 1965, spread = 2734, legal = 2669; time = 0.03s
Info:     at iteration #13, type MISTRAL_CLKENA: wirelen solved = 2531, spread = 2531, legal = 2669; time = 0.02s
Info:     at iteration #13, type ALL: wirelen solved = 599, spread = 2158, legal = 2218; time = 0.08s
Info:     at iteration #14, type MISTRAL_FF: wirelen solved = 2099, spread = 2144, legal = 2143; time = 0.03s
Info:     at iteration #14, type MISTRAL_COMB: wirelen solved = 1890, spread = 2381, legal = 2397; time = 0.03s
Info:     at iteration #14, type MISTRAL_M10K: wirelen solved = 1613, spread = 2314, legal = 2283; time = 0.03s
Info:     at iteration #14, type MISTRAL_CLKENA: wirelen solved = 2151, spread = 2151, legal = 2283; time = 0.02s
Info:     at iteration #14, type ALL: wirelen solved = 673, spread = 1916, legal = 2049; time = 0.08s
Info:     at iteration #15, type MISTRAL_FF: wirelen solved = 2085, spread = 2077, legal = 2095; time = 0.04s
Info:     at iteration #15, type MISTRAL_COMB: wirelen solved = 1821, spread = 2250, legal = 2257; time = 0.03s
Info:     at iteration #15, type MISTRAL_M10K: wirelen solved = 1471, spread = 2170, legal = 2170; time = 0.03s
Info:     at iteration #15, type MISTRAL_CLKENA: wirelen solved = 2042, spread = 2042, legal = 2150; time = 0.03s
Info:     at iteration #15, type ALL: wirelen solved = 593, spread = 1840, legal = 1942; time = 0.08s
Info: HeAP Placer Time: 4.84s
Info:   of which solving equations: 1.93s
Info:   of which spreading cells: 0.63s
Info:   of which strict legalisation: 0.45s

Info: Running simulated annealing placer for refinement.
Info:   at iteration #1: temp = 0.000000, timing cost = 14, wirelen = 1892
Info:   at iteration #5: temp = 0.000000, timing cost = 4, wirelen = 1435
Info:   at iteration #10: temp = 0.000000, timing cost = 5, wirelen = 1359
Info:   at iteration #14: temp = 0.000000, timing cost = 4, wirelen = 1341
Info: SA placement time 0.81s

Info: Max frequency for clock 'blink_rgb_0.i_clock': 324.68 MHz (PASS at 50.00 MHz)

Info: Slack histogram:
Info:  legend: * represents 1 endpoint(s)
Info:          + represents [1,1) endpoint(s)
Info: [ 16920,  17043) |****
Info: [ 17043,  17166) |****
Info: [ 17166,  17289) |*******************
Info: [ 17289,  17412) |*************************
Info: [ 17412,  17535) |*********************
Info: [ 17535,  17658) |*************
Info: [ 17658,  17781) |**********************
Info: [ 17781,  17904) |*******************
Info: [ 17904,  18027) |*****************************
Info: [ 18027,  18150) |*****************
Info: [ 18150,  18273) |***************
Info: [ 18273,  18396) |***********
Info: [ 18396,  18519) |
Info: [ 18519,  18642) |
Info: [ 18642,  18765) |
Info: [ 18765,  18888) |
Info: [ 18888,  19011) |
Info: [ 19011,  19134) |*
Info: [ 19134,  19257) |***
Info: [ 19257,  19380) |***
Info: Checksum: 0xd45ef47f
Info: Preparing LABs for routing...
Info: Routing globals...
Info:     routed net 'blink_rgb_0.i_clock' using global resources
Info: Running router2...
Info: Setting up routing resources...
Info: Running main router loop...
Info:     iter=1 wires=4063 overused=146 overuse=156 archfail=NA
Info:     iter=2 wires=4139 overused=53 overuse=53 archfail=NA
Info:     iter=3 wires=4199 overused=18 overuse=18 archfail=NA
Info:     iter=4 wires=4210 overused=11 overuse=11 archfail=NA
Info:     iter=5 wires=4215 overused=4 overuse=4 archfail=NA
Info:     iter=6 wires=4223 overused=1 overuse=1 archfail=NA
Info:     iter=7 wires=4223 overused=1 overuse=1 archfail=NA
Info:     iter=8 wires=4223 overused=1 overuse=1 archfail=NA
Info:     iter=9 wires=4223 overused=1 overuse=1 archfail=NA
Info:     iter=10 wires=4223 overused=1 overuse=1 archfail=NA
Info:     iter=11 wires=4224 overused=1 overuse=1 archfail=NA
Info:     iter=12 wires=4223 overused=1 overuse=1 archfail=NA
Info:     iter=13 wires=4223 overused=1 overuse=1 archfail=NA
Info:     iter=14 wires=4228 overused=0 overuse=0 archfail=0
Info: Router2 time 2.51s
Info: Running router1 to check that route is legal...

Info: Routing..
Info: Setting up routing queue.
Info: Routing 0 arcs.
Info:            |   (re-)routed arcs  |   delta    | remaining|       time spent     |
Info:    IterCnt |  w/ripup   wo/ripup |  w/r  wo/r |      arcs| batch(sec) total(sec)|
Info:          0 |        0          0 |    0     0 |         0|       0.07       0.07|
Info: Routing complete.
Info: Router1 time 0.07s
Info: Checksum: 0x19b2a390

Info: Critical path report for clock 'blink_rgb_0.i_clock' (posedge -> posedge):
Info: curr total
Info:  0.7  0.7  Source blink_rgb_0.counter_MISTRAL_FF_Q_6.Q
Info:  1.6  2.3    Net blink_rgb_0.counter[19] budget 4.587000 ns (71,1) -> (71,1)
Info:                Sink blink_rgb_0.i_reset_n_MISTRAL_ALUT5_A_B_MISTRAL_ALUT5_Q_1_E_MISTRAL_ALUT4_Q.C
Info:  0.4  2.7  Source blink_rgb_0.i_reset_n_MISTRAL_ALUT5_A_B_MISTRAL_ALUT5_Q_1_E_MISTRAL_ALUT4_Q.Q
Info:  1.4  4.1    Net blink_rgb_0.i_reset_n_MISTRAL_ALUT5_A_B_MISTRAL_ALUT5_Q_1_E[4] budget 4.587000 ns (71,1) -> (72,1)
Info:                Sink blink_rgb_0.i_reset_n_MISTRAL_ALUT5_A_B_MISTRAL_ALUT5_Q_1.E
Info:                Defined in:
Info:                  /usr/local/bin/../share/yosys/intel_alm/common/alm_map.v:7.19-7.20
Info:  0.1  4.2  Source blink_rgb_0.i_reset_n_MISTRAL_ALUT5_A_B_MISTRAL_ALUT5_Q_1.Q
Info:  0.6  4.8    Net blink_rgb_0.i_reset_n_MISTRAL_ALUT5_A_B[4] budget 4.587000 ns (72,1) -> (72,1)
Info:                Sink blink_rgb_0.i_reset_n_MISTRAL_ALUT5_A.E
Info:                Defined in:
Info:                  /usr/local/bin/../share/yosys/intel_alm/common/alm_map.v:7.19-7.20
Info:  0.1  4.9  Source blink_rgb_0.i_reset_n_MISTRAL_ALUT5_A.Q
Info:  0.8  5.8    Net blink_rgb_0.i_reset_n_MISTRAL_ALUT5_A_Q budget 4.587000 ns (72,1) -> (72,2)
Info:                Sink blink_rgb_0.counter_MISTRAL_FF_Q_10.SCLR
Info: -0.2  5.6  Setup blink_rgb_0.counter_MISTRAL_FF_Q_10.SCLR
Info: 1.1 ns logic, 4.4 ns routing

Info: Max frequency for clock 'blink_rgb_0.i_clock': 180.02 MHz (PASS at 50.00 MHz)

Info: Slack histogram:
Info:  legend: * represents 1 endpoint(s)
Info:          + represents [1,1) endpoint(s)
Info: [ 14445,  14668) |*************
Info: [ 14668,  14891) |****
Info: [ 14891,  15114) |************************************
Info: [ 15114,  15337) |*****
Info: [ 15337,  15560) |************************************
Info: [ 15560,  15783) |**************
Info: [ 15783,  16006) |**********
Info: [ 16006,  16229) |************************
Info: [ 16229,  16452) |***********
Info: [ 16452,  16675) |*********************
Info: [ 16675,  16898) |*************
Info: [ 16898,  17121) |***
Info: [ 17121,  17344) |*****
Info: [ 17344,  17567) |*****
Info: [ 17567,  17790) |*
Info: [ 17790,  18013) |
Info: [ 18013,  18236) |
Info: [ 18236,  18459) |**
Info: [ 18459,  18682) |**
Info: [ 18682,  18905) |*
Info: Running signoff timing analysis...

Info: Critical path report for clock 'blink_rgb_0.i_clock' (posedge -> posedge):
Info: curr total
Info:  0.7  0.7  Source blink_rgb_0.counter_MISTRAL_FF_Q_6.Q
Info:  1.5  2.3    Net blink_rgb_0.counter[19] budget 4.587000 ns (71,1) -> (71,1)
Info:                Sink blink_rgb_0.i_reset_n_MISTRAL_ALUT5_A_B_MISTRAL_ALUT5_Q_1_E_MISTRAL_ALUT4_Q.C
Info:  0.4  2.7  Source blink_rgb_0.i_reset_n_MISTRAL_ALUT5_A_B_MISTRAL_ALUT5_Q_1_E_MISTRAL_ALUT4_Q.Q
Info:  1.7  4.4    Net blink_rgb_0.i_reset_n_MISTRAL_ALUT5_A_B_MISTRAL_ALUT5_Q_1_E[4] budget 4.587000 ns (71,1) -> (72,1)
Info:                Sink blink_rgb_0.i_reset_n_MISTRAL_ALUT5_A_B_MISTRAL_ALUT5_Q_1.E
Info:                Defined in:
Info:                  /usr/local/bin/../share/yosys/intel_alm/common/alm_map.v:7.19-7.20
Info:  0.1  4.5  Source blink_rgb_0.i_reset_n_MISTRAL_ALUT5_A_B_MISTRAL_ALUT5_Q_1.Q
Info:  0.5  4.9    Net blink_rgb_0.i_reset_n_MISTRAL_ALUT5_A_B[4] budget 4.587000 ns (72,1) -> (72,1)
Info:                Sink blink_rgb_0.i_reset_n_MISTRAL_ALUT5_A.E
Info:                Defined in:
Info:                  /usr/local/bin/../share/yosys/intel_alm/common/alm_map.v:7.19-7.20
Info:  0.1  5.0  Source blink_rgb_0.i_reset_n_MISTRAL_ALUT5_A.Q
Info:  0.9  5.9    Net blink_rgb_0.i_reset_n_MISTRAL_ALUT5_A_Q budget 4.587000 ns (72,1) -> (72,2)
Info:                Sink blink_rgb_0.counter_MISTRAL_FF_Q_10.SCLR
Info: -0.2  5.7  Setup blink_rgb_0.counter_MISTRAL_FF_Q_10.SCLR
Info: 1.1 ns logic, 4.6 ns routing

Info: Max frequency for clock 'blink_rgb_0.i_clock': 174.64 MHz (PASS at 50.00 MHz)

Info: Slack histogram:
Info:  legend: * represents 1 endpoint(s)
Info:          + represents [1,1) endpoint(s)
Info: [ 14274,  14507) |****************
Info: [ 14507,  14740) |
Info: [ 14740,  14973) |**************
Info: [ 14973,  15206) |
Info: [ 15206,  15439) |***********************
Info: [ 15439,  15672) |*******************
Info: [ 15672,  15905) |***********************************
Info: [ 15905,  16138) |************************
Info: [ 16138,  16371) |********************
Info: [ 16371,  16604) |*********************
Info: [ 16604,  16837) |*******
Info: [ 16837,  17070) |*********
Info: [ 17070,  17303) |*******
Info: [ 17303,  17536) |*****
Info: [ 17536,  17769) |*
Info: [ 17769,  18002) |
Info: [ 18002,  18235) |
Info: [ 18235,  18468) |*
Info: [ 18468,  18701) |*
Info: [ 18701,  18934) |***

Info: Program finished normally.

20/11/2022: so what have you been working on, lofty?

this is a crosspost from my cohost post two-ish weeks ago.
people might have missed me posting the link, and I also figured having an extra copy would be useful.

it's true that I've been quiet a lot recently (at least partially down to it being winter), but I decided to set myself a project: write a highly parallel FPGA router.

but to explain my project ideas, I must first explain how FPGA routing works in general.

how FPGA routing works in general

so, a router gets handed a list of (1 source wire, N sink wires) pairs ("nets"), and told to find a valid path for all of them, such that each physical wire on the FPGA has no more than one driver.

now, the first thing that happens to a net when it's routed is that it gets broken down into N (1 source wire, 1 sink wire) pairs ("arcs"), so that one can use A* from the source to the sink.

Now, with exceptions (nextpnr router1), A* is allowed to route wires across already-used wires, but a cost function is used to make it more expensive to do so for wires which aren't critical-path. By doing iterations of that with the costs increasing, eventually the wires get spread out enough to satisfy the "no more than one driver" criteria.

observation one: this is entirely serial

how to parallelise this? one way is to draw a bounding box around the X/Y coordinates of the source and sink wires in a net, and categorise them by where in the FPGA the bounding boxes are, and then route as many of these categories in parallel as possible.

other suggestions are to use fancy graph libraries which attempt to route all arcs at the same time while trying not to step on the other threads' toes.

observation two: A* is worst-case exponential in path length

approaches like the ones laid out above strike me as optimising the wrong thing. categorising by bounding boxes routes the smallest nets in parallel, but these are also often the nets that are fastest to route; the long nets are still routed serially, and this is very slow.

observation three: FPGAs are grid-like

a little bit of this is exploited in the categorisation approach; nextpnr's router2 uses quarters and halves (split vertically and horizontally) of the FPGA to route more efficiently.

hierarchical A* implementations in the literature cache paths between nodes to optimise finding the best path between them in the future. directly implementing this doesn't work because the cost function for A* changes, but perhaps there are still ways to exploit hierarchy for parallelism.

a better way to route, I think

if one starts top-down with the grid-like structure of the FPGA, one can divide it into quarters. these segments have wires interconnecting them, and we can view those as the boundary of the segments. we can then partition nets along those boundaries, producing smaller nets which are inside those boundaries.

as an example, if a net goes from the northeast segment to the southeast segment, it gets partitioned along the east boundary wires, producing a net for the northeast source wire to the east wire, and a net for the east wire to the southeast sink wire.

I think it's best to do partitioning using A* and the same cost function as above; since there aren't many boundary wires, this is fast (enough).

since the maximum path length of a net is now bounded by the segment it is in, the path length of that net decreases.

creating arbitrarily large amounts of parallelism

the above process can be done recursively; if one iteration splits an entire FPGA into four segments, the segments can be themselves split into four subsegments. that means we could potentially route 16 nets at the same time.

I think there's still room for improvement here, though. intuitively, I think the segments don't have to be exact quarters, but instead could be split such that each segment has the same amount of nets in it.

there's still the question of what to do towards the end of a routing iteration though; as threads complete their work, they would have nothing to do. while one could imagine the final thread splitting its segment for the other threads to work on, I suspect stopping routing to do that would reach a point where actual routing speed goes down a lot.

26/10/2020: A LUT Mapper (II).

LUT Mapping with Priority Cuts (II)

Some snippets are going to be in diff format, because I find that helps show the changes more. On the other hand, syntax highlighting dies.

The mapper works, but the results are mediocre. Let's fix that.

Depth-optimality

If you locally pick the lowest-depth solution at each node, the global solution found by the cut selector will also be the lowest-depth solution.
The depth of a cut is (for now) the maximum of the depth of its inputs plus one level.

/// An AND-Inverter Graph.
/// Populating this graph is out of scope for this article, but see the AIGER format if you want more.
struct Aig {
    /// The primary inputs of the graph.
    inputs: Vec<u32>,
    /// The primary outputs of the graph.
    outputs: Vec<u32>,
    /// The graph nodes.
    nodes: Vec<AigNode>,
+   /// The depth of the best cut at each node.
+   depth: Vec<i32>,
}

impl Aig {
+   /// Returns the depth of a cut.
+   pub fn calculate_depth(&self, cut: &Cut) -> i32 {
+       cut.inputs.iter().map(|node| self.depth[node as usize]).max().unwrap() + 1
+   }
+
    /// Return cuts in the graph that have at most `max_inputs` inputs.
    pub fn enumerate_cuts(&self, max_inputs: usize) -> Vec<Vec<Cut>> {
        // Mapping to 1-input LUTs doesn't make any sense.
        assert!(max_inputs >= 2);

        let mut cuts = vec![vec![]; self.nodes.len()];

        // Every primary input has exactly one cut: the trivial cut of that input.
+       // The depth of a primary input is zero.
        for input in &self.inputs {
            cuts[*input as usize] = vec![Cut::new(vec![*input], *input)];
+           self.depth[*input as usize] = 0;
        }

        // Writing a topological ordering is also out of scope, but see the Wikipedia article on it.
        let topological_ordering = self.topological_ordering();

        for node in topological_ordering {
            // Skip primary inputs.
            if self.inputs.contains(&node) {
                continue;
            }

            // Convenience variables for the children nodes.
            let lhs = self.nodes[node as usize].lhs_index();
            let rhs = self.nodes[node as usize].rhs_index();

            // Compute the trivial cut of this node.
            // Its inputs are the (sorted) inputs of this node.
            let trivial_cut_inputs = vec![lhs, rhs];
            trivial_cut_inputs.sort();
            // Its output is the output of this node.
            let trivial_cut_output = node;
            let trivial_cut = Cut::new(trivial_cut_inputs, trivial_cut_output);

            // Now for some iterator spam.
            let node_cuts = cuts[lhs as usize].iter()
            // Combine all child cuts together.
            .cartesian_product(&cuts[rhs as usize])
            // Merge the child cuts, using the current node as a root.
            .map(|(lhs, rhs)| lhs.merge(rhs, node))
            // Include the trivial cut of this node too.
            .chain(std::iter::once(trivial_cut))
            // Keep all `max_inputs`-feasible cuts.
            .filter(|cut| cut.input_count() <= max_inputs)
+           // Sort them by depth.
+           .sorted_by(|lhs, rhs| self.calculate_depth(lhs).cmp(&self.calculate_depth(rhs)))
            // And then collect them into a vector.
            .collect::<Vec<Cut>>();
+
+           // If you somehow have zero cuts at a node there is a bug, because the trivial cut guarantees at least one input cut.
+           assert!(!node_cuts.is_empty());
+
+           let best_cut = &node_cuts[0];
+           // Set the depth of the best cut at this node.
+           depth[node as usize] = self.calculate_depth(best_cut);

            // Set these cuts as the cuts at this node.
            cuts[node as usize] = node_cuts;          
        }

        // Return the set of cuts for the graph.
        cuts
    }
}

Since we picked the first element in the cut selector, this will now always pick a depth-optimal cut. But this also lays the foundation for other things.

Using my own implementation in mignite, I get:

Mapped to 155433 LUTs
LUT0: 0
LUT1: 0
LUT2: 15838
LUT3: 1153
LUT4: 138442
Maximum delay: 8531

This is an optimal-depth solution, but it's huge. Having secured depth-optimality, we now need to aim for reducing solution area.

Area recovery consists of two parts: picking the best area depth-optimal cut, and then later relaxing the depth where it can be achieved without affecting the overall delay.

Picking smaller depth-optimal cuts

Let's start by preferring depth-optimal cuts with fewer inputs, which reduces the number of times a node is covered by multiple LUTs.

impl Aig {
    /// Return cuts in the graph that have at most `max_inputs` inputs.
    pub fn enumerate_cuts(&self, max_inputs: usize) -> Vec<Vec<Cut>> {
        // Mapping to 1-input LUTs doesn't make any sense.
        assert!(max_inputs >= 2);

        let mut cuts = vec![vec![]; self.nodes.len()];

        // Every primary input has exactly one cut: the trivial cut of that input.
        // The depth of a primary input is zero.
        for input in &self.inputs {
            cuts[*input as usize] = vec![Cut::new(vec![*input], *input)];
            self.depth[*input as usize] = 0;
        }

        // Writing a topological ordering is also out of scope, but see the Wikipedia article on it.
        let topological_ordering = self.topological_ordering();

        for node in topological_ordering {
            // Skip primary inputs.
            if self.inputs.contains(&node) {
                continue;
            }

            // Convenience variables for the children nodes.
            let lhs = self.nodes[node as usize].lhs_index();
            let rhs = self.nodes[node as usize].rhs_index();

            // Compute the trivial cut of this node.
            // Its inputs are the (sorted) inputs of this node.
            let trivial_cut_inputs = vec![lhs, rhs];
            trivial_cut_inputs.sort();
            // Its output is the output of this node.
            let trivial_cut_output = node;
            let trivial_cut = Cut::new(trivial_cut_inputs, trivial_cut_output);

            // Now for some iterator spam.
            let node_cuts = cuts[lhs as usize].iter()
            // Combine all child cuts together.
            .cartesian_product(&cuts[rhs as usize])
            // Merge the child cuts, using the current node as a root.
            .map(|(lhs, rhs)| lhs.merge(rhs, node))
            // Include the trivial cut of this node too.
            .chain(std::iter::once(trivial_cut))
            // Keep all `max_inputs`-feasible cuts.
            .filter(|cut| cut.input_count() <= max_inputs)
-           // Sort them by depth.
-           .sorted_by(|lhs, rhs| self.calculate_depth(lhs).cmp(&self.calculate_depth(rhs)))
+           // Sort them by depth, breaking ties by picking the cut with the fewest inputs.
+           .sorted_by(|lhs, rhs| self.calculate_depth(lhs).cmp(&self.calculate_depth(rhs)).then_with(|| lhs.input_count().cmp(&rhs.input_count())))
            // And then collect them into a vector.
            .collect::<Vec<Cut>>();
 
            // If you somehow have zero cuts at a node there is a bug, because the trivial cut guarantees at least one input cut.
            assert!(!node_cuts.is_empty());
 
            let best_cut = &node_cuts[0];
            // Set the depth of the best cut at this node.
            depth[node as usize] = self.calculate_depth(best_cut);

            // Set these cuts as the cuts at this node.
            cuts[node as usize] = node_cuts;          
        }

        // Return the set of cuts for the graph.
        cuts
    }
}

This helps a lot:

Mapped to 99940 LUTs
LUT0: 0
LUT1: 0
LUT2: 40962
LUT3: 738
LUT4: 58240
Maximum delay: 8531

Required time calculation

Before we implement a heuristic to reduce area, we need to know how much we can increase depth by.
We do this by calculating a "required time" for that node, which is the depth of the node minus the distance from the output.
You can calculate that using a post-order (or reverse topological - children before node) traversal from primary outputs to primary inputs.

/// An AND-Inverter Graph.
/// Populating this graph is out of scope for this article, but see the AIGER format if you want more.
struct Aig {
    /// The primary inputs of the graph.
    inputs: Vec<u32>,
    /// The primary outputs of the graph.
    outputs: Vec<u32>,
    /// The graph nodes.
    nodes: Vec<AigNode>,
    /// The depth of the best cut at each node.
    depth: Vec<i32>,
+   /// The distance of the node to the most critical output.
+   required: Vec<Option<i32>>,
}

impl Aig {
    /// Select the final mapping of LUTs.
    pub fn select_cuts(&self, max_inputs: usize) -> Vec<Cut> {
        // Calculate the cuts.
        let cuts = self.enumerate_cuts(max_inputs);

        // The mapping frontier is all the nodes which need a cut selected that do not presently have one.
        // We start with the primary outputs of the graph and let the algorithm do its job from there.
        let mut frontier = self.outputs.clone();

        // The selected mappings to return.
        let mut mapping = Vec::new();

        // The nodes which have a selected mapping, to avoid duplicate work for a node.
        let mut mapped_nodes = Vec::new();

        // Pick the next node from the frontier.
        while let Some(node) = frontier.pop() {
            // Pick the first cut as the one to use.
            let cut = cuts[node as usize][0];

            // Add the cut's inputs to the mapping frontier if they are not primary inputs and not already mapped.
            for input in &cut.inputs {
                if !self.inputs.contains(input) && !mapped_nodes.contains(input) {
                    frontier.push(*input);
                }
            }

            // Mark this node as mapped.
            mapped_nodes.push(node);

            // Add the cut to the mapping.
            mapping.push(cut);
        }
+
+       // Calculate the maximum depth of the graph.
+       // Assuming no logic loops, this occurs at one of the primary outputs.
+       let max_depth = self.depth.iter().max().unwrap();
+
+       // Initialise the required times.
+       for node in &mut self.required {
+           *node = Some(max_depth);
+       }
+
+       // Calculate the required time by traversing the graph in postorder, starting from primary outputs and working to primary inputs.
+       // This could be implemented through a depth-first search.
+       let postorder_ordering = self.postorder_ordering();
+       for node in postorder_ordering {
+           // Inputs to this node have a required time less than this node.
+           let required = self.required[node as usize] - 1;
+
+           // Pick the first cut as the one to use.
+           let cut = cuts[node as usize][0];
+
+           // Update the required time of the input nodes if the current required time is stricter than the previous.
+           for input in &cut.inputs {
+               self.required[input as usize] = self.required[input as usize].unwrap().min(required);
+           }
+       }

        mapping
    }
}

Area minimisation

Now, the heuristic we implement is called "area flow", and it attempts to measure the global usefulness of the logic by preferring options which have the smallest implementation area, or the most useful logic.

a cut which has a smaller area will have a smaller left-hand-side.
a cut which is frequently used will have a larger right-hand-side.

If this ties, we pick a cut which isn't used as much to increase its efficiency. If that ties we fall back to depth.

/// An AND-Inverter Graph.
/// Populating this graph is out of scope for this article, but see the AIGER format if you want more.
struct Aig {
    /// The primary inputs of the graph.
    inputs: Vec<u32>,
    /// The primary outputs of the graph.
    outputs: Vec<u32>,
    /// The graph nodes.
    nodes: Vec<AigNode>,
    /// The depth of the best cut at each node.
    depth: Vec<i32>,
    /// The distance of the node to the most critical output.
    required: Vec<Option<i32>>,
+   /// The area flow of a node, measuring its usefulness.
+   area_flow: Vec<f32>,
+   /// The number of users of a node.
+   fanout: Vec<u32>,
}

impl Aig {
+   /// Returns the area flow of a cut.
+   pub fn calculate_area_flow(&self, cut: &Cut) -> f32 {
+       (cut.inputs.iter().map(|node| self.area_flow[*node]).sum::<f32>() + 1.0) / (self.fanout[cut.output].max(1) as f32)
+   }
+
+   /// Returns the average input fan-out count of a cut.
+   pub fn calculate_fanin(&self, cut: &Cut) -> f32 {
+       (cut.inputs.iter().map(|node| self.references[*node] as f32).sum::<f32>() / (cut.input_count() as f32)
+   }
+
    /// Return cuts in the graph that have at most `max_inputs` inputs.
-   pub fn enumerate_cuts(&self, max_inputs: usize) -> Vec<Vec<Cut>> {
+   pub fn enumerate_cuts(&self, max_inputs: usize, relax_depth: bool) -> Vec<Vec<Cut>> {
        // Mapping to 1-input LUTs doesn't make any sense.
        assert!(max_inputs >= 2);

        let mut cuts = vec![vec![]; self.nodes.len()];

        // Every primary input has exactly one cut: the trivial cut of that input.
-       // The depth of a primary input is zero.
+       // The depth, fanout and area flow of a primary input is zero.
        for input in &self.inputs {
            cuts[*input as usize] = vec![Cut::new(vec![*input], *input)];
            self.depth[*input as usize] = 0;
+           self.area_flow[*input as usize] = 0.0;
+           self.fanout[*input as usize] = 0;
        }

        // Writing a topological ordering is also out of scope, but see the Wikipedia article on it.
        let topological_ordering = self.topological_ordering();

        for node in topological_ordering {
            // Skip primary inputs.
            if self.inputs.contains(&node) {
                continue;
            }

            // Convenience variables for the children nodes.
            let lhs = self.nodes[node as usize].lhs_index();
            let rhs = self.nodes[node as usize].rhs_index();

            // Compute the trivial cut of this node.
            // Its inputs are the (sorted) inputs of this node.
            let trivial_cut_inputs = vec![lhs, rhs];
            trivial_cut_inputs.sort();
            // Its output is the output of this node.
            let trivial_cut_output = node;
            let trivial_cut = Cut::new(trivial_cut_inputs, trivial_cut_output);

            // Now for some iterator spam.
            let node_cuts = cuts[lhs as usize].iter()
            // Combine all child cuts together.
            .cartesian_product(&cuts[rhs as usize])
            // Merge the child cuts, using the current node as a root.
            .map(|(lhs, rhs)| lhs.merge(rhs, node))
            // Include the trivial cut of this node too.
            .chain(std::iter::once(trivial_cut))
+           // Also include the last-best cut, if we have one.
+           .chain(self.cuts[node.index()].first().cloned())
            // Keep all `max_inputs`-feasible cuts.
            .filter(|cut| cut.input_count() <= max_inputs)
+           // Keep all cuts that meet the node's required time, if it has one.
+           .filter(|cut| self.required[cut.output].is_none() || self.calculate_depth(cut) <= self.required[cut.output].unwrap())
-           // Sort them by depth, breaking ties by picking the cut with the fewest inputs.
-           .sorted_by(|lhs, rhs| self.calculate_depth(lhs).cmp(&self.calculate_depth(rhs)).then_with(|| lhs.input_count().cmp(&rhs.input_count())))
+           // In depth mode: sort them by depth, breaking ties by picking the cut with the fewest inputs, then by area flow.
+           // In area mode: sort them by area flow, breaking ties by picking the cut with the smallest average fan-in count, then by depth.
+           .sorted_by(|lhs, rhs| {
+               if relax_depth {
+                   self.calculate_area_flow(lhs).partial_cmp(&self.calculate_area_flow(lhs)).unwrap()
+                       .then_with(|| self.calculate_fanin(lhs).partial_cmp(&self.calculate_fanin(lhs)).unwrap())
+                       .then_with(|| self.calculate_depth(lhs).cmp(&self.calculate_depth(rhs)))
+               } else {
+                   self.calculate_depth(lhs).cmp(&self.calculate_depth(rhs))
+                       .then_with(|| lhs.input_count().cmp(&rhs.input_count())).then_with(|| lhs.input_count().cmp(&rhs.input_count()))
+                       .then_with(|| self.calculate_area_flow(lhs).partial_cmp(&self.calculate_area_flow(lhs)).unwrap())
+               }
+           })
            // And then collect them into a vector.
            .collect::<Vec<Cut>>();
 
            // If you somehow have zero cuts at a node there is a bug, because the trivial cut guarantees at least one input cut.
            assert!(!node_cuts.is_empty());
 
            let best_cut = &node_cuts[0];
            // Set the depth of the best cut at this node.
            depth[node as usize] = self.calculate_depth(best_cut);

+           // Update the fan-out and area flow arrays.
+           for input in &best_cut.inputs {
+               self.fanout[*input as usize] += 1;
+               self.area_flow[*input as usize] += self.calculate_area_flow(self.cuts[*input as usize][0]);
+           }
+
            // Set these cuts as the cuts at this node.
            cuts[node as usize] = node_cuts;          
        }

        // Return the set of cuts for the graph.
        cuts
    }
}

Now, if you run a mapping of depth, followed by a mapping of area:

Depth:
Mapped to 98644 LUTs
LUT0: 0
LUT1: 0
LUT2: 40162
LUT3: 720
LUT4: 57762
Maximum delay: 8531

Area:
Mapped to 99815 LUTs
LUT0: 0
LUT1: 0
LUT2: 38752
LUT3: 664
LUT4: 60399
Maximum delay: 8531

While not much of an improvement here, area flow comes into its own when you include generalised delay and area models. (Part 3, maybe?)

Partial cut enumeration

The less patient of you might have noticed that for sufficiently large designs, this algorithm takes forever.

real    1m55.251s

Fortunately, just as exhaustive cut enumeration exists, partial cut enumeration exists, too; we can set a number of cuts to keep per node, which reduces the time we're spending on them.

-   pub fn enumerate_cuts(&self, max_inputs: usize, relax_depth: bool) -> Vec<Vec<Cut>> {
+   pub fn enumerate_cuts(&self, max_inputs: usize, max_cuts: usize, relax_depth: bool) -> Vec<Vec<Cut>> {
        // Mapping to 1-input LUTs doesn't make any sense.
        assert!(max_inputs >= 2);

        let mut cuts = vec![vec![]; self.nodes.len()];

        // Every primary input has exactly one cut: the trivial cut of that input.
        // The depth, fanout and area flow of a primary input is zero.
        for input in &self.inputs {
            cuts[*input as usize] = vec![Cut::new(vec![*input], *input)];
            self.depth[*input as usize] = 0;
            self.area_flow[*input as usize] = 0.0;
            self.fanout[*input as usize] = 0;
        }

        // Writing a topological ordering is also out of scope, but see the Wikipedia article on it.
        let topological_ordering = self.topological_ordering();

        for node in topological_ordering {
            // Skip primary inputs.
            if self.inputs.contains(&node) {
                continue;
            }

            // Convenience variables for the children nodes.
            let lhs = self.nodes[node as usize].lhs_index();
            let rhs = self.nodes[node as usize].rhs_index();

            // Compute the trivial cut of this node.
            // Its inputs are the (sorted) inputs of this node.
            let trivial_cut_inputs = vec![lhs, rhs];
            trivial_cut_inputs.sort();
            // Its output is the output of this node.
            let trivial_cut_output = node;
            let trivial_cut = Cut::new(trivial_cut_inputs, trivial_cut_output);

            // Now for some iterator spam.
            let node_cuts = cuts[lhs as usize].iter()
            // Combine all child cuts together.
            .cartesian_product(&cuts[rhs as usize])
            // Merge the child cuts, using the current node as a root.
            .map(|(lhs, rhs)| lhs.merge(rhs, node))
            // Include the trivial cut of this node too.
            .chain(std::iter::once(trivial_cut))
            // Also include the last-best cut, if we have one.
            .chain(self.cuts[node.index()].first().cloned())
            // Keep all `max_inputs`-feasible cuts.
            .filter(|cut| cut.input_count() <= max_inputs)
            // Keep all cuts that meet the node's required time, if it has one.
            .filter(|cut| self.required[cut.output].is_none() || self.calculate_depth(cut) <= self.required[cut.output].unwrap())
            // In depth mode: sort them by depth, breaking ties by picking the cut with the fewest inputs, then by area flow.
            // In area mode: sort them by area flow, breaking ties by picking the cut with the smallest average fan-in count, then by depth.
            .sorted_by(|lhs, rhs| {
                if relax_depth {
                    self.calculate_area_flow(lhs).partial_cmp(&self.calculate_area_flow(lhs)).unwrap()
                        .then_with(|| self.calculate_area_flow(lhs).partial_cmp(&self.calculate_area_flow(lhs)).unwrap())
                        .then_with(|| self.calculate_depth(lhs).cmp(&self.calculate_depth(rhs)))
                } else {
                    self.calculate_depth(lhs).cmp(&self.calculate_depth(rhs))
                        .then_with(|| lhs.input_count().cmp(&rhs.input_count())).then_with(|| lhs.input_count().cmp(&rhs.input_count()))
                        .then_with(|| self.calculate_area_flow(lhs).partial_cmp(&self.calculate_area_flow(lhs)).unwrap())
                }
            })
+           // Keep only `max_cuts` cuts.
+           .take(max_cuts)
            // And then collect them into a vector.
            .collect::<Vec<Cut>>();
 
            // If you somehow have zero cuts at a node there is a bug, because the trivial cut guarantees at least one input cut.
            assert!(!node_cuts.is_empty());
 
            let best_cut = &node_cuts[0];
            // Set the depth of the best cut at this node.
            depth[node as usize] = self.calculate_depth(best_cut);

            // Update the fan-out and area flow arrays.
            for input in &best_cut.inputs {
                self.fanout[*input as usize] += 1;
                self.area_flow[*input as usize] += self.calculate_area_flow(self.cuts[*input as usize][0]);
            }

            // Set these cuts as the cuts at this node.
            cuts[node as usize] = node_cuts;          
        }

        // Return the set of cuts for the graph.
        cuts
    }
}

A good value for max_cuts is 8, which lets you do exhaustive enumeration where it isn't too expensive.

After that small tweak, we have:

Depth:
Mapped to 98626 LUTs
LUT0: 0
LUT1: 0
LUT2: 40218
LUT3: 734
LUT4: 57674
Maximum delay: 8532

Area:
Mapped to 100351 LUTs
LUT0: 0
LUT1: 0
LUT2: 42571
LUT3: 648
LUT4: 57132
Maximum delay: 8532

real    0m21.303s

And now we have the algorithm: you can add other heuristics to prioritise whatever you think is important.