Pipelining with Filament

While we've designed a correct ALU, it is quite slow: it processes one input completely before moving on to the next. Such a hardware design is called a "fully sequential". A standard technique to improve throughput of the design is pipelining, which allows a hardware module to process multiple inputs at the same time.

Filament is designed so that check that a pipeline can support the throughput specified in its interface. We'll take our sequential ALU design and use Filament's type system to guide us to a correctly pipelined design. Before that, however, let's run the design and see how it performs:

fud e --to cocotb-out examples/tut-seq.fil \
      -s cocotb.data examples/data.json \
      -s calyx.flags ' -d canonicalize'

Which generates the following output:

{"out": {"0": [28], "1": [10], "2": [66], "3": [16]}, "cycles": 12}

Note that this sequential design takes 12 cycles to process 4 inputs.

We'll start with pipelining our program:

import "primitives/core.fil";
import "./sequential.fil";

/// ANCHOR: sig
comp main<'G: 3>(
/// ANCHOR_END: sig
    go: interface['G],
     op: ['G, 'G+3] 1,
     left: ['G, 'G+1] 32,
     right: ['G, 'G+1] 32,
) -> ( out: ['G+2, 'G+3] 32)
{
    A := new Add[32];
    M := new Mult[32];
    m0 := M<'G>(left, right);
    a0 := A<'G>(left, right);
    // Use register to hold the adder's value
    r0 := new Register[32]<'G, 'G+3>(a0.out);
    // Use the multiplexer when the mult's output is ready
    mx := new Mux[32]<'G+2>(op, r0.out, m0.out);
    out = mx.out;
}

Delays and Throughput

Filament uses an event's delay to determine when the module is can accept new inputs.

comp main<'G: 3>(

Note that the delay for the event G is 3 which indicates to Filament that the ALU process new inputs every three cycles. We can tell Filament that we instead want a module that can process new inputs every cycle by changing the delay to 1:

comp main<'G: 1>(

And run the compiler:

cargo run -- alu.fil

However, much to our dismay, Filament tells us that this program cannot be pipelined to achieve throughput 1. However, being the nice type checker it is, it will tell us exactly why the design cannot be pipelined in the form of type errors.

Let's work through each error and see how we can fix it.

Availability Intervals of Ports

The first error message points out that one of the inputs is required for three cycles, but the module may re-execute every cycle:

error: bundle's availability is greater than the delay of the event
  ┌─ examples/tut-pipe-wrong-1.fil:8:10
  │
5 │ comp main<'G: 1>(
  │               - event's delay
  ·
8 │      op: ['G, 'G+3] 1,
  │          ^^^^^^^^^^ available for 3 cycles

error: event provided to invocation triggers more often that invocation's event's delay allows

This is problematic because op represents a physical wire; it is incapable of holding multiple values. Our request to process inputs every cycle and have op last for three cycles is physically impossible. The fix is easy: looking at our original design, we see that op is only used by the multiplexer in the interval [G+2, G+3) so we can change the availability interval of op to be [G+2, G+3).

Note: If this step feel like divine insight, another way to reach the same conclusion is by changing the availability interval of op to be [G, G+1) which will cause the compiler to point out all availability intervals where op is used.

Delays of Subcomponents

The second error message points out that the ALU component may execute every cycle but the multiplier we used can only execute every two cycles:

error: event provided to invocation triggers more often that invocation's event's delay allows
   ┌─ examples/tut-pipe-wrong-1.fil:15:13
   │
 5 │ comp main<'G: 1>(
   │               - this event triggers every 1 cycles
   ·
15 │     m0 := M<'G>(left, right);
   │             ^^ event provided to invoke triggers too often
   │
   ┌─ examples/./sequential.fil:3:18
   │
 3 │ comp Mult[W]<'G: 2>(
   │                  - invocation's event is allowed to trigger every 2 cycles

Yet again, our request is physically impossible to satisfy: our multiplier circuit is fundamentally incapable of executing every cycle. Thankfully for us, the primitives/math.fil file provides a component called FastMult which does have delay 1:

/// Implementation of a multiplier with initiation interval 1 and latency 3.
/// Written in a way to allow Vivado to infer a DSP.
comp FastMult[W]<'G: 1>(
   left: ['G, 'G+1] W,
   right: ['G, 'G+1] W,
) -> (
   out: ['G+3, 'G+4] W,
) where W > 0

We can change out program to use this component instead:

import "primitives/core.fil";
import "./sequential.fil";

comp main<'G: 1>(
    go: interface['G],
     op: ['G+2, 'G+3] 1,
     left: ['G, 'G+1] 32,
     right: ['G, 'G+1] 32,
) -> ( out: ['G+2, 'G+3] 32)
{
    A := new Add[32];
    M := new FastMult[32];
    m0 := M<'G>(left, right);
    a0 := A<'G>(left, right);
    // Use register to hold the adder's value
    r0 := new Register[32]<'G, 'G+3>(a0.out);
    // Use the multiplexer when the mult's output is ready
    mx := new Mux[32]<'G+2>(op, r0.out, m0.out);
    out = mx.out;
}

However, in making this change, we've created a new problem for ourselves:

error: source port does not provide value for as long as destination requires
   ┌─ examples/tut-pipe-wrong-2.fil:18:41
   │
18 │     mx := new Mux[32]<'G+2>(op, r0.out, m0.out);
   │                                         ^^^^^^
   │                                         │
   │                                         source is available for ['G+3, 'G+4]
   │                                         required for ['G+2, 'G+3]

Compilation failed with 2 errors.

Filament tells us that FastMult's out port is available in the interval [G+3, G+4) instead of [G+2, G+3) for Mult, i.e., the latency of FastMult is different from the latency of Mult.

Filament catching this bug is important-it would be very easy to miss such a mistake in a Verilog program. Fixing it is quite mechanical:

import "primitives/core.fil";
import "./sequential.fil";

comp main<'G: 1>(
    go: interface['G],
     op: ['G+3, 'G+4] 1,
     left: ['G, 'G+1] 32,
     right: ['G, 'G+1] 32,
) -> ( out: ['G+3, 'G+4] 32)
{
    A := new Add[32];
    M := new FastMult[32];
    m0 := M<'G>(left, right);
    a0 := A<'G>(left, right);
    // Use register to hold the adder's value
    r0 := new Register[32]<'G, 'G+4>(a0.out);
    // Use the multiplexer when the mult's output is ready
    mx := new Mux[32]<'G+3>(op, r0.out, m0.out);
    out = mx.out;
}

Registers that Hold on for too Long

The final problem is quite similar to the previous one:

error: event provided to invocation triggers more often that invocation's event's delay allows
   ┌─ examples/tut-pipe-wrong-3.fil:16:28
   │
 4 │ comp main<'G: 1>(
   │               - this event triggers every 1 cycles
   ·
16 │     r0 := new Register[32]<'G, 'G+4>(a0.out);
   │                            ^^ event provided to invoke triggers too often
   │
   ┌─ ./primitives/./state.fil:6:29
   │
 6 │    comp Register[WIDTH]<'G: 'L-('G+1), 'L: 1>(
   │                             --------- invocation's event is allowed to trigger every 3 cycles

Compilation failed with 1 errors.

The compiler is telling us the register's delay is 3 cycles. However, unlike the multiplier, this is a consequence of our decision: we make the register hold on to its value for three cycles which increases its delay. The last line of the error message points to the problem: the delay of a register depends on how we use it; this means that if we make it hold onto a value for exactly one cycle, its delay is reduced to one.

However, the problem is that we need the computation from the adder to be available three cycles from when it starts. To get both the pipelining and correctness we want, we need to instantiate a chain of registers that feed values forward.

The intuition behind this is that because we want our ALU to process inputs every cycle, we need to "save" the computation in every cycle and push it forward.

The final program will look like this:

import "primitives/core.fil";
import "./sequential.fil";

comp main<'G: 1>(
    go: interface['G],
     op: ['G+3, 'G+4] 1,
     left: ['G, 'G+1] 32,
     right: ['G, 'G+1] 32,
) -> ( out: ['G+3, 'G+4] 32) {
    A := new Add[32];
    M := new FastMult[32];
    m0 := M<'G>(left, right);
    a0 := A<'G>(left, right);
    r0 := new Register[32]<'G, 'G+2>(a0.out);
    r1 := new Register[32]<'G+1, 'G+3>(r0.out);
    r2 := new Register[32]<'G+2, 'G+4>(r1.out);
    mx := new Mux[32]<'G+3>(op, r2.out, m0.out);
    out = mx.out;
}

Running the Pipelined Design

Now to the moment of truth: let's run the design and see how it performs:

fud e --to cocotb-out examples/tut-pipe.fil \
      -s cocotb.data examples/data.json \
      -s calyx.flags ' -d canonicalize'

We get the following output which shows that the design took only 7 cycles to process 4 inputs:

{"out": {"0": [28], "1": [10], "2": [66], "3": [16]}, "cycles": 7}

If you're still not convinced, try adding another transaction to the data file in examples/data.json and see how the cycle count for the original sequential and pipelined designs change.

Something Remarkable

During the process of pipelining, we all of our time looking at type errors. Once the program type checked, it produced the correct output and was correctly pipelined. Filament supports many other features but at its heart, this is the guarantee it provides: if your program type checks, it is correctly pipelined. Fast and correct, you can have both!