Session 5 – FPGA Building Blocks

In the previous sessions, we’ve learned about combinational and sequential logic, memory systems, and how digital circuits work in theory. Now it’s time to see how all of this gets implemented in practice on one of the most versatile hardware platforms: Field-Programmable Gate Arrays (FPGAs).

Unlike ASICs (Application-Specific Integrated Circuits) that are manufactured with fixed functionality, FPGAs are reconfigurable. You can program them to implement virtually any digital circuit—from a simple LED blinker to a complete RISC-V processor, a neural network accelerator, or a high-speed network switch. This flexibility makes FPGAs invaluable for prototyping, research, and applications where you need custom hardware without the multi-million dollar cost of ASIC fabrication.

This session will walk you through the architecture of modern FPGAs, focusing on the building blocks you’ll encounter in devices from Xilinx (now AMD), Intel (formerly Altera), and Lattice. We’ll look at how lookup tables implement logic, how block RAM is organized, what DSP slices do, and how clock management works.

By the end of this session, you should be comfortable with:

LUT (lookup table) architecture and how they implement combinational logic
The structure of a complete logic cell (LUT + flip-flop + carry chain)
DSP slices and when to use them
Block RAM (BRAM) organization, modes, and instantiation
Clock resources: PLLs, MMCMs, and global clock networks
Vendor-specific primitives and how to infer vs. instantiate them
Resource utilization and optimization strategies
How to read FPGA datasheets and architecture guides

1. FPGA overview: the big picture

1.1 What is an FPGA?

An FPGA is a programmable integrated circuit consisting of:

Configurable logic blocks (CLBs) or logic elements (LEs): The basic building blocks that implement your logic.
Programmable interconnect: A network of switches and wires that connect the logic blocks together.
Block RAM (BRAM): Embedded memory blocks scattered across the chip.
DSP blocks: Hardened multipliers and arithmetic units for signal processing.
I/O blocks: Configurable input/output pins with various standards (LVDS, DDR, etc.).
Clock management: PLLs (Phase-Locked Loops) and MMCMs (Mixed-Mode Clock Managers) for generating and distributing clocks.
Hard IP blocks: Pre-built circuits like PCIe controllers, Ethernet MACs, memory controllers, and sometimes even processor cores (ARM, PowerPC, etc.).

The configuration memory (usually SRAM-based) stores the bitstream that defines how all these resources are connected and configured. When you “program” an FPGA, you’re loading this bitstream.

1.2 FPGA vs. ASIC vs. CPU

Aspect	FPGA	ASIC	CPU (software)
Development cost	Low ($100s to $10Ks)	Very high ($millions)	Very low
Development time	Hours to weeks	Months to years	Hours to weeks
Performance	Medium to high	Highest	Lower (sequential execution)
Power efficiency	Medium	Highest	Lower
Flexibility	Reconfigurable	Fixed after fabrication	Fully flexible (software)
Unit cost	Higher	Lower (at volume)	N/A (processor already there)
Time to market	Fast	Slow	Fastest

FPGAs sit in the sweet spot: custom hardware performance with software-like flexibility.

1.3 Major FPGA vendors

Xilinx (AMD): Market leader. Families include Spartan (low-cost), Artix/Kintex/Virtex (mid to high-end), and Versal (AI-optimized with AI engines).
Intel (formerly Altera): Cyclone (low-cost), Arria (mid-range), Stratix (high-end), and Agilex (latest).
Lattice: iCE40 and ECP5 (low-power, small form factor), CrossLink (video/imaging).
Microchip (formerly Microsemi/Actel): PolarFire (low-power, radiation-tolerant).

We’ll focus primarily on Xilinx/AMD terminology since they’re the most common in academia and industry, but the concepts translate across vendors.

2. Lookup tables (LUTs): the heart of FPGA logic

2.1 What is a LUT?

A lookup table (LUT) is a small memory that implements arbitrary combinational logic. Think of it as a truth table stored in SRAM.

A 6-input LUT (LUT6) has:

6 input pins (A, B, C, D, E, F)
1 output pin (O)
$2^6 = 64$ bits of configuration memory

The configuration memory stores the truth table for any 6-input Boolean function. When you present a 6-bit input, the LUT simply looks up the corresponding bit in its memory and outputs it.

2.2 Example: 2-input AND gate

For a 2-input LUT (LUT2), the truth table for Y = A AND B is:

A	B	Y
0	0	0
0	1	0
1	0	0
1	1	1

The LUT configuration memory is programmed to [0, 0, 0, 1]. When (A, B) = (1, 1), the LUT outputs the bit at index 3, which is 1.

2.3 LUT capacity and flexibility

Modern FPGAs use LUT6 (6-input LUTs) as the standard. Why 6?

Flexibility: Can implement any function of up to 6 inputs. This covers most common logic expressions.
Efficiency: Larger LUTs (7, 8 inputs) are possible but increase memory and delay. Smaller LUTs (4 inputs) waste resources for complex functions.
Fracturable: A LUT6 can often be split into two LUT5s or even more LUT3/LUT4s, giving you flexibility.

2.4 LUT implementation example

Let’s implement Y = (A & B) | (C & D & E):

The synthesis tool creates the truth table for this function.
The bitstream programs the LUT’s configuration memory with the 64-bit truth table.
At runtime, the LUT simply performs a memory lookup: config_mem[{A, B, C, D, E, F}].

This is why FPGAs can implement any combinational logic—as long as it fits within the LUT’s input limit.

2.5 LUT fracturable modes

Modern Xilinx FPGAs (7 series, UltraScale, etc.) have fracturable LUTs:

One LUT6 can implement any single 6-input function.
Or, it can be split into two LUT5s with shared inputs (5 inputs each, with one input shared).
Or, split into two independent LUT5s if you have separate functions.

This flexibility increases resource utilization. If you only need a 3-input function, you don’t waste an entire LUT6.

2.6 Carry logic and arithmetic

LUTs alone are inefficient for arithmetic (adders, counters). FPGAs include dedicated carry chains alongside LUTs:

Each logic cell has a fast carry-in and carry-out.
These are hardwired connections that bypass the programmable interconnect, creating a ripple-carry adder with minimal delay.
An N-bit adder uses N LUTs + N carry cells, with O(1) routing delay per bit (not O(N)).

When you write a + b in Verilog, the synthesis tool infers a carry chain.

3. The complete logic cell (slice)

A single LUT doesn’t exist in isolation. It’s part of a larger structure called a slice (Xilinx) or logic element (LE) (Intel).

3.1 Xilinx 7-series slice structure (SLICEL)

A typical slice in Xilinx 7-series FPGAs contains:

4 × LUT6: Four 6-input LUTs (can be fractured into 8 × LUT5).
4 × flip-flops: Registers to store the LUT outputs (or bypass the LUT and use them independently).
Multiplexers: To route LUT outputs to flip-flops or directly to outputs.
Carry chain logic: CARRY4 primitive for fast arithmetic.
Wide multiplexers: For efficient mux trees (using F7MUX, F8MUX primitives).

3.2 SLICEL vs. SLICEM

Xilinx has two types of slices:

SLICEL (logic-only): Standard logic cells. LUTs can only be used as lookup tables.
SLICEM (memory-capable): LUTs can be configured as small distributed RAMs (32×1, 64×1) or shift registers (SRL32, SRL16). These are more flexible but scarcer (typically 50% of slices are SLICEM).

If you need small memories or shift registers, SLICEM is essential.

3.3 Logic cell block diagram

        ┌──────────────────────────────────────┐
        │             Slice (SLICEL)           │
        │                                      │
        │  ┌─────┐    ┌────┐    ┌─────┐       │
   A ───┼─→│LUT6 │───→│ MUX├───→│ FF  ├──→ Q  │
   B ───┼─→│  A  │    └────┘    └─────┘       │
   C ───┼─→└─────┘                             │
   D ───┤                                      │
   E ───┤  ┌─────┐    ┌────┐    ┌─────┐       │
   F ───┼─→│LUT6 │───→│ MUX├───→│ FF  ├──→ Q  │
        │  │  B  │    └────┘    └─────┘       │
        │  └─────┘                             │
        │    ...     (2 more LUT6 + FF pairs)  │
        │                                      │
        │  ┌────────────────────────┐          │
        │  │   Carry Chain (CARRY4) │          │
        │  └────────────────────────┘          │
        └──────────────────────────────────────┘

3.4 Flip-flops in detail

Each flip-flop in a slice has:

D input: Data input (usually from a LUT or interconnect).
Clock: Connected to a global or regional clock net.
Clock enable (CE): Allows conditional updating (e.g., if (enable) q <= d;).
Set/Reset (SR): Synchronous or asynchronous set or reset.

The synthesis tool automatically infers flip-flops when you write:

always @(posedge clk)
    q <= d;

4. DSP slices: hardened arithmetic

4.1 What is a DSP slice?

Modern FPGAs include DSP blocks (Xilinx calls them DSP48E1/DSP48E2)—hardened arithmetic units optimized for signal processing and math-heavy applications.

A typical DSP slice contains:

25×18-bit multiplier (or 27×18 in newer devices)
48-bit accumulator
Pre-adder (for symmetric filters)
Pattern detector (for rounding, overflow detection)
Pipeline registers (multiple stages for high-speed operation)

4.2 DSP48E1 block diagram (simplified)

        ┌─────────────────────────────────────┐
        │          DSP48E1                    │
        │                                     │
   A ───┤  ┌──────────┐    ┌──────────────┐  │
 (25b)  │  │ Pre-adder│───→│              │  │
   D ───┤  │  (D ± A) │    │  Multiplier  │  │
 (25b)  │  └──────────┘    │   25 × 18    │──┤
        │                  │              │  │
   B ───┤─────────────────→│              │  │
 (18b)  │                  └──────────────┘  │
        │                        │           │
        │                        ▼           │
        │                  ┌──────────────┐  │
   C ───┤─────────────────→│ ALU + Accum. │──┼──→ P (48b)
 (48b)  │                  │   (48-bit)   │  │
        │                  └──────────────┘  │
        │                                     │
        │  [Pipeline registers at each stage] │
        └─────────────────────────────────────┘

4.3 Common DSP operations

Multiply: P = A × B
Multiply-accumulate (MAC): P = P + (A × B) — critical for filters, dot products, neural networks.
Pre-add + multiply: P = (D ± A) × B — efficient for symmetric FIR filters.
Multiply-add: P = (A × B) + C — single-cycle multiply-add.

4.4 Why use DSP slices?

Performance: DSP blocks can run at 500–900 MHz (depending on the device), much faster than logic-based multipliers.
Resource efficiency: A 25×18 multiplier in logic would consume hundreds of LUTs and have poor timing.
Power: Hardened blocks are more power-efficient than equivalent soft logic.

4.5 When to use DSP slices

Use DSP blocks for:

Multiplications (16-bit and larger)
Multiply-accumulate operations (filters, convolution, matrix multiply)
Wide adders and subtractors (the post-adder can be used standalone)
Counters and accumulators

Don’t use them for:

Small multipliers (e.g., multiply by a constant power of 2—use shifts)
Simple logic—LUTs are better

4.6 Inference vs. instantiation

Inference (recommended):

// Synthesis tool infers a DSP block
always @(posedge clk) begin
    product <= a * b;
end

Instantiation (for fine control):

DSP48E1 #(
    .AREG(1),
    .BREG(1),
    .PREG(1),
    // ... many parameters
) dsp_inst (
    .A(a),
    .B(b),
    .P(product),
    .CLK(clk),
    // ... many ports
);

Instantiation gives you precise control over pipelining and features but is vendor-specific and harder to port.

5. Block RAM (BRAM): embedded memory

5.1 What is BRAM?

FPGAs have dedicated memory blocks called Block RAM (BRAM) or M20K (Intel). These are much more efficient than using LUTs as distributed RAM.

Xilinx 7-series BRAM characteristics:

36 Kb per block (can be split into two 18 Kb blocks)
True dual-port: Two independent read/write ports (A and B)
Configurable width and depth (e.g., 36K×1, 18K×2, 4K×9, 2K×18, 1K×36, etc.)
Optional output registers for improved timing
Byte-enable signals for partial writes
Built-in ECC (error correction) in some devices

5.2 BRAM configurations

A single 36 Kb BRAM can be configured in many ways:

Configuration	Width	Depth	Notes
1K × 36	36	1024	Maximum width
2K × 18	18	2048	Common for 16-bit data
4K × 9	9	4096	8 data + 1 parity bit
8K × 4	4	8192	Narrow, deep memory
16K × 2	2	16384
32K × 1	1	32768	Maximum depth (bit-serial)

You can also split one 36 Kb block into two independent 18 Kb blocks.

5.3 True dual-port operation

Each BRAM has two ports (A and B), and both can:

Read and write independently
Access different addresses in the same cycle
Have different clocks (asynchronous dual-port)

This is perfect for:

Register files (two read ports, one write port in a CPU)
Ping-pong buffers (one port writes, the other reads)
Multi-clock domain FIFOs

5.4 Example: simple dual-port RAM

module simple_dual_port_ram #(
    parameter WIDTH = 32,
    parameter DEPTH = 1024,
    parameter ADDR_WIDTH = $clog2(DEPTH)
)(
    input  wire                     clk,
    // Write port
    input  wire                     we,
    input  wire [ADDR_WIDTH-1:0]    waddr,
    input  wire [WIDTH-1:0]         din,
    // Read port
    input  wire [ADDR_WIDTH-1:0]    raddr,
    output reg  [WIDTH-1:0]         dout
);
    (* ram_style = "block" *)  // Hint to use BRAM, not distributed RAM
    reg [WIDTH-1:0] mem [0:DEPTH-1];

    always @(posedge clk) begin
        if (we)
            mem[waddr] <= din;
        dout <= mem[raddr];  // Registered output
    end
endmodule

The (* ram_style = "block" *) attribute tells the synthesis tool to use BRAM. Without it, small memories might be implemented in distributed RAM (LUTs).

5.5 Output register modes

BRAMs can operate in two output modes:

No output register (READ_FIRST or WRITE_FIRST): Output reflects the old or new value immediately.
Output register: The BRAM has a built-in output register, adding one cycle of latency but improving timing closure.

For high-speed designs, always use the output register—it’s “free” in terms of resources (the register is part of the BRAM tile).

5.6 BRAM resource calculation

Example: You need 128 KB of memory in a Xilinx 7-series FPGA. How many BRAMs?

128 KB = 128 × 1024 bytes = 131,072 bytes = 1,048,576 bits
Each BRAM = 36 Kb = 36,864 bits
Number of BRAMs = 1,048,576 / 36,864 ≈ 28.4 blocks → 29 BRAMs (round up)

Always check the datasheet for the exact count of BRAMs in your device (e.g., Artix-7 100T has 135 BRAMs).

5.7 When to use BRAM vs. distributed RAM

Use BRAM when…	Use distributed RAM when…
Memory is large (> 1 Kb)	Memory is tiny (< 256 bits)
You need dual-port access	You need many small, independent RAMs
Timing is critical (use output regs)	You have LUTs to spare
You want to save logic resources	BRAMs are scarce and logic is plenty

6. Clock resources: PLLs, MMCMs, and clock networks

6.1 Clock domains and distribution

FPGAs have dedicated global clock networks with low skew and low jitter. These are essential for synchronous designs.

Xilinx 7-series has:

32 global clock buffers (BUFG) per clock region
Regional clocks for lower power (BUFRs)
High-speed regional clocks (BUFIOs) for I/O serialization (e.g., DDR, LVDS)

Using a BUFG ensures that your clock reaches every flip-flop with minimal skew (< 100 ps across the entire chip).

6.2 PLLs (Phase-Locked Loops)

A PLL is a clock generation circuit that can:

Multiply the input clock frequency (e.g., 100 MHz → 200 MHz)
Divide the input clock (e.g., 100 MHz → 25 MHz)
Phase-shift the clock (e.g., 90° shift for DDR capture)
Filter jitter from the input clock

Xilinx devices have multiple PLLs (often called PLLE2 in 7-series).

6.3 MMCMs (Mixed-Mode Clock Managers)

An MMCM is a more advanced version of a PLL, with additional features:

Multiple output clocks (up to 7 in Xilinx 7-series)
Dynamic reconfiguration (change frequency on-the-fly)
Fine phase adjustment (in ps steps)
Jitter filtering and dynamic phase shifting

MMCMs are used when you need multiple clock domains (e.g., 50 MHz, 100 MHz, 200 MHz) from a single input.

6.4 MMCM block diagram (simplified)

                     ┌──────────────────────────────┐
                     │           MMCM               │
                     │                              │
   CLK_IN ───────────┤→ Phase Detector → VCO       │
                     │     ↓                        │
   CLKFB_IN ─────────┤─ Feedback Divider (/M)      │
                     │                              │
                     │   VCO freq = CLK_IN × (N/M)  │
                     │                              │
                     │   ┌──────────────────┐       │
                     │   │  Output Dividers │       │
                     │   │   (/D0, /D1...)  │       │
                     │   └──────────────────┘       │
                     │         │  │  │  │           │
                     └─────────┼──┼──┼──┼───────────┘
                               │  │  │  │
                            CLK0 CLK1 CLK2 ...

6.5 Example: generating 50 MHz and 200 MHz from 100 MHz input

MMCME2_BASE #(
    .CLKIN1_PERIOD(10.0),      // 100 MHz input = 10 ns period
    .CLKFBOUT_MULT_F(8.0),     // VCO = 100 MHz × 8 = 800 MHz
    .CLKOUT0_DIVIDE_F(16.0),   // CLK0 = 800 MHz / 16 = 50 MHz
    .CLKOUT1_DIVIDE(4)         // CLK1 = 800 MHz / 4 = 200 MHz
) mmcm_inst (
    .CLKIN1(clk_in),
    .CLKFBOUT(clkfb),
    .CLKFBIN(clkfb),           // Feedback loop
    .CLKOUT0(clk_50),
    .CLKOUT1(clk_200),
    .RST(reset),
    .LOCKED(locked)            // High when MMCM is stable
);

Always wait for LOCKED to go high before using the output clocks.

6.6 Clock domain crossing (CDC)

When you have multiple clock domains, you need to handle clock domain crossings (CDC) carefully:

Asynchronous FIFOs: For transferring data between domains.
Synchronizer chains: Two or more flip-flops to avoid metastability (for control signals).
Handshake protocols: Request/acknowledge for safe data transfer.

Never directly connect a signal from one clock domain to another without proper synchronization—this causes metastability and intermittent failures.

7. Vendor-specific primitives

7.1 What are primitives?

Primitives are low-level building blocks provided by the FPGA vendor. They give you direct control over hardware features that might not be easily inferred by synthesis tools.

Examples:

BUFG: Global clock buffer
CARRY4: 4-bit carry logic
DSP48E1: DSP slice
RAMB36E1: 36 Kb block RAM
IBUF/OBUF: Input/output buffers
ODDR/IDDR: Output/input DDR (double-data-rate) registers
SRL16/SRL32: Shift register LUTs

7.2 Inference vs. instantiation

Inference (recommended for portability):

Write standard Verilog/VHDL
Let the synthesis tool map to primitives
Easier to port to different FPGA families

Instantiation (for fine control or special features):

Directly instantiate a primitive in your code
Vendor-specific, not portable
Necessary for features that can’t be inferred (e.g., ODDR, IDELAYCTRL)

7.3 Example: BUFG for clock buffering

Inference (synthesis tool adds BUFG automatically):

wire clk_internal;
assign clk_internal = clk_in;  // Synthesis adds BUFG

Instantiation (explicit control):

BUFG bufg_inst (
    .I(clk_in),
    .O(clk_buffered)
);

7.4 Example: ODDR for DDR output

You can’t infer an ODDR (output double-data-rate register) with standard Verilog—you must instantiate it:

ODDR #(
    .DDR_CLK_EDGE("SAME_EDGE")
) oddr_inst (
    .Q(ddr_out),       // DDR output
    .C(clk),           // Clock
    .CE(1'b1),         // Clock enable (always enabled)
    .D1(data_rise),    // Data on rising edge
    .D2(data_fall),    // Data on falling edge
    .R(1'b0),
    .S(1'b0)
);

This primitive outputs data_rise on the rising edge of clk and data_fall on the falling edge, effectively doubling the data rate.

7.5 When to instantiate primitives

BUFG: Usually auto-inferred, but instantiate if you need precise control over clock routing.
MMCM/PLL: Always instantiate (or use the Clock Wizard IP).
DSP48: Inference is fine for most cases; instantiate for exotic modes.
BRAM: Inference is usually fine; instantiate for specific cascade or ECC modes.
ODDR/IDDR: Must instantiate (no inference).
IDELAYCTRL/IDELAYE2: Must instantiate (for input delay calibration).

7.6 Reading the primitive library

Every FPGA vendor provides a Libraries Guide (e.g., Xilinx 7 Series Libraries Guide (Vivado Design Suite)) that documents every primitive:

Available parameters
Port descriptions
Timing characteristics
Usage examples

Bookmark this guide—you’ll refer to it constantly when doing low-level FPGA work.

8. Resource utilization and optimization

8.1 Understanding utilization reports

After synthesis and implementation, you get a utilization report showing how many resources your design uses:

+----------------------------+-------+
|         Resource           | Used  |
+----------------------------+-------+
| Slice LUTs                 | 12345 |
|   LUT as Logic             | 11000 |
|   LUT as Memory            | 1345  |
| Slice Registers            | 18000 |
| Block RAM Tile             | 45    |
| DSPs                       | 32    |
| BUFG                       | 4     |
| MMCM                       | 2     |
+----------------------------+-------+

Compare this to your device’s total resources (e.g., Artix-7 100T has 63,400 LUTs, 126,800 FFs, 135 BRAMs, 240 DSPs).

8.2 Optimization strategies

Use the right resource for the job

Arithmetic: Use DSP blocks, not LUTs.
Memory: Use BRAM, not distributed RAM (unless memory is tiny).
Shift registers: Use SRL primitives (SLICEM), not chains of flip-flops.

Pipeline for speed

Add registers between logic stages to break up long combinational paths:

// Before: long combinational path
assign result = ((a + b) * c) - (d & e);

// After: pipelined (3 stages)
always @(posedge clk) begin
    stage1 <= a + b;
    stage2 <= stage1 * c;
    result <= stage2 - (d & e);
end

This increases latency but allows higher clock frequency.

Avoid unnecessary logic

Don’t compare to 1’b1 or 1’b0: if (enable == 1'b1) wastes a LUT. Use if (enable).
Use parameters: Hard-coded bit widths waste resources. Use parameterized modules.
Remove debug logic in production: $display, unused signals, and debug counters consume resources.

If two operations never happen simultaneously, use the same resource:

always @(posedge clk) begin
    if (state == LOAD)
        result <= a + b;
    else if (state == COMPUTE)
        result <= c + d;  // Synthesis can share the adder
end

8.3 Retiming

Modern synthesis tools can retime your design—move registers across combinational logic to balance delays. Enable retiming in Vivado:

set_property RETIMING true [get_cells *]

This can significantly improve timing without changing functionality.

9. Practical examples

9.1 Example 1: FIR filter using DSP blocks

A simple 4-tap FIR filter:

module fir_filter (
    input  wire        clk,
    input  wire [15:0] x_in,    // Input sample
    output reg  [31:0] y_out    // Output sample
);
    // Coefficients (fixed-point)
    parameter signed [15:0] H0 = 16'sh1000;  // 0.25 in Q15 format
    parameter signed [15:0] H1 = 16'sh2000;  // 0.50
    parameter signed [15:0] H2 = 16'sh2000;  // 0.50
    parameter signed [15:0] H3 = 16'sh1000;  // 0.25

    // Delay line
    reg signed [15:0] x[0:3];

    always @(posedge clk) begin
        // Shift delay line
        x[0] <= x_in;
        x[1] <= x[0];
        x[2] <= x[1];
        x[3] <= x[2];

        // Compute FIR output (synthesis infers DSP blocks)
        y_out <= (H0 * x[0]) + (H1 * x[1]) + (H2 * x[2]) + (H3 * x[3]);
    end
endmodule

The synthesis tool will infer DSP48E1 blocks for the multiplications and accumulations.

9.2 Example 2: Asynchronous FIFO for CDC

module async_fifo #(
    parameter DATA_WIDTH = 32,
    parameter DEPTH = 16,
    parameter ADDR_WIDTH = $clog2(DEPTH)
)(
    input  wire                     wr_clk,
    input  wire                     wr_en,
    input  wire [DATA_WIDTH-1:0]    wr_data,
    output wire                     wr_full,

    input  wire                     rd_clk,
    input  wire                     rd_en,
    output wire [DATA_WIDTH-1:0]    rd_data,
    output wire                     rd_empty,

    input  wire                     rst
);
    // Gray code counters for read/write pointers
    reg [ADDR_WIDTH:0] wr_ptr, rd_ptr;
    reg [ADDR_WIDTH:0] wr_ptr_gray, rd_ptr_gray;
    reg [ADDR_WIDTH:0] rd_ptr_gray_sync1, rd_ptr_gray_sync2;  // Synchronizer
    reg [ADDR_WIDTH:0] wr_ptr_gray_sync1, wr_ptr_gray_sync2;

    // Dual-port RAM (infers BRAM)
    (* ram_style = "block" *)
    reg [DATA_WIDTH-1:0] mem [0:DEPTH-1];

    // Write logic
    always @(posedge wr_clk or posedge rst) begin
        if (rst)
            wr_ptr <= 0;
        else if (wr_en && !wr_full) begin
            mem[wr_ptr[ADDR_WIDTH-1:0]] <= wr_data;
            wr_ptr <= wr_ptr + 1;
        end
    end

    // Gray code conversion for write pointer
    always @(posedge wr_clk or posedge rst) begin
        if (rst)
            wr_ptr_gray <= 0;
        else
            wr_ptr_gray <= wr_ptr ^ (wr_ptr >> 1);
    end

    // Synchronize read pointer to write clock domain
    always @(posedge wr_clk or posedge rst) begin
        if (rst) begin
            rd_ptr_gray_sync1 <= 0;
            rd_ptr_gray_sync2 <= 0;
        end else begin
            rd_ptr_gray_sync1 <= rd_ptr_gray;
            rd_ptr_gray_sync2 <= rd_ptr_gray_sync1;
        end
    end

    assign wr_full = (wr_ptr_gray == {~rd_ptr_gray_sync2[ADDR_WIDTH:ADDR_WIDTH-1],
                                       rd_ptr_gray_sync2[ADDR_WIDTH-2:0]});

    // Read logic (similar structure, omitted for brevity)
    // ...

endmodule

This uses Gray code counters and a two-stage synchronizer to safely cross clock domains.

10. Reading FPGA datasheets

10.1 Key sections to understand

When evaluating an FPGA for a project, look for:

Logic resources: Number of LUTs, FFs, slices.
BRAM: Total memory capacity and number of blocks.
DSP slices: Count and capability (e.g., 18×25 vs. 18×18).
I/O pins: Total user I/O, supported standards (LVDS, LVTTL, HSTL, etc.), maximum toggle rate.
Clock resources: Number of PLLs, MMCMs, global clock networks.
Transceivers: If you need high-speed serial (PCIe, 10G Ethernet), check GTX/GTH/GTY counts and data rates.
Hard IP: PCIe blocks, Ethernet MACs, memory controllers, processor cores.
Speed grade: Faster speed grades (-3, -2, -1) support higher clock frequencies but cost more and consume more power.

10.2 Example: Xilinx Artix-7 100T (XC7A100T)

Resource	Count
Logic Cells	101,440
Slices	15,850
CLB Flip-Flops	126,800
Max Distributed RAM	400 Kb
Block RAM (36 Kb)	135 (4,860 Kb)
DSP Slices	240
PCIe Blocks	1
MMCM	6
PLL	6
User I/O	300–500 (pkg)
GTX Transceivers	8

This is a mid-range FPGA suitable for signal processing, embedded systems, and moderate compute tasks.

10.3 Estimating if your design will fit

Rule of thumb:

Logic: Aim to use < 70% of LUTs/FFs (routing congestion increases failure rates beyond that).
BRAM: Sum your memories and add 10–20% margin.
DSP: Count multipliers and MACs; ensure you have enough DSP blocks.
I/O: Verify you have enough pins and the right standards.

If you’re close to 100% utilization, implementation will be slow, timing closure will be hard, and small changes can break the design. Leave headroom.

11. Common pitfalls and best practices

Pitfall 1: Underutilizing DSP blocks

Problem: Implementing multipliers in logic when DSP blocks are available.

Solution: Write clean arithmetic code (a * b) and let synthesis infer DSP blocks. Check the synthesis report to confirm DSP usage.

Pitfall 2: Ignoring clock domain crossings

Problem: Directly connecting signals between different clock domains causes metastability.

Solution: Always use synchronizers (two-FF chain minimum) for control signals, and async FIFOs for data buses.

Pitfall 3: Overusing global clocks

Problem: Using too many global clocks (> 32 in 7-series) forces some clocks onto regional or local routing, increasing skew and reducing performance.

Solution: Consolidate clock domains where possible. Use clock enables instead of multiple clocks.

Pitfall 4: Not pipelining DSP blocks

Problem: Using DSP blocks without internal pipeline registers, resulting in long combinational paths.

Solution: Enable all internal registers in DSP blocks (AREG, BREG, MREG, PREG). This allows DSP blocks to run at maximum speed (500+ MHz).

Pitfall 5: Ignoring timing reports

Problem: Assuming your design works because it compiled.

Solution: Always check the timing report. Look for negative slack (WNS, Worst Negative Slack). If WNS < 0, your design will fail at the target frequency.

12. Advanced topics (preview)

These topics are beyond this session but important for serious FPGA work:

Partial reconfiguration: Reconfiguring part of the FPGA while the rest keeps running.
High-level synthesis (HLS): Writing FPGA designs in C/C++ instead of HDL.
NoC (Network-on-Chip): Versal FPGAs have a built-in network for inter-block communication.
AI engines: Versal AI Core has dedicated AI/ML processors (400 MHz, INT8/INT16 MACs).
SerDes (GTX/GTH/GTY): Multi-gigabit transceivers for PCIe, 10G/100G Ethernet, Aurora, etc.
Memory interfaces: DDR3/DDR4 controllers, QDRII+, HBM2.

As you gain experience, these will become essential tools in your FPGA design toolkit.

13. Hands-on exercises

Exercise 1: LUT calculation

A 5-input LUT implements the function Y = (A & B) | (C & D & E). How many bits of configuration memory does it have? Write out the truth table (first 8 rows only).

Exercise 2: DSP resource estimation

You’re designing a 16-tap FIR filter with 16-bit coefficients and 16-bit input samples. How many DSP48E1 blocks will you need? (Assume one DSP per multiply-accumulate.)

Exercise 3: BRAM capacity

You need to store a 1920×1080 frame buffer with 24-bit color (8 bits per channel). How many Xilinx 36 Kb BRAMs are required?

Exercise 4: Clock generation

Given a 100 MHz input clock, design an MMCM configuration to generate:

25 MHz (for VGA)
125 MHz (for Ethernet PHY)
250 MHz (for internal logic)

Specify the VCO frequency and output dividers.

Exercise 5: Clock domain crossing

You have a signal valid in the 100 MHz domain that needs to cross into the 50 MHz domain. Draw the synchronizer circuit (at least two flip-flops). Why do you need two stages?

14. Solutions to exercises

Solution 1

A 5-input LUT has $2^5 = 32$ bits of configuration memory.

Truth table (first 8 rows):

C	D	E	Y
0	0	0	0
0	0	1	0
0	1	0	0
0	1	1	0
1	0	0	0
1	0	1	0
1	1	0	0
1	1	1	1	← (C & D & E) = 1

Solution 2

A 16-tap FIR filter needs 16 multiplications per sample. If you implement it as a fully parallel filter, you need 16 DSP blocks.

However, if you can tolerate multiple cycles per sample, you can time-multiplex a single DSP block (or a small number) and compute the taps sequentially. For maximum throughput (one sample per cycle), you need 16 DSPs.

Solution 3

Frame size = 1920 × 1080 × 24 bits = 49,766,400 bits

Each BRAM = 36,864 bits

Number of BRAMs = 49,766,400 / 36,864 ≈ 1350 BRAMs

(This is huge! In practice, you’d use external DRAM, not BRAM, for frame buffers.)

Solution 4

Goal: Generate 25 MHz, 125 MHz, 250 MHz from 100 MHz.

Choose VCO frequency as a common multiple: 1000 MHz (1 GHz).

VCO multiplier: 100 MHz × 10 = 1000 MHz → CLKFBOUT_MULT_F = 10.0
Output dividers:
- 25 MHz: 1000 / 40 = 25 → CLKOUT0_DIVIDE = 40
- 125 MHz: 1000 / 8 = 125 → CLKOUT1_DIVIDE = 8
- 250 MHz: 1000 / 4 = 250 → CLKOUT2_DIVIDE = 4

Configuration summary:

CLKIN1_PERIOD = 10.0       (100 MHz input)
CLKFBOUT_MULT_F = 10.0     (VCO = 1000 MHz)
CLKOUT0_DIVIDE = 40        (25 MHz)
CLKOUT1_DIVIDE = 8         (125 MHz)
CLKOUT2_DIVIDE = 4         (250 MHz)

Solution 5

Synchronizer circuit:

100 MHz domain               50 MHz domain
                            
  valid ──────────────────────┬──────────┬──────────► valid_sync
                              │          │
                          ┌───▼───┐  ┌───▼───┐
                          │  FF1  │  │  FF2  │
                          └───────┘  └───────┘
                              ▲          ▲
                              │          │
                          clk_50MHz   clk_50MHz

Why two stages?

When valid changes asynchronously with respect to clk_50MHz, the first flip-flop (FF1) may enter a metastable state (output undefined, neither 0 nor 1). If we used FF1’s output directly, this metastability could propagate through the design, causing unpredictable behavior.

The second flip-flop (FF2) gives the first flip-flop time to settle. Metastability decays exponentially, and after one clock cycle, the probability of FF2 also being metastable is astronomically small (< $10^{-20}$ in most FPGAs).

Result: valid_sync is now a clean, synchronized signal safe to use in the 50 MHz domain, with 2 cycles of latency.

15. Summary and takeaways

Modern FPGAs are powerful, flexible platforms for digital design. To use them effectively:

Understand the building blocks: LUTs implement logic, flip-flops store state, DSP blocks handle arithmetic, BRAMs store data, and clock resources distribute timing.
Let synthesis do its job: Write clean, synthesizable code and let the tools infer resources. Instantiate primitives only when necessary.
Mind your clocks: Use PLLs/MMCMs for clock generation, BUFGs for distribution, and proper CDC techniques when crossing clock domains.
Optimize for resources: Use DSP blocks for math, BRAMs for memory, and pipeline for speed. Aim for < 70% utilization to leave routing headroom.
Check your reports: Always review synthesis and implementation reports. Look for timing violations, resource utilization, and warnings.
Read the docs: Vendor libraries guides, architecture manuals, and datasheets are your friends. Bookmark them.

With these fundamentals in hand, you’re ready to tackle real FPGA projects—whether it’s a simple Verilog exercise or a full SoC with processors, peripherals, and high-speed I/O.

16. Further reading and resources

Xilinx 7 Series FPGAs Configurable Logic Block User Guide (UG474): Detailed explanation of CLBs, LUTs, and carry logic.
Xilinx 7 Series FPGAs Memory Resources User Guide (UG473): Block RAM and distributed RAM architecture.
Xilinx 7 Series FPGAs Clocking Resources User Guide (UG472): MMCMs, PLLs, clock distribution.
Xilinx DSP48E1 Slice User Guide (UG479): Everything about DSP blocks.
Intel (Altera) Stratix/Cyclone Handbook: For Intel FPGA architecture (similar concepts, different terminology).
“FPGA Prototyping by Verilog Examples” by Pong P. Chu: Hands-on book with many practical examples.
Xilinx Forums and GitHub: Community designs and IP cores.

That’s it for Session 5! You now have a solid understanding of FPGA architecture and how to use its resources effectively. Practice designing small modules (counters, FIFOs, filters) and synthesizing them to see how the tools map your code to hardware. Over time, you’ll develop an intuition for what code structures map efficiently to FPGAs.

Happy FPGA hacking, and see you in the next session!