Session 5 – FPGA Building Blocks
In the previous sessions, we’ve learned about combinational and sequential logic, memory systems, and how digital circuits work in theory. Now it’s time to see how all of this gets implemented in practice on one of the most versatile hardware platforms: Field-Programmable Gate Arrays (FPGAs).
Unlike ASICs (Application-Specific Integrated Circuits) that are manufactured with fixed functionality, FPGAs are reconfigurable. You can program them to implement virtually any digital circuit—from a simple LED blinker to a complete RISC-V processor, a neural network accelerator, or a high-speed network switch. This flexibility makes FPGAs invaluable for prototyping, research, and applications where you need custom hardware without the multi-million dollar cost of ASIC fabrication.
This session will walk you through the architecture of modern FPGAs, focusing on the building blocks you’ll encounter in devices from Xilinx (now AMD), Intel (formerly Altera), and Lattice. We’ll look at how lookup tables implement logic, how block RAM is organized, what DSP slices do, and how clock management works.
By the end of this session, you should be comfortable with:
- LUT (lookup table) architecture and how they implement combinational logic
- The structure of a complete logic cell (LUT + flip-flop + carry chain)
- DSP slices and when to use them
- Block RAM (BRAM) organization, modes, and instantiation
- Clock resources: PLLs, MMCMs, and global clock networks
- Vendor-specific primitives and how to infer vs. instantiate them
- Resource utilization and optimization strategies
- How to read FPGA datasheets and architecture guides
1. FPGA overview: the big picture
1.1 What is an FPGA?
An FPGA is a programmable integrated circuit consisting of:
- Configurable logic blocks (CLBs) or logic elements (LEs): The basic building blocks that implement your logic.
- Programmable interconnect: A network of switches and wires that connect the logic blocks together.
- Block RAM (BRAM): Embedded memory blocks scattered across the chip.
- DSP blocks: Hardened multipliers and arithmetic units for signal processing.
- I/O blocks: Configurable input/output pins with various standards (LVDS, DDR, etc.).
- Clock management: PLLs (Phase-Locked Loops) and MMCMs (Mixed-Mode Clock Managers) for generating and distributing clocks.
- Hard IP blocks: Pre-built circuits like PCIe controllers, Ethernet MACs, memory controllers, and sometimes even processor cores (ARM, PowerPC, etc.).
The configuration memory (usually SRAM-based) stores the bitstream that defines how all these resources are connected and configured. When you “program” an FPGA, you’re loading this bitstream.
1.2 FPGA vs. ASIC vs. CPU
| Aspect | FPGA | ASIC | CPU (software) |
|---|---|---|---|
| Development cost | Low ($100s to $10Ks) | Very high ($millions) | Very low |
| Development time | Hours to weeks | Months to years | Hours to weeks |
| Performance | Medium to high | Highest | Lower (sequential execution) |
| Power efficiency | Medium | Highest | Lower |
| Flexibility | Reconfigurable | Fixed after fabrication | Fully flexible (software) |
| Unit cost | Higher | Lower (at volume) | N/A (processor already there) |
| Time to market | Fast | Slow | Fastest |
FPGAs sit in the sweet spot: custom hardware performance with software-like flexibility.
1.3 Major FPGA vendors
- Xilinx (AMD): Market leader. Families include Spartan (low-cost), Artix/Kintex/Virtex (mid to high-end), and Versal (AI-optimized with AI engines).
- Intel (formerly Altera): Cyclone (low-cost), Arria (mid-range), Stratix (high-end), and Agilex (latest).
- Lattice: iCE40 and ECP5 (low-power, small form factor), CrossLink (video/imaging).
- Microchip (formerly Microsemi/Actel): PolarFire (low-power, radiation-tolerant).
We’ll focus primarily on Xilinx/AMD terminology since they’re the most common in academia and industry, but the concepts translate across vendors.
2. Lookup tables (LUTs): the heart of FPGA logic
2.1 What is a LUT?
A lookup table (LUT) is a small memory that implements arbitrary combinational logic. Think of it as a truth table stored in SRAM.
A 6-input LUT (LUT6) has:
- 6 input pins (A, B, C, D, E, F)
- 1 output pin (O)
- $2^6 = 64$ bits of configuration memory
The configuration memory stores the truth table for any 6-input Boolean function. When you present a 6-bit input, the LUT simply looks up the corresponding bit in its memory and outputs it.
2.2 Example: 2-input AND gate
For a 2-input LUT (LUT2), the truth table for Y = A AND B is:
| A | B | Y |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 1 |
The LUT configuration memory is programmed to [0, 0, 0, 1]. When (A, B) = (1, 1), the LUT outputs the bit at index 3, which is 1.
2.3 LUT capacity and flexibility
Modern FPGAs use LUT6 (6-input LUTs) as the standard. Why 6?
- Flexibility: Can implement any function of up to 6 inputs. This covers most common logic expressions.
- Efficiency: Larger LUTs (7, 8 inputs) are possible but increase memory and delay. Smaller LUTs (4 inputs) waste resources for complex functions.
- Fracturable: A LUT6 can often be split into two LUT5s or even more LUT3/LUT4s, giving you flexibility.
2.4 LUT implementation example
Let’s implement Y = (A & B) | (C & D & E):
- The synthesis tool creates the truth table for this function.
- The bitstream programs the LUT’s configuration memory with the 64-bit truth table.
- At runtime, the LUT simply performs a memory lookup:
config_mem[{A, B, C, D, E, F}].
This is why FPGAs can implement any combinational logic—as long as it fits within the LUT’s input limit.
2.5 LUT fracturable modes
Modern Xilinx FPGAs (7 series, UltraScale, etc.) have fracturable LUTs:
- One LUT6 can implement any single 6-input function.
- Or, it can be split into two LUT5s with shared inputs (5 inputs each, with one input shared).
- Or, split into two independent LUT5s if you have separate functions.
This flexibility increases resource utilization. If you only need a 3-input function, you don’t waste an entire LUT6.
2.6 Carry logic and arithmetic
LUTs alone are inefficient for arithmetic (adders, counters). FPGAs include dedicated carry chains alongside LUTs:
- Each logic cell has a fast carry-in and carry-out.
- These are hardwired connections that bypass the programmable interconnect, creating a ripple-carry adder with minimal delay.
- An N-bit adder uses N LUTs + N carry cells, with O(1) routing delay per bit (not O(N)).
When you write a + b in Verilog, the synthesis tool infers a carry chain.
3. The complete logic cell (slice)
A single LUT doesn’t exist in isolation. It’s part of a larger structure called a slice (Xilinx) or logic element (LE) (Intel).
3.1 Xilinx 7-series slice structure (SLICEL)
A typical slice in Xilinx 7-series FPGAs contains:
- 4 × LUT6: Four 6-input LUTs (can be fractured into 8 × LUT5).
- 4 × flip-flops: Registers to store the LUT outputs (or bypass the LUT and use them independently).
- Multiplexers: To route LUT outputs to flip-flops or directly to outputs.
- Carry chain logic: CARRY4 primitive for fast arithmetic.
- Wide multiplexers: For efficient mux trees (using F7MUX, F8MUX primitives).
3.2 SLICEL vs. SLICEM
Xilinx has two types of slices:
- SLICEL (logic-only): Standard logic cells. LUTs can only be used as lookup tables.
- SLICEM (memory-capable): LUTs can be configured as small distributed RAMs (32×1, 64×1) or shift registers (SRL32, SRL16). These are more flexible but scarcer (typically 50% of slices are SLICEM).
If you need small memories or shift registers, SLICEM is essential.
3.3 Logic cell block diagram
┌──────────────────────────────────────┐
│ Slice (SLICEL) │
│ │
│ ┌─────┐ ┌────┐ ┌─────┐ │
A ───┼─→│LUT6 │───→│ MUX├───→│ FF ├──→ Q │
B ───┼─→│ A │ └────┘ └─────┘ │
C ───┼─→└─────┘ │
D ───┤ │
E ───┤ ┌─────┐ ┌────┐ ┌─────┐ │
F ───┼─→│LUT6 │───→│ MUX├───→│ FF ├──→ Q │
│ │ B │ └────┘ └─────┘ │
│ └─────┘ │
│ ... (2 more LUT6 + FF pairs) │
│ │
│ ┌────────────────────────┐ │
│ │ Carry Chain (CARRY4) │ │
│ └────────────────────────┘ │
└──────────────────────────────────────┘
3.4 Flip-flops in detail
Each flip-flop in a slice has:
- D input: Data input (usually from a LUT or interconnect).
- Clock: Connected to a global or regional clock net.
- Clock enable (CE): Allows conditional updating (e.g.,
if (enable) q <= d;). - Set/Reset (SR): Synchronous or asynchronous set or reset.
The synthesis tool automatically infers flip-flops when you write:
always @(posedge clk)
q <= d;
4. DSP slices: hardened arithmetic
4.1 What is a DSP slice?
Modern FPGAs include DSP blocks (Xilinx calls them DSP48E1/DSP48E2)—hardened arithmetic units optimized for signal processing and math-heavy applications.
A typical DSP slice contains:
- 25×18-bit multiplier (or 27×18 in newer devices)
- 48-bit accumulator
- Pre-adder (for symmetric filters)
- Pattern detector (for rounding, overflow detection)
- Pipeline registers (multiple stages for high-speed operation)
4.2 DSP48E1 block diagram (simplified)
┌─────────────────────────────────────┐
│ DSP48E1 │
│ │
A ───┤ ┌──────────┐ ┌──────────────┐ │
(25b) │ │ Pre-adder│───→│ │ │
D ───┤ │ (D ± A) │ │ Multiplier │ │
(25b) │ └──────────┘ │ 25 × 18 │──┤
│ │ │ │
B ───┤─────────────────→│ │ │
(18b) │ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
C ───┤─────────────────→│ ALU + Accum. │──┼──→ P (48b)
(48b) │ │ (48-bit) │ │
│ └──────────────┘ │
│ │
│ [Pipeline registers at each stage] │
└─────────────────────────────────────┘
4.3 Common DSP operations
- Multiply:
P = A × B - Multiply-accumulate (MAC):
P = P + (A × B)— critical for filters, dot products, neural networks. - Pre-add + multiply:
P = (D ± A) × B— efficient for symmetric FIR filters. - Multiply-add:
P = (A × B) + C— single-cycle multiply-add.
4.4 Why use DSP slices?
- Performance: DSP blocks can run at 500–900 MHz (depending on the device), much faster than logic-based multipliers.
- Resource efficiency: A 25×18 multiplier in logic would consume hundreds of LUTs and have poor timing.
- Power: Hardened blocks are more power-efficient than equivalent soft logic.
4.5 When to use DSP slices
Use DSP blocks for:
- Multiplications (16-bit and larger)
- Multiply-accumulate operations (filters, convolution, matrix multiply)
- Wide adders and subtractors (the post-adder can be used standalone)
- Counters and accumulators
Don’t use them for:
- Small multipliers (e.g., multiply by a constant power of 2—use shifts)
- Simple logic—LUTs are better
4.6 Inference vs. instantiation
Inference (recommended):
// Synthesis tool infers a DSP block
always @(posedge clk) begin
product <= a * b;
end
Instantiation (for fine control):
DSP48E1 #(
.AREG(1),
.BREG(1),
.PREG(1),
// ... many parameters
) dsp_inst (
.A(a),
.B(b),
.P(product),
.CLK(clk),
// ... many ports
);
Instantiation gives you precise control over pipelining and features but is vendor-specific and harder to port.
5. Block RAM (BRAM): embedded memory
5.1 What is BRAM?
FPGAs have dedicated memory blocks called Block RAM (BRAM) or M20K (Intel). These are much more efficient than using LUTs as distributed RAM.
Xilinx 7-series BRAM characteristics:
- 36 Kb per block (can be split into two 18 Kb blocks)
- True dual-port: Two independent read/write ports (A and B)
- Configurable width and depth (e.g., 36K×1, 18K×2, 4K×9, 2K×18, 1K×36, etc.)
- Optional output registers for improved timing
- Byte-enable signals for partial writes
- Built-in ECC (error correction) in some devices
5.2 BRAM configurations
A single 36 Kb BRAM can be configured in many ways:
| Configuration | Width | Depth | Notes |
|---|---|---|---|
| 1K × 36 | 36 | 1024 | Maximum width |
| 2K × 18 | 18 | 2048 | Common for 16-bit data |
| 4K × 9 | 9 | 4096 | 8 data + 1 parity bit |
| 8K × 4 | 4 | 8192 | Narrow, deep memory |
| 16K × 2 | 2 | 16384 | |
| 32K × 1 | 1 | 32768 | Maximum depth (bit-serial) |
You can also split one 36 Kb block into two independent 18 Kb blocks.
5.3 True dual-port operation
Each BRAM has two ports (A and B), and both can:
- Read and write independently
- Access different addresses in the same cycle
- Have different clocks (asynchronous dual-port)
This is perfect for:
- Register files (two read ports, one write port in a CPU)
- Ping-pong buffers (one port writes, the other reads)
- Multi-clock domain FIFOs
5.4 Example: simple dual-port RAM
module simple_dual_port_ram #(
parameter WIDTH = 32,
parameter DEPTH = 1024,
parameter ADDR_WIDTH = $clog2(DEPTH)
)(
input wire clk,
// Write port
input wire we,
input wire [ADDR_WIDTH-1:0] waddr,
input wire [WIDTH-1:0] din,
// Read port
input wire [ADDR_WIDTH-1:0] raddr,
output reg [WIDTH-1:0] dout
);
(* ram_style = "block" *) // Hint to use BRAM, not distributed RAM
reg [WIDTH-1:0] mem [0:DEPTH-1];
always @(posedge clk) begin
if (we)
mem[waddr] <= din;
dout <= mem[raddr]; // Registered output
end
endmodule
The (* ram_style = "block" *) attribute tells the synthesis tool to use BRAM. Without it, small memories might be implemented in distributed RAM (LUTs).
5.5 Output register modes
BRAMs can operate in two output modes:
- No output register (READ_FIRST or WRITE_FIRST): Output reflects the old or new value immediately.
- Output register: The BRAM has a built-in output register, adding one cycle of latency but improving timing closure.
For high-speed designs, always use the output register—it’s “free” in terms of resources (the register is part of the BRAM tile).
5.6 BRAM resource calculation
Example: You need 128 KB of memory in a Xilinx 7-series FPGA. How many BRAMs?
- 128 KB = 128 × 1024 bytes = 131,072 bytes = 1,048,576 bits
- Each BRAM = 36 Kb = 36,864 bits
- Number of BRAMs = 1,048,576 / 36,864 ≈ 28.4 blocks → 29 BRAMs (round up)
Always check the datasheet for the exact count of BRAMs in your device (e.g., Artix-7 100T has 135 BRAMs).
5.7 When to use BRAM vs. distributed RAM
| Use BRAM when… | Use distributed RAM when… |
|---|---|
| Memory is large (> 1 Kb) | Memory is tiny (< 256 bits) |
| You need dual-port access | You need many small, independent RAMs |
| Timing is critical (use output regs) | You have LUTs to spare |
| You want to save logic resources | BRAMs are scarce and logic is plenty |
6. Clock resources: PLLs, MMCMs, and clock networks
6.1 Clock domains and distribution
FPGAs have dedicated global clock networks with low skew and low jitter. These are essential for synchronous designs.
Xilinx 7-series has:
- 32 global clock buffers (BUFG) per clock region
- Regional clocks for lower power (BUFRs)
- High-speed regional clocks (BUFIOs) for I/O serialization (e.g., DDR, LVDS)
Using a BUFG ensures that your clock reaches every flip-flop with minimal skew (< 100 ps across the entire chip).
6.2 PLLs (Phase-Locked Loops)
A PLL is a clock generation circuit that can:
- Multiply the input clock frequency (e.g., 100 MHz → 200 MHz)
- Divide the input clock (e.g., 100 MHz → 25 MHz)
- Phase-shift the clock (e.g., 90° shift for DDR capture)
- Filter jitter from the input clock
Xilinx devices have multiple PLLs (often called PLLE2 in 7-series).
6.3 MMCMs (Mixed-Mode Clock Managers)
An MMCM is a more advanced version of a PLL, with additional features:
- Multiple output clocks (up to 7 in Xilinx 7-series)
- Dynamic reconfiguration (change frequency on-the-fly)
- Fine phase adjustment (in ps steps)
- Jitter filtering and dynamic phase shifting
MMCMs are used when you need multiple clock domains (e.g., 50 MHz, 100 MHz, 200 MHz) from a single input.
6.4 MMCM block diagram (simplified)
┌──────────────────────────────┐
│ MMCM │
│ │
CLK_IN ───────────┤→ Phase Detector → VCO │
│ ↓ │
CLKFB_IN ─────────┤─ Feedback Divider (/M) │
│ │
│ VCO freq = CLK_IN × (N/M) │
│ │
│ ┌──────────────────┐ │
│ │ Output Dividers │ │
│ │ (/D0, /D1...) │ │
│ └──────────────────┘ │
│ │ │ │ │ │
└─────────┼──┼──┼──┼───────────┘
│ │ │ │
CLK0 CLK1 CLK2 ...
6.5 Example: generating 50 MHz and 200 MHz from 100 MHz input
MMCME2_BASE #(
.CLKIN1_PERIOD(10.0), // 100 MHz input = 10 ns period
.CLKFBOUT_MULT_F(8.0), // VCO = 100 MHz × 8 = 800 MHz
.CLKOUT0_DIVIDE_F(16.0), // CLK0 = 800 MHz / 16 = 50 MHz
.CLKOUT1_DIVIDE(4) // CLK1 = 800 MHz / 4 = 200 MHz
) mmcm_inst (
.CLKIN1(clk_in),
.CLKFBOUT(clkfb),
.CLKFBIN(clkfb), // Feedback loop
.CLKOUT0(clk_50),
.CLKOUT1(clk_200),
.RST(reset),
.LOCKED(locked) // High when MMCM is stable
);
Always wait for LOCKED to go high before using the output clocks.
6.6 Clock domain crossing (CDC)
When you have multiple clock domains, you need to handle clock domain crossings (CDC) carefully:
- Asynchronous FIFOs: For transferring data between domains.
- Synchronizer chains: Two or more flip-flops to avoid metastability (for control signals).
- Handshake protocols: Request/acknowledge for safe data transfer.
Never directly connect a signal from one clock domain to another without proper synchronization—this causes metastability and intermittent failures.
7. Vendor-specific primitives
7.1 What are primitives?
Primitives are low-level building blocks provided by the FPGA vendor. They give you direct control over hardware features that might not be easily inferred by synthesis tools.
Examples:
- BUFG: Global clock buffer
- CARRY4: 4-bit carry logic
- DSP48E1: DSP slice
- RAMB36E1: 36 Kb block RAM
- IBUF/OBUF: Input/output buffers
- ODDR/IDDR: Output/input DDR (double-data-rate) registers
- SRL16/SRL32: Shift register LUTs
7.2 Inference vs. instantiation
Inference (recommended for portability):
- Write standard Verilog/VHDL
- Let the synthesis tool map to primitives
- Easier to port to different FPGA families
Instantiation (for fine control or special features):
- Directly instantiate a primitive in your code
- Vendor-specific, not portable
- Necessary for features that can’t be inferred (e.g., ODDR, IDELAYCTRL)
7.3 Example: BUFG for clock buffering
Inference (synthesis tool adds BUFG automatically):
wire clk_internal;
assign clk_internal = clk_in; // Synthesis adds BUFG
Instantiation (explicit control):
BUFG bufg_inst (
.I(clk_in),
.O(clk_buffered)
);
7.4 Example: ODDR for DDR output
You can’t infer an ODDR (output double-data-rate register) with standard Verilog—you must instantiate it:
ODDR #(
.DDR_CLK_EDGE("SAME_EDGE")
) oddr_inst (
.Q(ddr_out), // DDR output
.C(clk), // Clock
.CE(1'b1), // Clock enable (always enabled)
.D1(data_rise), // Data on rising edge
.D2(data_fall), // Data on falling edge
.R(1'b0),
.S(1'b0)
);
This primitive outputs data_rise on the rising edge of clk and data_fall on the falling edge, effectively doubling the data rate.
7.5 When to instantiate primitives
- BUFG: Usually auto-inferred, but instantiate if you need precise control over clock routing.
- MMCM/PLL: Always instantiate (or use the Clock Wizard IP).
- DSP48: Inference is fine for most cases; instantiate for exotic modes.
- BRAM: Inference is usually fine; instantiate for specific cascade or ECC modes.
- ODDR/IDDR: Must instantiate (no inference).
- IDELAYCTRL/IDELAYE2: Must instantiate (for input delay calibration).
7.6 Reading the primitive library
Every FPGA vendor provides a Libraries Guide (e.g., Xilinx 7 Series Libraries Guide (Vivado Design Suite)) that documents every primitive:
- Available parameters
- Port descriptions
- Timing characteristics
- Usage examples
Bookmark this guide—you’ll refer to it constantly when doing low-level FPGA work.
8. Resource utilization and optimization
8.1 Understanding utilization reports
After synthesis and implementation, you get a utilization report showing how many resources your design uses:
+----------------------------+-------+
| Resource | Used |
+----------------------------+-------+
| Slice LUTs | 12345 |
| LUT as Logic | 11000 |
| LUT as Memory | 1345 |
| Slice Registers | 18000 |
| Block RAM Tile | 45 |
| DSPs | 32 |
| BUFG | 4 |
| MMCM | 2 |
+----------------------------+-------+
Compare this to your device’s total resources (e.g., Artix-7 100T has 63,400 LUTs, 126,800 FFs, 135 BRAMs, 240 DSPs).
8.2 Optimization strategies
Use the right resource for the job
- Arithmetic: Use DSP blocks, not LUTs.
- Memory: Use BRAM, not distributed RAM (unless memory is tiny).
- Shift registers: Use SRL primitives (SLICEM), not chains of flip-flops.
Pipeline for speed
Add registers between logic stages to break up long combinational paths:
// Before: long combinational path
assign result = ((a + b) * c) - (d & e);
// After: pipelined (3 stages)
always @(posedge clk) begin
stage1 <= a + b;
stage2 <= stage1 * c;
result <= stage2 - (d & e);
end
This increases latency but allows higher clock frequency.
Avoid unnecessary logic
- Don’t compare to 1’b1 or 1’b0:
if (enable == 1'b1)wastes a LUT. Useif (enable). - Use parameters: Hard-coded bit widths waste resources. Use parameterized modules.
- Remove debug logic in production:
$display, unused signals, and debug counters consume resources.
Share resources
If two operations never happen simultaneously, use the same resource:
always @(posedge clk) begin
if (state == LOAD)
result <= a + b;
else if (state == COMPUTE)
result <= c + d; // Synthesis can share the adder
end
8.3 Retiming
Modern synthesis tools can retime your design—move registers across combinational logic to balance delays. Enable retiming in Vivado:
set_property RETIMING true [get_cells *]
This can significantly improve timing without changing functionality.
9. Practical examples
9.1 Example 1: FIR filter using DSP blocks
A simple 4-tap FIR filter:
module fir_filter (
input wire clk,
input wire [15:0] x_in, // Input sample
output reg [31:0] y_out // Output sample
);
// Coefficients (fixed-point)
parameter signed [15:0] H0 = 16'sh1000; // 0.25 in Q15 format
parameter signed [15:0] H1 = 16'sh2000; // 0.50
parameter signed [15:0] H2 = 16'sh2000; // 0.50
parameter signed [15:0] H3 = 16'sh1000; // 0.25
// Delay line
reg signed [15:0] x[0:3];
always @(posedge clk) begin
// Shift delay line
x[0] <= x_in;
x[1] <= x[0];
x[2] <= x[1];
x[3] <= x[2];
// Compute FIR output (synthesis infers DSP blocks)
y_out <= (H0 * x[0]) + (H1 * x[1]) + (H2 * x[2]) + (H3 * x[3]);
end
endmodule
The synthesis tool will infer DSP48E1 blocks for the multiplications and accumulations.
9.2 Example 2: Asynchronous FIFO for CDC
module async_fifo #(
parameter DATA_WIDTH = 32,
parameter DEPTH = 16,
parameter ADDR_WIDTH = $clog2(DEPTH)
)(
input wire wr_clk,
input wire wr_en,
input wire [DATA_WIDTH-1:0] wr_data,
output wire wr_full,
input wire rd_clk,
input wire rd_en,
output wire [DATA_WIDTH-1:0] rd_data,
output wire rd_empty,
input wire rst
);
// Gray code counters for read/write pointers
reg [ADDR_WIDTH:0] wr_ptr, rd_ptr;
reg [ADDR_WIDTH:0] wr_ptr_gray, rd_ptr_gray;
reg [ADDR_WIDTH:0] rd_ptr_gray_sync1, rd_ptr_gray_sync2; // Synchronizer
reg [ADDR_WIDTH:0] wr_ptr_gray_sync1, wr_ptr_gray_sync2;
// Dual-port RAM (infers BRAM)
(* ram_style = "block" *)
reg [DATA_WIDTH-1:0] mem [0:DEPTH-1];
// Write logic
always @(posedge wr_clk or posedge rst) begin
if (rst)
wr_ptr <= 0;
else if (wr_en && !wr_full) begin
mem[wr_ptr[ADDR_WIDTH-1:0]] <= wr_data;
wr_ptr <= wr_ptr + 1;
end
end
// Gray code conversion for write pointer
always @(posedge wr_clk or posedge rst) begin
if (rst)
wr_ptr_gray <= 0;
else
wr_ptr_gray <= wr_ptr ^ (wr_ptr >> 1);
end
// Synchronize read pointer to write clock domain
always @(posedge wr_clk or posedge rst) begin
if (rst) begin
rd_ptr_gray_sync1 <= 0;
rd_ptr_gray_sync2 <= 0;
end else begin
rd_ptr_gray_sync1 <= rd_ptr_gray;
rd_ptr_gray_sync2 <= rd_ptr_gray_sync1;
end
end
assign wr_full = (wr_ptr_gray == {~rd_ptr_gray_sync2[ADDR_WIDTH:ADDR_WIDTH-1],
rd_ptr_gray_sync2[ADDR_WIDTH-2:0]});
// Read logic (similar structure, omitted for brevity)
// ...
endmodule
This uses Gray code counters and a two-stage synchronizer to safely cross clock domains.
10. Reading FPGA datasheets
10.1 Key sections to understand
When evaluating an FPGA for a project, look for:
- Logic resources: Number of LUTs, FFs, slices.
- BRAM: Total memory capacity and number of blocks.
- DSP slices: Count and capability (e.g., 18×25 vs. 18×18).
- I/O pins: Total user I/O, supported standards (LVDS, LVTTL, HSTL, etc.), maximum toggle rate.
- Clock resources: Number of PLLs, MMCMs, global clock networks.
- Transceivers: If you need high-speed serial (PCIe, 10G Ethernet), check GTX/GTH/GTY counts and data rates.
- Hard IP: PCIe blocks, Ethernet MACs, memory controllers, processor cores.
- Speed grade: Faster speed grades (-3, -2, -1) support higher clock frequencies but cost more and consume more power.
10.2 Example: Xilinx Artix-7 100T (XC7A100T)
| Resource | Count |
|---|---|
| Logic Cells | 101,440 |
| Slices | 15,850 |
| CLB Flip-Flops | 126,800 |
| Max Distributed RAM | 400 Kb |
| Block RAM (36 Kb) | 135 (4,860 Kb) |
| DSP Slices | 240 |
| PCIe Blocks | 1 |
| MMCM | 6 |
| PLL | 6 |
| User I/O | 300–500 (pkg) |
| GTX Transceivers | 8 |
This is a mid-range FPGA suitable for signal processing, embedded systems, and moderate compute tasks.
10.3 Estimating if your design will fit
Rule of thumb:
- Logic: Aim to use < 70% of LUTs/FFs (routing congestion increases failure rates beyond that).
- BRAM: Sum your memories and add 10–20% margin.
- DSP: Count multipliers and MACs; ensure you have enough DSP blocks.
- I/O: Verify you have enough pins and the right standards.
If you’re close to 100% utilization, implementation will be slow, timing closure will be hard, and small changes can break the design. Leave headroom.
11. Common pitfalls and best practices
Pitfall 1: Underutilizing DSP blocks
Problem: Implementing multipliers in logic when DSP blocks are available.
Solution: Write clean arithmetic code (a * b) and let synthesis infer DSP blocks. Check the synthesis report to confirm DSP usage.
Pitfall 2: Ignoring clock domain crossings
Problem: Directly connecting signals between different clock domains causes metastability.
Solution: Always use synchronizers (two-FF chain minimum) for control signals, and async FIFOs for data buses.
Pitfall 3: Overusing global clocks
Problem: Using too many global clocks (> 32 in 7-series) forces some clocks onto regional or local routing, increasing skew and reducing performance.
Solution: Consolidate clock domains where possible. Use clock enables instead of multiple clocks.
Pitfall 4: Not pipelining DSP blocks
Problem: Using DSP blocks without internal pipeline registers, resulting in long combinational paths.
Solution: Enable all internal registers in DSP blocks (AREG, BREG, MREG, PREG). This allows DSP blocks to run at maximum speed (500+ MHz).
Pitfall 5: Ignoring timing reports
Problem: Assuming your design works because it compiled.
Solution: Always check the timing report. Look for negative slack (WNS, Worst Negative Slack). If WNS < 0, your design will fail at the target frequency.
12. Advanced topics (preview)
These topics are beyond this session but important for serious FPGA work:
- Partial reconfiguration: Reconfiguring part of the FPGA while the rest keeps running.
- High-level synthesis (HLS): Writing FPGA designs in C/C++ instead of HDL.
- NoC (Network-on-Chip): Versal FPGAs have a built-in network for inter-block communication.
- AI engines: Versal AI Core has dedicated AI/ML processors (400 MHz, INT8/INT16 MACs).
- SerDes (GTX/GTH/GTY): Multi-gigabit transceivers for PCIe, 10G/100G Ethernet, Aurora, etc.
- Memory interfaces: DDR3/DDR4 controllers, QDRII+, HBM2.
As you gain experience, these will become essential tools in your FPGA design toolkit.
13. Hands-on exercises
Exercise 1: LUT calculation
A 5-input LUT implements the function Y = (A & B) | (C & D & E). How many bits of configuration memory does it have? Write out the truth table (first 8 rows only).
Exercise 2: DSP resource estimation
You’re designing a 16-tap FIR filter with 16-bit coefficients and 16-bit input samples. How many DSP48E1 blocks will you need? (Assume one DSP per multiply-accumulate.)
Exercise 3: BRAM capacity
You need to store a 1920×1080 frame buffer with 24-bit color (8 bits per channel). How many Xilinx 36 Kb BRAMs are required?
Exercise 4: Clock generation
Given a 100 MHz input clock, design an MMCM configuration to generate:
- 25 MHz (for VGA)
- 125 MHz (for Ethernet PHY)
- 250 MHz (for internal logic)
Specify the VCO frequency and output dividers.
Exercise 5: Clock domain crossing
You have a signal valid in the 100 MHz domain that needs to cross into the 50 MHz domain. Draw the synchronizer circuit (at least two flip-flops). Why do you need two stages?
14. Solutions to exercises
Solution 1
A 5-input LUT has $2^5 = 32$ bits of configuration memory.
Truth table (first 8 rows):
| A | B | C | D | E | Y | |
|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | |
| 0 | 0 | 0 | 0 | 1 | 0 | |
| 0 | 0 | 0 | 1 | 0 | 0 | |
| 0 | 0 | 0 | 1 | 1 | 0 | |
| 0 | 0 | 1 | 0 | 0 | 0 | |
| 0 | 0 | 1 | 0 | 1 | 0 | |
| 0 | 0 | 1 | 1 | 0 | 0 | |
| 0 | 0 | 1 | 1 | 1 | 1 | ← (C & D & E) = 1 |
Solution 2
A 16-tap FIR filter needs 16 multiplications per sample. If you implement it as a fully parallel filter, you need 16 DSP blocks.
However, if you can tolerate multiple cycles per sample, you can time-multiplex a single DSP block (or a small number) and compute the taps sequentially. For maximum throughput (one sample per cycle), you need 16 DSPs.
Solution 3
Frame size = 1920 × 1080 × 24 bits = 49,766,400 bits
Each BRAM = 36,864 bits
Number of BRAMs = 49,766,400 / 36,864 ≈ 1350 BRAMs
(This is huge! In practice, you’d use external DRAM, not BRAM, for frame buffers.)
Solution 4
Goal: Generate 25 MHz, 125 MHz, 250 MHz from 100 MHz.
Choose VCO frequency as a common multiple: 1000 MHz (1 GHz).
- VCO multiplier: 100 MHz × 10 = 1000 MHz →
CLKFBOUT_MULT_F = 10.0 - Output dividers:
- 25 MHz: 1000 / 40 = 25 →
CLKOUT0_DIVIDE = 40 - 125 MHz: 1000 / 8 = 125 →
CLKOUT1_DIVIDE = 8 - 250 MHz: 1000 / 4 = 250 →
CLKOUT2_DIVIDE = 4
- 25 MHz: 1000 / 40 = 25 →
Configuration summary:
CLKIN1_PERIOD = 10.0 (100 MHz input)
CLKFBOUT_MULT_F = 10.0 (VCO = 1000 MHz)
CLKOUT0_DIVIDE = 40 (25 MHz)
CLKOUT1_DIVIDE = 8 (125 MHz)
CLKOUT2_DIVIDE = 4 (250 MHz)
Solution 5
Synchronizer circuit:
100 MHz domain 50 MHz domain
valid ──────────────────────┬──────────┬──────────► valid_sync
│ │
┌───▼───┐ ┌───▼───┐
│ FF1 │ │ FF2 │
└───────┘ └───────┘
▲ ▲
│ │
clk_50MHz clk_50MHz
Why two stages?
When valid changes asynchronously with respect to clk_50MHz, the first flip-flop (FF1) may enter a metastable state (output undefined, neither 0 nor 1). If we used FF1’s output directly, this metastability could propagate through the design, causing unpredictable behavior.
The second flip-flop (FF2) gives the first flip-flop time to settle. Metastability decays exponentially, and after one clock cycle, the probability of FF2 also being metastable is astronomically small (< $10^{-20}$ in most FPGAs).
Result: valid_sync is now a clean, synchronized signal safe to use in the 50 MHz domain, with 2 cycles of latency.
15. Summary and takeaways
Modern FPGAs are powerful, flexible platforms for digital design. To use them effectively:
-
Understand the building blocks: LUTs implement logic, flip-flops store state, DSP blocks handle arithmetic, BRAMs store data, and clock resources distribute timing.
-
Let synthesis do its job: Write clean, synthesizable code and let the tools infer resources. Instantiate primitives only when necessary.
-
Mind your clocks: Use PLLs/MMCMs for clock generation, BUFGs for distribution, and proper CDC techniques when crossing clock domains.
-
Optimize for resources: Use DSP blocks for math, BRAMs for memory, and pipeline for speed. Aim for < 70% utilization to leave routing headroom.
-
Check your reports: Always review synthesis and implementation reports. Look for timing violations, resource utilization, and warnings.
-
Read the docs: Vendor libraries guides, architecture manuals, and datasheets are your friends. Bookmark them.
With these fundamentals in hand, you’re ready to tackle real FPGA projects—whether it’s a simple Verilog exercise or a full SoC with processors, peripherals, and high-speed I/O.
16. Further reading and resources
- Xilinx 7 Series FPGAs Configurable Logic Block User Guide (UG474): Detailed explanation of CLBs, LUTs, and carry logic.
- Xilinx 7 Series FPGAs Memory Resources User Guide (UG473): Block RAM and distributed RAM architecture.
- Xilinx 7 Series FPGAs Clocking Resources User Guide (UG472): MMCMs, PLLs, clock distribution.
- Xilinx DSP48E1 Slice User Guide (UG479): Everything about DSP blocks.
- Intel (Altera) Stratix/Cyclone Handbook: For Intel FPGA architecture (similar concepts, different terminology).
- “FPGA Prototyping by Verilog Examples” by Pong P. Chu: Hands-on book with many practical examples.
- Xilinx Forums and GitHub: Community designs and IP cores.
That’s it for Session 5! You now have a solid understanding of FPGA architecture and how to use its resources effectively. Practice designing small modules (counters, FIFOs, filters) and synthesizing them to see how the tools map your code to hardware. Over time, you’ll develop an intuition for what code structures map efficiently to FPGAs.
Happy FPGA hacking, and see you in the next session!