A high-performance hardware accelerator for 2D convolution operations, designed as part of the CMP3020 – VLSI course. This project implements a streaming coprocessor architecture that efficiently performs convolution operations under tight on-chip memory constraints.
- Overview
- Project Features
- Architecture
- Project Structure
- Getting Started
- Design Specifications
- Documentation
- Future Work
- Team Contributions
This project presents a Weight Stationary (WS) dataflow architecture optimized for 2D convolution acceleration. Rather than implementing a straightforward convolution approach, the design evolved through rigorous analysis, failed assumptions, and trade-offs—closely resembling a real hardware development process.
The accelerator addresses the challenge of limited on-chip memory by:
- Using kernel folding to decompose large kernels (up to 16×16) into smaller blocks (8×8)
- Accumulating partial results across multiple passes
- Employing a split-kernel approach that distributes computation across multiple phases
The accelerator is designed as a streaming coprocessor that:
- Accepts input image and kernel data from external DRAM
- Performs efficient 2D convolution operations
- Returns output results to DRAM
- Works in tight integration with a host system
- 8×8 Systolic Array - Parallel processing element array for MAC operations
- Dual-Port SRAM Architecture - Concurrent read/write for efficient data movement
- SRAM0 (64-bit × 1024): Image and kernel storage
- SRAM1 (32-bit × 4096): Packed partial output buffer
- DMA-Based Data Loading - Efficient data movement from external DRAM
- Split-Kernel Support - Handles kernels up to 16×16 on 8×8 array
- Column-Major Output - Memory-efficient streaming of results
- Control Unit FSM - Orchestrates complex multi-phase kernel execution
| Metric | Value |
|---|---|
| Total Power | 0.444 W |
| Core Area | 17,089,700 µm² |
| Core Utilization | 28.3% |
| Array Dimension | 8×8 |
| Max Kernel Size | 16×16 |
| Supported Image Size | Up to 64×64 |
- Load Phase: DRAM image and kernel loaded into SRAM0 via DMA
- Kernel Streaming: 8×8 kernel blocks streamed to systolic array
- Convolution: SA computes partial contributions for each kernel block
- Writeback: Partial results accumulated in SRAM1 using byte-masked writes
- Drain Phase: Final results summed and streamed back to DRAM
For kernels larger than 8×8:
- Phase A: Top-left 8×8 kernel block → partial output
- Phase B: Top-right 8×8 kernel block → accumulated
- Phase C: Bottom-left 8×8 kernel block → accumulated
- Phase D: Bottom-right 8×8 kernel block → accumulated
Final output = sum of all partial contributions
convolution-accelerator/
├── rtl/ # RTL Design Files (Verilog)
│ ├── conv_accelerator_top.v # Top-level module
│ ├── control_unit/ # FSM-based control unit
│ ├── data-loader-agu/ # Data loader and AGU
│ │ ├── src/ # Core streaming modules
│ │ ├── Python_scripts/ # Helper scripts for memory generation
│ │ └── designs/ # SRAM design files
│ ├── systolic_array/ # Systolic array implementation
│ │ ├── pe.v # Processing element
│ │ └── systolic_array.v # 8×8 array
│ └── tb/ # Testbenches
│
├── config/ # Configuration files
│ ├── config.json # Design parameters
│ └── macro_placement.cfg # Placement configuration
│
├── docs/ # Documentation
│
├── scripts/ # Testing scripts
│
├── test_cases/ # Test configurations
│ ├── 01 -> 10
│
├── sim/ # Simulation scripts
- Verilog/SystemVerilog simulator (ModelSim, VCS, etc.)
- Python 3.x (for test generation and verification scripts)
- Make or equivalent build tool (optional)
Systolic Array Test:
cd rtl/systolic_array
vsim -do ../../sim/systolic_array_sim.doProcessing Element Test:
cd rtl/systolic_array
vsim -do ../../sim/pe_sim.doControl Unit Test:
cd rtl/control_unit
vsim -do run_tb.docd scripts
python3 run_all_tests.pyThis will:
- Load test configurations from
test_cases/ - Generate stimulus data
- Run full integration simulations
- Compare outputs with golden references
bash scripts/verify.sh| Parameter | Default | Description |
|---|---|---|
ADDR_W |
10 | SRAM0 word address width (1024 words) |
BYTE_ADDR_W |
13 | Byte address width (8 KB) |
KER_BASE_BYTE |
4096 | Kernel base address in SRAM0 |
IMG_BASE_BYTE |
0 | Image base address in SRAM0 |
SRAM1_ADDR_W |
12 | SRAM1 word address width (4096 words) |
SA_DIM |
8 | Systolic array dimension |
SA_INPUT_FILL_TIME |
8 | SA pipeline fill time |
// Inputs
input clk // System clock
input rst_n // Active-low reset
input start // Start convolution operation
input [6:0] cfg_N // Image dimension (N×N)
input [4:0] cfg_K // Kernel dimension (K×K)
input [7:0] rx_data // Input data from DRAM
input rx_valid // Input data valid signal
input tx_ready // Output ready signal
// Outputs
output done // Convolution complete
output rx_ready // Ready to accept input data
output tx_valid // Output data valid
output [7:0] tx_data // Output data to DRAM- Stores full input image and kernel weights
- Dual-port for concurrent reads
- Image stored from address 0
- Kernel stored from address 4096 (configurable)
- Stores packed partial outputs
- 4 bytes per pixel (one byte per kernel phase)
- Byte-masked writes enable atomic lane updates
- No read-modify-write cycles required
| Metric | Value |
|---|---|
| Worst Setup Slack | -11.6 ns |
| Total Negative Slack | -1113.27 ns |
| Max Operating Frequency | ~20-30 MHz (after timing closure) |
The final architecture employs Weight Stationary (WS) rather than Output Stationary (OS) because:
- Kernel is reused across the entire input image
- Keeping weights fixed in PEs minimizes redundant weight movement
- Simplifies kernel loading and reduces data communication
- Well-suited for single-kernel, large-input-image scenarios
For kernels larger than 8×8:
- Decompose into 8×8 sub-kernels
- Process sequentially through multiple phases
- Accumulate partial outputs in SRAM1
- Final results obtained by summing all partial contributions
- SRAM0 (64-bit): Optimized for unaligned window reads and kernel loading
- SRAM1 (32-bit): Packed output format with byte-lane isolation
- Enables pipelined data movement without pipeline stalls
For detailed information, refer to the documentation files:
- Comprehensive_Architecture.md - Complete system architecture and module descriptions
- Convolution Accelerator.md - Project journey, design decisions, and architectural evolution
- Memory Organization.md - SRAM layout, address generation, and memory mapping
- Control Unit.md - FSM design and state transitions
- Systolic Array Documentation - Detailed PE and array specifications
- Metrics.md - PPA (Power, Performance, Area) results
- Team_Contributions.md - Team member roles and module ownership
The team is currently looking into 2 other implementations that are expected to improve the performance metrics even more.
- DiP Architectured systolic arrays
Referenced from this paper: https://arxiv.org/pdf/2412.09709
current work can be found in this branch: feat/sa-dip
It basically works by eliminating the input/output synchronization FIFOs required by state-of-the-art weight stationary systolic arrays by adopting diagonal input movement and weight permutation.
- A slight timing adjustment on the current 101 implementation
Inspired after reading this article: https://telesens.co/2018/07/30/systolic-architectures
current work progress can be found in this branch: feat/sa-101-optimized
This project implements concepts from CNN accelerator literature, including:
- Systolic array design principles
- Dataflow mapping techniques for convolution
- Memory hierarchy optimization for embedded systems
Ahmed Sobhy |
AhmedAmrNabil |
Ahmed Fathy |
Ziad Montaser |
Tasneem Mohamed |
Habiba Ayman |
Tony Nagy |
Helana Nady |