Skip to content

Hardware accelerator for 2D convolution using an 8×8 weight-stationary systolic array with split-kernel support, dual-port SRAM architecture, and DMA-based streaming

Notifications You must be signed in to change notification settings

AhmedSobhy01/convolution-accelerator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

132 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Convolution Accelerator

A high-performance hardware accelerator for 2D convolution operations, designed as part of the CMP3020 – VLSI course. This project implements a streaming coprocessor architecture that efficiently performs convolution operations under tight on-chip memory constraints.

📋 Table of Contents


Overview

This project presents a Weight Stationary (WS) dataflow architecture optimized for 2D convolution acceleration. Rather than implementing a straightforward convolution approach, the design evolved through rigorous analysis, failed assumptions, and trade-offs—closely resembling a real hardware development process.

Key Innovation

The accelerator addresses the challenge of limited on-chip memory by:

  • Using kernel folding to decompose large kernels (up to 16×16) into smaller blocks (8×8)
  • Accumulating partial results across multiple passes
  • Employing a split-kernel approach that distributes computation across multiple phases

Use Case

The accelerator is designed as a streaming coprocessor that:

  • Accepts input image and kernel data from external DRAM
  • Performs efficient 2D convolution operations
  • Returns output results to DRAM
  • Works in tight integration with a host system

Project Features

🔧 Core Components

  • 8×8 Systolic Array - Parallel processing element array for MAC operations
  • Dual-Port SRAM Architecture - Concurrent read/write for efficient data movement
    • SRAM0 (64-bit × 1024): Image and kernel storage
    • SRAM1 (32-bit × 4096): Packed partial output buffer
  • DMA-Based Data Loading - Efficient data movement from external DRAM
  • Split-Kernel Support - Handles kernels up to 16×16 on 8×8 array
  • Column-Major Output - Memory-efficient streaming of results
  • Control Unit FSM - Orchestrates complex multi-phase kernel execution

📊 Performance Metrics

Metric Value
Total Power 0.444 W
Core Area 17,089,700 µm²
Core Utilization 28.3%
Array Dimension 8×8
Max Kernel Size 16×16
Supported Image Size Up to 64×64

Architecture

System Overview

ba5d1150-3803-4dfe-b6b3-72ebcee24833

Data Flow Phases

  1. Load Phase: DRAM image and kernel loaded into SRAM0 via DMA
  2. Kernel Streaming: 8×8 kernel blocks streamed to systolic array
  3. Convolution: SA computes partial contributions for each kernel block
  4. Writeback: Partial results accumulated in SRAM1 using byte-masked writes
  5. Drain Phase: Final results summed and streamed back to DRAM

Split-Kernel Approach

For kernels larger than 8×8:

  • Phase A: Top-left 8×8 kernel block → partial output
  • Phase B: Top-right 8×8 kernel block → accumulated
  • Phase C: Bottom-left 8×8 kernel block → accumulated
  • Phase D: Bottom-right 8×8 kernel block → accumulated

Final output = sum of all partial contributions


Project Structure

convolution-accelerator/
├── rtl/                           # RTL Design Files (Verilog)
│   ├── conv_accelerator_top.v     # Top-level module
│   ├── control_unit/              # FSM-based control unit
│   ├── data-loader-agu/           # Data loader and AGU
│   │   ├── src/                   # Core streaming modules
│   │   ├── Python_scripts/        # Helper scripts for memory generation
│   │   └── designs/               # SRAM design files
│   ├── systolic_array/            # Systolic array implementation
│   │   ├── pe.v                   # Processing element
│   │   └── systolic_array.v       # 8×8 array
│   └── tb/                        # Testbenches
│
├── config/                        # Configuration files
│   ├── config.json                # Design parameters
│   └── macro_placement.cfg        # Placement configuration
│
├── docs/                          # Documentation
│
├── scripts/                       # Testing scripts
│
├── test_cases/                    # Test configurations
│   ├── 01 -> 10
│
├── sim/                           # Simulation scripts

Getting Started

Prerequisites

  • Verilog/SystemVerilog simulator (ModelSim, VCS, etc.)
  • Python 3.x (for test generation and verification scripts)
  • Make or equivalent build tool (optional)

Running Simulations

1. Simulate Individual Components

Systolic Array Test:

cd rtl/systolic_array
vsim -do ../../sim/systolic_array_sim.do

Processing Element Test:

cd rtl/systolic_array
vsim -do ../../sim/pe_sim.do

Control Unit Test:

cd rtl/control_unit
vsim -do run_tb.do

2. Run Full System Tests

cd scripts
python3 run_all_tests.py

This will:

  • Load test configurations from test_cases/
  • Generate stimulus data
  • Run full integration simulations
  • Compare outputs with golden references

3. Verify Output

bash scripts/verify.sh

Design Specifications

Top-Level Module: conv_accelerator_top

Parameters

Parameter Default Description
ADDR_W 10 SRAM0 word address width (1024 words)
BYTE_ADDR_W 13 Byte address width (8 KB)
KER_BASE_BYTE 4096 Kernel base address in SRAM0
IMG_BASE_BYTE 0 Image base address in SRAM0
SRAM1_ADDR_W 12 SRAM1 word address width (4096 words)
SA_DIM 8 Systolic array dimension
SA_INPUT_FILL_TIME 8 SA pipeline fill time

Port Interface

// Inputs
input clk                    // System clock
input rst_n                  // Active-low reset
input start                  // Start convolution operation
input [6:0] cfg_N           // Image dimension (N×N)
input [4:0] cfg_K           // Kernel dimension (K×K)
input [7:0] rx_data         // Input data from DRAM
input rx_valid              // Input data valid signal
input tx_ready              // Output ready signal

// Outputs
output done                 // Convolution complete
output rx_ready             // Ready to accept input data
output tx_valid             // Output data valid
output [7:0] tx_data        // Output data to DRAM

Memory Architecture

SRAM0 (64-bit × 1024 words)

  • Stores full input image and kernel weights
  • Dual-port for concurrent reads
  • Image stored from address 0
  • Kernel stored from address 4096 (configurable)

SRAM1 (32-bit × 4096 words)

  • Stores packed partial outputs
  • 4 bytes per pixel (one byte per kernel phase)
  • Byte-masked writes enable atomic lane updates
  • No read-modify-write cycles required

Timing Constraints

Metric Value
Worst Setup Slack -11.6 ns
Total Negative Slack -1113.27 ns
Max Operating Frequency ~20-30 MHz (after timing closure)

Key Design Decisions

Weight Stationary Dataflow

The final architecture employs Weight Stationary (WS) rather than Output Stationary (OS) because:

  • Kernel is reused across the entire input image
  • Keeping weights fixed in PEs minimizes redundant weight movement
  • Simplifies kernel loading and reduces data communication
  • Well-suited for single-kernel, large-input-image scenarios

Split-Kernel Approach

For kernels larger than 8×8:

  • Decompose into 8×8 sub-kernels
  • Process sequentially through multiple phases
  • Accumulate partial outputs in SRAM1
  • Final results obtained by summing all partial contributions

Dual-Port SRAM Strategy

  • SRAM0 (64-bit): Optimized for unaligned window reads and kernel loading
  • SRAM1 (32-bit): Packed output format with byte-lane isolation
  • Enables pipelined data movement without pipeline stalls

Documentation

For detailed information, refer to the documentation files:


Future Work

The team is currently looking into 2 other implementations that are expected to improve the performance metrics even more.

  1. DiP Architectured systolic arrays
    Referenced from this paper: https://arxiv.org/pdf/2412.09709
    current work can be found in this branch: feat/sa-dip
    It basically works by eliminating the input/output synchronization FIFOs required by state-of-the-art weight stationary systolic arrays by adopting diagonal input movement and weight permutation.
image
  1. A slight timing adjustment on the current 101 implementation
    Inspired after reading this article: https://telesens.co/2018/07/30/systolic-architectures
    current work progress can be found in this branch: feat/sa-101-optimized

References

This project implements concepts from CNN accelerator literature, including:

  • Systolic array design principles
  • Dataflow mapping techniques for convolution
  • Memory hierarchy optimization for embedded systems

Team Contributions


Ahmed Sobhy

AhmedAmrNabil

Ahmed Fathy

Ziad Montaser

Tasneem Mohamed

Habiba Ayman

Tony Nagy

Helana Nady

About

Hardware accelerator for 2D convolution using an 8×8 weight-stationary systolic array with split-kernel support, dual-port SRAM architecture, and DMA-based streaming

Topics

Resources

Stars

Watchers

Forks

Contributors 8