Skip to content

Educational implementation of a weight-stationary systolic array architecture for ASIC design flow (65nm TSMC). Selected modules are shared publicly under NDA restrictions, converted into version of FreePDK45.

Notifications You must be signed in to change notification settings

davidlee1229/ws-systolic-65nm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Weight Stationary - Systolic Array in 65nm TSMC

  • Educational implementation of a weight-stationary systolic array architecture for ASIC design flow (65nm TSMC). Selected modules are shared publicly under NDA restrictions, converted into version of FreePDK45.

Design Overview

Systolic Array Architecture

The design implements a systolic array of processing elements (PEs) for efficient matrix multiplication. MMU Diagram

Feature Description
Number of rows 4 (Programable by Designer)
Number of columns 4 (Programable by Designer)
Input data format Signed Integer (2's Complement)
Input data width 8 bits
Output data width 8 bits
Handling data overflow Saturation
Modes supported Memory, External, & BiST
IO ports See matrix_mult.sv
Reset Active-Low Reset (Reset when 0)
Process node TSMC 65GP / FreePDK45
Clock frequency 200 Mhz
Timing model NLDM
Power supply 0.9 ~ 1.1 V
Highest metal allowed M6
Target Area 172.8 um x 172.8 um

Control States in MatMul Array

  • Idle
  • Load
  • Compute

Verification with Golden Model

Verification Pipeline

Tools & Dependencies

  • Python 3.8 (for test generation and verification)
  • ICC2
  • VCS
  • Verdi

Project Structure

.
├── README.md
├── sim
│   ├── behav
│   ├── apr
│   └── syn
├── syn
├── apr
├── src
│   ├── Makefiles                              # Makefiles
│   ├── syn                                    # TCL files
│   ├── apr                                    # TCL files
│   └── verilog                                # SV files
│       ├── sdf.max.cfg                        
│       ├── sdf.min.cfg                        
│       ├── matrix_mult_wrapper.include        # Pre-synthesis include file
│       ├── matrix_mult_wrapper_syn.include    # Post-synthesis include file
│       ├── matrix_mult_wrapper_apr.include    # Post-APR include file
│       ├── memory/                            # Memory emulator modules
│       ├── misc/                              # Monitor module         
│       ├── bist/                              # Driver module          
│       ├── matrix_mult/                       # Matrix multiplication SystemVerilog files
│       │   ├── matrix_mult_wrapper.sv         # Top-level wrapper
│       │   ├── matrix_mult_pkg.sv             # Package file
│       │   ├── matrix_mult.sv                 # Matrix multiplication unit
│       │   ├── matrix_mult_array.sv           # PE array with activation/weight inputs
│       │   ├── matrix_mult_control.sv         # Control & memory address generation
│       │   ├── matrix_mult_pe.sv              # Processing element (PE)
│       ├── testcase/                          # Edge case examples
│       ├── gold_result/                       # Golden result examples
│       ├── tb_matrix_mult.sv                  # Testbench (pre-synthesis)
│       ├── tb_matrix_mult_syn.sv              # Testbench (post-synthesis)
│       ├── tb_matrix_mult_sapr.sv             # Testbench (post-SAPR)
│       ├── tasks.sv                           # Read, operate, and write tasks
│       ├── golden.py                          # Golden data comparison
│       └── generate.py                        # Random data generation

Pre-Synthesis Simulation

Run 5 Pre-defined Test Cases

  1. Modify the TESTNAME in ./src/Makefiles/Makefile_sim_presyn to one of the following:

    • Group A: memory, external, offset_ext, offset_mem
    • Group B: bist, offset_bist, consec, reset, overwrite, long_act_ext, long_act_mem, long_act_bist

    If Group A:

    cd sim/behav/
    make run_example

    If Group B:

    cd sim/behav/
    make run_behavior
  2. Check output files in ./results.


Creating Custom Test Cases

Example

Matrix multiplication A × B, where:

  • A = [[a, b], [c, d]]
  • B = [[0, 1], [2, 3]]

1. Weight Memory Format

File: ./src/verilog/testcase/[number]_wb_init.mem

Note: [number] should start from 1.

Format:

B[1][1]B[1][0]
B[0][1]B[0][0]

Example (hex):

0302
0100

2. Input Memory Format

File: ./src/verilog/testcase/[number]_ib_init.mem

Note: [number] should start from 1.

Format:

X        A[0][0]
A[0][1]  A[1][0]
A[1][1]  X       

Example (hex):

000A
0B0C
0D00

Note: X represents unused/don't-care values, set to 0.

3. Update Makefile

Modify line 7 of ./src/Makefiles/Makefile_sim_presyn:

TESTNUM ?= [your_test_number]

Replace [your_test_number] with the total number of custom test cases.

4. Run Custom Test

Follow the same procedure as in Run 5 Pre-defined Test Cases.


Synthesis

cd ../../syn/
make design

Post-Synthesis Simulation

cd ../sim/syn
  1. Modify the TESTNAME in ./src/Makefiles/Makefile_sim_postsyn as described above.

If Group A:

make run_example

If Group B:

make run_behavior
  1. Check output files in ./results.

Auto Place and Route

cd ../../apr/
make design

Post-Synthesis Simulation

cd ../sim/apr
  1. Modify the TESTNAME in ./src/Makefiles/Makefile_sim_postapr as described above.

If Group A:

make run_example

If Group B:

make run_behavior
  1. Check output files in ./results.

APR Update Summary

Modified Files and Key Changes:

  1. rtl, syn
  • did RTL logic diet to decrease cell count and utility.
  • change the targeted clk frequecy to 100MHz for lowest area utilization and power.
  1. user_config.tcl
  • Update to Match the Metal, Area Requirements
  1. 04_place_opt.tcl
  • Enabled congestion-layer-aware placement
  • Added refine_placement step with high effort to reduce congestion
  • Maintained timing update and PG connectivity check
  1. 05_clock_opt.tcl
  • Added clock tree skew target (set_clock_tree_options -target_skew 0.03)
  • Increased hold optimization effort (opt.common.hold_effort = high)
  • Maintained all existing clock routing and reporting configs
  1. 06_route.tcl
  • Added CCD-based hold fixing flow with escalating effort (high → ultra)
  • Marked hold buffer cells and enabled CCD timing flow
  • Performed iterative route_opt + update_timing cycles
  1. 08_report.tcl
  • Removed fanout threshold option (-threshold 64) from high-fanout net report
  • Added congestion, placement report

SDF post-processing

  • Created backup copies of the original SDF files (matrix_mult_wrapper_08.wc.sdf.bak, matrix_mult_wrapper_08.bc.sdf.bak)
  • Used sed to clean up escaped array index notations inside the SDF files (\[number\][number])
  • This fixes issues where backslashes were added during SDF generation, which can cause simulators to misread bus or net names like data\[3\] instead of data[3]

Post APR Verification

  • added simple OFFSET to prevent timing violation

Matrix Multiplier ASIC Implementation Results

Final APR Result
Final APR Result
Clock Tree
Clock Tree

Design Specification

  • Design: 4×4 Matrix Multiplier
  • Data Type: INT8 MAC Operations
  • Technology: TSMC 65nm GP
  • Tool: Synopsys ICC2 W-2024.09-SP3

Results

Area

  • Total Core Area: 29,859.84 μm² (0.0299 mm²)
  • Standard Cells: 9,356 cells
  • Utilization: 78.0%

Power

  • Total Power: 0.620 mW @ 0.9V, 125°C
    • Dynamic: 4.449e+05 nW (444.9 µW)
    • Leakage: 1.753e+05 nW (175.3 µW)

Performance

  • Frequency: 100 MHz
  • MAC Units: 16 (4×4 array)
  • MAX Throughput: ~3.2GOPS (INT8)
    • 16 PE * 2 MAC × 100 MHz = 3.2 GOPS

Key Metrics

  • Power Efficiency @100MHz: 5.16 GOPS/mW
  • Area Efficiency @100MHz: 107 GOPS/mm²
  • DRC/LVS: Clean (0 violations)
  • Timing: All constraints met

About

Educational implementation of a weight-stationary systolic array architecture for ASIC design flow (65nm TSMC). Selected modules are shared publicly under NDA restrictions, converted into version of FreePDK45.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •