Weight Stationary - Systolic Array in 65nm TSMC

Educational implementation of a weight-stationary systolic array architecture for ASIC design flow (65nm TSMC). Selected modules are shared publicly under NDA restrictions, converted into version of FreePDK45.

Design Overview

Systolic Array Architecture

The design implements a systolic array of processing elements (PEs) for efficient matrix multiplication.

Feature	Description
Number of rows	4 (Programable by Designer)
Number of columns	4 (Programable by Designer)
Input data format	Signed Integer (2's Complement)
Input data width	8 bits
Output data width	8 bits
Handling data overflow	Saturation
Modes supported	Memory, External, & BiST
IO ports	See matrix_mult.sv
Reset	Active-Low Reset (Reset when 0)
Process node	TSMC 65GP / FreePDK45
Clock frequency	200 Mhz
Timing model	NLDM
Power supply	0.9 ~ 1.1 V
Highest metal allowed	M6
Target Area	172.8 um x 172.8 um

Control States in MatMul Array

Idle
Load
Compute

Verification with Golden Model

Tools & Dependencies

Python 3.8 (for test generation and verification)
ICC2
VCS
Verdi

Project Structure

.
├── README.md
├── sim
│   ├── behav
│   ├── apr
│   └── syn
├── syn
├── apr
├── src
│   ├── Makefiles                              # Makefiles
│   ├── syn                                    # TCL files
│   ├── apr                                    # TCL files
│   └── verilog                                # SV files
│       ├── sdf.max.cfg                        
│       ├── sdf.min.cfg                        
│       ├── matrix_mult_wrapper.include        # Pre-synthesis include file
│       ├── matrix_mult_wrapper_syn.include    # Post-synthesis include file
│       ├── matrix_mult_wrapper_apr.include    # Post-APR include file
│       ├── memory/                            # Memory emulator modules
│       ├── misc/                              # Monitor module         
│       ├── bist/                              # Driver module          
│       ├── matrix_mult/                       # Matrix multiplication SystemVerilog files
│       │   ├── matrix_mult_wrapper.sv         # Top-level wrapper
│       │   ├── matrix_mult_pkg.sv             # Package file
│       │   ├── matrix_mult.sv                 # Matrix multiplication unit
│       │   ├── matrix_mult_array.sv           # PE array with activation/weight inputs
│       │   ├── matrix_mult_control.sv         # Control & memory address generation
│       │   ├── matrix_mult_pe.sv              # Processing element (PE)
│       ├── testcase/                          # Edge case examples
│       ├── gold_result/                       # Golden result examples
│       ├── tb_matrix_mult.sv                  # Testbench (pre-synthesis)
│       ├── tb_matrix_mult_syn.sv              # Testbench (post-synthesis)
│       ├── tb_matrix_mult_sapr.sv             # Testbench (post-SAPR)
│       ├── tasks.sv                           # Read, operate, and write tasks
│       ├── golden.py                          # Golden data comparison
│       └── generate.py                        # Random data generation

Pre-Synthesis Simulation

Run 5 Pre-defined Test Cases

Modify the TESTNAME in ./src/Makefiles/Makefile_sim_presyn to one of the following:
- Group A: memory, external, offset_ext, offset_mem
- Group B: bist, offset_bist, consec, reset, overwrite, long_act_ext, long_act_mem, long_act_bist
If Group A:
```
cd sim/behav/
make run_example
```
If Group B:
```
cd sim/behav/
make run_behavior
```
Check output files in ./results.

Creating Custom Test Cases

Example

Matrix multiplication A × B, where:

A = [[a, b], [c, d]]
B = [[0, 1], [2, 3]]

1. Weight Memory Format

File: ./src/verilog/testcase/[number]_wb_init.mem

Note: [number] should start from 1.

Format:

B[1][1]B[1][0]
B[0][1]B[0][0]

Example (hex):

0302
0100

2. Input Memory Format

File: ./src/verilog/testcase/[number]_ib_init.mem

Note: [number] should start from 1.

Format:

X        A[0][0]
A[0][1]  A[1][0]
A[1][1]  X

Example (hex):

000A
0B0C
0D00

Note: X represents unused/don't-care values, set to 0.

3. Update Makefile

Modify line 7 of ./src/Makefiles/Makefile_sim_presyn:

TESTNUM ?= [your_test_number]

Replace [your_test_number] with the total number of custom test cases.

4. Run Custom Test

Follow the same procedure as in Run 5 Pre-defined Test Cases.

Synthesis

cd ../../syn/
make design

Post-Synthesis Simulation

cd ../sim/syn

Modify the TESTNAME in ./src/Makefiles/Makefile_sim_postsyn as described above.

If Group A:

make run_example

If Group B:

make run_behavior

Check output files in ./results.

Auto Place and Route

cd ../../apr/
make design

Post-Synthesis Simulation

cd ../sim/apr

Modify the TESTNAME in ./src/Makefiles/Makefile_sim_postapr as described above.

If Group A:

make run_example

If Group B:

make run_behavior

Check output files in ./results.

APR Update Summary

Modified Files and Key Changes:

rtl, syn

did RTL logic diet to decrease cell count and utility.
change the targeted clk frequecy to 100MHz for lowest area utilization and power.

user_config.tcl

Update to Match the Metal, Area Requirements

04_place_opt.tcl

Enabled congestion-layer-aware placement
Added refine_placement step with high effort to reduce congestion
Maintained timing update and PG connectivity check

05_clock_opt.tcl

Added clock tree skew target (set_clock_tree_options -target_skew 0.03)
Increased hold optimization effort (opt.common.hold_effort = high)
Maintained all existing clock routing and reporting configs

06_route.tcl

Added CCD-based hold fixing flow with escalating effort (high → ultra)
Marked hold buffer cells and enabled CCD timing flow
Performed iterative route_opt + update_timing cycles

08_report.tcl

Removed fanout threshold option (-threshold 64) from high-fanout net report
Added congestion, placement report

SDF post-processing

Created backup copies of the original SDF files (matrix_mult_wrapper_08.wc.sdf.bak, matrix_mult_wrapper_08.bc.sdf.bak)
Used sed to clean up escaped array index notations inside the SDF files (\[number\] → [number])
This fixes issues where backslashes were added during SDF generation, which can cause simulators to misread bus or net names like data\[3\] instead of data[3]

Post APR Verification

added simple OFFSET to prevent timing violation

Matrix Multiplier ASIC Implementation Results

_{Final APR Result}

_{Clock Tree}

Design Specification

Design: 4×4 Matrix Multiplier
Data Type: INT8 MAC Operations
Technology: TSMC 65nm GP
Tool: Synopsys ICC2 W-2024.09-SP3

Results

Area

Total Core Area: 29,859.84 μm² (0.0299 mm²)
Standard Cells: 9,356 cells
Utilization: 78.0%

Power

Total Power: 0.620 mW @ 0.9V, 125°C
- Dynamic: 4.449e+05 nW (444.9 µW)
- Leakage: 1.753e+05 nW (175.3 µW)

Performance

Frequency: 100 MHz
MAC Units: 16 (4×4 array)
MAX Throughput: ~3.2GOPS (INT8)
- 16 PE * 2 MAC × 100 MHz = 3.2 GOPS

Key Metrics

Power Efficiency @100MHz: 5.16 GOPS/mW
Area Efficiency @100MHz: 107 GOPS/mm²
DRC/LVS: Clean (0 violations)
Timing: All constraints met

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
apr		apr
freepdk-45nm		freepdk-45nm
img		img
scripts		scripts
sim		sim
sim_cycle		sim_cycle
src		src
syn		syn
.gitignore		.gitignore
README.md		README.md

davidlee1229/ws-systolic-65nm

Folders and files

Latest commit

History

Repository files navigation

Weight Stationary - Systolic Array in 65nm TSMC

Design Overview

Systolic Array Architecture

Control States in MatMul Array

Verification with Golden Model

Tools & Dependencies

Project Structure

Pre-Synthesis Simulation

Run 5 Pre-defined Test Cases

Creating Custom Test Cases

Example

1. Weight Memory Format

2. Input Memory Format

3. Update Makefile

4. Run Custom Test

Synthesis

Post-Synthesis Simulation

Auto Place and Route

Post-Synthesis Simulation

APR Update Summary

Modified Files and Key Changes:

SDF post-processing

Post APR Verification

Matrix Multiplier ASIC Implementation Results

Design Specification

Results

Area

Power

Performance

Key Metrics

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages