- Educational implementation of a weight-stationary systolic array architecture for ASIC design flow (65nm TSMC). Selected modules are shared publicly under NDA restrictions, converted into version of FreePDK45.
The design implements a systolic array of processing elements (PEs) for efficient matrix multiplication.

| Feature | Description |
|---|---|
| Number of rows | 4 (Programable by Designer) |
| Number of columns | 4 (Programable by Designer) |
| Input data format | Signed Integer (2's Complement) |
| Input data width | 8 bits |
| Output data width | 8 bits |
| Handling data overflow | Saturation |
| Modes supported | Memory, External, & BiST |
| IO ports | See matrix_mult.sv |
| Reset | Active-Low Reset (Reset when 0) |
| Process node | TSMC 65GP / FreePDK45 |
| Clock frequency | 200 Mhz |
| Timing model | NLDM |
| Power supply | 0.9 ~ 1.1 V |
| Highest metal allowed | M6 |
| Target Area | 172.8 um x 172.8 um |
- Idle
- Load
- Compute
- Python 3.8 (for test generation and verification)
- ICC2
- VCS
- Verdi
.
├── README.md
├── sim
│ ├── behav
│ ├── apr
│ └── syn
├── syn
├── apr
├── src
│ ├── Makefiles # Makefiles
│ ├── syn # TCL files
│ ├── apr # TCL files
│ └── verilog # SV files
│ ├── sdf.max.cfg
│ ├── sdf.min.cfg
│ ├── matrix_mult_wrapper.include # Pre-synthesis include file
│ ├── matrix_mult_wrapper_syn.include # Post-synthesis include file
│ ├── matrix_mult_wrapper_apr.include # Post-APR include file
│ ├── memory/ # Memory emulator modules
│ ├── misc/ # Monitor module
│ ├── bist/ # Driver module
│ ├── matrix_mult/ # Matrix multiplication SystemVerilog files
│ │ ├── matrix_mult_wrapper.sv # Top-level wrapper
│ │ ├── matrix_mult_pkg.sv # Package file
│ │ ├── matrix_mult.sv # Matrix multiplication unit
│ │ ├── matrix_mult_array.sv # PE array with activation/weight inputs
│ │ ├── matrix_mult_control.sv # Control & memory address generation
│ │ ├── matrix_mult_pe.sv # Processing element (PE)
│ ├── testcase/ # Edge case examples
│ ├── gold_result/ # Golden result examples
│ ├── tb_matrix_mult.sv # Testbench (pre-synthesis)
│ ├── tb_matrix_mult_syn.sv # Testbench (post-synthesis)
│ ├── tb_matrix_mult_sapr.sv # Testbench (post-SAPR)
│ ├── tasks.sv # Read, operate, and write tasks
│ ├── golden.py # Golden data comparison
│ └── generate.py # Random data generation-
Modify the
TESTNAMEin./src/Makefiles/Makefile_sim_presynto one of the following:- Group A:
memory,external,offset_ext,offset_mem - Group B:
bist,offset_bist,consec,reset,overwrite,long_act_ext,long_act_mem,long_act_bist
If Group A:
cd sim/behav/ make run_exampleIf Group B:
cd sim/behav/ make run_behavior - Group A:
-
Check output files in
./results.
Matrix multiplication A × B, where:
- A =
[[a, b], [c, d]] - B =
[[0, 1], [2, 3]]
File: ./src/verilog/testcase/[number]_wb_init.mem
Note:
[number]should start from 1.
Format:
B[1][1]B[1][0]
B[0][1]B[0][0]
Example (hex):
0302
0100
File: ./src/verilog/testcase/[number]_ib_init.mem
Note:
[number]should start from 1.
Format:
X A[0][0]
A[0][1] A[1][0]
A[1][1] X
Example (hex):
000A
0B0C
0D00
Note:
Xrepresents unused/don't-care values, set to 0.
Modify line 7 of ./src/Makefiles/Makefile_sim_presyn:
TESTNUM ?= [your_test_number]Replace [your_test_number] with the total number of custom test cases.
Follow the same procedure as in Run 5 Pre-defined Test Cases.
cd ../../syn/
make designcd ../sim/syn- Modify the
TESTNAMEin./src/Makefiles/Makefile_sim_postsynas described above.
If Group A:
make run_exampleIf Group B:
make run_behavior- Check output files in
./results.
cd ../../apr/
make designcd ../sim/apr- Modify the
TESTNAMEin./src/Makefiles/Makefile_sim_postapras described above.
If Group A:
make run_exampleIf Group B:
make run_behavior- Check output files in
./results.
- rtl, syn
- did RTL logic diet to decrease cell count and utility.
- change the targeted clk frequecy to 100MHz for lowest area utilization and power.
- user_config.tcl
- Update to Match the Metal, Area Requirements
- 04_place_opt.tcl
- Enabled congestion-layer-aware placement
- Added refine_placement step with high effort to reduce congestion
- Maintained timing update and PG connectivity check
- 05_clock_opt.tcl
- Added clock tree skew target (set_clock_tree_options -target_skew 0.03)
- Increased hold optimization effort (opt.common.hold_effort = high)
- Maintained all existing clock routing and reporting configs
- 06_route.tcl
- Added CCD-based hold fixing flow with escalating effort (high → ultra)
- Marked hold buffer cells and enabled CCD timing flow
- Performed iterative route_opt + update_timing cycles
- 08_report.tcl
- Removed fanout threshold option (-threshold 64) from high-fanout net report
- Added congestion, placement report
- Created backup copies of the original SDF files (matrix_mult_wrapper_08.wc.sdf.bak, matrix_mult_wrapper_08.bc.sdf.bak)
- Used sed to clean up escaped array index notations inside the SDF files (
\[number\]→[number]) - This fixes issues where backslashes were added during SDF generation, which can cause simulators to misread bus or net names like
data\[3\]instead ofdata[3]
- added simple OFFSET to prevent timing violation
![]() Final APR Result |
![]() Clock Tree |
- Design: 4×4 Matrix Multiplier
- Data Type: INT8 MAC Operations
- Technology: TSMC 65nm GP
- Tool: Synopsys ICC2 W-2024.09-SP3
- Total Core Area: 29,859.84 μm² (0.0299 mm²)
- Standard Cells: 9,356 cells
- Utilization: 78.0%
- Total Power: 0.620 mW @ 0.9V, 125°C
- Dynamic: 4.449e+05 nW (444.9 µW)
- Leakage: 1.753e+05 nW (175.3 µW)
- Frequency: 100 MHz
- MAC Units: 16 (4×4 array)
- MAX Throughput: ~3.2GOPS (INT8)
- 16 PE * 2 MAC × 100 MHz = 3.2 GOPS
- Power Efficiency @100MHz: 5.16 GOPS/mW
- Area Efficiency @100MHz: 107 GOPS/mm²
- DRC/LVS: Clean (0 violations)
- Timing: All constraints met


