🚀 Terafly: A Multi-Node FPGA-Based Accelerator for Efficient Cooperative LLM Inference

Terafly enables high-throughput, low-latency inference of Large Language Models (LLMs) by leveraging a multi-node FPGA architecture optimized for cooperative execution.

💡 Highlight

We provide HLS kernels that can be rapidly customized for research purposes, enabling efficient experimentation and algorithm validation on FPGAs.

🔍 Overview

Terafly is designed to maximize memory bandwidth and computational efficiency on FPGA platforms—specifically targeting embedded and datacenter FPGAs like the Xilinx Alveo U50lv. It supports end-to-end LLM inference with minimal host intervention, and includes tooling for weight packing, hardware generation, and interactive demo deployment.

📚 Related Work

If you're exploring FPGA-based LLM acceleration, you might also be interested in:

llama-fpga

⚙️ Prerequisites

To ensure compatibility, we recommend replicating our experimental environment:

Component	Version / Configuration
OS	Ubuntu 18.04
Shell	`xilinx-u50lv-gen3x4-xdma-base_2`
XRT	2023.2
Vitis HLS & Vivado	2023.2

💡 Ensure your Alveo U50lv card is properly flashed with the matching shell.

📂 Code Structure

File/Directory	Description
`template/`	Template HLS code used by the generation framework.
`OPT-1.3b_optimize/`	Directory for the generated code tailored for the Vitis development flow.
`LLM-demo-gui/`	Contains files for WebUI interaction.
`OPT-1.3b_optimize/connectivity.cfg`	Configuration file to specify the multi-node accelerator topology.
`codegen.py`	Python script to modify the template based on configuration.
`OPT-1.3b.json`	Configuration file to specify performance and model parameters.
`weight_packer.py`	Python script to pack model weights into the Terafly memory layout.

⚡ Quick Start

Follow these steps to quickly set up and run the Terafly accelerator.

1. Download Model Weights

Download the pre-packed model weights (OPT-1.3B) from the provided link: Model Weights Download (Password: bcbf).

2. Compile and Program the FPGA

Navigate to the optimized code directory and run the compilation command. This will automatically generate the xclbin file and program your Alveo card.

cd OPT-1.3b_optimize/
make run

3. Run the Benchmark (`lambada`)

Compile and execute the host-side application to run the lambada benchmark.

Note: Check tokenizer_predict_eigen.cpp to verify that the code correctly loads the packed data.

cd tokenizer/
sh ./command.sh

4. Run the Web Demo

You can also interact with the LLM via a WebUI interface:

Start the Python server (requires python==3.6).
Open the web interface in your browser: LLM-demo-gui/llm-gui/web/index.html. (Please open the HTML file directly in your browser to chat with the LLM.)

cd LLM-demo-gui/alveo
(python==3.6) python client-v3.py

📝 Citation

If you find Terafly or LoopLynx useful in your research or project, please cite our papers. We appreciate your interest in our work!

@ARTICLE{Terafly,
  author={Zheng, Jianing and Chen, Gang and Huang, Libo and Lou, Xin and Zheng, Wei-shi},
  journal={IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems},
  title={Terafly : A Multi-Node FPGA Based Accelerator Design for Efficient Cooperative Inference in LLMs},
  year={2025},
  volume={},
  number={},
  pages={1-1}}

@inproceedings{LoopLynx,
  author         = {Jianing Zheng and Gang Chen},
  title          = {LoopLynx: {A} Scalable Dataflow Architecture for Efficient {LLM} Inference},
  booktitle      = {Design, Automation {\&} Test in Europe Conference, {DATE} 2025, Lyon, France, March 31 - April 2, 2025},
  pages          = {1--7},
  publisher      = {{IEEE}},
  year           = {2025}}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
LLM-demo-gui		LLM-demo-gui
OPT-1.3b_optimize		OPT-1.3b_optimize
assets		assets
template		template
.gitignore		.gitignore
LICENSE		LICENSE
OPT-1.3b.json		OPT-1.3b.json
OPT-1.3b.toml		OPT-1.3b.toml
README.md		README.md
codegen.py		codegen.py
hs_err_pid14993.log		hs_err_pid14993.log
weight_packer.py		weight_packer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Terafly: A Multi-Node FPGA-Based Accelerator for Efficient Cooperative LLM Inference

💡 Highlight

🔍 Overview

📚 Related Work

⚙️ Prerequisites

📂 Code Structure

⚡ Quick Start

1. Download Model Weights

2. Compile and Program the FPGA

3. Run the Benchmark (`lambada`)

4. Run the Web Demo

📝 Citation

About

Uh oh!

Releases

Packages

Languages

License

zjnyly/TeraFly

Folders and files

Latest commit

History

Repository files navigation

🚀 Terafly: A Multi-Node FPGA-Based Accelerator for Efficient Cooperative LLM Inference

💡 Highlight

🔍 Overview

📚 Related Work

⚙️ Prerequisites

📂 Code Structure

⚡ Quick Start

1. Download Model Weights

2. Compile and Program the FPGA

3. Run the Benchmark (lambada)

4. Run the Web Demo

📝 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

3. Run the Benchmark (`lambada`)

Packages