Terafly enables high-throughput, low-latency inference of Large Language Models (LLMs) by leveraging a multi-node FPGA architecture optimized for cooperative execution.
We provide HLS kernels that can be rapidly customized for research purposes, enabling efficient experimentation and algorithm validation on FPGAs.
Terafly is designed to maximize memory bandwidth and computational efficiency on FPGA platforms—specifically targeting embedded and datacenter FPGAs like the Xilinx Alveo U50lv. It supports end-to-end LLM inference with minimal host intervention, and includes tooling for weight packing, hardware generation, and interactive demo deployment.
If you're exploring FPGA-based LLM acceleration, you might also be interested in:
To ensure compatibility, we recommend replicating our experimental environment:
| Component | Version / Configuration |
|---|---|
| OS | Ubuntu 18.04 |
| Shell | xilinx-u50lv-gen3x4-xdma-base_2 |
| XRT | 2023.2 |
| Vitis HLS & Vivado | 2023.2 |
💡 Ensure your Alveo U50lv card is properly flashed with the matching shell.
| File/Directory | Description |
|---|---|
template/ |
Template HLS code used by the generation framework. |
OPT-1.3b_optimize/ |
Directory for the generated code tailored for the Vitis development flow. |
LLM-demo-gui/ |
Contains files for WebUI interaction. |
OPT-1.3b_optimize/connectivity.cfg |
Configuration file to specify the multi-node accelerator topology. |
codegen.py |
Python script to modify the template based on configuration. |
OPT-1.3b.json |
Configuration file to specify performance and model parameters. |
weight_packer.py |
Python script to pack model weights into the Terafly memory layout. |
Follow these steps to quickly set up and run the Terafly accelerator.
Download the pre-packed model weights (OPT-1.3B) from the provided link:
Model Weights Download (Password: bcbf).
Navigate to the optimized code directory and run the compilation command. This will automatically generate the xclbin file and program your Alveo card.
cd OPT-1.3b_optimize/
make runCompile and execute the host-side application to run the lambada benchmark.
- Note: Check
tokenizer_predict_eigen.cppto verify that the code correctly loads the packed data.
cd tokenizer/
sh ./command.shYou can also interact with the LLM via a WebUI interface:
-
Start the Python server (requires
python==3.6). -
Open the web interface in your browser:
LLM-demo-gui/llm-gui/web/index.html. (Please open the HTML file directly in your browser to chat with the LLM.)
cd LLM-demo-gui/alveo
(python==3.6) python client-v3.pyIf you find Terafly or LoopLynx useful in your research or project, please cite our papers. We appreciate your interest in our work!
@ARTICLE{Terafly,
author={Zheng, Jianing and Chen, Gang and Huang, Libo and Lou, Xin and Zheng, Wei-shi},
journal={IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems},
title={Terafly : A Multi-Node FPGA Based Accelerator Design for Efficient Cooperative Inference in LLMs},
year={2025},
volume={},
number={},
pages={1-1}}
@inproceedings{LoopLynx,
author = {Jianing Zheng and Gang Chen},
title = {LoopLynx: {A} Scalable Dataflow Architecture for Efficient {LLM} Inference},
booktitle = {Design, Automation {\&} Test in Europe Conference, {DATE} 2025, Lyon, France, March 31 - April 2, 2025},
pages = {1--7},
publisher = {{IEEE}},
year = {2025}}