Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
171 changes: 9 additions & 162 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,172 +1,14 @@
# bitnet.cpp
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
![version](https://img.shields.io/badge/version-1.0-blue)

[<img src="./assets/header_model_release.png" alt="BitNet Model on Hugging Face" width="800"/>](https://huggingface.co/microsoft/BitNet-b1.58-2B-4T)

Try it out via this [demo](https://bitnet-demo.azurewebsites.net/), or build and run it on your own [CPU](https://github.com/microsoft/BitNet?tab=readme-ov-file#build-from-source) or [GPU](https://github.com/microsoft/BitNet/blob/main/gpu/README.md).

bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support **fast** and **lossless** inference of 1.58-bit models on CPU and GPU (NPU support will coming next).

The first release of bitnet.cpp is to support inference on CPUs. bitnet.cpp achieves speedups of **1.37x** to **5.07x** on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by **55.4%** to **70.0%**, further boosting overall efficiency. On x86 CPUs, speedups range from **2.37x** to **6.17x** with energy reductions between **71.9%** to **82.2%**. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. Please refer to the [technical report](https://arxiv.org/abs/2410.16144) for more details.

<img src="./assets/m2_performance.jpg" alt="m2_performance" width="800"/>
<img src="./assets/intel_performance.jpg" alt="m2_performance" width="800"/>

>The tested models are dummy setups used in a research context to demonstrate the inference performance of bitnet.cpp.

## Demo

A demo of bitnet.cpp running a BitNet b1.58 3B model on Apple M2:

https://github.com/user-attachments/assets/7f46b736-edec-4828-b809-4be780a3e5b1

## What's New:
- 05/20/2025 [BitNet Official GPU inference kernel](https://github.com/microsoft/BitNet/blob/main/gpu/README.md) ![NEW](https://img.shields.io/badge/NEW-red)
- 04/14/2025 [BitNet Official 2B Parameter Model on Hugging Face](https://huggingface.co/microsoft/BitNet-b1.58-2B-4T)
- 02/18/2025 [Bitnet.cpp: Efficient Edge Inference for Ternary LLMs](https://arxiv.org/abs/2502.11880)
- 11/08/2024 [BitNet a4.8: 4-bit Activations for 1-bit LLMs](https://arxiv.org/abs/2411.04965)
- 10/21/2024 [1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs](https://arxiv.org/abs/2410.16144)
- 10/17/2024 bitnet.cpp 1.0 released.
- 03/21/2024 [The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ](https://github.com/microsoft/unilm/blob/master/bitnet/The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ.pdf)
- 02/27/2024 [The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits](https://arxiv.org/abs/2402.17764)
- 10/17/2023 [BitNet: Scaling 1-bit Transformers for Large Language Models](https://arxiv.org/abs/2310.11453)

## Acknowledgements

This project is based on the [llama.cpp](https://github.com/ggerganov/llama.cpp) framework. We would like to thank all the authors for their contributions to the open-source community. Also, bitnet.cpp's kernels are built on top of the Lookup Table methodologies pioneered in [T-MAC](https://github.com/microsoft/T-MAC/). For inference of general low-bit LLMs beyond ternary models, we recommend using T-MAC.
## Official Models
<table>
</tr>
<tr>
<th rowspan="2">Model</th>
<th rowspan="2">Parameters</th>
<th rowspan="2">CPU</th>
<th colspan="3">Kernel</th>
</tr>
<tr>
<th>I2_S</th>
<th>TL1</th>
<th>TL2</th>
</tr>
<tr>
<td rowspan="2"><a href="https://huggingface.co/microsoft/BitNet-b1.58-2B-4T">BitNet-b1.58-2B-4T</a></td>
<td rowspan="2">2.4B</td>
<td>x86</td>
<td>&#9989;</td>
<td>&#10060;</td>
<td>&#9989;</td>
</tr>
<tr>
<td>ARM</td>
<td>&#9989;</td>
<td>&#9989;</td>
<td>&#10060;</td>
</tr>
</table>

## Supported Models
❗️**We use existing 1-bit LLMs available on [Hugging Face](https://huggingface.co/) to demonstrate the inference capabilities of bitnet.cpp. We hope the release of bitnet.cpp will inspire the development of 1-bit LLMs in large-scale settings in terms of model size and training tokens.**

<table>
</tr>
<tr>
<th rowspan="2">Model</th>
<th rowspan="2">Parameters</th>
<th rowspan="2">CPU</th>
<th colspan="3">Kernel</th>
</tr>
<tr>
<th>I2_S</th>
<th>TL1</th>
<th>TL2</th>
</tr>
<tr>
<td rowspan="2"><a href="https://huggingface.co/1bitLLM/bitnet_b1_58-large">bitnet_b1_58-large</a></td>
<td rowspan="2">0.7B</td>
<td>x86</td>
<td>&#9989;</td>
<td>&#10060;</td>
<td>&#9989;</td>
</tr>
<tr>
<td>ARM</td>
<td>&#9989;</td>
<td>&#9989;</td>
<td>&#10060;</td>
</tr>
<tr>
<td rowspan="2"><a href="https://huggingface.co/1bitLLM/bitnet_b1_58-3B">bitnet_b1_58-3B</a></td>
<td rowspan="2">3.3B</td>
<td>x86</td>
<td>&#10060;</td>
<td>&#10060;</td>
<td>&#9989;</td>
</tr>
<tr>
<td>ARM</td>
<td>&#10060;</td>
<td>&#9989;</td>
<td>&#10060;</td>
</tr>
<tr>
<td rowspan="2"><a href="https://huggingface.co/HF1BitLLM/Llama3-8B-1.58-100B-tokens">Llama3-8B-1.58-100B-tokens</a></td>
<td rowspan="2">8.0B</td>
<td>x86</td>
<td>&#9989;</td>
<td>&#10060;</td>
<td>&#9989;</td>
</tr>
<tr>
<td>ARM</td>
<td>&#9989;</td>
<td>&#9989;</td>
<td>&#10060;</td>
</tr>
<tr>
<td rowspan="2"><a href="https://huggingface.co/collections/tiiuae/falcon3-67605ae03578be86e4e87026">Falcon3 Family</a></td>
<td rowspan="2">1B-10B</td>
<td>x86</td>
<td>&#9989;</td>
<td>&#10060;</td>
<td>&#9989;</td>
</tr>
<tr>
<td>ARM</td>
<td>&#9989;</td>
<td>&#9989;</td>
<td>&#10060;</td>
</tr>
<tr>
<td rowspan="2"><a href="https://huggingface.co/collections/tiiuae/falcon-edge-series-6804fd13344d6d8a8fa71130">Falcon-E Family</a></td>
<td rowspan="2">1B-3B</td>
<td>x86</td>
<td>&#9989;</td>
<td>&#10060;</td>
<td>&#9989;</td>
</tr>
<tr>
<td>ARM</td>
<td>&#9989;</td>
<td>&#9989;</td>
<td>&#10060;</td>
</tr>
</table>


# bitnet.cpp using minGW
This repository is a fork of [Bitnet](https://github.com/microsoft/BitNet)
I have made changes to the codebase and configured it to use minGW instead of Visual Studio 2022, so make sure to have installed [minGW](https://sourceforge.net/projects/mingw/) and follow the instructions below.

## Installation

### Requirements
- python>=3.9
- cmake>=3.22
- clang>=18
- For Windows users, install [Visual Studio 2022](https://visualstudio.microsoft.com/downloads/). In the installer, toggle on at least the following options(this also automatically installs the required additional tools like CMake):
- Desktop-development with C++
- C++-CMake Tools for Windows
- Git for Windows
- C++-Clang Compiler for Windows
- MS-Build Support for LLVM-Toolset (clang)
- For Windows users, install [MinGW](https://sourceforge.net/projects/mingw/). In the installer, toggle on the Clang and C++ dependencies.
- For Debian/Ubuntu users, you can download with [Automatic installation script](https://apt.llvm.org/)

`bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"`
Expand All @@ -191,6 +33,11 @@ conda activate bitnet-cpp
pip install -r requirements.txt
```
3. Build the project
For initializing the build folder run:
```
cmake -B build -S . -G "MinGW MakeFiles" -DCMAKE_CXX_STANDARD=11 -DCMAKE_CXX_COMPILER=C:/msys64/bin/g++.exe -DCMAKE_CXX_FLAGS="-D_WIN32_WINNT=0x0602" -DGGML_NATIVE
cmake --build build --config Release
```
```bash
# Manually download the model and run with local path
huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T
Expand Down
Loading