Releases: AmpereComputingAI/llama.cpp
Releases · AmpereComputingAI/llama.cpp
v3.4.0
Based on ggml-org/llama.cpp b6735 (https://github.com/ggml-org/llama.cpp/releases/tag/b6735)
- Flash Attention for SWA models fixed
- New Flash Attention algorithm. It is optimized for long contexts (above 1024). See
"Flash Attention algorithm selection" section for details how to select attention algorithm
manually.
Also available at: DockerHub
v3.3.1
v3.3.0
v3.2.1
v3.2.0
v3.1.2
v3.1.0
v2.2.1
Update benchmark.py
v2.0.0
- Upgraded upstream tag enables Llama 3.1 in ollama
- Support for AmpereOne platform
- Breaking change: due to changed weight type IDs it is now required to re-quantize models to Q8R16 and Q4_K_4 formats with current llama-quantize tool.
v1.2.6
Create README.md