Skip to content

Releases: AmpereComputingAI/llama.cpp

v3.4.0

26 Nov 12:51
9af0385

Choose a tag to compare

Based on ggml-org/llama.cpp b6735 (https://github.com/ggml-org/llama.cpp/releases/tag/b6735)

  • Flash Attention for SWA models fixed
  • New Flash Attention algorithm. It is optimized for long contexts (above 1024). See
    "Flash Attention algorithm selection" section for details how to select attention algorithm
    manually.

Also available at: DockerHub

v3.3.1

15 Oct 16:32
6219c16

Choose a tag to compare

Also available at: DockerHub

v3.3.0

09 Oct 12:54
6219c16

Choose a tag to compare

Also available at: DockerHub

v3.2.1

03 Sep 10:24
ecbcf6e

Choose a tag to compare

Also available at: DockerHub

v3.2.0

06 Aug 21:39
ecbcf6e

Choose a tag to compare

Also available at: DockerHub

v3.1.2

07 Jul 12:40
aa0a5d7

Choose a tag to compare

Also available at: DockerHub

v3.1.0

11 Jun 21:21
aa0a5d7

Choose a tag to compare

Also available at: DockerHub

v2.2.1

03 Jun 15:44
aa0a5d7

Choose a tag to compare

Update benchmark.py

v2.0.0

23 Sep 20:15
4f32b2c

Choose a tag to compare

  • Upgraded upstream tag enables Llama 3.1 in ollama
  • Support for AmpereOne platform
  • Breaking change: due to changed weight type IDs it is now required to re-quantize models to Q8R16 and Q4_K_4 formats with current llama-quantize tool.

v1.2.6

16 Jul 23:03
06e1efb

Choose a tag to compare

Create README.md