Based on ggml-org/llama.cpp b6735 (https://github.com/ggml-org/llama.cpp/releases/tag/b6735)
- Flash Attention for SWA models fixed
- New Flash Attention algorithm. It is optimized for long contexts (above 1024). See
"Flash Attention algorithm selection" section for details how to select attention algorithm
manually.
Also available at: DockerHub