Skip to content

itsmostafa/inference-speed-tests

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 

Repository files navigation

Inference Speed Tests on Local LLMs

Inference speed tests on Local Large Language Models on various devices. Feel free to contribute your results.

Note: None of the following results are verified

All models have been tested with the following Prompt: Write a 500 word story

Macbook Pro

Ollama

GGUF models M4 Max (128 GB RAM, 40-core GPU) M4 Max (36GB RAM, 32-core GPU) M1 Pro (32GB RAM, 16-core GPU)
Qwen2.5:7B (4bit) 72.50 tokens/s 60.71 tokens/s 26.85 tokens/s
Qwen2.5:14B (4bit) 38.23 tokens/s Didn't Test 14.66 tokens/s
Qwen2.5:32B (4bit) 19.35 tokens/s Didn't Test 6.95 tokens/s
Qwen2.5:72B (4bit) 8.76 tokens/s Didn't Test Didn't Test
gpt-oss:20B (4bit) Didn't Test 68.20 tokens/s Didn't Test

LM Studio

MLX models M4 Max (128 GB RAM, 40-core GPU) M4 Max (36GB RAM, 32-core GPU) M1 Pro (32GB RAM, 16-core GPU)
Qwen2.5-7B-Instruct (4bit) 101.87 tokens/s 81.60 tokens/s 38.99 tokens/s
Qwen2.5-14B-Instruct (4bit) 52.22 tokens/s Didn't Test 18.88 tokens/s
Qwen2.5-32B-Instruct (4bit) 24.46 tokens/s Didn't Test 9.10 tokens/s
Qwen2.5-32B-Instruct (8bit) 13.75 tokens/s Didn't Test Won’t Complete (Crashed)
Qwen2.5-72B-Instruct (4bit) 10.86 tokens/s Didn't Test Didn't Test
gpt-oss:20B (4bit) Didn't Test 82.66 tokens/s Didn't Test
GGUF models M4 Max (128 GB RAM, 40-core GPU) M1 Pro (32GB RAM, 16-core GPU)
Qwen2.5-7B-Instruct (4bit) 71.73 tokens/s 26.12 tokens/s
Qwen2.5-14B-Instruct (4bit) 39.04 tokens/s 14.67 tokens/s
Qwen2.5-32B-Instruct (4bit) 19.56 tokens/s 4.53 tokens/s
Qwen2.5-72B-Instruct (4bit) 8.31 tokens/s Didn't Test

Mac Studio

Ollama

GGUF models M1 Max (32GB RAM, 23-core GPU) M3 Ultra (256GB, 80-core GPU)
mistral-small:23b (4bit) 15.11 tokens/s Didn't Test
mistral-large:123b (4bit) Didn't Test 8.42 tokens/s
llama3.1:8b (4bit) 38.73 tokens/s 85.02 tokens/s
llama3.2-vision:9b (4bit) 39.05 tokens/s Didn't Test
deepseek-r1:14b (4bit) 21.16 tokens/s 46.50 tokens/s
deepseek-r1:32b (4bit) Didn't Test 25.58 tokens/s
deepseek-r1:70b (4bit) Didn't Test 13.16 tokens/s
hermes3:405b (4bit) Didn't Test 2.47 tokens/s
Qwen2.5:7B (4bit) Didn't Test 88.87 tokens/s
Qwen2.5:14B (4bit) Didn't Test 47.25 tokens/s
Qwen2.5:32B (4bit) Didn't Test 26.02 tokens/s
Qwen2.5:70B (4bit) Didn't Test 12.21 tokens/s

Contributing

Using Ollama

  1. Run your model with the verbose flag (e.g ollama run mistral-small --verbose)
  2. Enter the prompt Write a 500 word story
  3. In the column of your device add the TPS (tokens-per-second) output of eval rate in Ollama
  4. If your device is not in the list add it

About

Local LLM inference speed tests on various devices

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5