Inference Speed Tests on Local LLMs

Inference speed tests on Local Large Language Models on various devices. Feel free to contribute your results.

Note: None of the following results are verified

All models have been tested with the following Prompt: Write a 500 word story

Macbook Pro

Ollama

GGUF models	M4 Max (128 GB RAM, 40-core GPU)	M4 Max (36GB RAM, 32-core GPU)	M1 Pro (32GB RAM, 16-core GPU)
Qwen2.5:7B (4bit)	72.50 tokens/s	60.71 tokens/s	26.85 tokens/s
Qwen2.5:14B (4bit)	38.23 tokens/s	Didn't Test	14.66 tokens/s
Qwen2.5:32B (4bit)	19.35 tokens/s	Didn't Test	6.95 tokens/s
Qwen2.5:72B (4bit)	8.76 tokens/s	Didn't Test	Didn't Test
gpt-oss:20B (4bit)	Didn't Test	68.20 tokens/s	Didn't Test

LM Studio

MLX models	M4 Max (128 GB RAM, 40-core GPU)	M4 Max (36GB RAM, 32-core GPU)	M1 Pro (32GB RAM, 16-core GPU)
Qwen2.5-7B-Instruct (4bit)	101.87 tokens/s	81.60 tokens/s	38.99 tokens/s
Qwen2.5-14B-Instruct (4bit)	52.22 tokens/s	Didn't Test	18.88 tokens/s
Qwen2.5-32B-Instruct (4bit)	24.46 tokens/s	Didn't Test	9.10 tokens/s
Qwen2.5-32B-Instruct (8bit)	13.75 tokens/s	Didn't Test	Won’t Complete (Crashed)
Qwen2.5-72B-Instruct (4bit)	10.86 tokens/s	Didn't Test	Didn't Test
gpt-oss:20B (4bit)	Didn't Test	82.66 tokens/s	Didn't Test

GGUF models	M4 Max (128 GB RAM, 40-core GPU)	M1 Pro (32GB RAM, 16-core GPU)
Qwen2.5-7B-Instruct (4bit)	71.73 tokens/s	26.12 tokens/s
Qwen2.5-14B-Instruct (4bit)	39.04 tokens/s	14.67 tokens/s
Qwen2.5-32B-Instruct (4bit)	19.56 tokens/s	4.53 tokens/s
Qwen2.5-72B-Instruct (4bit)	8.31 tokens/s	Didn't Test

Mac Studio

Ollama

GGUF models	M1 Max (32GB RAM, 23-core GPU)	M3 Ultra (256GB, 80-core GPU)
mistral-small:23b (4bit)	15.11 tokens/s	Didn't Test
mistral-large:123b (4bit)	Didn't Test	8.42 tokens/s
llama3.1:8b (4bit)	38.73 tokens/s	85.02 tokens/s
llama3.2-vision:9b (4bit)	39.05 tokens/s	Didn't Test
deepseek-r1:14b (4bit)	21.16 tokens/s	46.50 tokens/s
deepseek-r1:32b (4bit)	Didn't Test	25.58 tokens/s
deepseek-r1:70b (4bit)	Didn't Test	13.16 tokens/s
hermes3:405b (4bit)	Didn't Test	2.47 tokens/s
Qwen2.5:7B (4bit)	Didn't Test	88.87 tokens/s
Qwen2.5:14B (4bit)	Didn't Test	47.25 tokens/s
Qwen2.5:32B (4bit)	Didn't Test	26.02 tokens/s
Qwen2.5:70B (4bit)	Didn't Test	12.21 tokens/s

Contributing

Using Ollama

Run your model with the verbose flag (e.g ollama run mistral-small --verbose)
Enter the prompt Write a 500 word story
In the column of your device add the TPS (tokens-per-second) output of eval rate in Ollama
If your device is not in the list add it

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md
mac mini.md		mac mini.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inference Speed Tests on Local LLMs

Macbook Pro

Ollama

LM Studio

Mac Studio

Ollama

Contributing

Using Ollama

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

itsmostafa/inference-speed-tests

Folders and files

Latest commit

History

Repository files navigation

Inference Speed Tests on Local LLMs

Macbook Pro

Ollama

LM Studio

Mac Studio

Ollama

Contributing

Using Ollama

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Packages