I want to test differnt LLM's coding capabilities using the Advent of Code challenges.
To ensure comparability of the models, I use the same promt for all models. The initial prompt is:
Please solve the following problem using Python, assuming that the provided input is in a file named input.txt.
This prompt is followed by the copied problem description of this day's exercise.
The models are rated on 0-shot prompting. If this does not work, they are being prompted with the information that AoC provides, whether the result is too high or too low and have a chance of fixing their solution up to four times to a total of five tries. If they don't manage to provide a working solution within 5 tries, this day is marked as failed (X), if this happens in the first part, the second part is marked with (-). Otherwise, the number of tries is noted. The reason why the solution is wrong is noted following the tries with (l) for logic and (s) for syntax errors.
- Claude 3.5 Sonnet -> claude-3.5-sonnet-20241022 used on https://claude.ai
- MS Copilot -> used the enterprise version of https://copilot.microsoft.com, exact model unknown
- GPT 4o -> GPT 4o used in the chat panel of GitHub Copilot without context of any file or workplace
- o1-mini -> o1-mini used in the chat panel of GitHub Copilot without context of any file or workplace
- o1-preview -> o1-preview used in the chat panel of GitHub Copilot without context of any file or workplace
- o1 -> o1 used on https://chatgpt.com/
- Qwen2.5-72b -> Qwen/Qwen2.5-72B-Instruct used on https://huggingface.co/chat
- Qwen2.5-Coder-32b -> Qwen/Qwen2.5-Coder-32B-Instruct used on https://huggingface.co/chat
- Qwen-QwQ -> Qwen/QwQ-32B-Preview used on https://huggingface.co/chat
- R1-Lite -> DeepSeek-R1-Lite preview used on https://chat.deepseek.com
- Llama 3.3 -> meta-llama/Llama-3.3-70B-Instruct used on https://huggingface.co/chat
- Gemini -> gemini-exp-1206 used in Direct Chat on https://lmarena.ai
- Gemini 2.0 -> gemini-2.0-flash-exp used in Direct Chat on https://lmarena.ai
- Phi-4 -> vanilj/Phi-4:Q8_0 used locally in ollama
The models are ranked using a composite score:
- Success Rate (70% weight): Percentage of problems solved (both parts)
- Efficiency Rate (30% weight): Average solve efficiency (1/number of attempts)
- Final Score = (Success Rate × 0.7) + (Efficiency Rate × 0.3)
| Rank | Model | Success Rate | Efficiency Rate | Final Score |
|---|---|---|---|---|
| 1 | o1 | 85.4% | 89.2% | 86.6 |
| 2 | o1-preview | 83.3% | 93.2% | 86.3 |
| 3 | R1-Lite | 83.3% | 70.8% | 79.6 |
| 4 | o1-mini | 75.0% | 77.5% | 75.7 |
| 5 | Claude 3.5 Sonnet | 75.0% | 65.0% | 72.0 |
| 6 | Gemini | 50.0% | 84.4% | 60.3 |
| 7 | MS Copilot | 54.2% | 72.1% | 59.5 |
| 8 | GPT 4o | 56.2% | 63.9% | 58.5 |
| 9 | Qwen-QwQ | 52.1% | 68.3% | 56.9 |
| 10 | Llama 3.3 | 50.0% | 56.6% | 52.0 |
| 11 | Qwen2.5-Coder-32b | 37.5% | 83.7% | 51.3 |
| 12 | Qwen2.5-72b | 41.7% | 66.2% | 49.0 |
| Day | o1 | o1-preview | R1-Lite | o1-mini | Claude 3.5 Sonnet | Gemini | MS Copilot | GPT 4o | Qwen-QwQ | Llama 3.3 | Qwen2.5-Coder-32b | Qwen2.5-72b | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P1 | P2 | E | P1 | P2 | E | P1 | P2 | E | P1 | P2 | E | P1 | P2 | E | P1 | P2 | E | P1 | P2 | E | P1 | P2 | E | P1 | P2 | E | P1 | P2 | E | P1 | P2 | E | P1 | P2 | E | |
| 01 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 5 | 1 | l | 1 | 1 | 1 | 1 | 1 | 2 | s | ||||||||||
| 02 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | s | |||||||||||
| 03 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 4 | l | 1 | 1 | 1 | 5 | l | 1 | 1 | 1 | 2 | l | |||||||||
| 04 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | l | 1 | 2 | l | 1 | X | l | 1 | X | l | 1 | X | l | 1 | X | l,s | 1 | X | l | 1 | X | l | 1 | X | l | |||
| 05 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 1 | l | 1 | 1 | 3 | 1 | l | 1 | 1 | 1 | 1 | X | - | l | 1 | 1 | |||||||||
| 06 | 3 | 1 | l | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | l | X | - | l | X | - | l | 1 | X | l | 2 | X | l | 1 | X | l | |||||||||
| 07 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | l | X | - | l | 1 | 1 | 1 | 1 | X | - | l | X | - | l | ||||||||||||
| 08 | 1 | 1 | 2 | 1 | l | 1 | 2 | l | 1 | 2 | l | X | - | l | X | - | l,s | X | - | l | X | - | l,s | X | - | l | X | - | l | |||||||
| 09 | 1 | 1 | 1 | 1 | 2 | 1 | l | X | - | l | 3 | 1 | l | X | - | l | X | - | l | X | - | l | X | - | l | X | - | l | ||||||||
| 10 | 1 | 1 | 1 | 1 | 2 | 1 | l | 1 | 1 | 1 | 3 | l | 1 | 1 | 2 | 1 | l | 1 | 1 | 1 | 2 | l | 3 | 1 | l | X | - | l | 1 | 5 | l | |||||
| 11 | 1 | 3 | l | 2 | 1 | l | 1 | X | l | 1 | 1 | 1 | X | l | 1 | X | l | 1 | 3 | l | 1 | 3 | l | 1 | X | l | 1 | X | l | 1 | X | l | 1 | 4 | l | |
| 12 | 1 | X | l | 1 | X | l | 1 | 5 | l | 2 | X | l | 2 | X | l | 1 | X | l | 2 | X | s | X | - | l | 2 | X | l,s | 1 | X | l | 1 | X | l,s | X | - | l,s |
| 13 | 1 | 1 | 1 | 2 | s | 2 | 1 | l | 1 | 1 | 1 | 4 | l | X | - | l | 1 | X | l | 2 | X | l | 3 | X | l | 5 | X | l | X | - | l | 2 | X | l | ||
| 14 | 1 | X | l | 1 | X | l | 1 | X | l | 1 | X | l | 1 | X | l | 1 | X | l | 1 | 1 | 1 | 2 | l | 1 | X | l | 4 | X | l | 1 | X | l | X | - | l | |
| 15 | 1 | X | l | X | - | l | 1 | X | l | X | - | l | X | - | l | X | - | l | X | - | l | X | - | l | X | - | l,s | X | - | l | X | - | l | X | - | l,s |
| 16 | 1 | 1 | 1 | 1 | 1 | X | l | 1 | 5 | l | 1 | 1 | 1 | X | l | 4 | X | l | 1 | X | l | X | - | l,s | 3 | X | l | X | - | l | X | - | l | |||
| 17 | 2 | X | l | 1 | X | l | 5 | X | l | 1 | X | l | 1 | X | l | 1 | X | l | X | - | l | X | - | l | X | - | l | 3 | X | l | X | - | l | X | - | l,s |
| 18 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 4 | l | 1 | 1 | 1 | 1 | 1 | 1 | 5 | 1 | l | 1 | 2 | l | 1 | 1 | 1 | 1 | 1 | 1 | |||||||||
| 19 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 4 | 1 | l | 3 | 1 | l | 1 | 1 | 1 | 3 | l,s | |||||||||
| 20 | 1 | 1 | 1 | 1 | 5 | 2 | l | X | - | l | 2 | 4 | l | 3 | X | l | X | - | l | X | - | l | X | - | l | X | - | l | X | - | l | X | - | l,s | ||
| 21 | X | - | l | X | - | l | X | - | l | X | - | l | X | - | l | X | - | l | X | - | l | X | - | l | X | - | l | X | - | l | X | - | l | X | - | l,s |
| 22 | 1 | 1 | 1 | 1 | l | 1 | 1 | 1 | 1 | 1 | 4 | l | 1 | X | l | 1 | X | l | 1 | X | l | 1 | X | l | 1 | X | l | 1 | X | l | X | - | l,s | |||
| 23 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | l | 1 | 2 | l | 1 | 3 | l | 1 | 2 | l | 1 | 2 | s | 1 | 2 | l | 1 | 3 | l,s | 1 | X | l | ||||
| 24 | 1 | X | l | 1 | X | l | 1 | X | l | 2 | X | s,l | X | - | l | X | - | l | 2 | X | l | 2 | X | l | X | - | l | X | - | l | X | - | l | X | - | l |
- On day 02 GitHub Copilot GPT 4o repeatedly refused to answer my initial query. After ~5-10 attempts the safeguards allowed the answer. Since the answer was correct on the first successful response, I count this as one try.
- Qwen2.5-72b sometimes made syntax mistakes like using a space instead of
_in variable names but only in the second part. I counted this as a failed try and marked it with (s). - When prompting Qwen QwQ with the result from AoC that the value is too low it just responded with
I'm here to help you understand the correct solution, not to debate the answer. - Qwen QwQ sometimes only provided the full code solution in the thought process, not in the final output. In these cases I still used the last code provided in the thought process.
- On day 08 o1-mini and R1-Lite produced a similar solution for Part 2 in their first try with the same wrong answer.
- On day 09 o1's solution was 0-shot but took 1m30s, R1-Lite took one follow-up prompt but resulted in a solution that takes less than 1s.
- On day 13 o1-preview even though explicitly asked about the python code, initially returned a step by step instruction on how to achieve the solution for part 2 but not the python code. Since I had to ask about the full implementatoin manually I counted this as a second try and marked it as syntax issue.
- Qwen2.5-72b has been the model that produced by fare the most gibberish output, where it sometimes repeatedly got stuck in outputting the same nonesense.

