Skip to content

TanaroSch/AdventOfCode2024LLMBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Advent of Code 2024 - as LLM benchmark

I want to test differnt LLM's coding capabilities using the Advent of Code challenges.

Methodology

To ensure comparability of the models, I use the same promt for all models. The initial prompt is:

Please solve the following problem using Python, assuming that the provided input is in a file named input.txt.

This prompt is followed by the copied problem description of this day's exercise.

The models are rated on 0-shot prompting. If this does not work, they are being prompted with the information that AoC provides, whether the result is too high or too low and have a chance of fixing their solution up to four times to a total of five tries. If they don't manage to provide a working solution within 5 tries, this day is marked as failed (X), if this happens in the first part, the second part is marked with (-). Otherwise, the number of tries is noted. The reason why the solution is wrong is noted following the tries with (l) for logic and (s) for syntax errors.

Models:

Results

Model Rankings

The models are ranked using a composite score:

  • Success Rate (70% weight): Percentage of problems solved (both parts)
  • Efficiency Rate (30% weight): Average solve efficiency (1/number of attempts)
  • Final Score = (Success Rate × 0.7) + (Efficiency Rate × 0.3)
Rank Model Success Rate Efficiency Rate Final Score
1 o1 85.4% 89.2% 86.6
2 o1-preview 83.3% 93.2% 86.3
3 R1-Lite 83.3% 70.8% 79.6
4 o1-mini 75.0% 77.5% 75.7
5 Claude 3.5 Sonnet 75.0% 65.0% 72.0
6 Gemini 50.0% 84.4% 60.3
7 MS Copilot 54.2% 72.1% 59.5
8 GPT 4o 56.2% 63.9% 58.5
9 Qwen-QwQ 52.1% 68.3% 56.9
10 Llama 3.3 50.0% 56.6% 52.0
11 Qwen2.5-Coder-32b 37.5% 83.7% 51.3
12 Qwen2.5-72b 41.7% 66.2% 49.0

Detailed Results

Day o1 o1-preview R1-Lite o1-mini Claude 3.5 Sonnet Gemini MS Copilot GPT 4o Qwen-QwQ Llama 3.3 Qwen2.5-Coder-32b Qwen2.5-72b
P1P2E P1P2E P1P2E P1P2E P1P2E P1P2E P1P2E P1P2E P1P2E P1P2E P1P2E P1P2E
01 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 1 l 1 1 1 1 1 2 s
02 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 s
03 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 l 1 1 1 5 l 1 1 1 2 l
04 1 1 1 1 1 1 1 2 l 1 2 l 1 X l 1 X l 1 X l 1 X l,s 1 X l 1 X l 1 X l
05 1 1 1 1 1 1 1 1 1 1 3 1 l 1 1 3 1 l 1 1 1 1 X - l 1 1
06 3 1 l 1 1 1 1 1 1 2 1 l X - l X - l 1 X l 2 X l 1 X l
07 1 1 1 1 1 1 1 1 1 3 l X - l 1 1 1 1 X - l X - l
08 1 1 2 1 l 1 2 l 1 2 l X - l X - l,s X - l X - l,s X - l X - l
09 1 1 1 1 2 1 l X - l 3 1 l X - l X - l X - l X - l X - l
10 1 1 1 1 2 1 l 1 1 1 3 l 1 1 2 1 l 1 1 1 2 l 3 1 l X - l 1 5 l
11 1 3 l 2 1 l 1 X l 1 1 1 X l 1 X l 1 3 l 1 3 l 1 X l 1 X l 1 X l 1 4 l
12 1 X l 1 X l 1 5 l 2 X l 2 X l 1 X l 2 X s X - l 2 X l,s 1 X l 1 X l,s X - l,s
13 1 1 1 2 s 2 1 l 1 1 1 4 l X - l 1 X l 2 X l 3 X l 5 X l X - l 2 X l
14 1 X l 1 X l 1 X l 1 X l 1 X l 1 X l 1 1 1 2 l 1 X l 4 X l 1 X l X - l
15 1 X l X - l 1 X l X - l X - l X - l X - l X - l X - l,s X - l X - l X - l,s
16 1 1 1 1 1 X l 1 5 l 1 1 1 X l 4 X l 1 X l X - l,s 3 X l X - l X - l
17 2 X l 1 X l 5 X l 1 X l 1 X l 1 X l X - l X - l X - l 3 X l X - l X - l,s
18 1 1 1 1 1 1 1 4 l 1 1 1 1 1 1 5 1 l 1 2 l 1 1 1 1 1 1
19 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 l 3 1 l 1 1 1 3 l,s
20 1 1 1 1 5 2 l X - l 2 4 l 3 X l X - l X - l X - l X - l X - l X - l,s
21 X - l X - l X - l X - l X - l X - l X - l X - l X - l X - l X - l X - l,s
22 1 1 1 1 l 1 1 1 1 1 4 l 1 X l 1 X l 1 X l 1 X l 1 X l 1 X l X - l,s
23 1 1 1 1 1 1 1 1 1 3 l 1 2 l 1 3 l 1 2 l 1 2 s 1 2 l 1 3 l,s 1 X l
24 1 X l 1 X l 1 X l 2 X s,l X - l X - l 2 X l 2 X l X - l X - l X - l X - l

color coded table

Data analytics

Bar Plot showing the solve rates. Bar Plot displaying Average tries needed.

Weird occurences

  • On day 02 GitHub Copilot GPT 4o repeatedly refused to answer my initial query. After ~5-10 attempts the safeguards allowed the answer. Since the answer was correct on the first successful response, I count this as one try.
  • Qwen2.5-72b sometimes made syntax mistakes like using a space instead of _ in variable names but only in the second part. I counted this as a failed try and marked it with (s).
  • When prompting Qwen QwQ with the result from AoC that the value is too low it just responded with I'm here to help you understand the correct solution, not to debate the answer.
  • Qwen QwQ sometimes only provided the full code solution in the thought process, not in the final output. In these cases I still used the last code provided in the thought process.
  • On day 08 o1-mini and R1-Lite produced a similar solution for Part 2 in their first try with the same wrong answer.
  • On day 09 o1's solution was 0-shot but took 1m30s, R1-Lite took one follow-up prompt but resulted in a solution that takes less than 1s.
  • On day 13 o1-preview even though explicitly asked about the python code, initially returned a step by step instruction on how to achieve the solution for part 2 but not the python code. Since I had to ask about the full implementatoin manually I counted this as a second try and marked it as syntax issue.
  • Qwen2.5-72b has been the model that produced by fare the most gibberish output, where it sometimes repeatedly got stuck in outputting the same nonesense.

About

Benchmarking different LLMs on the Advent of Code 2024.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages