Bundesliga Round 23 AI Model Performance Audit
Llama 3.3 70B Instruct led Bundesliga predictions with 3.13 points per match, followed by MiniMax M2.1 (2.50) and GLM-5 (2.25). Models achieved 32.75% correct tendency overall, though the 1. FC Heidenheim vs VfB Stuttgart 3-3 draw caught most models off guard.
Llama 3.3 70B Instruct led Bundesliga predictions with 3.13 points per match, followed by MiniMax M2.1 (2.50) and GLM-5 (2.25). Models achieved 32.75% correct tendency overall, though the 1. FC Heidenheim vs VfB Stuttgart 3-3 draw caught most models off guard.
Bundesliga Regular Season - 23 featured 8 matches including high-profile fixtures like RB Leipzig vs Borussia Dortmund. AI prediction accuracy matters as models compete in a challenging round with several unexpected results. This audit examines the statistical performance across all predictions.
Top 10 Models
| # | Model | Matches | Total Points | Avg Pts/Match | Tendency % | Exact % |
|---|---|---|---|---|---|---|
| 1 | Llama 3.3 70B Instruct (OpenRouter) | 8 | 25 | 3.13 | 62.5% | 12.5% |
| 2 | MiniMax M2.1 (OpenRouter) | 8 | 20 | 2.50 | 50.0% | 12.5% |
| 3 | GLM-5 (OpenRouter) | 8 | 18 | 2.25 | 50.0% | 12.5% |
| 4 | Step 3.5 Flash (OpenRouter) | 8 | 15 | 1.88 | 50.0% | 0.0% |
| 5 | Phi-4 (OpenRouter) | 8 | 14 | 1.75 | 37.5% | 0.0% |
| 6 | DeepSeek V3.2 (OpenRouter) | 8 | 13 | 1.63 | 37.5% | 12.5% |
| 7 | DeepSeek R1-0528 (OpenRouter) | 8 | 13 | 1.63 | 37.5% | 0.0% |
| 8 | MiniMax M2.5 (OpenRouter) | 8 | 12 | 1.50 | 37.5% | 12.5% |
| 9 | Mistral Small 3.2 24B (OpenRouter) | 8 | 11 | 1.38 | 37.5% | 0.0% |
| 10 | Kimi K2.5 (OpenRouter) | 7 | 9 | 1.29 | 28.6% | 0.0% |
Match-by-Match Audit
- 1. FC Heidenheim vs VfB Stuttgart: Result 3-3 | Correct tendency: 5.3% | Exact score hits: 0.0% | Consensus: A (84.2%) | correct: no
- FC St. Pauli vs Werder Bremen: Result 2-1 | Correct tendency: 26.3% | Exact score hits: 0.0% | Consensus: D (68.4%) | correct: no
- SC Freiburg vs Borussia Mönchengladbach: Result 2-1 | Correct tendency: 52.6% | Exact score hits: 36.8% | Consensus: H (52.6%) | correct: yes
- RB Leipzig vs Borussia Dortmund: Result 2-2 | Correct tendency: 31.6% | Exact score hits: 5.3% | Consensus: A (57.9%) | correct: no
- VfL Wolfsburg vs FC Augsburg: Result 2-3 | Correct tendency: 31.6% | Exact score hits: 0.0% | Consensus: D (52.6%) | correct: no
- 1. FC Köln vs 1899 Hoffenheim: Result 2-2 | Correct tendency: 26.3% | Exact score hits: 0.0% | Consensus: A (68.4%) | correct: no
- Union Berlin vs Bayer Leverkusen: Result 1-0 | Correct tendency: 10.5% | Exact score hits: 0.0% | Consensus: A (68.4%) | correct: no
- Bayern München vs Eintracht Frankfurt: Result 3-2 | Correct tendency: 77.8% | Exact score hits: 0.0% | Consensus: H (77.8%) | correct: yes
Biggest Consensus Misses
-
- FC Heidenheim vs VfB Stuttgart (3-3) | Consensus: A (84.2%) | Counts H/D/A: 2/1/16
- FC St. Pauli vs Werder Bremen (2-1) | Consensus: D (68.4%) | Counts H/D/A: 5/13/1
-
- FC Köln vs 1899 Hoffenheim (2-2) | Consensus: A (68.4%) | Counts H/D/A: 1/5/13
- Union Berlin vs Bayer Leverkusen (1-0) | Consensus: A (68.4%) | Counts H/D/A: 2/4/13
- RB Leipzig vs Borussia Dortmund (2-2) | Consensus: A (57.9%) | Counts H/D/A: 2/6/11
Methodology
kroam.xyz uses a quota-based scoring system that rewards both accuracy and boldness:
Tendency Points (2-6 points): Models earn points for correctly predicting the match outcome (home win, draw, or away win). The points awarded depend on prediction rarity—if most models predicted a home win but the away team won, models who correctly predicted the away win earn more points (up to 6). Common predictions earn fewer points (minimum 2).
Goal Difference Bonus (+1 point): If the model predicts the correct goal difference (e.g., predicted 2-1 and result was 3-2, both +1 difference), they earn a bonus point.
Exact Score Bonus (+3 points): Predicting the exact final score earns 3 additional points.
Maximum: 10 points per prediction (6 tendency + 1 goal diff + 3 exact).
This system ensures that models taking calculated risks on unlikely outcomes are rewarded when correct, while also recognizing precision in exact score predictions. Learn more about our methodology.
Frequently Asked Questions
Q: Which AI model performed best in Bundesliga Regular Season - 23? A: Llama 3.3 70B Instruct (OpenRouter) performed best with 3.13 average points per match across 8 matches.
Q: How accurate were AI predictions for Bundesliga this round? A: Models achieved 32.75% correct tendency and 5.26% exact score hit rate across 151 total predictions.
Q: What was the biggest upset in Bundesliga Regular Season - 23? A: The 1. FC Heidenheim vs VfB Stuttgart 3-3 draw was the biggest consensus miss, with 84.2% of models incorrectly predicting an away win.
Q: How does kroam.xyz score AI football predictions? A: kroam.xyz uses a quota-based system awarding 2-6 points for correct tendency, +1 for correct goal difference, and +3 for exact score predictions, with a maximum of 10 points per match.
Generation cost: $0.0021
Tokens: 4,662 input + 1,918 output
Frequently Asked Questions
What is this article about?
Which AI model performed best in Bundesliga Regular Season - 23?**?
Q: Which AI model performed best in Bundesliga Regular Season - 23?
Q: How accurate were AI predictions for Bundesliga this round?
You might also like
Bundesliga AI Model Performance Audit - Regular Season 24
MiniMax M2.5 led Bundesliga predictions with 2.89 points per match, followed by GPT-OSS 20B (2.33) and Llama 4 Scout (2.11). Models achieved 34.24% correct tendency overall, with the 1899 Hoffenheim vs FC St. Pauli upset (0-1) catching 94.7% consensus predictions wrong.
Mar 2, 2026
UEFA Conference League Round of 32 AI Prediction Audit
GPT-OSS 20B led UEFA Conference League predictions with 2.88 points per match, followed by Trinity Large Preview (2.63) and GLM-5 (2.25). Models achieved 38.16% correct tendency overall, with Fiorentina vs Jagiellonia (2-4) as the biggest upset.
Mar 2, 2026
UEFA Europa League Round of 32 AI Model Performance Audit
Mistral Small 3.2 24B led predictions with 3.38 avg points/match, followed by Phi-4 (2.88) and Llama 4 Scout (2.75). Models achieved 38.82% correct tendency. VfB Stuttgart's 0-1 loss to Celtic was the biggest consensus miss.