Benchmark Scoring

Manually and automatically score LLM performance on generated benchmarks.

Scoring for: llama3.2
Scoring for: gemma3
Scoring for: deepseek-r1
Scoring for: mistral small
Scoring for: qwen2.5
Scoring for: claude sonnet3.5
Scoring for: phi4
Overall Score Comparison
Overall Time Comparison