Methodology

How the AI War Tracker leaderboard is built — what we measure, how the intelligence index is computed, where the data comes from, and where it falls short. We'd rather you trust the numbers because you understand them.

The intelligence index

Every model on the leaderboard gets a single intelligence index from 0–100. It's computed in two steps so that no single benchmark — or category with lots of benchmarks — dominates:

1. Per-category indices. Each benchmark is grouped into a category (reasoning, coding, math, …). A model's index for a category is the average of its scores on that category's benchmarks (all normalized to a 0–100 scale).
2. Overall index = balanced mean of category indices. We average the per-category indices — not the raw benchmark scores. This keeps a model that's only been tested on many coding benchmarks from outranking a well-rounded model. A model needs at least one category index to appear on the leaderboard.

Currently computed across 430 ranked models and 42 benchmarks.

Benchmarks we use

The featured benchmarks that feed the index, by category. Every model page shows the exact scores behind its index.

Benchmarks excluded from the index

Some well-known benchmarks are shown on model pages but deliberately left out of the index, because their score ceilings are far from the others and would distort a simple average — either they're saturated (top models all near the max) or brutally hard with a very low ceiling. Coverage is also uneven across models. Excluded:

MMLUHumanEvalMBPPGSM8KMATHMGSMHellaSwagWinograndeARC-CTruthfulQAHumanity’s Last Exam

Where the data comes from

AI War Tracker aggregates and normalizes published results — we don't run our own evaluations. Sources:

Artificial Analysis — independent benchmark results (GPQA Diamond, AIME, MMLU-Pro, LiveCodeBench, MATH-500, and more), plus output speed and pricing.
OpenRouter — the model catalog: providers, context windows, modalities, capabilities, and API pricing.
arXiv — the source for the research papers we curate.

The catalog refreshes automatically every day — new models are added and existing pricing, specs, and scores are updated from these sources.

Automated vs. curated

The factual data — specs, pricing, benchmark scores, the index — is 100% machine-maintained, so it stays current without manual data entry. The only human layer is editorial: takes on notable models and the significance notes on papers. Those are clearly editorial opinion; everything else is sourced data.

Limitations

We aggregate third-party and self-reported results; we don't independently re-run evals.
Benchmark coverage is uneven — not every model is tested on every benchmark, so the index averages what's available.
Benchmarks can be saturated, gamed, or contaminated by training data. Treat the index as a strong directional guide, not a verdict.
Open benchmarks favor models that publish results; some closed models may be under-represented.

Spotted something wrong?

Accuracy matters to us. If a number looks off, tell us on X (@aiwartracker) and we'll fix it.