1. Cells are color-coded based on official medal thresholds. Models are ranked by the average score across all 13 Olympiad exams (↓).
2. Medal cutoffs are derived from the theoretical exam scores of human medalists.
3. Only the theoretical components of each exam are evaluated; experimental and diagram-drawing problems are excluded, so Full Mark (Model) ≤ Full Mark (Human).
4. Each model was run 8 times. Problem scores were averaged and summed to compute the final exam score.