coachbench

Research Benchmark Interface

3,863 items · 25 sports · 11 models · 18 settings

Overview

Automatically parses the JSONL files and summarizes overall accuracy, subgroup performance, and high-miss examples.

Loading dataset...

A horizontal comparison of overall model accuracy with a localized axis to make small gaps easier to read.

Computing model scores...

Loading model filters...

Shows a multidimensional comparison of leading models across overall accuracy and mid-level class aggregates.

Computing multidimensional view...

Switch between class and sport to locate where each model gains or loses accuracy.

Loading grouped metrics...

Ranks models by overall accuracy and distinguishes direct, thinking, and search-augmented variants.

Waiting for benchmark files...