Automatically parses the JSONL files and summarizes overall accuracy, subgroup performance, and high-miss examples.
A horizontal comparison of overall model accuracy with a localized axis to make small gaps easier to read.
Shows a multidimensional comparison of leading models across overall accuracy and mid-level class aggregates.
Switch between class and sport to locate where each model gains or loses accuracy.
Ranks models by overall accuracy and distinguishes direct, thinking, and search-augmented variants.