SREGym Leaderboard

Comparing SRE agents across diagnosis, mitigation, and end-to-end incident resolution on SREGym. Ranked by E2E success rate, requiring both correct root-cause diagnosis and successful mitigation on the same run.

Noise:
Noise
1
Claude Code
Claude
Claude Sonnet 4.6
No72.675.660.7295.6709.61.47M
2
Stratus
Claude
Claude Sonnet 4.6
No61.578.554.8114.91145.0812K
3
Claude Code
Claude
Claude Sonnet 4.6
Yes62.676.353.7316.1739.11.71M
4
Codex
OpenAI
GPT-5.4
No70.065.253.3172.1374.21.98M
5
Codex
OpenAI
GPT-5.4
Yes59.364.045.9214.3389.81.88M
6
Stratus
Claude
Claude Sonnet 4.6
Yes51.565.540.2128.4582.9464K
7
Stratus
Kimi
Kimi K2.5
No41.360.632.9417.6892.6413K
8
Stratus
Kimi
Kimi K2.5
Yes38.957.330.4469.4848.3443K

Diag. Diagnosis success rate · Mit. Mitigation success rate · E2E End-to-end (both diagnosis and mitigation correct) · TTD Time-to-diagnose (seconds) · TTM Time-to-mitigate (seconds) · Tokens Mean token usage per run