How I Built Eval Tools for Karpathy's Autoresearch
TL;DR: Karpathy's autoresearch runs hundreds of GPT pretraining experiments overnight. It doesn't tell you which ones mattered. I built three CLIs that do: autojudge (noise floor + Pareto analysis)...

Source: DEV Community
TL;DR: Karpathy's autoresearch runs hundreds of GPT pretraining experiments overnight. It doesn't tell you which ones mattered. I built three CLIs that do: autojudge (noise floor + Pareto analysis), autosteer (what to try next), autoevolve (competing agents, cross-pollinate winners). The problem After running autoresearch for a week I had a TSV with thousands of rows and no idea what to trust. The built-in keep/discard logic is: did val_bpb go down? That's it. No noise floor estimation. No way to know if a 0.02% improvement is real signal or run-to-run jitter. After 700 experiments I had 6 "improvements" and zero confidence in any of them. The eval layer isn't there. Karpathy left it as an exercise. What I built autojudge Reads results.tsv and run.log, estimates the noise floor from recent experiments, checks if the improvement is on the Pareto front (val_bpb vs memory), and returns a verdict with a confidence score. pip install autojudge autojudge --results results.tsv --run run.log O