Put Your Agent Evals in CI or Stop Calling Them Evals
Most teams I talk to have "evals." I ask them where the evals run. The answer is almost always the same: a notebook, a dashboard, a spreadsheet someon…
Latest Programming news from Tech News
Most teams I talk to have "evals." I ask them where the evals run. The answer is almost always the same: a notebook, a dashboard, a spreadsheet someon…
The Argument The AI safety conversation is dominated by two camps: the alignment researchers thinking about existential risk, and the product engineer…
What is an LLM evaluation harness? A deep dive into lm-eval-harness You fine-tuned a 7B model. It aced your smoke tests, your colleague ran a few prom…
If your LLM-as-judge calibration kappa moves around week to week and you cannot explain it from labeller behavior, the usual cause is the marginal dis…