Put Your Agent Evals in CI or Stop Calling Them Evals
Most teams I talk to have "evals." I ask them where the evals run. The answer is almost always the same: a notebook, a dashboard, a spreadsheet someon…
Latest Testing & QA news from Tech News
Most teams I talk to have "evals." I ask them where the evals run. The answer is almost always the same: a notebook, a dashboard, a spreadsheet someon…
The Argument The AI safety conversation is dominated by two camps: the alignment researchers thinking about existential risk, and the product engineer…
What is an LLM evaluation harness? A deep dive into lm-eval-harness You fine-tuned a 7B model. It aced your smoke tests, your colleague ran a few prom…
На связи Сергей Смирнов, AI-инженер и основатель LLMStart.ru. Сегодня разбираем самое больное место разработки ИИ-агентов — как доказать, что они реал…