AI & ML — Tech News

All EN RU

1C Code Bench — спустя 5 месяцев

В прошлой статье я описал 1C Code Bench — бенчмарк для оценки способности LLM писать правильный код на 1С. Там я описал принципы составления задач и п…

1C LLM benchmarking vibecoding

Your model speed benchmark is measuring the wrong thing

Model speed is not a property of the model. It is a property of the model plus your payload size plus your output format plus whether you're constrain…

ai llm discuss benchmarking

Как создать свой бенчмарк: 6 уроков с туториала NeurIPS

Посмотрела Туториал NeurIPS «The Art of Benchmarking» — панель с авторами SWE-bench, GPQA и ведущими исследователями из Google DeepMind, NYU и Berkele…

benchmarking

Google Said It Had Native Function Calling. I Tested It.

Google released Gemma 4 E4B with a specific claim: native function calling. "Enhanced coding and agentic capabilities," the model card said. "Native f…

ai agents localai benchmarking

We Tested 10 Untested LLMs on Agent Coding — The Results Are In

We Tested 10 Untested LLMs on Agent Coding — The Results Are In Yesterday I promised to benchmark 10 LLMs that have never been tested on real agent co…

ai llm programming benchmarking

Why I spun my benchmark into its own repo (and why every dev tool with a benchmark should)

This week I shipped a benchmark for code-intelligence MCP servers and posted the results — including the cases where my own tool lost. Within 36 hours…

opensource benchmarking devtools ai