Tech News — Latest News

EN

AdvancedMathBench: A New Benchmark for LLM Advanced Mathematical Reasoning

What Changed Large language models (LLMs) have demonstrated proficiency in high-school and olympiad-style mathematics. However, their performance in a…

llm mathematics benchmark proofgeneration

EN

Which LLM should I actually code with? I built a small benchmark to find out

LLM code benchmark — Peculiar Engineer A small, self-run coding benchmark: 3 models on 14 problems across 3 languages, scored on pass@k, cost, and spe…

ai llm benchmark programming

EN

I Benchmarked 42 Compression Formats Spanning Four Decades. Here's What to Actually Use.

I run ezyZip , a browser-based archive tool, so "which format should I use?" is a question I field constantly. The honest answer is usually "it depend…

compression zip benchmark cli

RU

Как я писал in-memory векторный движок на Go — и в каком месте он обогнал hnswilb

Полгода назад я начал писать in-memory базу с векторным поиском на Go: RESP-протокол, HNSW-индекс, WAL, многопоточность. Рассказываю, что из этого выш…

векторные базы данных векторный поиск hnsw golang go in-memory benchmark квантизация rag

EN

AI Coding Tools Benchmark 2026: Cursor vs Copilot vs Windsurf vs Claude Code

I spent two weeks testing Cursor, GitHub Copilot, Windsurf, and Claude Code on the same set of tasks. Not vibes. Not feature lists. Actual work: build…

coding benchmark cursor githubcopilot

EN

I built a neutral benchmarking layer for quantum simulators in Rust — and it revealed a silent disagreement between two backends

rust quantumcomputing opensource benchmark

EN

GLM Is the New Hotness, So Let's Test It On the Homelab

GLM is the new hotness. I'm hearing it from both sides of the AI builder world. Software engineers are talking about it because the benchmark numbers …

modelshowdown benchmark ai llm

RU

Harness Bench: как оценить агентский harness и выбрать связку с моделью

Привет! Я Андрей Иванов, NLP-исследователь в R&D-лаборатории red_mad_robot. Когда мы собираем AI-агента, первым делом выбираем модель под задачу. …

ai-агенты agent harness llm mcp ai evaluation harness benchmark

RU

Собрал ИИ-бенчмарк под себя из 2 месяцев своих сессий — и дорогие модели проиграли дешёвым

Два месяца своих сессий с ИИ скормил скрипту и собрал бенчмарк под СВОЮ работу — не под чужой лидерборд. Результат: тройка «лучших открытых моделей» с…

llm benchmark llm-as-a-judge gemma glm self-hosting

EN

Too cheap to be good? Think again.

For years, I ran my WordPress sites on OpenLiteSpeed. Fast server, LSCache is genuinely impressive, and the OLS/WordPress combo is hard to beat on raw…

ai benchmark devops webdev

EN

A UMAP With Arrows Is Not a Benchmark. This Is

How I built a three-task evaluation framework for RNA velocity trajectory inference -- measuring global ordering, pairwise rank preservation, and robu…

benchmark bioinformatics rna scientificsoftware

EN

Engineering CellFateBench: A Reproducible Python Benchmark for Single-Cell Genomics Reasoning

CellFateBench is a scientific software and benchmark-engineering project for evaluating reasoning over single-cell genomics workflows. The project was…

bioinformatics genomics benchmark python

RU

Иллюзия 99% F1 в Time Series: как искажаются метрики в детекции аномалий и что показывает реальный тест 14 архитектур

Многие свежие SOTA-статьи по детекции аномалий во временных рядах заявляют F1 ≈ 99%. Мы проверили один из таких методов, и оказалось, что волшебство и…

time series временные ряды anomaly detection поиск аномалий predictive maintenance предиктивная аналитика benchmark трансформеры графовые нейросети MVTS

RU

Локальный запуск openai/gpt-oss-20b MXFP4 GGUF на ноутбуке без дискретной видеокарты: практический тест на 32 GB RAM

Запустил openai/gpt-oss-20b в варианте MXFP4 GGUF на обычном ноутбуке без дискретной видеокарты: только CPU, встроенная Radeon 780M и общая оперативна…

локальные LLM openai gpt-oss-20b GGUF MXFP4 LM Studio Radeon 780M Ryzen ноутбук без дискретной видеокарты Windows 11 benchmark

RU

Поиск черной кошки в 2000-мерной темной комнате. Турнир алгоритмов машинного обучения

Добро пожаловать на мой маленький тестовый полигон. В этой статье я расскажу, как столкнул лбами двадцать один алгоритм машинного обучения - от старой…

машинное+обучение нейросети benchmark сравнение моделей lightgbm xgboost catboost random forest исследование

EN

LLM Wire Format Benchmark: Which Format Can AI Actually Read and Write?

Every LLM wire format claims token savings. Nobody proves whether AI models can actually comprehend the format at scale, or produce valid output in it…

llm benchmark ai webdev

EN

Ideogram 4.0 is Good. Just Good.

A blind test across 240 images and 10 professional designers just dropped. Ideogram 4.0 against Gemini 3.1, Grok Imagine, and FLUX.2 Max. The results …

ai review imagegeneration benchmark

RU

Я попробовал считать нейросетевой слой в конечном поле Галуа GF(137): 4x по памяти, ARM NEON и честные ограничения

Я проверил маленький нейросетевой слой в арифметике GF(137): не через квантизацию готовой float32-модели, а сразу в байтовом конечнополевом представле…

GF137 конечные-поля периферийный-инференс ARM-NEON uint8 benchmark cpp воспроизводимость

RU

[Перевод] Что нового в Swift: Май 2026 года

«Что нового в Swift» — кураторский дайджест релизов, видео и обсуждений в проекте и сообществе Swift. Для начала мы остановимся на некоторых локальных…

swift hummingbird llm benchmark

EN

How fast is LlamaStash? Overhead, throughput, and a fair comparison with Ollama and LM Studio

Originally published at deepu.tech . In my release post for LlamaStash I made a claim I need to back up. The wrapper adds zero overhead vs running lla…

ai llamacpp benchmark llm

RU

Как тестируют кодинг-агентов в 2026 — и почему вашему продакшну нужен свой бенчмарк

Ни для кого не секрет, что эра «спросить что-то у GPT» постепенно уходит в прошлое. На смену генеративному AI приходит Agentic AI, который не просто п…

ml ai benchmark ai-агенты ai-agent swe-bench swe-bench verified OSWorld GAIA terminal-bench

EN

Benchmarking the Claude Agent SDK on a local LLM: Haiku and Sonnet tier performance

The Claude Agent SDK exposes three budget tiers ( haiku , sonnet , opus ) and reads its routing target from environment variables on every call. That …

llm claude llamacpp benchmark

EN

I Benchmarked 17 ESLint Security Plugins. Only One Found Every Vulnerability.

Skip to: Full Results | Category Breakdown | The Leaderboard | Methodology TL;DR I built a benchmark suite with 40 vulnerable code patterns across 14 …

security eslint javascript benchmark

EN

Multi-Shot vs Zero-Shot: When Adding Examples Actually Hurts Accuracy

Book: Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs Also by me: Thinking in Go (2-book series) — Complete Guide to Go Pro…

ai llm prompt benchmark

EN

LMR-BENCH: Can LLM Agents Reproduce NLP Research Code? (EMNLP 2025)

A research team from the University of Texas at Dallas published LMR-BENCH at EMNLP 2025, asking a specific question: can LLM agents reproduce the cor…

benchmark researchreproducibility llmagents paperpoc

EN

AI-generated accessibility, an update — frontier models still fail, but skills change the game

A few months ago I shared early results from the A11y LLM Eval project, a benchmark that measures how accessibly LLMs generate UI code. The previous p…

a11y llm ai benchmark

EN

Benchmarks- Kubernetes MCP Servers Passed. That Was Not Enough.

Kubernetes MCP servers passed our live benchmark. That was not the interesting part. The interesting part was what happened on the way to the green ch…

kubernetes ai benchmark opensource

RU

AI-агент действительно ловит баги? Пусть докажет на бенчмарке

Привет! Это снова Михаил Федоров. В первой статье — архитектура QA Assist: 11 AI-агентов от декомпозиции требований до готовых автотестов. Во второй —…

ai ассистент qa автоматизация llm-агент claude benchmark

EN

How do you benchmark an MCP server you built?

I built a code-intelligence MCP server. Then I built a benchmark for code-intelligence MCP servers. Then my tool placed first on every scenario. I did…

ai mcp claude benchmark

EN

Model Showdown Round 4: Opus vs Qwen — Writers, Not Coders

Two models. Same prompt. Same five fodder files. Same 27 published posts to check for redundancy. Same writing style guide. One chose the Dev.to syndi…

ai llm benchmark agents