Tech News — Latest News

EN

You're Not Paying for Compute. You're Paying for Memory Bandwidth

TL;DR— Inference cost conversations obsess over FLOPs and token prices, but the real constraint on LLM serving is memory bandwidth— specifically the c…

ai llm inference mlops

RU

Как оптимизировать инференс LLM: кеширование, время ответа и GPU-ресурсы

Вы запустили LLM-инференс в продакшене.   Поток запросов не менялся, нагрузка та же, что вчера, — а Time to First Token внезапно вырос в три раза…

ai ml inference mlops

EN

Facing US export controls, China's DeepSeek plans to make its own chips

It's early, but the plan is to reduce dependency on Nvidia and Huawei.

AI china data centers deepseek Huawei inference NVIDIA openai silicon

EN

Two labs race to make AI write whole paragraphs at once instead of word by word

Diffusion text models — which draft an entire block of text at once and then iteratively refine it, rather than generating one token at a time left to…

diffusion openweight google inference

EN

KV Cache Is Eating Your VRAM — Here's How to Estimate It Before You Run Out

Every LLM inference engineer hits this wall eventually. You deployed a model, it works in testing, then production traffic arrives. Suddenly your 80GB…

llm inference engineering ai

EN

Lossless, But Not Free: The Lossless, But Not Free — When Speculative Decoding Actually Pays Off (and When It Doesn't)

One of the hottest topics in LLM inference acceleration right now is Speculative Decoding . DSpark claims 60%–85% single-user speedup at the same thro…

ai llm inference engineering

EN

Extract Structured JSON from Messy Text with Telnyx AI Inference

Messy text is everywhere: support tickets, lead forms, emails, contracts, incident reports, call notes, Slack messages. The annoying part is that the …

ai inference telnyx json

EN

96% of cuBLAS, no `unsafe`: what cuTile Rust proves

GPU programming usually asks Rust developers to surrender the borrow checker at the launch boundary: references collapse into raw pointers, and aliasi…

cutile rust gpu inference

EN

OpenAI and Broadcom announce chip designed for LLM inference at scale

The silicon race is heating up amid the struggle to keep up with demand.

AI Tech Broadcom ChatGPT Codex compute data centers inference Jalapeño LLM openai silicon

EN

Sipp: a local-first runtime for Hybrid AI Applications

Over the past few months, I had the opportunity to contribute to llama.cpp’s WebGPU backend, helping push it from isolated operator support toward a m…

inference ai localai llm

EN

Can You Tell When an LLM API Swaps in a Cheaper Model?

If you call an open-weight model behind an API, whether that is your own box, a hosted endpoint, or a router, you are trusting that the thing answerin…

localai llm inference verification

EN

How to Build a Secure Homelab for LLM Inference

We’ve treated local AI deployments as experimental toys for too long. The moment a homelab becomes a dependency for work, the security posture must sh…

homelab llmsecurity inference supplychain

EN

Speculative decoding: when and why it actually speeds up inference

Speculative decoding: when and why it actually speeds up inference Your chat endpoint serves 200 requests per second. The model is a 70B Llama 3 fine-…

llm ai inference performance

EN

Intel: Our upcoming AI chip will be cheaper, run cooler than Nvidia, AMD options

Crescent Island is an air-cooled chip that uses LPDDR5 memory.

AI AI inference AMD data centers inference Intel NVIDIA

RU

Как работает адаптивный RAG, которому вообще не нужна LLM

Один из самых популярных способов снизить процент галлюцинаций языковых моделей — метод RAG, то есть схема, в которой модель при необходимости обращае…

rag llm machinelearning inference оптимизация вычислений искусственный интеллект классификатор архитектура ии инференс ллм

RU

Как работает адаптивный RAG, которому вообще не нужна LLM

Один из самых популярных способов снизить процент галлюцинаций языковых моделей — метод RAG, то есть схема, в которой модель при необходимости обращае…

rag llm machinelearning inference оптимизация вычислений искусственный интеллект классификатор архитектура ии инференс ллм

RU

[Перевод] Дезагрегированный инференс LLM в Kubernetes: префилл, декодирование и планирование подов

С ростом сложности рабочих нагрузок инференса больших языковых моделей (LLM) единый монолитный процесс обслуживания упирается в свои пределы. У префил…

vk cloud llm kubernetes inference gpu nvidia дезагрегированный инференс оркестрация автомасштабирование планирование подов

RU

MELT-1: трансформер 7B сдыхает за 11 часов, а наш агент живёт 95

TL;DR. Мы выкатили открытый бенчмарк MELT-1 — он меряет не сколько модель знает в идеальных условиях (MMLU & co), а сколько она проживёт под дрифт…

inference ai-агенты суверенный ии