nvidia-smi Reports 97% Utilization While the GPU Sits Idle
TL;DR A GPU shows 97% utilization in nvidia-smi , but training throughput is a fraction of what benchmarks promise. The GPU is not computing; it is wa…
Latest AI & ML news from Tech News
TL;DR A GPU shows 97% utilization in nvidia-smi , but training throughput is a fraction of what benchmarks promise. The GPU is not computing; it is wa…
Привет, Хабр! Я Миша Онянов, Python-разработчик и платформенный инженер в крупнейшем проекте MAGNIT TECH – F&R. Из статьи вы узнаете, как с помощь…
В большинстве бизнес-сценариев LLM перестала быть просто чат-ботом. Современные модели становятся частью агентских систем: у них есть инструменты, дос…
I run a production multi-agent AI system on a single M1 Mac in Jamaica. 6 autonomous agents. 26 cron workflows. 5-layer persistent memory. All contain…
Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4 You just finished fine-tuning a 7B parameter model. The raw FP16 weights are 14 GB. Your tar…
На раннем этапе внедрения LLM в компании выглядят как быстрый выигрыш: подключается внешний API (например, ChatGPT), ускоряется работа с текстами, авт…
I Built a Production RAG System on My M1 Mac for $0 Most RAG tutorials stop at "it answers questions." But answering questions is table stakes. The re…
LLM упростили запуск AI-функций до нескольких вызовов API, и дата-сайентисты будто бы выпали из критического пути. На практике именно здесь начинаются…
TL;DR. If your LLM bill is one line item on a cloud invoice, you cannot answer "which team spent that." We fixed this by tagging every gateway span wi…
Последние годы развитие LLM шло по пути экстенсивного масштабирования: считалось, что чем больше весов и данных, тем умнее модель. В индустрии даже сл…
TL;DR: We spent three weeks chasing a 6 mAP regression in an event-camera object detector. The model was fine. The bug was the accumulation window we …
I Built a Complete AI Infrastructure Stack from Scratch — Here's What I Learned Most AI projects start at the top of the stack. You grab an LLM API, w…
Рассказываем, как мы интегрировали CodeBERT-based модель классификации секретов в production-продукт с жёсткими ограничениями по железу, сократив врем…
TL;DR: We ran post-training quantisation (PTQ) and quantisation-aware training (QAT) side by side on the same defect-classification model deployed on …
TL;DR: Our DPO pipeline used a single LLM as the preference judge. Training reward climbed every run. Production accuracy fell 4 points. The judge was…
TL;DR: Our SDXL LoRA fine-tune for a Photoroom product photography model trained for six days while silently corrupting its adapter weights. The cause…
TL;DR: We turned on vLLM's prefix cache for our agent workloads at Nexus Labs and watched TTFT drop from 480ms to 110ms on one tenant and stay exactly…
Part 1 showed how to evaluate binary classification thresholds in Python. This part asks the harder enterprise question: What happens when that thresh…
Same prompt, two models, different outputs. No tooling was actually showing me where they diverged. Built tokenflame that gives entropy heatmaps, toke…
GPU waste in Kubernetes does not announce itself. Your cluster shows healthy utilization. Your dashboards are green. But 20–40% of your GPU capacity i…
Everyone is building AI agents right now. Autonomous systems that reason, plan, and act without humans in the loop. Agents that write code, manage wor…
PyPI: pip install llm-nano-vm GitHub: http://github.com/Ale007XD/nano_vm MCP gateway: http://github.com/Ale007XD/nano-vm-mcp I've been building a dete…
TL;DR: Running LLM evaluations on every PR will burn your GPU budget faster than you can blink. We cut our eval spend by about 60% by batching jobs in…
Есть компании, которые верят в то, что уж лучше много джунов за копейки, чем несколько сеньоров за дорого. Очевидно, мнения могут быть разными, поэтом…
TL;DR: Single-image diffusion inference is bottlenecked by kernel launch overhead and attention memory traffic, not raw FLOPs. torch.compile with mode…
Most AI tutorials stop at “Hello World.” You wire up a model, send a prompt, get a response, and feel like you’ve built something. But the moment you …
The first number anyone quotes when asked what generative AI costs is a per-token figure. It is a comfortable number — small, unambiguous, available o…
The "build an agent in 5 minutes" tutorials get you to a demo. They don't get you to production. Here's the field guide for the four primitives that d…
«Если что-то может пойти не так, это обязательно случится» . Мы не пытаемся предотвратить отказ, мы проектируем систему так, чтобы отказ одного элемен…
TL;DR — ONNX Runtime's QNN execution provider will quietly route unsupported ops to the CPU instead of the Hexagon NPU. Your accuracy is fine. Your ev…