DevOps — Tech News

EN

nvidia-smi Reports 97% Utilization While the GPU Sits Idle

TL;DR A GPU shows 97% utilization in nvidia-smi , but training throughput is a fraction of what benchmarks promise. The GPU is not computing; it is wa…

gpu ebpf observability mlops

RU

LLM Sandbox: изолированная среда для исполнения кода от LLM [часть 1, теория]

В большинстве бизнес-сценариев LLM перестала быть просто чат-ботом. Современные модели становятся частью агентских систем: у них есть инструменты, дос…

агент llm sandbox docker security agents ai devops mlops

EN

I Processed 2.4 Billion Tokens Across 52 AI Models for $0.52. Here's the Full Breakdown.

I run a production multi-agent AI system on a single M1 Mac in Jamaica. 6 autonomous agents. 26 cron workflows. 5-layer persistent memory. All contain…

agenticai openrouter mlops costoptimization

EN

Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4

Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4 You just finished fine-tuning a 7B parameter model. The raw FP16 weights are 14 GB. Your tar…

llm quantization mlops tutorial

RU

Когда компании пора строить свой LLM-кластер, а не пользоваться внешними API

На раннем этапе внедрения LLM в компании выглядят как быстрый выигрыш: подключается внешний API (например, ChatGPT), ускоряется работа с текстами, авт…

llm on-premise ai kubernetes mlops инференс моделей gpu-кластеры rag llama 3 ai-инфраструктура корпоративный ai

EN

I Built a Production RAG System on My M1 Mac for $0

I Built a Production RAG System on My M1 Mac for $0 Most RAG tutorials stop at "it answers questions." But answering questions is table stakes. The re…

rag mlops ai python

EN

Per-project LLM cost attribution with OTel spans: the wiring

TL;DR. If your LLM bill is one line item on a cloud invoice, you cannot answer "which team spent that." We fixed this by tagging every gateway span wi…

devops observability opentelemetry mlops

EN

Our event-camera detector lost 6 mAP to a badly chosen accumulation window

TL;DR: We spent three weeks chasing a 6 mAP regression in an event-camera object detector. The model was fine. The bug was the accumulation window we …

computervision machinelearning pytorch mlops

EN

I Built a Complete AI Infrastructure Stack from Scratch — Here's What I Learned

I Built a Complete AI Infrastructure Stack from Scratch — Here's What I Learned Most AI projects start at the top of the stack. You grab an LLM API, w…

distributedsystems mlops cpp go

EN

QAT vs PTQ on our edge vision model: 6 months of A/B data

TL;DR: We ran post-training quantisation (PTQ) and quantisation-aware training (QAT) side by side on the same defect-classification model deployed on …

machinelearning computervision mlops pytorch

EN

LLM-as-judge variance broke our DPO training signal for 3 weeks

TL;DR: Our DPO pipeline used a single LLM as the preference judge. Training reward climbed every run. Production accuracy fell 4 points. The judge was…

machinelearning mlops llm pytorch

EN

The bf16 grad accumulator that killed our SDXL LoRA training

TL;DR: Our SDXL LoRA fine-tune for a Photoroom product photography model trained for six days while silently corrupting its adapter weights. The cause…

machinelearning pytorch mlops computervision

EN

Prefix caching in vLLM under multi-tenant agent traffic

TL;DR: We turned on vLLM's prefix cache for our agent workloads at Nexus Labs and watched TTFT drop from 480ms to 110ms on one tenant and stay exactly…

llm mlops infrastructure pytorch

EN

Part 2: Enterprise Decision Intelligence Architecture: AI Governance, Threshold Policy Engines, and Operational AI Systems

Part 1 showed how to evaluate binary classification thresholds in Python. This part asks the harder enterprise question: What happens when that thresh…

ai architecture governance mlops

EN

How to Detect GPU Waste in a Kubernetes Cluster

GPU waste in Kubernetes does not announce itself. Your cluster shows healthy utilization. Your dashboards are green. But 20–40% of your GPU capacity i…

kubernetes gpu mlops devops

EN

Why 91% of AI Agents Fail in Production (And What the 9% Do Differently)

Everyone is building AI agents right now. Autonomous systems that reason, plan, and act without humans in the loop. Agents that write code, manage wor…

ai mlops systemdesign productionai

EN

llm-nano-vm v0.8.0 — deterministic FSM runtime for LLM pipelines, now with output validation and per-step timeouts

PyPI: pip install llm-nano-vm GitHub: http://github.com/Ale007XD/nano_vm MCP gateway: http://github.com/Ale007XD/nano-vm-mcp I've been building a dete…

mlops backend opensource fintech

EN

Stop paying for idle GPUs in your CI: batching LLM eval jobs

TL;DR: Running LLM evaluations on every PR will burn your GPU budget faster than you can blink. We cut our eval spend by about 60% by batching jobs in…

devops mlops llm sre

EN

Why your diffusion model is slow at batch size 1 (and what actually helps)

TL;DR: Single-image diffusion inference is bottlenecked by kernel launch overhead and attention memory traffic, not raw FLOPs. torch.compile with mode…

machinelearning pytorch computervision mlops

EN

When AI Meets Reality: Why “Hello World” Isn’t Enough for LLM Systems

Most AI tutorials stop at “Hello World.” You wire up a model, send a prompt, get a response, and feel like you’ve built something. But the moment you …

ai architecture llm mlops

EN

What GenAI Actually Costs in Production

The first number anyone quotes when asked what generative AI costs is a per-token figure. It is a comfortable number — small, unambiguous, available o…

llm mlops aiengineering cost

EN

The Missing Engineering Stack for Production AI Agents

The "build an agent in 5 minutes" tutorials get you to a demo. They don't get you to production. Here's the field guide for the four primitives that d…

agentskills promptengineering mcp mlops

EN

How we catch silent NPU fallback on Snapdragon in CI (and why your eval set won't)

TL;DR — ONNX Runtime's QNN execution provider will quietly route unsupported ops to the CPU instead of the Hexagon NPU. Your accuracy is fine. Your ev…

edgeai mlops onnxruntime cicd

EN

Beyond Monitoring: Building AI-Powered Predictive Observability for Retail Data Pipelines published

Three numbers before we start: Average detection time with traditional monitoring: 4.2 hours Average detection time with predictive observability: 11 …

dataengineering observability mlops dataquality