OpenTelemetry in Production: Traces, Context, and What Actually Matters
Why OpenTelemetry Won Three years ago, the observability landscape was fragmented. Jaeger for tracing, Prometheus for metrics, Fluentd for logs, each …
Latest Architecture news from Tech News
Why OpenTelemetry Won Three years ago, the observability landscape was fragmented. Jaeger for tracing, Prometheus for metrics, Fluentd for logs, each …
AI agents are distributed systems. They fan out across LLM calls, tool invocations, memory lookups, and multi-step reasoning loops — often asynchronou…
In an earlier post I argued that event-driven agents reduce scope, cost, and decision dispersion because they narrow the decision space before the mod…
A single slow GPU – a straggler – in a 1,000-node training cluster idles 999 healthy GPUs at every AllReduce barrier. The job does not crash. There is…
OpenTelemetry eBPF Instrumentation (OBI) — The Complete Guide: KubeCon EU 2026 Beta Launch, Zero-Code Observability, and the 1.0 GA Roadmap Published …
The Day Prometheus Fell Over Prometheus memory usage spiked from 8GB to 32GB overnight. OOM-killed. Monitoring was down for 20 minutes while we scramb…
A practical guide In the first part , I covered the two initial signals to diagnose that something is wrong : Latency Traffic Those two alone explain …
*Originally published on cubeapm.com As organizations adopt cloud-native architectures, Kubernetes, and microservices, systems have become more distri…
TL;DR APM = metrics + traces + logs — Use all three together. Auto-instrument first — Agents cover HTTP, DB, queues. Add custom tags ( order_id , cust…
Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22 Also by me: Thinking in Go (2-book series) — Complete…
Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22 My project: Hermes IDE | GitHub — an IDE for develope…
Why event-driven agents reduce scope, cost, and decision dispersion Most agent systems do not control their costs because they spend tokens letting th…
The State of Observability 2026 report is out — here's what 407 DevOps engineers and SREs actually told us. Let's be honest. Most of us are juggling m…
Я Шевкопляс Дмитрий, технический руководитель проекта Swapno — сервис для обмена автомобилями ключ-в-ключ, без дилеров. Механика — как в Tinder: свайп…
У вас есть Grafana. Она показывает графики с Prometheus. Prometheus скрейпит метрики с ваших сервисов. Если сервис упал — вы видите красный на дашборд…
How Observability Engineering Cut Incident Response Time by 85% in Production Part 1 of 3: Structured Logs and Correlation IDs Part of a three-part se…
For many teams, Let’s Encrypt expiry reminder emails were a quiet but important safety net. When those reminders stopped, something subtle changed: Ce…
Text Generation Inference (TGI) has a very specific energy. It is not the newest kid in the inference street, but it is the one that already learned h…
When something goes wrong in my applications, logging is almost always the first tool I reach for. I'll throw a few log statements at the start and en…
Implementing Visual Audit Trails for LLM Agents in Production — A Step-by-Step Guide Your LLM agent is live in production. It's handling 500+ customer…