Building an AI SRE That Learns From Every Outage: Inside Nexus Sentinel
Every engineering team has experienced it. A production incident happens at 2 AM. An engineer joins the bridge call, opens dashboards, checks logs, se…
Latest Open Source news from Tech News
Every engineering team has experienced it. A production incident happens at 2 AM. An engineer joins the bridge call, opens dashboards, checks logs, se…
The thesis Quorum is built on is uncomfortable and true: the tools a team uses to coordinate an incident often live in the same region as the thing th…
If you're a DevOps, SRE, or Cloud engineer in India — or hiring one — the market in 2026 looks very different from a few years ago. Instead of guessin…
The ecosystem surrounding Kubernetes has always been a rapidly moving target. Just when Site Reliability Engineers and Platform Engineers feel they ha…
I'm going to tell you about an engineer I worked with. Call him Mark. Mark was talented, well-liked, and utterly ineffective. Here's what I learned fr…
Error budgets are useless without a policy. 'We're out of error budget' should trigger consequences. If it doesn't, you don't have an error budget — y…
A production-focused redesign of a Stage 6 LGTM observability platform, moving from a single-service Anvila monitoring setup to a reusable, secure, hi…
TL;DR: sre-skills is an open-source (Apache-2.0) library of SRE methodology skills an AI agent can load: the decision procedure for working an inciden…
Three years ago we had our first real outage. Six hours of downtime. Thousands of angry users. Multiple executives on the call. Here's what we did rig…
Production is down. Slack is on fire. Your phone is ringing. You've seen this exact error before — ConnectionResetError: [Errno 104] cascading through…
TL;DR 3am page: GPU training pipeline missed its SLA. Datadog shows 95% GPU utilization. nvidia-smi agrees. Everything looks green, but the job is 3x …
Abstract Distributed financial systems are described through explicit interfaces. Services call APIs, consume events, write to databases, submit trans…
1. Введение Всем привет! Меня зовут Яблоков Олег, я — ведущий инженер ИТ-отдела Navio и отвечаю за систему мониторинга основной инфраструктуры …
A startup raised $50M this week to help companies move AI out of stalled pilots and into production. Investors called it "the defining gap of 2026." S…
A production debugging story: tracing recurring 2–5-second read-only storms on a ClickHouse cluster down to a single 32-bit integer — and the one-line…
TL;DR Alert on symptoms, not causes – users feel latency and errors, not high CPU. Alert on p95 latency and error rates, not internal metrics. Use SLO…
TL;DR: Our ECS build workers were quietly killing in-flight jobs every time we scaled in or deployed. The fix wasn't a bigger timeout, it was actually…
The test passed. The runbook completed. Infrastructure came back online inside the RTO window. None of that means the organization can recover from an…
From Eclipses to P95 Latency: What the Joseon Dynasty Can Teach Us About Incident Response The Joseon Dynasty ruled Korea for more than five centuries…
Over the weekend, I vibe coded a cooking game. You combine random ingredients, and the game generates a dish with a score and a snarky review — stuff …
Production incidents almost never break in one place. The alert fires in one tool. The broken deploy is in Netlify. The suspicious change is in GitHub…
The Non-Negotiable Imperative: Architecting Predictive AIOps for IBM ACE/MQ The era of reactive integration management is dead. In today's hyper-conne…
Привет. В предыдущих статьях этого цикла мы разбирали, как Kubernetes-объекты читаются ( первая — informer и кэш в controller-runtime ) и записываются…
TL;DR: Our metrics bill went 6x in a single month. Traffic was flat. One Prometheus label carrying per-build IDs spawned millions of time series, and …
The explosion of artificial intelligence retrieval applications has transformed the way enterprises deploy document databases. However, transitioning …
Yesterday a piece came out that framed something I've been watching build across production environments for months. There is a category of production…
For years, database teams have relied on a simple assumption: “The backup completed successfully, so we are safe.” Unfortunately, reality is very diff…
As I discussed in my SLO Design article, traditional reliability metrics fail for agentic AI systems. Now let's look at how to actually implement sema…
I'm an SRE at Sony Interactive Entertainment. After a week where my teammate had four incidents (and four RCAs), I built something for the blank-page …
If you manage Kubernetes on bare metal or on prem environments, you'll eventually encounter the KubeAPIErrorBudgetBurn alert from the kube-prometheus-…