Open Source — Tech News

EN

Building an AI SRE That Learns From Every Outage: Inside Nexus Sentinel

Every engineering team has experienced it. A production incident happens at 2 AM. An engineer joins the bridge call, opens dashboards, checks logs, se…

agents ai devops sre

EN

Surviving the region you run in: failover on Aurora DSQL, and what the demo proves

The thesis Quorum is built on is uncomfortable and true: the tools a team uses to coordinate an incident often live in the same region as the thing th…

aws database reliability sre

EN

DevOps Salaries & Hiring in India 2026: What 800+ Live Job Listings Reveal

If you're a DevOps, SRE, or Cloud engineer in India — or hiring one — the market in 2026 looks very different from a few years ago. Instead of guessin…

career devops news sre

EN

Supercharging Kubernetes: How eBPF is Revolutionizing SRE and Platform Engineering

The ecosystem surrounding Kubernetes has always been a rapidly moving target. Just when Site Reliability Engineers and Platform Engineers feel they ha…

kubernetes devops sre platformengineering

EN

The Engineer Who Owns Nothing: A Cautionary Tale

I'm going to tell you about an engineer I worked with. Call him Mark. Mark was talented, well-liked, and utterly ineffective. Here's what I learned fr…

sre devops culture ownership

EN

Error Budget Policies That Hold Leadership Accountable

Error budgets are useless without a policy. 'We're out of error budget' should trigger consequences. If it doesn't, you don't have an error budget — y…

sre devops slo leadership

EN

Engineering Design Document: Reusable Observability Platform V2

A production-focused redesign of a Stage 6 LGTM observability platform, moving from a single-service Anvila monitoring setup to a reusable, secure, hi…

devops observability architecture sre

EN

Open-source SRE methodology skills an AI agent can load. Apache-2.0, runnable offline against fixtures, no credentials.

TL;DR: sre-skills is an open-source (Apache-2.0) library of SRE methodology skills an AI agent can load: the decision procedure for working an inciden…

sre devops ai opensource

EN

How We Handled Our First Major Outage (And Survived)

Three years ago we had our first real outage. Six hours of downtime. Thousands of angry users. Multiple executives on the call. Here's what we did rig…

sre devops incident culture

EN

How I Built an AI Agent That Fixes Production Errors Using Memory — And Why Memory Changes Everything

Production is down. Slack is on fire. Your phone is ringing. You've seen this exact error before — ConnectionResetError: [Errno 104] cascading through…

agents ai rag sre

EN

GPU Incident at 3am: eBPF Tracing from Page to Root Cause in 60 Seconds

TL;DR 3am page: GPU training pipeline missed its SLA. Datadog shows 95% GPU utilization. nvidia-smi agrees. Everything looks green, but the job is 3x …

gpu ebpf observability sre

EN

Hidden Coupling in Distributed Financial Systems: Dependencies You Didn't Know You Had

Abstract Distributed financial systems are described through explicit interfaces. Services call APIs, consume events, write to databases, submit trans…

distributedsystems fintech sre systemdesign

RU

От Prometheus к Victoria Metrics: как мы пересобрали мониторинг в Kubernetes

1.   Введение Всем привет! Меня зовут Яблоков Олег, я — ведущий инженер ИТ-отдела Navio и отвечаю за систему мониторинга основной инфраструктуры …

victoriametrics prometheus kubernetes мониторинг observability sre devops gitops grafana alertmanager

EN

The AI Pilot-to-Production Gap Is an SRE Problem And We Already Know How to Close It

A startup raised $50M this week to help companies move AI out of stalled pilots and into production. Investors called it "the defining gap of 2026." S…

sre aws devops agentaichallenge

EN

The 32-bit Hidden Countdown in ClickHouse Keeper: How an XID Overflow Gave Us Weekly Read-Only Bursts

A production debugging story: tracing recurring 2–5-second read-only storms on a ClickHouse cluster down to a single 32-bit integer — and the one-line…

database distributedsystems sre systems

EN

Setting Up Alerts and Notifications for Performance Bottlenecks

TL;DR Alert on symptoms, not causes – users feel latency and errors, not high CPU. Alert on p95 latency and error rates, not internal metrics. Use SLO…

devops monitoring performance sre

EN

The SIGTERM our build workers ignored, and the 90s that fixed it

TL;DR: Our ECS build workers were quietly killing in-flight jobs every time we scaled in or deployed. The fix wasn't a bigger timeout, it was actually…

sre infrastructure devops kubernetes

EN

Why Most Disaster Recovery Tests Don't Test Recovery

The test passed. The runbook completed. Infrastructure came back online inside the RTO window. None of that means the organization can recover from an…

disasterrecovery sre infrastructure devops

EN

From Eclipses to P95 Latency: What the Joseon Dynasty Can Teach Us About Incident Response

From Eclipses to P95 Latency: What the Joseon Dynasty Can Teach Us About Incident Response The Joseon Dynasty ruled Korea for more than five centuries…

devops monitoring performance sre

EN

Agentic Ops: How I Shipped My Vibe-Coded Game to Production

Over the weekend, I vibe coded a cooking game. You combine random ingredients, and the game generates a dish with a score and a snarky review — stuff …

vibecoding devops sre aiops

EN

Building ReefWatch, a Coral-Powered Production Triage Agent

Production incidents almost never break in one place. The alert fires in one tool. The broken deploy is in Netlify. The suspicious change is in GitHub…

agents ai showdev sre

EN

Observability Telemetry and Predictive AIOps

The Non-Negotiable Imperative: Architecting Predictive AIOps for IBM ACE/MQ The era of reactive integration management is dead. In today's hyper-conne…

ai architecture automation sre

RU

Жизненный цикл объекта в Kubernetes: путь от kubectl apply до полного удаления

Привет. В предыдущих статьях этого цикла мы разбирали, как Kubernetes-объекты читаются ( первая — informer и кэш в controller-runtime ) и записываются…

kubernetes controller-runtime api go golang open sou platform engineering devops sre cloud

EN

The Prometheus label that blew our monitoring bill out 6x

TL;DR: Our metrics bill went 6x in a single month. Traffic was flat. One Prometheus label carrying per-build IDs spawned millions of time series, and …

devops infrastructure sre

EN

How to Optimize MongoDB on Bare Metal Servers: SRE Playbook

The explosion of artificial intelligence retrieval applications has transformed the way enterprises deploy document databases. However, transitioning …

mongodb devops sre database

EN

Your Agent Acts Without Checking Your Error Budget — That's the Failure Mode Nobody Is Tracking

Yesterday a piece came out that framed something I've been watching build across production environments for months. There is a category of production…

ai sre devops cursor

EN

Why Backup Success Does Not Mean Database Recoverability

For years, database teams have relied on a simple assumption: “The backup completed successfully, so we are safe.” Unfortunately, reality is very diff…

database devops sre

EN

Why Your AI Agent Monitoring is Wrong (And How to Fix It)

As I discussed in my SLO Design article, traditional reliability metrics fail for agentic AI systems. Now let's look at how to actually implement sema…

ai sre devops fintech

EN

I got tired of writing post-mortems — so I built RCAi for SREs

I'm an SRE at Sony Interactive Entertainment. After a week where my teammate had four incidents (and four RCAs), I built something for the blank-page …

sre devops ai postmortem

EN

Diagnosing KubeAPIErrorBudgetBurn: When a 7-Year-Old Disk Takes Down Your Control Plane

If you manage Kubernetes on bare metal or on prem environments, you'll eventually encounter the KubeAPIErrorBudgetBurn alert from the kube-prometheus-…

kubernetes devops etcd sre