The Case for a Dedicated Reliability Engineer
Many engineering teams treat reliability as 'everyone's responsibility.' In practice, that means it's nobody's responsibility. Here's why you need som…
Tech news from the best sources
Many engineering teams treat reliability as 'everyone's responsibility.' In practice, that means it's nobody's responsibility. Here's why you need som…
Over the weekend, I vibe coded a cooking game. You combine random ingredients, and the game generates a dish with a score and a snarky review — stuff …
Production incidents almost never break in one place. The alert fires in one tool. The broken deploy is in Netlify. The suspicious change is in GitHub…
The Non-Negotiable Imperative: Architecting Predictive AIOps for IBM ACE/MQ The era of reactive integration management is dead. In today's hyper-conne…
TL;DR: Our metrics bill went 6x in a single month. Traffic was flat. One Prometheus label carrying per-build IDs spawned millions of time series, and …
The explosion of artificial intelligence retrieval applications has transformed the way enterprises deploy document databases. However, transitioning …
Yesterday a piece came out that framed something I've been watching build across production environments for months. There is a category of production…
For years, database teams have relied on a simple assumption: “The backup completed successfully, so we are safe.” Unfortunately, reality is very diff…
As I discussed in my SLO Design article, traditional reliability metrics fail for agentic AI systems. Now let's look at how to actually implement sema…
I'm an SRE at Sony Interactive Entertainment. After a week where my teammate had four incidents (and four RCAs), I built something for the blank-page …
If you manage Kubernetes on bare metal or on prem environments, you'll eventually encounter the KubeAPIErrorBudgetBurn alert from the kube-prometheus-…
A few weeks ago I started building SafeRun — inline reliability infrastructure for AI agents in production. The temptation, when you're building somet…
TL;DR: Running LLM evaluations on every PR will burn your GPU budget faster than you can blink. We cut our eval spend by about 60% by batching jobs in…
The Microsoft team that built the Azure SRE Agent published something in January that I keep coming back to. Six months into building it, they realize…
Running large language model inference servers in production exposes gaps that neither stock Prometheus dashboards nor the official documentation of v…
The game development ecosystem is scaling at an unprecedented rate. Modern studio teams are engineering massive, interconnected virtual worlds operati…
Over the last few years, people have been asking the same question about AI: with so much money going into models, GPUs, and data centers, when will i…
AI platforms have a unique failure mode: they can bankrupt you. A runaway inference loop. A cascading retry storm. An agent that decides to call GPT-4…
Introduction In modern DevOps, simply knowing whether your application is "up" or "down" isn't enough. Users care about latency, reliability, and the …
The Monitoring Stack We Actually Use in Production Prometheus, Grafana, and three things nobody talks about until they break. Our Stack Prometheus for…
We're building the agentic OS for DevOps — AI agents that make cloud environments self-building, self-healing, and self-optimizing. We're looking for …
I no longer think the most dangerous cloud outage looks like an outage. The servers may be healthy. The dashboard may load. The data may still exist. …
LinkedIn Draft — Workflow (2026-05-19) A hard-earned rule from incident retrospectives: Incident RCA without a data-backed timeline is just a story yo…
TL;DR: We bolted an LLM gateway in front of the AI features in our build pipeline tooling and ended up running Bifrost instead of LiteLLM or Kong. The…
Three weeks ago someone on the AWS Builders Slack posted something that stopped me cold. Their production AI agent had been running for six hours. CPU…
A single LLM writing a production runbook is like asking one engineer to design, review, and approve their own code. It works. Sometimes. But the fail…
Agentic AI in DevOps: Useful Only After You Add Guardrails Most DevOps teams do not need an AI agent with production access on day one. What they actu…
Most developers spend years learning how to build software, but far fewer spend time studying how software breaks. Yet some of the most valuable engin…
It was a regular working day when the first alert landed. Kubernetes health check showing the control plane was degraded. I'd seen these before. Usual…
I've been writing software and running production infrastructure for over 20 years. I've been on call at 3am, written post-mortems, and had the kind o…