The Case for a Dedicated Reliability Engineer
Many engineering teams treat reliability as 'everyone's responsibility.' In practice, that means it's nobody's responsibility. Here's why you need som…
Tech news from the best sources
Many engineering teams treat reliability as 'everyone's responsibility.' In practice, that means it's nobody's responsibility. Here's why you need som…
Over the weekend, I vibe coded a cooking game. You combine random ingredients, and the game generates a dish with a score and a snarky review — stuff …
Production incidents almost never break in one place. The alert fires in one tool. The broken deploy is in Netlify. The suspicious change is in GitHub…
The Non-Negotiable Imperative: Architecting Predictive AIOps for IBM ACE/MQ The era of reactive integration management is dead. In today's hyper-conne…
После инцидента команда почти всегда хочет видеть больше: добавить поле в лог, сохранить еще одну метку, оставить дашборд «на всякий случай». В момент…
Привет. В предыдущих статьях этого цикла мы разбирали, как Kubernetes-объекты читаются ( первая — informer и кэш в controller-runtime ) и записываются…
TL;DR: Our metrics bill went 6x in a single month. Traffic was flat. One Prometheus label carrying per-build IDs spawned millions of time series, and …
The explosion of artificial intelligence retrieval applications has transformed the way enterprises deploy document databases. However, transitioning …
Зарплаты джунов в IT обычно невысокие. Работодатели ищут сотрудников с опытом от года даже на начальные позиции, а в вакансиях без опыта нередко предл…
Yesterday a piece came out that framed something I've been watching build across production environments for months. There is a category of production…
В первой части разобрали, как обращения из Mattermost попадают в n8n, классифицируются по категориям и отправляются в нужную ветку обработки. В этой ч…
For years, database teams have relied on a simple assumption: “The backup completed successfully, so we are safe.” Unfortunately, reality is very diff…
As I discussed in my SLO Design article, traditional reliability metrics fail for agentic AI systems. Now let's look at how to actually implement sema…
I'm an SRE at Sony Interactive Entertainment. After a week where my teammate had four incidents (and four RCAs), I built something for the blank-page …
If you manage Kubernetes on bare metal or on prem environments, you'll eventually encounter the KubeAPIErrorBudgetBurn alert from the kube-prometheus-…
A few weeks ago I started building SafeRun — inline reliability infrastructure for AI agents in production. The temptation, when you're building somet…
TL;DR: Running LLM evaluations on every PR will burn your GPU budget faster than you can blink. We cut our eval spend by about 60% by batching jobs in…
The Microsoft team that built the Azure SRE Agent published something in January that I keep coming back to. Six months into building it, they realize…
Running large language model inference servers in production exposes gaps that neither stock Prometheus dashboards nor the official documentation of v…
The game development ecosystem is scaling at an unprecedented rate. Modern studio teams are engineering massive, interconnected virtual worlds operati…
Over the last few years, people have been asking the same question about AI: with so much money going into models, GPUs, and data centers, when will i…
AI platforms have a unique failure mode: they can bankrupt you. A runaway inference loop. A cascading retry storm. An agent that decides to call GPT-4…
Introduction In modern DevOps, simply knowing whether your application is "up" or "down" isn't enough. Users care about latency, reliability, and the …
The Monitoring Stack We Actually Use in Production Prometheus, Grafana, and three things nobody talks about until they break. Our Stack Prometheus for…
We're building the agentic OS for DevOps — AI agents that make cloud environments self-building, self-healing, and self-optimizing. We're looking for …
I no longer think the most dangerous cloud outage looks like an outage. The servers may be healthy. The dashboard may load. The data may still exist. …
LinkedIn Draft — Workflow (2026-05-19) A hard-earned rule from incident retrospectives: Incident RCA without a data-backed timeline is just a story yo…
В инфраструктуре Яндекса работают тысячи микросервисов, которые каждую секунду генерируют миллионы временных рядов — метрик. Это могут быть количества…
TL;DR: We bolted an LLM gateway in front of the AI features in our build pipeline tooling and ended up running Bifrost instead of LiteLLM or Kong. The…
Three weeks ago someone on the AWS Builders Slack posted something that stopped me cold. Their production AI agent had been running for six hours. CPU…