What is SRE? A Beginner's Guide to Site Reliability Engineering
Why This Matters: The 2 AM Problem It's 2 AM. Your phone rings. Your production database is down. Customers can't log in. Revenue is dropping by the s…
Latest Testing & QA news from Tech News
Why This Matters: The 2 AM Problem It's 2 AM. Your phone rings. Your production database is down. Customers can't log in. Revenue is dropping by the s…
Incident response automation is a trap. Some things should be automated. Some things absolutely should not be. Getting the line wrong is worse than au…
The ecosystem surrounding Kubernetes has always been a rapidly moving target. Just when Site Reliability Engineers and Platform Engineers feel they ha…
A production-focused redesign of a Stage 6 LGTM observability platform, moving from a single-service Anvila monitoring setup to a reusable, secure, hi…
TL;DR: sre-skills is an open-source (Apache-2.0) library of SRE methodology skills an AI agent can load: the decision procedure for working an inciden…
TL;DR 3am page: GPU training pipeline missed its SLA. Datadog shows 95% GPU utilization. nvidia-smi agrees. Everything looks green, but the job is 3x …
Abstract Distributed financial systems are described through explicit interfaces. Services call APIs, consume events, write to databases, submit trans…
A production debugging story: tracing recurring 2–5-second read-only storms on a ClickHouse cluster down to a single 32-bit integer — and the one-line…
TL;DR Alert on symptoms, not causes – users feel latency and errors, not high CPU. Alert on p95 latency and error rates, not internal metrics. Use SLO…
TL;DR: Our ECS build workers were quietly killing in-flight jobs every time we scaled in or deployed. The fix wasn't a bigger timeout, it was actually…
The test passed. The runbook completed. Infrastructure came back online inside the RTO window. None of that means the organization can recover from an…
Production incidents almost never break in one place. The alert fires in one tool. The broken deploy is in Netlify. The suspicious change is in GitHub…
The Non-Negotiable Imperative: Architecting Predictive AIOps for IBM ACE/MQ The era of reactive integration management is dead. In today's hyper-conne…
Yesterday a piece came out that framed something I've been watching build across production environments for months. There is a category of production…
As I discussed in my SLO Design article, traditional reliability metrics fail for agentic AI systems. Now let's look at how to actually implement sema…
A few weeks ago I started building SafeRun — inline reliability infrastructure for AI agents in production. The temptation, when you're building somet…
TL;DR: Running LLM evaluations on every PR will burn your GPU budget faster than you can blink. We cut our eval spend by about 60% by batching jobs in…
Running large language model inference servers in production exposes gaps that neither stock Prometheus dashboards nor the official documentation of v…
Introduction In modern DevOps, simply knowing whether your application is "up" or "down" isn't enough. Users care about latency, reliability, and the …
В инфраструктуре Яндекса работают тысячи микросервисов, которые каждую секунду генерируют миллионы временных рядов — метрик. Это могут быть количества…
TL;DR: We bolted an LLM gateway in front of the AI features in our build pipeline tooling and ended up running Bifrost instead of LiteLLM or Kong. The…
Agentic AI in DevOps: Useful Only After You Add Guardrails Most DevOps teams do not need an AI agent with production access on day one. What they actu…
Most developers spend years learning how to build software, but far fewer spend time studying how software breaks. Yet some of the most valuable engin…
It was a regular working day when the first alert landed. Kubernetes health check showing the control plane was degraded. I'd seen these before. Usual…
Summary In this iteration, I improved the observability of GBIM on both the backend and frontend, based on the latest origin/staging branch. The focus…
Key Takeaways Most teams do not yet auto-remediate inside CI/CD. Per JetBrains' AI Pulse coverage (April 2026) , 78.2% of respondents don't use AI in …
Ракету не отправляют в космос только потому, что её двигатель и насос успешно прошли стендовые испытания по отдельности. Перед стартом инженеры рассчи…
A Pipelock instance running in a Kubernetes cluster watched its config file for hours while four edits to the underlying ConfigMap landed in etcd. The…
The failover was declared at 02:14. The runbook was followed. DNS records updated. Health checks passing on secondary. The on-call engineer closed the…
IRAS: Building a Production-Grade Autonomous Incident Response Agent Incident response at 3 AM is brutal. Your on-call engineer is woken up, scrambles…