Testing & QA — Tech News

EN

Reviving K6's StatsD Extension for Datadog Observability

A lot gets said about how data can help us with observability, but much less gets said about how much we can shape the collection itself to actually h…

k6 datadog devops sre

EN

Building a Production-Safe AI Remediation Firewall for Amazon EKS

A reproducible multi-AZ resilience walkthrough: spread a service across simulated zones, kill one under load, and measure the dropped requests — plus …

ai aiops devops sre

EN

SRE Playbook: A Guide to Discover and Catalog Non-Human Identities (NHI)

As a site reliability engineer in a global company, I'm running a modern (well, relatively modern, to be honest and modest) cloud-native stack: HashiC…

security sre devops nhi

EN

Knowing What’s Under the Hood Helps

I’m not a car guy. I can drive them, but I don’t know the first thing about fixing them, and not much about maintaining them. My son, on the other han…

learning monitoring sre writing

EN

Your gRPC stream is "healthy" and serving nothing: synthetic monitoring for server-side streams

By Daniil Romashov — SRE/DevOps engineer. The tool described here is open source: github.com/youngpabl0/grpc-streams-checker (Apache-2.0). Uptime chec…

opensource sre devops grpc

EN

Getting Real Numbers Into the VPA Model: The Commands and Tools

The companion post to this one, "What the VPA Recommender Is Actually Computing," walks through the decay-weighted percentile math, the OOM bump, and …

devops sre kubernetes containers

EN

Incident Postmortem Template & Guide for Engineering Teams

Every engineering team has outages. The teams that improve fastest are not the ones that have the fewest incidents — they are the ones that extract th…

software devops productivity sre

EN

Building a Culture of Reliability: Beyond the SRE Handbook

You Can't Hire Your Way to Reliability I've seen companies hire 5 SREs and expect reliability to magically improve. It doesn't. Reliability is a cultu…

sre culture reliability engineering

EN

SRE AI Agent Safe Failure Implementation

Building Trustworthy AI Agents in Site Reliability Engineering Site Reliability Engineering is entering a new phase where agentic AI can assist with a…

ai sre

EN

Incident Communication: The Status Page That Builds Trust

Silence Destroys Trust During our worst outage, we went 35 minutes without updating the status page. Twitter filled the void. Theories ranged from dat…

incidents communication sre devops

EN

How We Built an AI That Never Forgets Production Incidents

How We Built an AI That Never Forgets Production Incidents Can AI become your smartest Site Reliability Engineer? We decided to find out. Every softwa…

ai automation showdev sre

EN

SLOs That Product Managers Actually Understand

The SLO Translation Problem You define an SLO: 99.95% availability with p99 latency under 200ms. Engineering loves it. Product managers glaze over. Th…

sre slo product reliability

EN

I built a production risk scanner in one day, here's what it caught

If you're an SRE or DevOps engineer — try blastradar.vercel.app and tell me what you actually think. The tool BlastRadar scores any code diff for prod…

devops sre programming ai

EN

Google SRE Review - Cheat Sheet

If you're a software engineer, architect, engineering manager, or platform engineer, I consider the Google SRE Book to be one of the handful of books …

google sre devops

EN

The Ultimate Guide to Production-Grade AI Agents

Production-grade AI agents are systems that execute multi-step workflows autonomously while maintaining reliability, security, and observability guara…

agents ai production sre

EN

Daftar Periksa Kesiapan Produksi AI Setelah POC: Dari Sandbox ke Sistem Nyata

POC selesai, demo berjalan mulus, dan stakeholder mengangguk setuju. Langkah berikutnya bukan sekadar "deploy ke production"—melainkan memastikan seti…

ai devops machinelearning sre

EN

Post-Mortem Best Practices That Actually Drive Change

The Post-Mortem Nobody Learns From I've sat through hundreds of post-mortems. Most follow the same pattern: something breaks, someone writes a Google …

sre postmortem incidents devops

EN

Humanizing Artificial Intelligence for SRE Teams: Reducing Alert Fatigue With Smarter AI Guidance

The pager goes off at 3:11 a.m. It's the fifth time tonight, and it's the same alert: HighMemoryUsage on a node that's running a memory-mapped cache d…

sre devops ai observability

EN

How an AI Terminal Assistant Became My Team's Most Productive Engineer - Opencode + Claude + MCP

Table of Contents The Moment That Changed Everything What It Actually Is The Setup Nobody Believes Is This Simple Focused Sessions — One Agent, One Mi…

ai sre productivity aws

EN

Auto-verifying your AI-SRE's fixes (Part II): HolmesGPT end-to-end on a real cluster

Part II of two. See Part I for the recipe. In Part I we've discussed how you can plug mirrord into your AI-SRE so it can autonomously test its fix in …

ai sre kubernetes devops

EN

Capacity Planning Without ML: The 80/20 Approach

There's a small industry of vendors that want to sell you machine learning capacity planning. For 95% of teams, you don't need it. You need a spreadsh…

sre devops capacity scaling

EN

EBS gp2 burst credits ran dry and our builds slowed to a crawl

TL;DR: A chunk of our EC2 build agents got slow at the same time every afternoon. No CPU pressure, no memory pressure, no network weirdness. It was EB…

infrastructure devops sre aws

EN

Semantic caching our flaky-test summariser: 58% fewer LLM calls

TL;DR: Our internal flaky-test summariser at Buildkite was firing ~40k LLM calls a day, and most were near-duplicates of failures we'd already explain…

sre devops llm mlops

EN

Chaos Engineering Is Theater Without These Three Things

Chaos engineering has a credibility problem. Half the teams that adopt it are doing it because it's fashionable, not because it makes their systems mo…

sre devops chaos resilience

EN

Humanizing Artificial Intelligence in DevOps Documentation: Making Runbooks Easier to Create and Use

The Runbook That Lied to Me at 3am The pager went off at 3:14am for a wedged OpenStack Neutron agent. I did what any tired engineer does: I opened the…

devops ai documentation sre

EN

How I Built an Autonomous Incident Investigation Agent That Reduced MTTR by 65%

Series: AI-Native SRE Table of Contents The Problem Every On-Call Engineer Knows What FRIDAY Does Architecture Overview Key Design Decisions The Tool-…

aiops sre cloudnative aws

EN

What is SRE? A Beginner's Guide to Site Reliability Engineering

Why This Matters: The 2 AM Problem It's 2 AM. Your phone rings. Your production database is down. Customers can't log in. Revenue is dropping by the s…

sre devops infrastructure

EN

Incident Automation: What to Automate, What to Leave to Humans

Incident response automation is a trap. Some things should be automated. Some things absolutely should not be. Getting the line wrong is worse than au…

sre devops automation incident

EN

Supercharging Kubernetes: How eBPF is Revolutionizing SRE and Platform Engineering

The ecosystem surrounding Kubernetes has always been a rapidly moving target. Just when Site Reliability Engineers and Platform Engineers feel they ha…

kubernetes devops sre platformengineering

EN

Engineering Design Document: Reusable Observability Platform V2

A production-focused redesign of a Stage 6 LGTM observability platform, moving from a single-service Anvila monitoring setup to a reusable, secure, hi…

devops observability architecture sre