Tech News — Latest News

EN

Fix Docker Exit Code 137 (OOMKilled): Why It Happens and How to Stop It

Your container died and docker ps -a shows something like Exited (137) 4 minutes ago . Nine times out of ten that's the kernel's OOM killer, not your …

docker devops sre tutorial

EN

Your uptime SLA means nothing when the physical process can't wait for your rollback

There’s a conversation that happens when IT developers first encounter operational technology. It usually goes something like this: “What’s your uptim…

devops sre programming iot

EN

AWS puts gray zone failures into the EKS control loop

What AWS is calling out The New Stack, on July 10, published a walk-through of what running Kubernetes across a very large EKS fleet has taught AWS ab…

aws eks kubernetes sre

EN

🚀 Calling all DevOps, SRE, and Platform Engineers! Let’s build the future of AI for DevOps together.

Over the last few years, I've been exploring AI agents, and one thing became obvious. There are hundreds of AI agents available today, but almost all …

devops ai programming sre

EN

Service Level Objectives for Complex Microservices

Why SLOs Break in Microservices A SLO that works for a monolith often collapses when you distribute the same logic across 30 services. The math of ava…

sre slo microservices reliability

EN

Building a Culture of Reliability: Beyond the SRE Handbook

You Can't Hire Your Way to Reliability I've seen companies hire 5 SREs and expect reliability to magically improve. It doesn't. Reliability is a cultu…

sre culture reliability engineering

EN

The expensive half of your incident bot is the half you didn't build

An incident bot caught the CrashLoopBackOff at 3:12 a.m., proposed delete_pod, and the on-call approved it half asleep at 3:14. The new pod went Runni…

devops sre observability kubernetes

EN

SRE AI Agent Safe Failure Implementation

Building Trustworthy AI Agents in Site Reliability Engineering Site Reliability Engineering is entering a new phase where agentic AI can assist with a…

ai sre

EN

End of week. Here's the thing I kept coming back to:

LinkedIn Draft — Insight (2026-07-10) End of week. Here's the thing I kept coming back to: SLOs work when they create conversations, not when they cre…

observability sre devops platformengineering

EN

Incident Communication: The Status Page That Builds Trust

Silence Destroys Trust During our worst outage, we went 35 minutes without updating the status page. Twitter filled the void. Theories ranged from dat…

incidents communication sre devops

EN

How to Configure Apache as a Reverse Proxy with mod_proxy

The shape of a typical modern enterprise deployment involves Apache serving as a TLS-terminating reverse proxy sitting in front of upstream applicatio…

apache devops sre sysadmin

EN

Docker Containerization Habits That Keep Production Calm

Most of the container incidents I've helped clean up didn't come from anything exotic. They came from small shortcuts that felt reasonable on a Tuesda…

docker devops containers sre

EN

Debugging Containers From the Terminal: A Practical Docker CLI Workflow

A container that's misbehaving is one of those problems where your instinct works against you. The pressure pushes you toward the dramatic move — rest…

docker devops cli sre

EN

Why Your Microservices Need Circuit Breakers (And How to Add Them)

The Cascading Failure That Took Down Everything Our payment service went down for 3 minutes. No big deal, right? Except every service that called paym…

microservices reliability sre devops

EN

How We Built an AI That Never Forgets Production Incidents

How We Built an AI That Never Forgets Production Incidents Can AI become your smartest Site Reliability Engineer? We decided to find out. Every softwa…

ai automation showdev sre

EN

I let an AI handle an outage. It invented a hack that never happened, then spiraled

One evening, a monitoring alert went off: a server behind a web service was down. I handed the incident to an AI coding agent. Half experiment, half l…

ai llm sre incident

EN

SLOs That Product Managers Actually Understand

The SLO Translation Problem You define an SLO: 99.95% availability with p99 latency under 200ms. Engineering loves it. Product managers glaze over. Th…

sre slo product reliability

EN

Something I wish someone had told me five years earlier:

LinkedIn Draft — Insight (2026-07-03) Something I wish someone had told me five years earlier: Distributed tracing: the gap between having it and usin…

observability sre devops platformengineering

EN

I built a production risk scanner in one day, here's what it caught

If you're an SRE or DevOps engineer — try blastradar.vercel.app and tell me what you actually think. The tool BlastRadar scores any code diff for prod…

devops sre programming ai

EN

Self healing and secure. Good combo.

Build software that heals itself in the agentic era Gabe LG Gabe LG Gabe LG Follow Jul 1 Build software that heals itself in the agentic era # ai # ag…

agents ai security sre

EN

Google SRE Review - Cheat Sheet

If you're a software engineer, architect, engineering manager, or platform engineer, I consider the Google SRE Book to be one of the handful of books …

google sre devops

EN

Planning network checks before running them: a local-first workflow pattern

Many operations tasks do not begin as tickets, dashboards, or scripts. They begin as intent. Someone says: Check whether this subnet looks normal. Or:…

sre devops automation aiops

EN

Kubernetes resource requests and limits explained: scheduling, throttling, and OOMKill

This is part of the Platform engineering with Go series: a growing collection of posts on Kubernetes, Go tooling, and infrastructure automation. View …

devops k8s kubernetes sre

EN

Log Management at Scale: How We Cut Costs 70% Without Losing Signal

$12,000/Month for Logs Nobody Reads Our logging bill was $12,000/month. We were ingesting 2TB/day. When I asked the team what percentage of logs they …

logging observability devops sre

EN

The Ultimate Guide to Production-Grade AI Agents

Production-grade AI agents are systems that execute multi-step workflows autonomously while maintaining reliability, security, and observability guara…

agents ai production sre

EN

Blameless Postmortems in Practice

Most teams claim they do blameless postmortems. Then the incident happens. "Jane didn't validate the input." "The on-call missed the alert." "We shoul…

devops management sre

EN

Circuit Breaker and Bulkhead Thresholds You Can Tune Live (Kiponos Java SDK)

Circuit breakers and bulkheads are design patterns — their numbers are operational weapons. Failure ratio 50% or 30%? Max concurrent calls 25 or 100? …

architecture java showdev sre

EN

Daftar Periksa Kesiapan Produksi AI Setelah POC: Dari Sandbox ke Sistem Nyata

POC selesai, demo berjalan mulus, dan stakeholder mengangguk setuju. Langkah berikutnya bukan sekadar "deploy ke production"—melainkan memastikan seti…

ai devops machinelearning sre

EN

Kubernetes 1.36: 8 Features Worth Your Attention

Kubernetes 1.36 (Haru) brings around 70 enhancements, ranging from security improvements to new scheduling capabilities. While most release summaries …

kubernetes aws devops sre

EN

The Golden Signals: A Practical Implementation Guide

Four Metrics to Rule Them All Google's SRE book introduced the four golden signals: Latency, Traffic, Errors, and Saturation. Simple concept, but I've…

sre monitoring observability devops