Team Management — Tech News

EN

When Your Homelab Grows Up: How SQLite Took Down My k3s Control Plane

Originally published at wostal.eu . TL;DR : My Hetzner k3s lab quietly became a platform. Dozens of operators with leader-election leases hammered the…

kubernetes sqlite sre homelab

EN

How to Write an Incident Postmortem in 2026 (With Template)

A postmortem turns an outage into something your team learns from instead of repeats. Here is the structure, why blameless matters, and a template you…

devops monitoring sre

EN

Building your own AI SRE moves the toil; it does not remove it

Now that Chronosphere leaders are telling engineering teams to build their own AI SRE, the honest question is who ends up carrying its pager. Their pi…

sre aisre incidentresponse observability

EN

The Reliability Roadmap: A 90-Day Plan for New SRE Teams

New SRE team at your company? Here's a 90-day plan I've used twice. It works because it balances 'show immediate value' with 'build for the long term.…

sre devops strategy onboarding

EN

Incident Retrospectives Without Blame

I've run over 100 post-mortems. The worst ones end with 'Alice will be more careful.' The best ones end with 'we fixed the system.' Here's how you get…

sre devops postmortem culture

EN

What if the safest Kubernetes fix is no fix at all?

AI-assisted DevOps and SRE tools are becoming more common. Tools like K8sGPT can scan Kubernetes clusters, detect issues, and explain what might be go…

ai k8sgpt sre devops

EN

Building a Culture of Reliability: Beyond the SRE Handbook

You Can't Hire Your Way to Reliability I've seen companies hire 5 SREs and expect reliability to magically improve. It doesn't. Reliability is a cultu…

sre culture reliability engineering

EN

SRE AI Agent Safe Failure Implementation

Building Trustworthy AI Agents in Site Reliability Engineering Site Reliability Engineering is entering a new phase where agentic AI can assist with a…

ai sre

RU

Система растет там, где можно ошибаться. История из минского ИТ-хаба

Привет, Хабр! Я Артем, групп-лид в Т-Банке. Я пришел в Т почти пять лет назад и с тех пор так или иначе всегда работал в домене «Кэшбэки» — одной из в…

sre карьера в it офис

EN

Something I wish someone had told me five years earlier:

LinkedIn Draft — Insight (2026-07-03) Something I wish someone had told me five years earlier: Distributed tracing: the gap between having it and usin…

observability sre devops platformengineering

EN

Google SRE Review - Cheat Sheet

If you're a software engineer, architect, engineering manager, or platform engineer, I consider the Google SRE Book to be one of the handful of books …

google sre devops

EN

Planning network checks before running them: a local-first workflow pattern

Many operations tasks do not begin as tickets, dashboards, or scripts. They begin as intent. Someone says: Check whether this subnet looks normal. Or:…

sre devops automation aiops

EN

The Ultimate Guide to Production-Grade AI Agents

Production-grade AI agents are systems that execute multi-step workflows autonomously while maintaining reliability, security, and observability guara…

agents ai production sre

EN

Blameless Postmortems in Practice

Most teams claim they do blameless postmortems. Then the incident happens. "Jane didn't validate the input." "The on-call missed the alert." "We shoul…

devops management sre

EN

Kubernetes 1.36: 8 Features Worth Your Attention

Kubernetes 1.36 (Haru) brings around 70 enhancements, ranging from security improvements to new scheduling capabilities. While most release summaries …

kubernetes aws devops sre

EN

On-Call Wellness: Protecting Your Engineers from Burnout

The On-Call Burnout Epidemic I watched three senior SREs leave our team in six months. Exit interviews all said the same thing: on-call was unsustaina…

sre oncall burnout culture

EN

Post-Mortem Best Practices That Actually Drive Change

The Post-Mortem Nobody Learns From I've sat through hundreds of post-mortems. Most follow the same pattern: something breaks, someone writes a Google …

sre postmortem incidents devops

EN

Humanizing Artificial Intelligence for SRE Teams: Reducing Alert Fatigue With Smarter AI Guidance

The pager goes off at 3:11 a.m. It's the fifth time tonight, and it's the same alert: HighMemoryUsage on a node that's running a memory-mapped cache d…

sre devops ai observability

RU

SLO as Code — нельзя верить людям

Всем привет, меня зовут Вячеслав, я Team Lead SRE в Купере. Рассказ в этой статье пойдет о том, как мы внедряли SLO , чего достигли и какие лайфхаки н…

slo sre oncall grafana reliability reliability engineering

EN

Exponential backoff with jitter stopped our CI retry storms

TL;DR: Exponential backoff with jitter spreads client retries over time so a recovering service doesn't get flattened by a synchronised wave. We added…

sre devops infrastructure reliability

EN

How an AI Terminal Assistant Became My Team's Most Productive Engineer - Opencode + Claude + MCP

Table of Contents The Moment That Changed Everything What It Actually Is The Setup Nobody Believes Is This Simple Focused Sessions — One Agent, One Mi…

ai sre productivity aws

EN

Capacity Planning Without ML: The 80/20 Approach

There's a small industry of vendors that want to sell you machine learning capacity planning. For 95% of teams, you don't need it. You need a spreadsh…

sre devops capacity scaling

EN

What 60+ Claude Code memory entries taught me about solo ops

I run a paid infrastructure service. Alone. No co-founder, no on-call rotation, no senior engineer to escalate to. My only collaborator is Claude Code…

claude sre devops ai

EN

Chaos Engineering Is Theater Without These Three Things

Chaos engineering has a credibility problem. Half the teams that adopt it are doing it because it's fashionable, not because it makes their systems mo…

sre devops chaos resilience

EN

Humanizing Artificial Intelligence in DevOps Documentation: Making Runbooks Easier to Create and Use

The Runbook That Lied to Me at 3am The pager went off at 3:14am for a wedged OpenStack Neutron agent. I did what any tired engineer does: I opened the…

devops ai documentation sre

EN

What is SRE? A Beginner's Guide to Site Reliability Engineering

Why This Matters: The 2 AM Problem It's 2 AM. Your phone rings. Your production database is down. Customers can't log in. Revenue is dropping by the s…

sre devops infrastructure

EN

DevOps Salaries & Hiring in India 2026: What 800+ Live Job Listings Reveal

If you're a DevOps, SRE, or Cloud engineer in India — or hiring one — the market in 2026 looks very different from a few years ago. Instead of guessin…

career devops news sre

EN

The Engineer Who Owns Nothing: A Cautionary Tale

I'm going to tell you about an engineer I worked with. Call him Mark. Mark was talented, well-liked, and utterly ineffective. Here's what I learned fr…

sre devops culture ownership

EN

Error Budget Policies That Hold Leadership Accountable

Error budgets are useless without a policy. 'We're out of error budget' should trigger consequences. If it doesn't, you don't have an error budget — y…

sre devops slo leadership

EN

Engineering Design Document: Reusable Observability Platform V2

A production-focused redesign of a Stage 6 LGTM observability platform, moving from a single-service Anvila monitoring setup to a reusable, secure, hi…

devops observability architecture sre