TECH ENGLISH

Incident Post-Mortems

Learn how to discuss production incidents in English — describe root causes, propose action items, and run blameless post-mortems effectively.

Practice Tech Discussions

Scenario Context

Your team's payment processing service went down for 47 minutes during peak hours, affecting approximately 12,000 transactions. You're leading the post-mortem meeting. You need to walk through the timeline, identify the root cause (a database connection pool exhaustion), discuss contributing factors, and assign action items — all without pointing fingers.

Why This Matters for Engineers

When production breaks, emotions run high and communication becomes critical. Engineers who can calmly describe what happened, explain the root cause without assigning blame, and propose concrete action items earn enormous trust. For non-native speakers, the pressure of an incident makes it even harder to find the right words. Practicing post-mortem English beforehand ensures you can lead these critical conversations with clarity and professionalism.

Essential Phrases

Let's start with the timeline. The incident began at 14:23 UTC when our monitoring detected elevated error rates.

Opening a post-mortem

formal

The root cause was connection pool exhaustion in the primary database.

Stating root cause

neutral

This is a blameless post-mortem — we're here to fix the system, not to find fault.

Setting the tone

formal

A contributing factor was the lack of connection pool monitoring.

Identifying contributing factors

formal

What could we have done to detect this earlier?

Discussing detection gaps

neutral

The time to detection was too long — we need better alerting.

Evaluating response

neutral

I'll take the action item to add connection pool metrics to our dashboard.

Volunteering for action items

neutral

Let's set a severity level for this — I'd call it a SEV-1.

Classifying severity

neutral

Were any other services affected downstream?

Assessing blast radius

formal

We need a runbook for this failure mode going forward.

Proposing documentation

neutral

The mitigation was a rolling restart — that bought us time until the fix was deployed.

Describing mitigation

neutral

Technical Pronunciation

Word❌ Common Error✅ CorrectTip
outageow-TAHJOW-tijStress on the first syllable. Rhymes with 'cottage'.
triageTRY-agetree-AHZHFrench origin — stress on the second syllable, soft 'zh' at the end.
daemonDAY-monDEE-muhnSounds like 'demon', referring to background processes.
nginxN-G-I-N-Xengine-XSay 'engine X' — don't spell it out.
DevOpsdev-OPSDEV-opsStress on DEV, not OPS.

Written vs. Spoken English

Engineers often write one way on Slack or GitHub, but speak differently in meetings. Here's how to translate.

Describing root cause

Written (Slack/PR)
Root Cause: Database connection pool exhaustion due to unreleased connections in the PaymentProcessor module.
Spoken (Meeting)
What happened is the database ran out of connections because a recent code change wasn't releasing them properly.

Proposing fix

Written (Slack/PR)
Action Item: Implement connection pool monitoring with alerting threshold at 80% utilization.
Spoken (Meeting)
We need to add monitoring for the connection pool and set an alert when it hits 80%, so we catch it before it's completely full.

Quantifying impact

Written (Slack/PR)
Impact: ~12,000 failed transactions over a 47-minute window.
Spoken (Meeting)
About twelve thousand transactions failed during the forty-seven minutes we were down.

Example Dialogue

YO
YouThanks everyone for joining. Before we begin, I want to remind us that this is a blameless post-mortem. We're here to understand what happened and prevent it from happening again.
YO
YouLet me walk through the timeline. At 14:23 UTC, our error rate spiked to 30%. By 14:28, we confirmed that the payment service was returning 500 errors.
ON
On-Call EngineerI was the one who got paged. I initially thought it was a downstream dependency, so I checked the payment provider first.
YO
YouThat's a reasonable first step. When did we identify the actual root cause?
ON
On-Call EngineerAbout fifteen minutes in. I checked the database metrics and saw that all connections in the pool were exhausted.
YO
YouRight. The root cause was that a recent code change introduced a query that wasn't releasing connections properly. Combined with peak traffic, the pool ran dry in minutes.
MA
ManagerHow do we prevent this from happening again?
YO
YouThree action items. First, add connection pool utilization to our monitoring dashboard — that should have caught this before it became an outage. Second, add a circuit breaker so the service degrades gracefully instead of crashing. Third, update our code review checklist to include connection management verification.
MA
ManagerGood plan. Who's owning each item?

Common Questions

What does 'blameless post-mortem' mean?
A blameless post-mortem focuses on system and process failures, not individual mistakes. The idea is that humans make errors — the system should prevent those errors from causing outages. In English, you signal this by saying 'This is about fixing the process, not pointing fingers.'
How do I describe an incident I caused without sounding defensive?
Be direct and factual. Say 'I deployed a change that introduced a connection leak — here's what happened and what I've learned.' Avoid phrases like 'but it wasn't my fault' or 'I didn't know.' Taking ownership shows maturity and builds trust.
What's the difference between mitigation and remediation?
Mitigation is the quick fix to stop the bleeding — like restarting a service. Remediation is the permanent fix — like fixing the underlying bug. In a post-mortem, describe both: 'We mitigated by restarting the service, and we'll remediate by fixing the connection leak and adding monitoring.'

Stop Stumbling on Tech Calls

Practice explaining code, architecture, and bugs with an AI coach that understands engineering context.

Start Practicing Now

No credit card required.