The Fireman — On Incident Management

01 · The Alarm

Neither one gets to ask whether they're in the mood.

An incident, in ITIL's vocabulary, is an unplanned interruption to a service or a reduction in its quality. A fireman doesn't decide what's burning or why. They just know something is on fire and people need help, fast. An Incident Manager gets paged at 2am because production is down, payments are failing, or a service is throwing errors at a million users a minute. The cause is unknown, the stakes are real, and the clock is already running.

FIREBox alarm

03:47Residential structure, two stories. Smoke showing from second floor. Engine 1, Truck 2, BC-4 responding.

SYSPage

03:47payments-api 5xx rate exceeded threshold. p99 latency 14s. On-call paged. Sev-1 declared.

02 · Triage & Command

About coordination under pressure, not heroics.

A fireman arriving at a burning building doesn't run in alone with a bucket. They size up the scene, figure out where the fire is spreading, decide whether it's a rescue or a containment situation, and direct their crew — one team on the hose, one doing search and rescue, one ventilating the roof.

The Incident Manager does the same thing with engineers: who's looking at the database, who's checking the load balancers, who's talking to customer support, who's drafting the status page update. They're not usually the person typing the fix — they're the one keeping the response coherent so five smart people don't all debug the same thing while the actual problem festers somewhere else.

FIREFireground orders

03:51Engine 1 — pull a line to side Charlie.
03:52Truck 2 — primary search, second floor.
03:53BC-4 — vent the roof, hold the stairwell.

SYSBridge assignments

03:51DBA — query plan + connection pool.
03:52SRE — load balancer + recent deploys.
03:53Comms — status page, support brief.

The job isn't to put out the fire yourself. It's to make sure the people who can put it out aren't all standing in the same room.

03 · The Translator

Translating chaos into something the rest of the world can act on.

Both have to manage information flow outward. The fire chief talks to the homeowner, the police, the press. The Incident Manager talks to the execs asking "is it fixed yet," to support teams getting buried in tickets, to legal if it's a data issue.

They translate technical chaos into something the rest of the world can act on, while shielding the responders from the noise so they can actually work. A good Incident Manager is, in part, a human firewall.

FIREOutward comms

04:08Briefing the homeowner. Updating the press line. Coordinating with PD on traffic control. Crew stays on the building.

SYSOutward comms

04:08Status page updated. Exec Slack briefed. Support given holding language. Engineers stay on the bridge.

04 · Restore First, Explain Later

The job is not to figure out why. The job is to make it stop.

This is where most people get the role wrong. ITIL Incident Management is measured on one thing: restoration of normal service operation. MTTR, not MTTU — mean time to repair, not mean time to understand. A rollback that gets payments flowing again at 04:12 is a win, even if nobody yet knows what the bad deploy actually did.

The fireman doesn't stop to inspect the wiring before knocking down the flames. They knock down the flames. A workaround beats a root cause every time, because the customer is on fire now. Cause-and-origin can wait until everyone's safe.

FIRERestore the building

04:11Knock it down. Vent the smoke. Pull occupants. Save what you can.

Why it started doesn't matter yet. Get the scene safe.

SYSRestore the service

04:11Roll the deploy back. Fail over. Flush the cache. Restart the pod.

Root cause doesn't matter yet. Get the customers green.

05 · The Handoff

When the fire is out, the fireman leaves.

This is where the analogy gets sharp, and where a lot of orgs blur a line ITIL drew on purpose. When the fire is out, the fireman packs up and goes home. They don't sift through the ash with a clipboard. That's a different person — a fire marshal, an arson investigator — with a different mandate, a different uniform, and a different boss.

ITIL draws the same line. Once normal service is restored, the incident is closed. What remains — the why, the underlying defect, the how do we keep this from happening again — becomes a problem. And Problem Management owns it. Not Incident Management.

The Incident Manager's notes from the bridge become the Problem Manager's opening file. The handoff matters. Conflate the two roles and you get one of two failure modes: the incident drags on for hours while everyone debates root cause and customers stay broken, or the post-mortem never gets written because the responders are already onto the next page.

FIREScene transfer

04:32Fire knocked down. Overhaul complete. Marshal arrives. Scene transferred. Crew returns to quarters.

The marshal owns cause & origin. Not us.

SYSProblem opened

04:32Service restored. Monitors green. Problem record opened. Owner assigned. Incident closed.

Problem Mgmt owns RCA. Not us.

An Incident Manager who refuses to close until they understand is not doing the job better. They're doing a different job — badly, and at the customer's expense.

06 · The Department

An Incident Manager fights this fire. An SRE makes sure the city wasn't built to burn.

A fireman is a single response unit. A fire department is the whole institution: training, dispatch, prevention codes, hydrant maintenance, drills, mutual-aid agreements, the budget meeting at city hall. The town that funds only the trucks gets one set of outcomes. The town that funds the department gets another.

ITIL Incident Management gives you the trucks. Site Reliability Engineering — when it's done as Google originally wrote it down — gives you the department. Same fire-service shape, larger surface area: the work between the fires that decides how bad the next one gets.

DEPTBetween the fires

DRILLLive evolutions on the training ground.
CODEBuilding codes, hydrant inspection, mutual-aid plan.
REPORTMarshal's report after every working fire.

SREBetween the pages

DRILLGame days, chaos engineering, load tests.
CODESLOs, golden signals, runbook automation.
REPORTBlameless post-mortem, action items owned.

Three load-bearing SRE practices, in fire-service vocabulary:

The 50% engineering rule. A department where every shift is calls all day burns out. The crew needs hours that aren't running into buildings — for training, prevention, gear maintenance. Google said the same thing about SREs: cap toil at half the role. The other half is engineering away the next page.

Error budgets. Zero fires is not the goal. Zero fires is the goal of a town with no buildings. A service that promises four-nines uptime is allowed about fifty minutes of downtime a quarter — and inside that budget you ship boldly. Outside it you freeze and harden. The budget is the negotiation between speed and safety, made explicit and shared.

The four golden signals. Latency, traffic, errors, saturation. The four gauges every dispatcher watches. The four readings the chief asks for first. Everything else is decoration.

Blameless post-mortems. Same shape as the marshal's report: what happened, what we'd do differently, what the system should make impossible next time. Names appear; blame doesn't. Because the next person to make the same mistake is the system, not the person.

An Incident Manager fights this fire. An SRE makes sure the city wasn't built to burn.

07 · The Ledger

A city budgets for fire as if it can burn down. Most companies budget for reliability as if it can't.

Look at the line items of any American city. State and local governments spend roughly fifty billion dollars a year on fire protection — about a hundred and fifty dollars per American, on average — and somewhere in the neighbourhood of five to ten percent of the typical municipal general fund. More for full-service big cities. Less for towns that lean on volunteers. Twenty-four-hour staffing. Dedicated stations. Dedicated equipment. Mutual-aid agreements with neighbours. It is mandatory. It has been mandatory since the great urban fires of the eighteen-hundreds taught everyone what the alternative looks like — and it has been defended, fiercely, every budget cycle since.

Fire is contagious. One house lights the next. The cost lands on the public regardless of whose stove it was. Insurance frameworks have spent a hundred and fifty years forcing the issue. The political cost of a city block burning down because the council cut the fire budget is career-ending. So the budget gets defended.

CITYWhat it costs

SPEND~5–10% of municipal budget
UNIT~$150 / resident / year (full-service)
STATUSMandatory infrastructure
STAFF24/7 dedicated, multi-station
CUTAlmost never; fights in public

CORPWhat it costs

SPENDMature: ~5–10% of eng. Median: under 1% or zero
UNITWildly variable; rarely measured
STATUSDiscretionary line item
STAFFOn-call rotation, often a side duty
CUTFirst; fights in private

Now look at the eng org. Google's Site Reliability Workbook says, on the record, that SRE has historically run five to ten percent of their engineering staff. The other mature shops — AWS, Netflix, the few who learned the lesson the hard way — cluster in the same neighbourhood. Same shape as a municipal fire budget. That is the punchline.

The median company is nowhere near it. Many have zero dedicated reliability function. On-call is whatever side duty fell to whoever wrote the code. The tooling line items — Datadog, PagerDuty — are real but rounding errors next to dev-velocity spend. Reliability gets cut first when the quarter looks tight.

The asymmetry is not an accident. Outage cost is diffuse. Customer churn from a four-hour incident does not show up on the same line of the P&L as the SRE headcount that would have prevented it. There is no city council watching. The political cost of not funding reliability lands on whoever inherits the pager next quarter — not on the exec who blocked the hire. So the budget gets cut.

A city that spent on fire what most companies spend on reliability would be ash by Tuesday.

08 · Where it bends

Where the analogy gets a little loose.

A fireman's fires are mostly independent events. A house burns; the house next door usually doesn't. Incidents in a complex distributed system aren't like that. One small flame in a dependency graph can light up half the org in seconds — a slow database makes the API slow, which makes the frontend slow, which makes the retry storm, which takes down a service that didn't even know the database existed.

So the line between one incident and many incidents tracing to one problem gets harder to draw. The Incident Manager has to think more like a fire chief commanding a wildfire than a single-engine response. Containment lines. Spot fires. Wind direction. The blaze you're not looking at yet. And sometimes the handoff to Problem Management has to happen while parts of the fire are still burning — because the underlying defect is the only thing that explains why three apparently unrelated services all caught at once.

In summary

Stay calm. Coordinate. Communicate. Restore. Hand off. Then build the city so the next fire doesn't spread.

That's the posture, across both uniforms. The radios are different. The hoses are different. The job has two halves now — the one that runs toward the smoke, and the one that goes back the next morning to draw the new building code.

09 · Forward

Brief the chief.

The fire is out. The bridge is closed. The runbook is open. Your task now is to brief. The Marshal needs the timeline. The COO needs the language. The next on-call needs the lesson.

Forward this dispatch to the people who set the budget for the next one.

DISP-01 · For exec The Fireman — DISP-01 briefing tile

Distribution

Pick the channel they actually open.

Forward by email→ Post to LinkedIn→ Post to X→