An ITIL Incident Manager works a lot like a fireman. An SRE rebuilds the department between the fires. Same uniform. Different shift.
An incident, in ITIL's vocabulary, is an unplanned interruption to a service or a reduction in its quality. A fireman doesn't decide what's burning or why. They just know something is on fire and people need help, fast. An Incident Manager gets paged at 2am because production is down, payments are failing, or a service is throwing errors at a million users a minute. The cause is unknown, the stakes are real, and the clock is already running.
A fireman arriving at a burning building doesn't run in alone with a bucket. They size up the scene, figure out where the fire is spreading, decide whether it's a rescue or a containment situation, and direct their crew — one team on the hose, one doing search and rescue, one ventilating the roof.
The Incident Manager does the same thing with engineers: who's looking at the database, who's checking the load balancers, who's talking to customer support, who's drafting the status page update. They're not usually the person typing the fix — they're the one keeping the response coherent so five smart people don't all debug the same thing while the actual problem festers somewhere else.
The job isn't to put out the fire yourself. It's to make sure the people who can put it out aren't all standing in the same room.
Both have to manage information flow outward. The fire chief talks to the homeowner, the police, the press. The Incident Manager talks to the execs asking "is it fixed yet," to support teams getting buried in tickets, to legal if it's a data issue.
They translate technical chaos into something the rest of the world can act on, while shielding the responders from the noise so they can actually work. A good Incident Manager is, in part, a human firewall.
This is where most people get the role wrong. ITIL Incident Management is measured on one thing: restoration of normal service operation. MTTR, not MTTU — mean time to repair, not mean time to understand. A rollback that gets payments flowing again at 04:12 is a win, even if nobody yet knows what the bad deploy actually did.
The fireman doesn't stop to inspect the wiring before knocking down the flames. They knock down the flames. A workaround beats a root cause every time, because the customer is on fire now. Cause-and-origin can wait until everyone's safe.
This is where the analogy gets sharp, and where a lot of orgs blur a line ITIL drew on purpose. When the fire is out, the fireman packs up and goes home. They don't sift through the ash with a clipboard. That's a different person — a fire marshal, an arson investigator — with a different mandate, a different uniform, and a different boss.
ITIL draws the same line. Once normal service is restored, the incident is closed. What remains — the why, the underlying defect, the how do we keep this from happening again — becomes a problem. And Problem Management owns it. Not Incident Management.
The Incident Manager's notes from the bridge become the Problem Manager's opening file. The handoff matters. Conflate the two roles and you get one of two failure modes: the incident drags on for hours while everyone debates root cause and customers stay broken, or the post-mortem never gets written because the responders are already onto the next page.
An Incident Manager who refuses to close until they understand is not doing the job better. They're doing a different job — badly, and at the customer's expense.
A fireman is a single response unit. A fire department is the whole institution: training, dispatch, prevention codes, hydrant maintenance, drills, mutual-aid agreements, the budget meeting at city hall. The town that funds only the trucks gets one set of outcomes. The town that funds the department gets another.
ITIL Incident Management gives you the trucks. Site Reliability Engineering — when it's done as Google originally wrote it down — gives you the department. Same fire-service shape, larger surface area: the work between the fires that decides how bad the next one gets.
Three load-bearing SRE practices, in fire-service vocabulary:
The 50% engineering rule. A department where every shift is calls all day burns out. The crew needs hours that aren't running into buildings — for training, prevention, gear maintenance. Google said the same thing about SREs: cap toil at half the role. The other half is engineering away the next page.
Error budgets. Zero fires is not the goal. Zero fires is the goal of a town with no buildings. A service that promises four-nines uptime is allowed about fifty minutes of downtime a quarter — and inside that budget you ship boldly. Outside it you freeze and harden. The budget is the negotiation between speed and safety, made explicit and shared.
The four golden signals. Latency, traffic, errors, saturation. The four gauges every dispatcher watches. The four readings the chief asks for first. Everything else is decoration.
Blameless post-mortems. Same shape as the marshal's report: what happened, what we'd do differently, what the system should make impossible next time. Names appear; blame doesn't. Because the next person to make the same mistake is the system, not the person.
An Incident Manager fights this fire. An SRE makes sure the city wasn't built to burn.
Look at the line items of any American city. State and local governments spend roughly fifty billion dollars a year on fire protection — about a hundred and fifty dollars per American, on average — and somewhere in the neighbourhood of five to ten percent of the typical municipal general fund. More for full-service big cities. Less for towns that lean on volunteers. Twenty-four-hour staffing. Dedicated stations. Dedicated equipment. Mutual-aid agreements with neighbours. It is mandatory. It has been mandatory since the great urban fires of the eighteen-hundreds taught everyone what the alternative looks like — and it has been defended, fiercely, every budget cycle since.
Fire is contagious. One house lights the next. The cost lands on the public regardless of whose stove it was. Insurance frameworks have spent a hundred and fifty years forcing the issue. The political cost of a city block burning down because the council cut the fire budget is career-ending. So the budget gets defended.
Now look at the eng org. Google's Site Reliability Workbook says, on the record, that SRE has historically run five to ten percent of their engineering staff. The other mature shops — AWS, Netflix, the few who learned the lesson the hard way — cluster in the same neighbourhood. Same shape as a municipal fire budget. That is the punchline.
The median company is nowhere near it. Many have zero dedicated reliability function. On-call is whatever side duty fell to whoever wrote the code. The tooling line items — Datadog, PagerDuty — are real but rounding errors next to dev-velocity spend. Reliability gets cut first when the quarter looks tight.
The asymmetry is not an accident. Outage cost is diffuse. Customer churn from a four-hour incident does not show up on the same line of the P&L as the SRE headcount that would have prevented it. There is no city council watching. The political cost of not funding reliability lands on whoever inherits the pager next quarter — not on the exec who blocked the hire. So the budget gets cut.
A city that spent on fire what most companies spend on reliability would be ash by Tuesday.
A fireman's fires are mostly independent events. A house burns; the house next door usually doesn't. Incidents in a complex distributed system aren't like that. One small flame in a dependency graph can light up half the org in seconds — a slow database makes the API slow, which makes the frontend slow, which makes the retry storm, which takes down a service that didn't even know the database existed.
So the line between one incident and many incidents tracing to one problem gets harder to draw. The Incident Manager has to think more like a fire chief commanding a wildfire than a single-engine response. Containment lines. Spot fires. Wind direction. The blaze you're not looking at yet. And sometimes the handoff to Problem Management has to happen while parts of the fire are still burning — because the underlying defect is the only thing that explains why three apparently unrelated services all caught at once.
That's the posture, across both uniforms. The radios are different. The hoses are different. The job has two halves now — the one that runs toward the smoke, and the one that goes back the next morning to draw the new building code.
The fire is out. The bridge is closed. The runbook is open. Your task now is to brief. The Marshal needs the timeline. The COO needs the language. The next on-call needs the lesson.
Forward this dispatch to the people who set the budget for the next one.