Why Firefighters are Better at IT Outages
I am a Sysadmin at a Fintech company. I am also a volunteer firefighter. When a house is burning, we don't argue about who is in charge. We don't have 50 people shouting on the radio. We have a system. It's time IT adopted the Incident Command System (ICS).
🚒 The Fire Ground
- ✓ Clear Chain of Command
- ✓ Standardized Terminology
- ✓ Bias for Action, but Safety First
- ✓ Designated Staging Areas
💻 The IT War Room
- ✗ "Hero" Culture (SPOF)
- ✗ C-Levels interrupting Engineers
- ✗ Alert Fatigue & Noise
- ✗ Infinite "Helpers" slowing progress
1. Span of Control
In NIMS (National Incident Management System), one person can effectively manage 3 to 7 people. The ideal is 5. In IT outages, we often have one "Incident Commander" trying to listen to 20 engineers, 3 VPs, and customer support simultaneously. This guarantees cognitive overload and failure.
Status:
CRITICAL OVERLOAD
The IC cannot process information. Messages are being missed. Decisions are delayed.
Cognitive Load vs. Team Size
2. The 360 Size-Up
Firefighters walk *around* the burning building before entering. IT Admins often SSH into the first server they see. Stop. Look. Think. Act. The "Size-Up" determines strategy (Offensive vs. Defensive) and required resources.
Initial Assessment Radar
Adjust the sliders to simulate an initial incident report. Visualize where the "fire" is spreading.
Current Mode: DEFENSIVE
Infrastructure load is high. Stabilize the perimeter before attempting to restore individual services.
3. Radio Discipline (CAN Reports)
On the fireground, radio airtime is scarce. We use CAN Reports: Conditions (What do I see?), Actions (What am I doing?), Needs (What resources do I need?). In IT, we clutter the bridge with theories and "umms".
🚫 Typical IT Bridge
Dev1: I'm checking the logs... wait, seeing 500s.
Manager: Is it fixed yet?
Dev2: I think it might be the redis cache, or maybe DNS?
CEO: Why is the site down?
Dev1: Still looking... umm...
✅ CAN Report Standard
Ops Lead: [Conditions] 500 error rate at 90% on API Gateway.
Ops Lead: [Actions] I am rolling back the last deployment to v4.2.
Ops Lead: [Needs] I need DB Admin to verify connection pool health.
-- Silence until complete --
Signal vs. Resolution Time
Data simulated based on Mean Time To Restore (MTTR) trends
4. The Staging Area
When a big fire happens, every volunteer drives to the scene. If they all park in front of the building, the trucks can't get in. In IT, "Staging" is a separate Slack channel or Zoom breakout where volunteers wait. They do not enter the main incident channel until requested.
Keep the Hot Zone clear. Only move resources in when they have a specific task.
5. Transfer of Command
Incidents can last hours. Fatigue sets in. When a new Incident Commander takes over, a formal Face-to-Face transfer is required. You recount the Situation, Resources, and Priorities. If you don't do this, the new IC starts from zero.
Transfer Checklist
Situation Status
What happened? What is happening now?
Deployment/Assignments
Who is doing what right now?
Immediate Priorities
What must happen in the next 15 mins?
Communications Plan
What channel are we on? Who knows?
🚨 LIVE INCIDENT: DB FAILOVER
Role: Incident Commander
You have just been paged. The main database is unresponsive. 500 errors are spiking. 15 engineers have jumped into the #incidents channel and are posting screenshots.
What is your first move?