🔥 Fireground to Cloud

Why Firefighters are Better at IT Outages

I am a Sysadmin at a Fintech company. I am also a volunteer firefighter. When a house is burning, we don't argue about who is in charge. We don't have 50 people shouting on the radio. We have a system. It's time IT adopted the Incident Command System (ICS).

🚒 The Fire Ground

  • Clear Chain of Command
  • Standardized Terminology
  • Bias for Action, but Safety First
  • Designated Staging Areas

💻 The IT War Room

  • "Hero" Culture (SPOF)
  • C-Levels interrupting Engineers
  • Alert Fatigue & Noise
  • Infinite "Helpers" slowing progress

1. Span of Control

In NIMS (National Incident Management System), one person can effectively manage 3 to 7 people. The ideal is 5. In IT outages, we often have one "Incident Commander" trying to listen to 20 engineers, 3 VPs, and customer support simultaneously. This guarantees cognitive overload and failure.

1 Person 12 People 15 People

Status:

CRITICAL OVERLOAD

The IC cannot process information. Messages are being missed. Decisions are delayed.

Cognitive Load vs. Team Size

2. The 360 Size-Up

Firefighters walk *around* the burning building before entering. IT Admins often SSH into the first server they see. Stop. Look. Think. Act. The "Size-Up" determines strategy (Offensive vs. Defensive) and required resources.

Initial Assessment Radar

Adjust the sliders to simulate an initial incident report. Visualize where the "fire" is spreading.

Current Mode: DEFENSIVE

Infrastructure load is high. Stabilize the perimeter before attempting to restore individual services.

3. Radio Discipline (CAN Reports)

On the fireground, radio airtime is scarce. We use CAN Reports: Conditions (What do I see?), Actions (What am I doing?), Needs (What resources do I need?). In IT, we clutter the bridge with theories and "umms".

🚫 Typical IT Bridge

Dev1: I'm checking the logs... wait, seeing 500s.

Manager: Is it fixed yet?

Dev2: I think it might be the redis cache, or maybe DNS?

CEO: Why is the site down?

Dev1: Still looking... umm...

✅ CAN Report Standard

Ops Lead: [Conditions] 500 error rate at 90% on API Gateway.

Ops Lead: [Actions] I am rolling back the last deployment to v4.2.

Ops Lead: [Needs] I need DB Admin to verify connection pool health.

-- Silence until complete --

Signal vs. Resolution Time

Data simulated based on Mean Time To Restore (MTTR) trends

4. The Staging Area

When a big fire happens, every volunteer drives to the scene. If they all park in front of the building, the trucks can't get in. In IT, "Staging" is a separate Slack channel or Zoom breakout where volunteers wait. They do not enter the main incident channel until requested.

Hot Zone (Incident Channel)
Limit: Essential Personnel Only
Staging Area (Waiting Room)
Fresh Resources

Keep the Hot Zone clear. Only move resources in when they have a specific task.

5. Transfer of Command

Incidents can last hours. Fatigue sets in. When a new Incident Commander takes over, a formal Face-to-Face transfer is required. You recount the Situation, Resources, and Priorities. If you don't do this, the new IC starts from zero.

Transfer Checklist

01

Situation Status

What happened? What is happening now?

02

Deployment/Assignments

Who is doing what right now?

03

Immediate Priorities

What must happen in the next 15 mins?

04

Communications Plan

What channel are we on? Who knows?

🚨 LIVE INCIDENT: DB FAILOVER

Role: Incident Commander

Elapsed Time
00:00:00

You have just been paged. The main database is unresponsive. 500 errors are spiking. 15 engineers have jumped into the #incidents channel and are posting screenshots.

What is your first move?