How to Respond to a Power/Fabric Outage Affecting Our Data Centres, or Digital Systems

Modified on Wed, 10 Sep, 2025 at 4:15 PM

Introduction

Power or fabric outages (mains loss, critical circuit failure, UPS/generator issues, network core/switch stack failures, cooling failure, or wider plant problems) can disrupt guest experience and core business services across hotels, membership sites, and the data centre. A coordinated, safety-first response across Operations (Duty Manager), Facilities, IT Service Desk/Engineering, and the Executive is essential. This guide defines roles, information capture, and the end-to-end process from first response to lessons learnt.

1) Response

First Responder

The First Responder is the person who first identifies or receives the report (e.g., reception, engineering, security, duty manager, IT on-call, or any colleague on site).
Their role is to log the incident promptly via the site 24/7 IT telephone contact and also follow up with the engineer who answers to confirm a unique call reference, ensuring that a high priority for the call has been set. This will invoke IT leadership and engineering escalation.

Essential information to capture (read verbatim if needed):

Location & scope: site/building/floor/room; data centre or comms room ID; affected circuits/racks/cabinets.
Time first noticed and whether power is site-wide, partial, or local (e.g., a single distribution board, a single rack PDU).
Safety indicators: emergency lighting, alarms active/failed, lifts stopped, visible damage, water ingress, heat/smell, electrical arcing.
Critical systems impacted: PMS/POS/Payments, Wi-Fi, telephony, door access/CCTV, BMS, data centre services (virtualisation clusters, storage/SAN/NAS, core switches, fire suppression), cooling/CRAC status.
Backup state observed: UPS on battery %, runtime, alarms; generator running/not running; ATS status; any recent plant works.
Immediate actions taken: areas isolated, guest moves, manual processes started, plant checks performed.
Reporter contact details and best call-back route.

Immediate triage & notifications

IT Service Desk: open Major Incident, set Priority 1; conference bridge/Teams war-room; notify IT Leadership and Facilities Lead; start an update cadence (every 30 minutes or as agreed).
Facilities: attend electrical plant (LV panels, DBs, UPS, ATS, generators, cooling/CRAC, BMS) and assess for hazards.
Duty Manager: assume site control for guest operations, crowd safety, and immediate business continuity actions.

2) Invoking the Business Continuity Plan (BCP)

The Duty Manager, in consultation with an Executive Team member, determines whether to invoke the documented BCP based on:
- Expected duration (supplier ETA, plant diagnosis).
- Health, safety, and guest welfare impacts.
- Loss of critical services (payments, access control, fire systems, CCTV, PMS).
Once invoked: brief teams, switch to manual/alternative operating procedures (manual check-in/out, offline payments, door-lists, radio comms), and record all BCP actions against the Major Incident.

3) Remediation (Stabilise & Make Safe)

Facilities (lead for plant/power/cooling)

Verify incoming mains supply with utility; check ATS position; confirm generator availability and fuel.
Assess UPS alarms, battery runtime, and load; consider load shedding of non-critical circuits to preserve runtime.
Check cooling/CRAC and mechanical plant—protect data centre rooms from over-temperature (open hot-aisle doors only if instructed and safe).
Engage approved electrical/mechanical contractors if required.

IT (lead for digital systems/data centre)

Protect critical systems: confirm graceful shutdown order if UPS runtime is limited (e.g., apps → DBs → hypervisors → storage last).
Prioritise core network, identity/DNS/DHCP, storage, virtualisation; park non-critical services.
Validate fire suppression status and access control to technical rooms.
Ensure backups and snapshots are recent/healthy; if protecting runtime, limit backup jobs to reduce load.

4) Switch-On (Safe Power Restoration & Controlled Start-Up)

Facilities:

Confirm environment is safe and faults are cleared; restore power phased by circuit to avoid inrush.
Verify generator → mains transfer and UPS returns to normal mode; check harmonic alarms and load balance.

IT:

Follow the approved start-up sequence:
1. Power & cooling stable (room < 22°C, humidity in range).
2. Core network: core/distribution switches, firewalls, VPN, DHCP/DNS/AD.
3. Storage platforms (SAN/NAS/object) and replication links.
4. Virtualisation clusters (Hyper-V), then management tooling/monitoring.
5. Databases & middleware, then application tiers (PMS, POS, payments, Wi-Fi controllers, telephony/CC, membership/CRM, BMS integrations).
Bring guest-facing systems online last; coordinate with Duty Manager to manage load/surge (e.g., stagger POS terminals, AP radios, TV head-ends).

5) Testing (Functional & Safety Validation)

Facilities:

Life-safety systems: fire alarm panels, emergency lighting, lifts, BMS alarms/points, plant interlocks.
Power integrity: UPS health, battery test schedule, ATS status, generator exercise logs, CRAC alarms.

IT:

Networking: core reachability, WAN/MPLS/SD-WAN, Wi-Fi SSIDs, bandwidth/utilisation.
Identity/Access: AD/SSO, MFA, door access controllers, card encoders.
Business systems: PMS check-in/out, POS transactions (test card present + contactless + fallback), membership/CRM, booking engines, TV/CAST, telephony.
Data integrity: DB health checks, application logs, queued jobs, integrations (e.g., payment gateways, Oracle/Snowflake pipelines).
Monitoring: ensure all servers, appliances, and network elements are visible and in green; re-enable any alert suppressions.
Backups/DR: confirm last successful backup; run an ad-hoc integrity verify; check RPO/RTO not breached; validate DR replication is healthy.

Operations:

Validate front-of-house flows: room key encoding, payment receipts, printing, guest Wi-Fi portal, TVs/OTT casting, spa/restaurant bookings.

Record test outcomes and any residual risk/workarounds against the Major Incident.

6) Communication

During the outage

IT Service Desk: Major Incident updates every 30 minutes (or as set by Incident Manager).
Duty Manager: guest and colleague briefings; signage; calm, factual updates.
Executive: briefed on impact, ETA to restore, and any regulatory/brand risks.

After restoration

Service Desk issues All-Clear once Facilities/IT testing is complete.
Customer-facing comms (if required): apologies, explanation at high level, reassurance on data/security if relevant.
Management note: short summary of impact, downtime, cause, and next steps.

7) Lessons Learnt (Post-Incident Review within 5 Working Days)

Root cause (utility, plant, configuration, procedural).
What went well / gaps in BCP, call-out, comms, runbooks, and tooling.
Improvements & owners:
- UPS/generator maintenance, fuel contracts, spares, ATS testing cadence.
- Data centre runbooks (shutdown/start-up), rack labelling, PDU load balancing.
- Monitoring thresholds/alerting; out-of-hours escalation; bridge etiquette.
- Resilience posture: RPO/RTO review, DR test schedule, circuit diversity, dual-UPS/dual-PSU coverage, network redundancy, cooling redundancy.
Publish minutes, actions, and due dates; update BCP and technical SOPs accordingly.

Roles & Responsibilities (at a glance)

First Responder — Detect/report; provide essential info; initiate 24/7 IT call; confirm unique call reference and Priority 1 set.
IT Service Desk (Incident Manager on duty) — Open Major Incident; assemble bridge; comms cadence; coordinate IT actions & escalation.
Facilities Lead — Electrical/mechanical plant diagnosis; UPS/generator/ATS/cooling; make-safe and restore supply.
IT Engineering (Infra/Network/App) — Protect systems; controlled shutdown/start-up; verify services, data integrity, monitoring, backups/DR.
Duty Manager (Site Lead) — Safety, guest operations, invoke BCP with Exec consultation, manual processes, on-site coordination.
Executive (On-call/Accountable) — Oversight, external comms approval, risk acceptance, regulatory/brand considerations.

Conclusion

Following this playbook ensures a safe, controlled restoration of power and digital services across our hotels, membership sites, and data centre. Clear ownership, disciplined shutdown/start-up sequences, rigorous testing, timely communication, and a structured lessons-learnt cycle protect guest experience, safeguard data, and strengthen resilience for the next event.