What Is Failover?
Failover is the automatic transfer of work from a failed component to a standby one — the mechanism that turns redundant hardware into actual resilience. When it works, users never notice. When it silently rots, the second failure becomes the outage the first one should have been.
Failover at Every Layer
Servers
Clusters and VM restart/migration when a host dies.
Network
Redundant links, VRRP gateways, and dynamic rerouting.
Storage
Multipath I/O, RAID, and replicated arrays.
Power
Dual PSUs on A/B feeds, UPS, and generator transfer.
Redundancy Rots When Nobody Watches It
The most dangerous state in infrastructure isn't a failure — it's running on the standby without knowing it. A power supply died last month and the second one is carrying the box alone; one path of the storage multipath dropped and nobody noticed; the standby's health degraded while it idled. Everything works, until the next single failure has no backup left.
That's why failover is a monitoring problem as much as an architecture problem. Component-level hardware telemetry — both PSUs, every fan, each link and path — is what catches redundancy loss while it's still a maintenance ticket instead of a 2 a.m. outage.
Common Questions About Failover
What is failover in simple terms?
Failover is the automatic switch to a standby system when the primary one fails — a second server, network path, or power feed takes over so the service keeps running.
What is the difference between failover and high availability?
High availability is the goal (a service that stays up); failover is one of the mechanisms that achieves it. HA designs combine redundancy, failover, and health monitoring to survive component failures.
What is failover testing?
Failover testing deliberately fails the primary — pulling a link, stopping a node — to verify the standby actually takes over within the expected time. Untested failover is a hope, not a design.
Active-active vs active-passive failover?
Active-passive keeps a standby idle until the primary fails. Active-active runs both nodes, sharing load and taking each other's traffic on failure — better utilization, but requires state synchronization.
