When New GPUs Start Vanishing
The company operates a large AI training environment supporting multiple internal R&D teams. Their GPU clusters run continuously, processing deep learning workloads around the clock. The infrastructure was modern, the hardware was new, and the deployment followed vendor best practices.
Yet shortly after a new batch of GPU servers was brought online, strange problems began to appear. A training job would suddenly terminate after running successfully for hours. A node that had been stable the previous day would start reporting CUDA errors. Engineers would log into a server and discover that one of the GPUs had vanished from the system.
An eight-GPU server would suddenly show only seven GPUs. Sometimes a reboot restored the missing card. Sometimes it did not.
The behavior was inconsistent enough to make troubleshooting extremely difficult. As incidents increased, the customer realized they were facing something more serious than an occasional hardware glitch.
The Business Impact
Data scientists were losing training runs that had consumed dozens of GPU hours. Development schedules slipped because experiments could no longer complete reliably. Platform engineers spent increasing amounts of time investigating failures instead of improving the environment.
Everyone agreed there was a problem. Nobody could agree on the cause.
Software Was the First Suspect
The infrastructure team upgraded GPU drivers. CUDA versions were updated. Firmware was refreshed. Kubernetes GPU operators were reinstalled. BIOS settings were adjusted. Entire operating systems were rebuilt.
For a short time, each change seemed promising. Then the failures returned. Some servers ran perfectly for weeks. Others failed multiple times within a few days despite identical hardware and software configurations. That inconsistency made the software theory increasingly difficult to defend.
Attention then shifted to the GPUs themselves. Suspect cards were replaced. The problem remained. Engineers replaced PCIe risers, swapped memory modules, moved GPUs between slots, and exchanged entire servers. Again, the failures continued.
The most frustrating part was that traditional monitoring tools showed nothing unusual before the incidents. CPU utilization looked normal. Memory consumption was stable. Temperatures were within acceptable limits. The GPUs simply disappeared.
Investigating From the Infrastructure Layer
At that point, the customer engaged the Sensaka team. Rather than starting from the operating system or application layer, we approached the investigation from the infrastructure level.
When a GPU disappears from a running system, it is often treated as a driver issue. But a GPU is still a physical device connected to a complex chain of power, cooling, motherboard, PCIe, and management components. We wanted visibility into every layer of that chain.
Using Sensaka's agentless monitoring and out-of-band management capabilities, we began collecting information directly from the underlying hardware environment — power supply status, BMC events, motherboard sensor data, PCIe health, voltage readings, thermal measurements, and hardware event logs.
Individually, none of these data points appeared alarming. When viewed together, a pattern emerged: nearly every server that experienced GPU loss showed abnormal voltage fluctuations shortly before the event.
The fluctuations were brief, often lasting less than a second. They were too short to trigger traditional alerts and too subtle to attract attention during manual troubleshooting. But when historical hardware telemetry was analyzed over time, the correlation became difficult to ignore.
The Hidden Power Problem
The customer initially believed power infrastructure could not be the issue because the power supplies were rated well above the calculated system requirements. On paper, everything looked correct.
The problem was that paper calculations only reflected average power consumption. AI workloads rarely behave according to averages.
Modern GPUs can generate very large transient power spikes within extremely short periods of time.
Certain training workloads caused GPUs to swing rapidly between low and full utilization. As utilization changed, power demand changed dramatically. Although the power supplies were technically large enough for steady-state consumption, parts of the power delivery chain struggled to absorb these rapid fluctuations.
Further inspection uncovered multiple contributing factors: uneven load distribution across riser power circuits, oxidation on some power connectors that had increased electrical resistance over time, and connections not providing optimal contact under heavy load. None of these conditions were severe enough to cause complete system failure. Together, they created the perfect environment for intermittent instability.
When a workload generated a sudden power spike, voltage would momentarily dip. The GPU's protection mechanisms interpreted the condition as a power fault and disconnected from the PCIe bus. From the OS perspective, the GPU simply disappeared. From the application perspective, the training job crashed. From the infrastructure perspective, the root cause was hidden inside the power delivery path.
A Second Underlying Cause
A subset of affected servers continued exhibiting occasional instability even after power delivery issues had been addressed. This suggested a second underlying cause.
Using Sensaka's infrastructure visibility and hardware correlation capabilities, we compared healthy nodes against problematic nodes. The comparison revealed subtle differences in motherboard power behavior. Certain systems displayed abnormal voltage regulation patterns during heavy GPU loads — power delivery phases responded more slowly than expected, producing brief instability during rapid load transitions.
The issue was ultimately traced to a specific motherboard power regulation component used in a limited hardware batch. Although the systems passed standard validation testing, the design margin became insufficient under the unique load characteristics generated by large-scale AI training workloads.
Working with the hardware vendor, the customer validated the finding and replaced the affected components.
Remediation and Results
The infrastructure team implemented a broader remediation plan. Power distribution across servers was rebalanced. Aging connectors were replaced. GPU power policies were optimized. Motherboard issues were corrected. Rack-level power planning was reviewed and improved.
At the same time, Sensaka was configured to continuously monitor the critical indicators that had previously gone unnoticed — GPU power consumption trends alongside power supply health, voltage stability, PCIe status, thermal behavior, and hardware events. Instead of waiting for a GPU to disappear, the operations team could now identify warning signs before an outage occurred.
Within two months, recurring GPU loss incidents were eliminated. Training interruptions dropped dramatically. Mean time to identify hardware issues was reduced from hours to minutes.
More importantly, the organization gained a fundamentally different understanding of its AI infrastructure. Before the project, engineers could see failures after they occurred. After the project, they could see the conditions that created those failures.
A Missing GPU Is Not Always a GPU Problem
Modern AI data centers are no longer simple collections of servers. They are highly interconnected systems where GPUs, power infrastructure, cooling, storage, networks, and management tools all influence one another.
A missing GPU is not always a GPU problem. It may be a power problem. It may be a motherboard problem. It may be a visibility problem.
This customer's experience is a reminder that successful AI operations require more than powerful hardware. They require the ability to observe, understand, and manage the entire infrastructure stack. The fastest way to solve a problem is not reacting after it happens — it is seeing it before it becomes an outage.
See the physical layer of your AI infrastructure
Vendor-neutral visibility across GPUs, power, cooling, and BMC telemetry — built to catch the conditions that cause AI outages before they happen.
References: NVIDIA DCGM and out-of-band management. Related: GPU infrastructure monitoring.
