Earlier today, your team pushed an update to inferencing workloads in production. Now 2:17 AM - GPUs are throwing errors on a few servers and workloads are failing. Is it something in the update, or are the GPUs themselves going bad? On bare metal, there's no snapshot to restore and no replacement instance to spin up. You need these nodes back, and you need to know whether the issue is your configuration or the hardware underneath it.
Bare metal GPU infrastructure gives you all the benefits of dedicated hardware, full control, privacy, security, and the performance that multi-tenant environments can't match. With that additional control, there's also more to manage. When something breaks - a workload misconfiguration, a firmware or driver incompatibility, a corrupted SSH config, a failed kernel update - that same dedicated hardware becomes a frustration between you and answers.
One of the hardest parts of running production workloads on bare metal is the blurry line between configuration issues and true hardware failures. A GPU throwing errors could be a driver mismatch from your last update, or it could be a hardware issue. The symptoms look similar from inside your workload.
The traditional path to resolution sends you down a support rabbit hole. File a ticket, describe the symptoms, wait for someone to manually test the server and report back. If the issue turns out to be your configuration, you've wasted hours waiting for an answer you could have found yourself. If it is hardware, you're still waiting - but now you have to convince the provider that the problem is real. And if you just request a new node, the same thing might happen again, because you never actually identified the root cause.
Go from "something's wrong" to "here's exactly what happened" in 60 seconds - and from there to a fix in minutes.
Run a full diagnostic sweep and know immediately whether your hardware is healthy, degraded, or failing. Boot into a clean rescue environment with known working kernels and drivers to test the server independently of your workload. If the hardware checks out, you know the issue is in your configuration - and you can fix it right there in Rescue Mode. If the hardware needs attention, send the full diagnostic evidence to our support team in one command and we start working on it immediately.
The whole loop - from detection to diagnosis to resolution - gets your production workloads back up and running in minutes instead of hours.
We built two self-service capabilities that work together or independently, depending on what you need.
Rescue Mode reboots your server into a full Linux environment loaded entirely in memory, pre-configured with your SSH key. One click from the control plane or one API call from your automation. Console access is typically available in 5-15 minutes.
This environment runs known working kernels and drivers, which makes it valuable for more than just recovery. Boot into Rescue Mode and you can isolate the underlying hardware from your host OS entirely - run diagnostics, stress tests, or connectivity checks against a clean baseline. If the hardware passes, you know the issue lives in your workload configuration.
When you do need to repair, you have full access. Mount drives and fix configurations directly. Repair broken netplan, firewall rules, or routing. Restore SSH access by repairing daemon configs or regenerating keys. Recover from boot failures, resize or repartition disks, reconfigure your OS - all while keeping your data safe.
Your existing data and workloads stay untouched through all of it. Rescue Mode provides access - it doesn't modify anything you haven't explicitly changed.
The Diagnostics Suite gives you direct, on-demand visibility into the full health of your server. A single topology-aware health check detects your machine configuration - PCIe, NVLink, or NVSwitch - and automatically runs the right tests. The result is one clear verdict: HEALTHY, DEGRADED, or UNHEALTHY.
Nine health check categories cover GPU presence, driver status, memory and ECC integrity, PCIe link quality, thermal and power state, kernel errors, peer-to-peer communication, NVLink health, and Fabric Manager status. Fifteen diagnostic commands go deeper when you need them - ECC/RAS analysis, register-level GPU debug, NVMe SMART data, InfiniBand status, and GPU-to-GPU bandwidth tests.
The full suite runs in 30-60 seconds. When you need Hydra's help, --send transmits the complete results directly to our support team. The ticket arrives pre-loaded with evidence. We start fixing instead of asking questions.
Run diagnostics to get the verdict. If the hardware is healthy, the issue is in your workload or OS configuration - boot into Rescue Mode to test against a clean baseline and make repairs. If the hardware is degraded or failing, --send the diagnostic data and our team picks it up immediately. No ambiguity about who's responsible for what, no wasted time chasing the wrong root cause. Hydra tools give you diagnostic answers fast.
Log into your Hydra control plane and launch Rescue Mode on any active deployment, or SSH in and run brokkr-diagnostics --run all for an immediate health check. Both tools are available on every node rented on Hydra.
Run Diagnostics | Launch Rescue Mode on a Deployment