GCP Live-Migrates, AWS Reboots: How Cloud Providers Handle Host Maintenance
The following incident is fictional, but the architecture lessons are real.
The incident #
Monday morning. You’re reviewing last week’s metrics and notice a gap in the graphs — your staging environment was down for about 5 minutes on Thursday afternoon. The AWS console shows the instance is healthy, but uptime is only 4 days.
Digging into the EC2 console, you find it: AWS had sent a Scheduled Event notification two weeks prior. The instance was due for host maintenance. The notification sat in the console unread. The instance rebooted, dropping connections and clearing the in-memory cache.
No big deal for staging. But you check production… same setup, same lack of monitoring for scheduled events.
Someone on the team suggests: “We should move to GCP — they do live migration, this wouldn’t have happened.”
But is that the right takeaway?
TL;DR #
Cloud providers handle host maintenance with fundamentally different philosophies:
- GCP: Live migration by default — your VM pauses for <1 second, processes keep running
- AWS: Scheduled Events with advance notice — you decide when and how to handle the reboot
- Azure: Best-effort live migration — works most of the time, falls back to scheduled reboot
The safest multi-cloud baseline: assume any VM can restart at any time. Design for it.
Three philosophies of maintenance #
When a cloud provider needs to patch a hypervisor or replace failing hardware, they have two options:
Live migration: Copy your VM’s memory and state to another physical host while it’s running. Done well, your VM experiences a brief pause (<1 second), but processes keep running, TCP connections stay open, no state is lost.
Reboot: Stop the VM, perform maintenance, start it on new hardware. Processes terminate, memory is wiped, connections drop. Downtime measured in minutes.
Each cloud provider made different trade-offs:
GCP: “You won’t even notice” #
GCP live-migrates by default. Maintenance is largely invisible — you might see a brief blip in latency metrics, but your processes keep running.
“Google Cloud observes a minimum disruption time, which is typically much less than 1 second.”
— GCP Live Migration Process
The exception: GPU instances and some bare-metal configurations still require restarts.
AWS: “We’ll tell you, you decide” #
AWS sends Scheduled Event notifications days or weeks in advance via the EC2 console, API, or EventBridge. You can:
- Wait and let AWS reboot during the maintenance window
- Reschedule to a time that works for you
- Proactively stop/start the instance to migrate on your terms
“During the reboot, the instance is migrated to a new host. This is known as a reboot migration. Typically completes in minutes.”
— AWS EC2 Scheduled Events
Azure: “We’ll try” #
Azure attempts live migration for most VM types (pause typically <10 seconds). But it’s best-effort — if live migration fails, or for some types of instances, you get a scheduled reboot with a ~35-day self-service window.
“Live migration is performed on a best effort basis. In some rare cases, live migration may not succeed and the VM will be scheduled to be Service Healed.”
— Azure Maintenance and Updates
What this means for SLAs #
| GCP | AWS | Azure | |
|---|---|---|---|
| Single-VM SLA | 99.9% | 99.5% | 99.9%* |
| Multi-AZ SLA | 99.99% | 99.99% | 99.99% |
| Default maintenance | Live migration | Scheduled reboot | Best-effort live migration |
| Typical disruption | <1 sec pause | Minutes of downtime | <10 sec pause, or reboot |
*Azure’s single-VM SLA depends on storage: 99.9% with Premium SSD, 99.5% with Standard SSD, 95% with Standard HDD. The SLA is determined by the lowest-tier disk attached — mix Premium with HDD and you get 95%.
The single-VM SLA difference is notable: AWS’s 99.5% vs GCP’s 99.9% reflects the reboot-based maintenance model. Interestingly, Azure with Standard SSD matches AWS at 99.5%. But deploy across 2+ AZs and all three converge to 99.99%.
The architecture lesson #
Back to our fictional incident. Was “switch to GCP” the right takeaway?
No. The real lesson is that the team had a single point of failure. One instance, no redundancy, no graceful degradation (and not reacting to the Scheduled Events). On GCP, they would have survived this particular maintenance event. But they would still be vulnerable to:
- Instance crashes (hardware failure, kernel panic)
- AZ outages
- Unplanned events
The correct multi-cloud baseline is to assume any VM can restart at any time on any cloud. Then:
- If availability is key, deploy across 2+ AZs (99.99% SLA on all three providers).
- You may get 30-60 seconds if you handle SIGTERM gracefully (but in case of unplanned outages you may not get any warning).
- Restarted VMs lose in-memory data — persist state externally (object store, database, etc.).
- Load balancers will route around restarting instances if you have enough capacity and implemented health checks correctly.
- Monitor the instance events.
Bottom line #
Design for failure on every cloud. GCP’s live migration is impressive engineering. Building on the assumption that VMs never restart is fragile regardless of provider. Then live migration becomes a nice-to-have.