DNS failover: When zero-weight records strike back

November 27, 2024 · #aws, #dns · 3 min read

Recently, I encountered a fascinating incident where some API endpoints received unexpected traffic. Services were getting requests they weren’t configured to handle, which shouldn’t have been possible with our ALB routing rules. Here’s what I learned.

Details of this incident and the service involved are anonymized to not distract from the key learning.

TL;DR #

When all weighted DNS records with weight > 0 become unhealthy, Route53 falls back to considering zero-weighted records. This can unexpectedly activate legacy request flow with different routing behaviors. Always fully decommission legacy infrastructure rather than leaving it in DNS with zero weights.

Infrastructure setup #

Our setup had two generations of load balancers running side by side. The current setup uses an Application Load Balancer (ALB) with routing rules. We point to it with our primary Route53 record, weighted at 100.

Then there’s the legacy part. It uses Classic Elastic Load Balancers (ELBs) with an application-level proxy handling the routing. This is pointed to by a legacy Route53 record weighted at 0. We kept this around as part of our migration strategy.

Since the migration, the ALB routing rules have drifted from the legacy application-level proxy.

Both records had “Evaluate target health” enabled, which was a crucial detail.

The incident chain #

It started with services reporting receiving traffic they weren’t configured to handle. This shouldn’t have been possible with our ALB routing rules – a clear sign something unusual was happening with our traffic flow.

1. Health check behavior #

Route53 marks an ALB record unhealthy if any target group becomes unhealthy. The target group was configured with a minimum healthy target count of 1, which made the ALB quite sensitive to backend health changes.

2. Zero-weight record activation #

When our primary record (weight=100) became unhealthy, something unexpected happened. Route53 started considering our zero-weighted legacy record, and traffic began failing over to the Classic ELBs.

3. Different routing behavior #

Post-migration, we had added new routing rules to the ALB, so it wasn’t a 1:1 match with our previous application proxy setup. When requests started going through the legacy application-level proxy, services began receiving traffic unintended for them.

4. Auto-recovery #

The service restored itself when backend health recovered and traffic returned to the primary ALB path.

Key learning #

Zero-weighted DNS records remain viable failover targets and spring into action when higher-weighted records become unhealthy.

While this behavior is documented in AWS docs, it’s not immediately obvious and can catch you off guard. The lesson? Don’t keep legacy infrastructure around with zero weights – either maintain it properly or decommission it completely.