๐Ÿง  AWS US-EAST-1 Outage Full Report: From DNS Resolution to EC2 Load Balancer Recovery

A complete technical report on the October 20 AWS US-EAST-1 outage โ€” from DNS failures and network degradation to the final recovery of EC2 and Lambda.

๐Ÿง  AWS US-EAST-1 Outage Full Report: From DNS Resolution to EC2 Load Balancer Recovery
AWS US-EAST-1 Outage Full Report: From DNS Resolution to EC2 Load Balancer Recovery

Date: October 20, 2025
Region: US-EAST-1 (N. Virginia)
Status: ๐ŸŸก Degraded โ€“ Recovery in Progress
Category: Cloud Infrastructure / Incident Report
Tags: AWS, US-EAST-1, EC2, Lambda, Load Balancer, DNS, Networking, Cloud, DevOps, System Reliability


๐Ÿ“ Summary

In the early hours of October 20, 2025 (PDT), AWS began experiencing widespread service disruptions across the US-EAST-1 (N. Virginia) region.
The incident, which lasted several hours, impacted core services such as DynamoDB, Lambda, EC2, SQS, EventBridge, and CloudTrail.

The root cause evolved through multiple phases:

  • Initial DNS resolution issues affecting DynamoDB,
  • Followed by EC2 internal network faults,
  • And finally, a failure within AWSโ€™s internal subsystem responsible for monitoring load balancer health.

AWS engineers worked continuously, applying multiple layers of mitigation to restore connectivity, API functionality, and service reliability.


๐Ÿงญ Complete Timeline of Events

๐Ÿ•› 12:11โ€“2:01 AM PDT โ€“ Outage Detected and DNS Root Cause Identified

The incident began with increased error rates and latency across multiple services in the US-EAST-1 region.
AWS identified DynamoDB API resolution failures caused by a DNS malfunction within the internal routing layer.

โ€œThe issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1. We are working on multiple parallel paths to accelerate recovery.โ€

This DNS disruption cascaded to other services like IAM, CloudWatch, and Support Center, leading to API timeouts and failed case creations.


๐Ÿ• 2:22โ€“3:35 AM PDT โ€“ Partial Recovery and Backlog Management

By 2:22 AM, AWS began observing significant signs of recovery.
While most services resumed, new EC2 instance launches continued to fail, and Lambda was still processing backlogged requests.

At 3:35 AM, AWS confirmed that the DNS issue had been fully mitigated.
However, customers launching EC2 instances saw persistent โ€œInsufficient Capacityโ€ errors.

โ€œThe underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may still be throttled while we work toward full resolution.โ€

๐Ÿ•“ 4:08โ€“5:48 AM PDT โ€“ EC2 and Lambda Recovery Underway

Focus shifted to restoring EC2 launch reliability and resolving Lambda polling delays for SQS event mappings.
By 5:48 AM, AWS confirmed successful EC2 launches in some Availability Zones (AZs) and that EventBridge and CloudTrail had resumed normal operations.

โ€œWe continue to recommend that customers launch EC2 Instances that are not targeted to a specific Availability Zone so that EC2 has flexibility in selecting the appropriate AZ.โ€

๐Ÿ•• 6:42โ€“7:29 AM PDT โ€“ Network Connectivity Issues

Even as EC2 began recovering, new network connectivity issues emerged.
AWS reported API errors and connectivity drops across multiple services, especially those reliant on EC2โ€™s internal network path.

โ€œWe can confirm significant API errors and connectivity issues across multiple services in the US-EAST-1 Region. We are investigating and will provide updates soon.โ€

By 7:29 AM, early signs of network recovery were observed.


๐Ÿ•– 8:04โ€“8:43 AM PDT โ€“ Root Cause Narrowed to EC2 Internal Network

AWS engineers traced the ongoing connectivity issues to an internal EC2 subsystem.
The problem originated from the load balancer health monitoring system, a crucial internal service that ensures stable traffic routing and endpoint validation.

โ€œThe root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers.โ€

AWS throttled EC2 instance launches while mitigation measures were deployed.


๐Ÿ•˜ 9:13 AM PDT โ€“ Load Balancer Subsystem Recovery

At 9:13 AM, AWS confirmed that connectivity and API recovery were progressing.
They also identified throttling conditions for EC2 instances and began applying fixes to restore normal scaling operations.

โ€œWe have taken additional mitigation steps to aid the recovery of the underlying internal subsystem responsible for monitoring the health of our network load balancers.โ€

This marked the shift from containment to controlled restoration across key services.


๐Ÿ•™ 10:03โ€“10:48 AM PDT โ€“ Lambda and EC2 Fix Validation

By 10:03 AM, AWS confirmed that most services were operational, though Lambda continued to experience function invocation errors.

โ€œLambda is experiencing function invocation errors because an internal subsystem was impacted by the network load balancer health checks.โ€

AWS began validating EC2 fixes in one AZ before region-wide deployment.
By 10:48 AM, additional AZs were showing signs of stability and recovery.


โš™๏ธ Root Cause Analysis

๐Ÿงฉ Phase 1: DNS Resolution Failure

The earliest disruption stemmed from internal DNS lookup issues for DynamoDB and IAM APIs.
These failures led to API request backlogs and retries, overwhelming the routing infrastructure.

๐Ÿ”ง Phase 2: EC2 Internal Network Degradation

After DNS stabilization, the internal EC2 control plane began experiencing communication delays.
This affected core services such as SQS, Lambda, and RDS, which depend on EC2 networking.

๐Ÿ’ก Phase 3: Load Balancer Health Subsystem Fault

The final identified root cause was an internal subsystem responsible for network load balancer health monitoring.
This system incorrectly flagged healthy instances as degraded, triggering automatic throttling, and impacting Lambdaโ€™s internal invocation logic.


๐Ÿ“Š Impact Summary

ServiceImpact LevelDescription
EC2โš™๏ธ DegradedInstance launches throttled; fix validation in progress
Lambdaโš ๏ธ ImpactedInvocation delays and internal subsystem errors
DynamoDBโœ… RecoveredDNS resolution issue resolved by 3:35 AM
SQS / EventBridgeโš™๏ธ DelaysProcessing backlog after DNS and network recovery
CloudTrail / IAMโœ… OperationalStable after early morning mitigation
RDS / ECS / Glueโš™๏ธ Minor ImpactIndirectly affected via EC2 instance creation dependency

Over 80 AWS services were listed as impacted throughout the event, spanning compute, storage, database, networking, and AI/ML categories.


๐Ÿ› ๏ธ AWS Engineering Response

AWS engineers deployed a multi-phase recovery strategy, which included:

  1. Parallel DNS Recovery Paths
    • Restored endpoint resolution through redundant routing.
  2. Network Rate Limiting and Throttling Control
    • Reduced EC2 traffic to prevent overloads.
  3. Subsystem Reboot and Validation
    • Restarted and verified the load balancer health system.
  4. Gradual AZ-based Recovery
    • Validated fixes zone-by-zone to ensure controlled restoration.
  5. Backlog Clearance for Event-driven Services
    • Sequentially processed Lambda, EventBridge, and SQS queues.

๐Ÿง  Lessons Learned

1. Interdependence Across Layers

Even isolated components like DNS or load balancer health checks can trigger cross-service disruptions.

2. Importance of Health Telemetry

AWSโ€™s internal health monitoring systems are critical; errors here cause cascading throttles and false โ€œunhealthyโ€ signals.

3. Architect for Multi-AZ and Multi-Region Resilience

Workloads distributed across multiple Availability Zones and regions are significantly more resilient during partial outages.

4. Transparent Communication

AWSโ€™s frequent, timestamped updates allowed customers to track progress clearly โ€” a best practice in cloud reliability management.


๐Ÿ”š Conclusion

The AWS US-EAST-1 outage on October 20, 2025, was a multi-layered incident that began with DNS resolution errors, evolved into EC2 internal networking issues, and culminated in a load balancer health subsystem failure.

Despite affecting dozens of services, AWS engineers demonstrated rapid diagnostics, transparent communication, and controlled recovery practices.

This incident reinforces a key principle of modern cloud design:

โ€œRedundancy without observability is fragility.โ€

Every system, no matter how large, depends on visibility, telemetry, and layered recovery planning.


๐Ÿ“˜ Full source: AWS Service Health Dashboard โ€“ US-EAST-1 Incident
๐Ÿ”— Extended analysis: Dargslan Publishing: AWS US-EAST-1 Outage Report
๐Ÿ’ก Learn more about AWS architecture and reliability: dargslan.com