๐ง AWS US-EAST-1 Outage Full Report: From DNS Resolution to EC2 Load Balancer Recovery
A complete technical report on the October 20 AWS US-EAST-1 outage โ from DNS failures and network degradation to the final recovery of EC2 and Lambda.
Date: October 20, 2025
Region: US-EAST-1 (N. Virginia)
Status: ๐ก Degraded โ Recovery in Progress
Category: Cloud Infrastructure / Incident Report
Tags: AWS, US-EAST-1, EC2, Lambda, Load Balancer, DNS, Networking, Cloud, DevOps, System Reliability
๐ Summary
In the early hours of October 20, 2025 (PDT), AWS began experiencing widespread service disruptions across the US-EAST-1 (N. Virginia) region.
The incident, which lasted several hours, impacted core services such as DynamoDB, Lambda, EC2, SQS, EventBridge, and CloudTrail.
The root cause evolved through multiple phases:
- Initial DNS resolution issues affecting DynamoDB,
- Followed by EC2 internal network faults,
- And finally, a failure within AWSโs internal subsystem responsible for monitoring load balancer health.
AWS engineers worked continuously, applying multiple layers of mitigation to restore connectivity, API functionality, and service reliability.
๐งญ Complete Timeline of Events
๐ 12:11โ2:01 AM PDT โ Outage Detected and DNS Root Cause Identified
The incident began with increased error rates and latency across multiple services in the US-EAST-1 region.
AWS identified DynamoDB API resolution failures caused by a DNS malfunction within the internal routing layer.
โThe issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1. We are working on multiple parallel paths to accelerate recovery.โ
This DNS disruption cascaded to other services like IAM, CloudWatch, and Support Center, leading to API timeouts and failed case creations.
๐ 2:22โ3:35 AM PDT โ Partial Recovery and Backlog Management
By 2:22 AM, AWS began observing significant signs of recovery.
While most services resumed, new EC2 instance launches continued to fail, and Lambda was still processing backlogged requests.
At 3:35 AM, AWS confirmed that the DNS issue had been fully mitigated.
However, customers launching EC2 instances saw persistent โInsufficient Capacityโ errors.
โThe underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may still be throttled while we work toward full resolution.โ
๐ 4:08โ5:48 AM PDT โ EC2 and Lambda Recovery Underway
Focus shifted to restoring EC2 launch reliability and resolving Lambda polling delays for SQS event mappings.
By 5:48 AM, AWS confirmed successful EC2 launches in some Availability Zones (AZs) and that EventBridge and CloudTrail had resumed normal operations.
โWe continue to recommend that customers launch EC2 Instances that are not targeted to a specific Availability Zone so that EC2 has flexibility in selecting the appropriate AZ.โ
๐ 6:42โ7:29 AM PDT โ Network Connectivity Issues
Even as EC2 began recovering, new network connectivity issues emerged.
AWS reported API errors and connectivity drops across multiple services, especially those reliant on EC2โs internal network path.
โWe can confirm significant API errors and connectivity issues across multiple services in the US-EAST-1 Region. We are investigating and will provide updates soon.โ
By 7:29 AM, early signs of network recovery were observed.
๐ 8:04โ8:43 AM PDT โ Root Cause Narrowed to EC2 Internal Network
AWS engineers traced the ongoing connectivity issues to an internal EC2 subsystem.
The problem originated from the load balancer health monitoring system, a crucial internal service that ensures stable traffic routing and endpoint validation.
โThe root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers.โ
AWS throttled EC2 instance launches while mitigation measures were deployed.
๐ 9:13 AM PDT โ Load Balancer Subsystem Recovery
At 9:13 AM, AWS confirmed that connectivity and API recovery were progressing.
They also identified throttling conditions for EC2 instances and began applying fixes to restore normal scaling operations.
โWe have taken additional mitigation steps to aid the recovery of the underlying internal subsystem responsible for monitoring the health of our network load balancers.โ
This marked the shift from containment to controlled restoration across key services.
๐ 10:03โ10:48 AM PDT โ Lambda and EC2 Fix Validation
By 10:03 AM, AWS confirmed that most services were operational, though Lambda continued to experience function invocation errors.
โLambda is experiencing function invocation errors because an internal subsystem was impacted by the network load balancer health checks.โ
AWS began validating EC2 fixes in one AZ before region-wide deployment.
By 10:48 AM, additional AZs were showing signs of stability and recovery.
โ๏ธ Root Cause Analysis
๐งฉ Phase 1: DNS Resolution Failure
The earliest disruption stemmed from internal DNS lookup issues for DynamoDB and IAM APIs.
These failures led to API request backlogs and retries, overwhelming the routing infrastructure.
๐ง Phase 2: EC2 Internal Network Degradation
After DNS stabilization, the internal EC2 control plane began experiencing communication delays.
This affected core services such as SQS, Lambda, and RDS, which depend on EC2 networking.
๐ก Phase 3: Load Balancer Health Subsystem Fault
The final identified root cause was an internal subsystem responsible for network load balancer health monitoring.
This system incorrectly flagged healthy instances as degraded, triggering automatic throttling, and impacting Lambdaโs internal invocation logic.
๐ Impact Summary
| Service | Impact Level | Description |
|---|---|---|
| EC2 | โ๏ธ Degraded | Instance launches throttled; fix validation in progress |
| Lambda | โ ๏ธ Impacted | Invocation delays and internal subsystem errors |
| DynamoDB | โ Recovered | DNS resolution issue resolved by 3:35 AM |
| SQS / EventBridge | โ๏ธ Delays | Processing backlog after DNS and network recovery |
| CloudTrail / IAM | โ Operational | Stable after early morning mitigation |
| RDS / ECS / Glue | โ๏ธ Minor Impact | Indirectly affected via EC2 instance creation dependency |
Over 80 AWS services were listed as impacted throughout the event, spanning compute, storage, database, networking, and AI/ML categories.
๐ ๏ธ AWS Engineering Response
AWS engineers deployed a multi-phase recovery strategy, which included:
- Parallel DNS Recovery Paths
- Restored endpoint resolution through redundant routing.
- Network Rate Limiting and Throttling Control
- Reduced EC2 traffic to prevent overloads.
- Subsystem Reboot and Validation
- Restarted and verified the load balancer health system.
- Gradual AZ-based Recovery
- Validated fixes zone-by-zone to ensure controlled restoration.
- Backlog Clearance for Event-driven Services
- Sequentially processed Lambda, EventBridge, and SQS queues.
๐ง Lessons Learned
1. Interdependence Across Layers
Even isolated components like DNS or load balancer health checks can trigger cross-service disruptions.
2. Importance of Health Telemetry
AWSโs internal health monitoring systems are critical; errors here cause cascading throttles and false โunhealthyโ signals.
3. Architect for Multi-AZ and Multi-Region Resilience
Workloads distributed across multiple Availability Zones and regions are significantly more resilient during partial outages.
4. Transparent Communication
AWSโs frequent, timestamped updates allowed customers to track progress clearly โ a best practice in cloud reliability management.
๐ Conclusion
The AWS US-EAST-1 outage on October 20, 2025, was a multi-layered incident that began with DNS resolution errors, evolved into EC2 internal networking issues, and culminated in a load balancer health subsystem failure.
Despite affecting dozens of services, AWS engineers demonstrated rapid diagnostics, transparent communication, and controlled recovery practices.
This incident reinforces a key principle of modern cloud design:
โRedundancy without observability is fragility.โ
Every system, no matter how large, depends on visibility, telemetry, and layered recovery planning.
๐ Full source: AWS Service Health Dashboard โ US-EAST-1 Incident
๐ Extended analysis: Dargslan Publishing: AWS US-EAST-1 Outage Report
๐ก Learn more about AWS architecture and reliability: dargslan.com