✅ [RESOLVED] AWS US-EAST-1 Outage: Full Post-Incident Report and Timeline

AWS confirms full recovery after the US-EAST-1 outage that began with a DynamoDB DNS failure and escalated into EC2 and load balancer impairments. Full timeline, analysis, and engineering lessons inside.

✅ [RESOLVED] AWS US-EAST-1 Outage: Full Post-Incident Report and Timeline
[RESOLVED] AWS US-EAST-1 Outage: Full Post-Incident Report and Root Cause

Date: October 20, 2025
Region: US-EAST-1 (N. Virginia)
Status: 🟢 Fully Recovered as of 3:01 PM PDT
Tags: AWS, US-EAST-1, Outage, EC2, Lambda, Load Balancer, DNS, Cloud, DevOps, Incident Analysis


🧭 Executive Summary

Between 11:49 PM PDT on October 19 and 3:01 PM PDT on October 20, 2025, AWS experienced a multi-layered outage in the US-EAST-1 (N. Virginia) region.
The incident began as a DNS resolution failure for the DynamoDB API endpoints, escalated into EC2 internal subsystem impairments, and peaked with a load balancer health check failure, impacting connectivity for multiple AWS services.

More than 80 services were affected, including Lambda, EC2, SQS, DynamoDB, CloudWatch, and IAM.
By 3:01 PM PDT, all AWS services were restored, with only a few (Config, Redshift, and Connect) processing remaining message backlogs.


🕒 Full Timeline of the AWS US-EAST-1 Outage

🕛 11:49 PM – 2:24 AM PDT: DNS Resolution Failure

  • The incident started with increased error rates and latency across multiple AWS services.
  • Root cause identified at 12:26 AM: DNS resolution issues for DynamoDB service endpoints.
  • Global services relying on US-EAST-1 endpoints (e.g., IAM and DynamoDB Global Tables) were also impacted.
“At 12:26 AM, we identified the trigger of the event as DNS resolution issues for the regional DynamoDB service endpoints.”
  • The DNS issue was fully resolved by 2:24 AM PDT, restoring basic API operations.

🕒 2:24 AM – 8:00 AM PDT: EC2 Internal Network and Load Balancer Failures

After resolving the DNS issue, AWS discovered secondary failures within EC2’s internal launch subsystem — a system dependent on DynamoDB for internal coordination.

  • 4:00–6:00 AM: EC2 launches were throttled; customers experienced “Insufficient Capacity” errors.
  • Lambda and SQS Event Source Mappings were also impacted due to their reliance on EC2’s internal API routes.

By 8:04 AM PDT, AWS engineers confirmed the root cause of the ongoing network issues was an internal subsystem responsible for monitoring load balancer health.

“The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers.”

This failure triggered false degradation signals, disrupting network connectivity between AWS services.


🕒 8:43 AM – 10:03 AM PDT: Throttling and Network Mitigations

AWS throttled new EC2 instance launches to stabilize the internal load balancer subsystem.
Connectivity recovery began across major services such as DynamoDB, SQS, and Amazon Connect.

At 10:03 AM PDT, AWS confirmed progress:

“We continue to apply mitigation steps for network load balancer health and recovering connectivity for most AWS services. Lambda is experiencing function invocation errors because an internal subsystem was impacted.”

🕒 10:38 AM – 12:15 PM PDT: Early Recovery Signs

  • EC2’s internal systems showed early recovery in several Availability Zones.
  • AWS gradually reduced throttles for EC2 launches and increased the rate of Lambda SQS polling.
  • Network connectivity improved region-wide, though intermittent function errors persisted for Lambda.

By 12:15 PM PDT, instance launches succeeded across multiple AZs, and Lambda invocations were stabilizing.


🕒 1:03 PM – 2:48 PM PDT: Broad Recovery Phase

  • EC2 throttles were progressively reduced to near-normal levels.
  • Lambda invocation errors fully resolved.
  • SQS polling rates were restored to pre-event levels.
  • ECS, Glue, and Redshift resumed processing delayed instance launches and analytics jobs.
  • Amazon Connect returned to normal operations for voice and chat sessions.
“We can confirm that Connect is handling new voice and chat sessions normally. There is a backlog of analytics and reporting data that we must process.”

🕒 3:01 PM PDT – Full Resolution

By 3:01 PM PDT, AWS declared full recovery across all services in the US-EAST-1 Region.
Residual backlogs for AWS Config, Redshift, and Connect were expected to clear within hours.

“By 3:01 PM, all AWS services returned to normal operations. Some services such as AWS Config, Redshift, and Connect continue to have a backlog of messages that they will finish processing over the next few hours.”

🧩 Root Cause Analysis

PhaseSubsystemDescription
Phase 1DNS (DynamoDB API)DNS resolution failure prevented DynamoDB and IAM API lookups.
Phase 2EC2 Launch SubsystemDependent on DynamoDB, causing EC2 instance provisioning delays.
Phase 3Load Balancer Health SystemInternal subsystem malfunction created false health states, throttling EC2 and Lambda.

The cascading nature of these systems — DNS → EC2 → Load Balancer — resulted in widespread disruption across compute, networking, and serverless layers.


⚙️ AWS Recovery Strategy

AWS engineers followed a multi-phase mitigation process:

  1. DNS Recovery: Restored DynamoDB API resolution and endpoint routing.
  2. EC2 Throttling: Reduced new instance launches to stabilize internal APIs.
  3. Subsystem Isolation: Restarted and revalidated the internal load balancer health monitor.
  4. Network Rebalancing: Sequentially restored connectivity across Availability Zones.
  5. Lambda Backlog Processing: Gradually scaled up SQS polling and invocation rates.
  6. Backlog Clearance: Reprocessed delayed messages for Config, Redshift, and Connect.

🧠 Engineering Lessons Learned

  1. Internal dependencies magnify impact.
    • EC2’s dependency on DynamoDB triggered secondary failures after DNS recovery.
  2. Load balancer health systems are critical.
    • Health check malfunctions can cascade into global service disruptions.
  3. Controlled throttling accelerates recovery.
    • AWS’s decision to limit EC2 and Lambda operations minimized instability.
  4. Transparent communication matters.
    • Regular updates (every 30–45 minutes) provided customers with critical situational awareness.
  5. Architect for regional fault isolation.
    • Multi-region design can prevent total downtime during single-region instability.

📊 Affected Services Overview

  • Compute: EC2, Lambda, ECS, EKS, Batch
  • Databases: DynamoDB, RDS, Redshift, Neptune
  • Networking: Load Balancer, VPC, CloudFront, PrivateLink
  • Messaging: SQS, SNS, EventBridge, CloudTrail
  • Security: IAM, Secrets Manager, STS
  • Storage: S3, EFS, FSx
  • Analytics & ML: Glue, SageMaker, Polly, Kendra
  • Customer Services: Connect, WorkSpaces, WorkMail

🔚 Conclusion

The US-EAST-1 outage demonstrated how deep interdependencies within cloud infrastructure can propagate failures beyond their origin.
What began as a localized DNS issue in DynamoDB escalated into EC2 and network-level disruptions, affecting multiple service tiers.

By 3:01 PM PDT, AWS engineers achieved full recovery, marking the end of a 15-hour region-wide incident.
AWS confirmed plans to publish a detailed post-event summary with further technical insights.

This event serves as a valuable case study in cloud resilience, dependency management, and system observability.


📘 Official Status Source: AWS Service Health Dashboard – US-EAST-1
🔗 Full Analysis by Dargslan Publishing: AWS US-EAST-1 Incident Report
💡 Learn Cloud Reliability Engineering: dargslan.com