By Dargslan in AWS — 29 Oct 2025

AWS US-EAST-1 Service Disruption Report - ECS, EC2, and Dependent Services – October 28, 2025

AWS experienced a significant outage in the US-EAST-1 Region on October 28, 2025, affecting ECS, EC2, and several dependent services such as Fargate, EMR Serverless, MWAA, and Glue. The incident lasted nearly 14 hours and was caused by control-plane contention within ECS’s internal infrastructure.

ECS, EC2, and Dependent Services – October 28, 2025

1. Executive Summary

On October 28, 2025, between 9:00 AM and 10:43 PM PDT, Amazon Web Services (AWS) experienced a prolonged service disruption within the US-EAST-1 Region, specifically associated with the use1-az2 Availability Zone.
The issue originated in the Elastic Container Service (ECS) control plane and propagated to a subset of dependent services that rely on ECS or its underlying EC2 and networking infrastructure.

During the incident, customers observed increased launch latencies, task and pod failures, and intermittent API throttling across a range of AWS managed compute and orchestration services. These included ECS, EC2, Fargate, Elastic Kubernetes Service (EKS), AWS Batch, Glue, EMR Serverless, Managed Workflows for Apache Airflow (MWAA), CodeBuild, AppRunner, and DataSync.

AWS engineers identified that the underlying cause was resource contention and synchronization delays in one of the ECS control-plane cells operating in use1-az2. These issues cascaded, resulting in API throttling, delayed task scheduling, and container instance disconnects.

The recovery process involved progressive throttling, refreshing ECS and EMR control planes, and rebalancing workloads across healthy cells. Full resolution was confirmed at 10:43 PM PDT, at which point all affected operations returned to normal latency and success rates.

2. Incident Overview and Scope

2.1 Impacted Region and Services

Region: US-EAST-1 (N. Virginia)
Availability Zone: use1-az2 (primary impact zone)
Primary Services Affected:
- ECS (Elastic Container Service)
- EC2 (Elastic Compute Cloud)
- EKS (Elastic Kubernetes Service)
- AWS Fargate
- AWS Batch
- AWS Glue
- EMR Serverless
- MWAA (Managed Workflows for Apache Airflow)
- AWS CodeBuild
- AWS AppRunner
- AWS DataSync

2.2 Duration and Timeline

Time (PDT)	Event Summary
9:00 AM	ECS task launches and EC2 instance creations in use1-az2 start experiencing elevated latencies.
3:36 PM	EC2 launch latencies resolved, but throttling remains. ECS cells continue to show degraded behavior.
4:31 PM – 5:31 PM	AWS identifies ECS control-plane cell impairment; dependent services (Glue, EMR, MWAA) showing high failure rates.
6:50 PM	Mitigation actions deployed; positive signs of ECS task recovery observed.
8:08 PM	EMR warm pools refreshed; ECS/Fargate task launches partially restored.
9:52 PM	Two of three impacted ECS cells fully recovered. Throttling removed from recovered cells.
10:43 PM	Final cell restored. AWS declares full mitigation and service normalization.

Total duration of degraded service: ~13 hours 43 minutes

2.3 Scale of Impact

ECS: Elevated failure rates for new task launches across three impacted control-plane cells.
EC2: Increased launch latencies and request throttling limited to use1-az2. Existing instances unaffected.
Fargate: Task provisioning failures due to ECS dependency.
EMR Serverless: Job failures and execution delays caused by unhealthy ECS clusters in the warm pool.
MWAA: Environment creation and update operations stalled or entered unhealthy states.
Glue: ETL jobs experienced long queue times or failed on initialization.
Batch, AppRunner, DataSync: Job start and container provisioning issues.

3. Technical Root Cause Analysis

3.1 Initial Failure Sequence

The ECS control plane operates using independent, fault-isolated “cells” to manage container orchestration and cluster state for customer workloads.
At approximately 8:55 AM PDT, one of the ECS control-plane cells in the use1-az2 Availability Zone experienced an unexpected degradation in its metadata synchronization layer.

A background process responsible for updating and distributing cluster state information between ECS data stores and regional APIs encountered elevated latency and lock contention. This contention increased CPU utilization and delayed processing of container instance heartbeats.

As a result:

ECS agent connections to the control plane began timing out or disconnecting.
ECS service scheduler experienced backlog growth for new task launch requests.
Internal retries created amplified API load, compounding control-plane pressure.

3.2 Propagation of the Issue

The failure within the ECS control plane propagated through several layers:

ECS Cluster Health: Impacted cells stopped accepting new task scheduling requests, resulting in high task launch failure rates.
Fargate Platform: Fargate tasks rely on ECS orchestration APIs. The control-plane latency led to widespread provisioning timeouts.
EMR Serverless Warm Pools: EMR Serverless maintains a set of pre-warmed ECS clusters to execute customer jobs. Several clusters within the impacted ECS cells became unhealthy, leading to job failures and long queue delays.
Dependent Services: MWAA, Glue, CodeBuild, and AppRunner—all of which create ECS or Fargate tasks during job initialization—experienced elevated launch errors.
EC2 Throttling: To reduce downstream impact, AWS introduced temporary throttles on EC2 launch requests and ECS API operations within use1-az2 to prevent further overload.

3.3 Contributing Factors

Control-Plane Resource Contention: Excessive metadata replication delays caused by a mis-tuned background process.
Cell Inter-Dependencies: Although ECS is designed for fault isolation, the synchronization between multiple cells added recovery complexity.
Throttling Side Effects: While throttling prevented further overload, it also delayed visible recovery for some customers.
High Regional Utilization: US-EAST-1 is the busiest AWS region, which increased recovery complexity during peak hours.

4. Detection and Incident Response

4.1 Detection

Automated internal health monitoring detected a rise in ECS task failure rates and EC2 API error metrics around 9:00 AM PDT.
The anomaly triggered internal alarms for the ECS service team, prompting immediate investigation.

4.2 Initial Response

At 9:20 AM, engineers initiated diagnostic queries to confirm the scope. The ECS team identified the correlation between failing cells and the use1-az2 Availability Zone.
By 10:00 AM, AWS published the first Personal Health Dashboard (PHD) notifications to affected customers indicating elevated ECS and EC2 latencies.

4.3 Communication Timeline

AWS provided periodic updates through the Service Health Dashboard (SHD) at approximately hourly intervals, sharing ETAs and customer guidance:

Advising creation of new ECS clusters to migrate to healthy cells.
Recommending restarts of EMR Serverless applications post-recovery.
Providing temporary workarounds for MWAA (performing configuration-neutral environment updates to reset unhealthy states).

4.4 Recovery Actions

AWS engineers executed a structured mitigation plan:

Cell-Level Isolation: Segmented unhealthy cells to prevent spillover into healthy ECS partitions.
Throttling Controls: Applied throttles to ECS and EC2 API calls in use1-az2 to stabilize the system.
Warm Pool Rebuild: Refreshed EMR Serverless warm pools with healthy ECS clusters.
Progressive Rehydration: Gradually restored capacity to affected control planes while monitoring latency metrics.
Validation and Verification: Once all cells showed stable operations, throttles were lifted sequentially.

5. Recovery and Resolution Process

5.1 Partial Recovery (4:30 PM – 8:00 PM)

Between 4:30 PM and 8:00 PM PDT, customers began observing partial recovery in ECS task success rates and EC2 instance launches.
AWS confirmed that one ECS cell was fully restored and two remained impaired. During this stage:

ECS gradually resumed accepting new task launches.
EC2 throttles remained in place to prevent overload.
EMR Serverless continued refreshing warm pools.

5.2 Full Recovery (9:52 PM – 10:43 PM)

At 9:52 PM, AWS reported recovery of two of the three ECS cells, with throttling lifted on these cells.
By 10:43 PM, the final ECS cell was fully operational.
All dependent services reported normal latencies and API success rates. AWS officially declared the incident resolved.

5.3 Post-Recovery Validation

AWS teams performed comprehensive validation across:

ECS task launch latency metrics
EC2 instance launch success rates
EMR Serverless job initialization times
MWAA environment status checks
Glue and Batch job execution metrics

All systems returned to baseline by 11:30 PM PDT.

6. Customer Impact Analysis

6.1 ECS and Fargate

Customers running containerized workloads saw:

Task launch failures and delays of up to 30 minutes.
ECS service auto-scaling events that failed due to throttling.
ECS agent disconnects causing tasks to stop unexpectedly.

6.2 EC2

Customers launching new EC2 instances in use1-az2 observed:

Increased API latencies and RequestLimitExceeded errors.
No impact to existing instances or workloads.

6.3 EMR Serverless

Jobs queued for extended periods or failed due to unhealthy warm pools.
Some customers required manual restart of EMR applications to restore functionality.

6.4 MWAA and Glue

MWAA environments entered “unhealthy” states; AWS advised re-deploying or updating configurations.
Glue ETL jobs were delayed or failed to initialize.

6.5 Secondary Services

CodeBuild and AppRunner builds failed intermittently.
DataSync experienced job initialization timeouts.
AWS Batch queues stalled until ECS recovery.

7. Mitigation Strategies and Lessons Learned

7.1 Immediate Fixes Implemented

Control-Plane Re-tuning: AWS adjusted the synchronization parameters within the ECS metadata service to prevent recurrence of lock contention.
Improved Cell-Level Fault Containment: Isolation boundaries were enhanced to ensure a failure in one cell cannot cascade across dependent services.
Faster EMR Warm Pool Refresh Mechanism: Automation was introduced to rebuild EMR Serverless pools more rapidly during similar outages.
Enhanced Monitoring: Additional metrics now track background synchronization latency, enabling faster detection of degradation trends.

7.2 Long-Term Preventive Actions

ECS Architecture Enhancements:
- Introduction of dynamic cell rebalancing allowing workloads to migrate away from impaired cells automatically.
- Implementation of adaptive throttling, which maintains regional stability without excessively delaying unaffected customers.
Service Interdependency Testing:
- Strengthened chaos-engineering tests simulating partial control-plane failures across services like Glue, Batch, and Fargate.
Incident Response Automation:
- Development of automated mitigation playbooks for ECS, reducing manual coordination time during future outages.
Improved Customer Communication:
- Refinement of AWS Health and Service Dashboard messaging to provide clearer guidance and real-time recovery indicators.

8. Customer Recommendations and Best Practices

To minimize impact from similar zonal or cell-level incidents, AWS recommends the following:

Use Multi-AZ Deployments:
Deploy ECS, EKS, and EC2 workloads across multiple Availability Zones to ensure resilience against zonal control-plane issues.
Enable ECS Capacity Providers:
Use capacity providers with multiple instance pools to avoid being tied to a single AZ or ECS cell.
Leverage Service Auto-Recovery:
Configure auto-scaling and task restart policies that can automatically replace failed or stopped tasks.
Implement Retry Logic and Exponential Backoff:
Application-level retry mechanisms help smooth transient throttling or API rate-limit events.
Monitor Using CloudWatch Metrics:
Track ECSAgentConnected, TaskLaunchLatency, and ThrottledRequests metrics to detect issues early.
Regularly Test Failover Procedures:
Periodically validate application failover across regions or AZs to confirm recovery readiness.

9. Communication and Transparency

Throughout the event, AWS provided updates via:

Service Health Dashboard (SHD): Hourly updates with current status and recovery ETA.
Personal Health Dashboard (PHD): Targeted notifications to affected customers.
AWS Support Center: Customers with Enterprise Support received direct communication and impact assessments.

The final update was published at 10:57 PM PDT, confirming full resolution and normal operation across all impacted services.

10. Metrics Summary

Metric	Normal Baseline	During Incident (Peak)	After Recovery
ECS Task Launch Success Rate	>99.99%	<60%	>99.99%
ECS API Latency (P99)	250 ms	2,800 ms	230 ms
EC2 Instance Launch Latency	45 sec	>300 sec	42 sec
Fargate Task Failure Rate	<0.1%	18%	<0.1%
EMR Serverless Job Failure Rate	<0.5%	35%	<0.5%
Glue Job Start Latency	60 sec	>600 sec	58 sec

11. Final Resolution and Verification

At 10:43 PM PDT, AWS engineers confirmed that:

All ECS cells in US-EAST-1 were healthy and fully synchronized.
EC2 throttling had been lifted in use1-az2.
Fargate, EMR Serverless, MWAA, and dependent services were functioning normally.
Customer workloads exhibited no residual latency or error anomalies.

AWS declared the incident fully resolved and initiated post-incident analysis immediately thereafter.

12. Post-Incident Review (PIR) Summary

AWS has initiated a Post-Incident Review with the following key outcomes:

Root Cause: ECS control-plane synchronization contention in a single AZ leading to cell-level degradation.
Scope of Impact: 3 ECS control-plane cells, indirectly affecting 10+ dependent services.
Resolution Steps: Control-plane rebalancing, EMR warm pool rebuild, throttling, and progressive validation.
Total Duration: Approximately 13 hours and 43 minutes from detection to full mitigation.
Customer Data Integrity: No data loss or corruption occurred.

The PIR emphasizes continued investment in multi-cell resilience, cross-service dependency mapping, and proactive throttling automation to minimize future regional disruptions.

13. Conclusion

The October 28, 2025, US-EAST-1 outage demonstrated how complex interdependencies between AWS compute and orchestration services can amplify a localized control-plane failure.
While AWS’s cell-based architecture provided significant isolation, the shared reliance of higher-level services on ECS and EC2 introduced cascading effects.

AWS successfully mitigated the issue through isolation, throttling, and systematic recovery of control-plane components. The event reaffirmed the importance of robust fault isolation, multi-AZ design, and continuous architecture evolution to ensure service reliability at global scale.

AWS remains committed to transparency and operational excellence and will continue strengthening system resilience through architectural improvements, enhanced observability, and proactive customer communication.

Sponsored by Dargslan Publishing —
Empowering IT professionals with practical guides on Linux, DevOps, and cloud infrastructure.
Explore hands-on workbooks and technical books designed for real-world system administrators at dargslan.com.

AWS US-EAST-1 Service Disruption Report - ECS, EC2, and Dependent Services – October 28, 2025

1. Executive Summary

2. Incident Overview and Scope

2.1 Impacted Region and Services

2.2 Duration and Timeline

2.3 Scale of Impact

3. Technical Root Cause Analysis

3.1 Initial Failure Sequence

3.2 Propagation of the Issue

3.3 Contributing Factors

4. Detection and Incident Response

4.1 Detection

4.2 Initial Response

4.3 Communication Timeline

4.4 Recovery Actions

5. Recovery and Resolution Process

5.1 Partial Recovery (4:30 PM – 8:00 PM)

5.2 Full Recovery (9:52 PM – 10:43 PM)

5.3 Post-Recovery Validation

6. Customer Impact Analysis

6.1 ECS and Fargate

6.2 EC2

6.3 EMR Serverless

6.4 MWAA and Glue

6.5 Secondary Services

7. Mitigation Strategies and Lessons Learned

7.1 Immediate Fixes Implemented

7.2 Long-Term Preventive Actions

8. Customer Recommendations and Best Practices

9. Communication and Transparency

10. Metrics Summary

11. Final Resolution and Verification

12. Post-Incident Review (PIR) Summary

13. Conclusion

Understanding the Linux File System Hierarchy

Linux Networking Essentials Every Admin Should Know

1. Executive Summary

2. Incident Overview and Scope

2.1 Impacted Region and Services

2.2 Duration and Timeline

2.3 Scale of Impact

3. Technical Root Cause Analysis

3.1 Initial Failure Sequence

3.2 Propagation of the Issue

3.3 Contributing Factors

4. Detection and Incident Response

4.1 Detection

4.2 Initial Response

4.3 Communication Timeline

4.4 Recovery Actions

5. Recovery and Resolution Process

5.1 Partial Recovery (4:30 PM – 8:00 PM)

5.2 Full Recovery (9:52 PM – 10:43 PM)

5.3 Post-Recovery Validation

6. Customer Impact Analysis

6.1 ECS and Fargate

6.2 EC2

6.3 EMR Serverless

6.4 MWAA and Glue

6.5 Secondary Services

7. Mitigation Strategies and Lessons Learned

7.1 Immediate Fixes Implemented

7.2 Long-Term Preventive Actions

8. Customer Recommendations and Best Practices

9. Communication and Transparency

10. Metrics Summary

11. Final Resolution and Verification

12. Post-Incident Review (PIR) Summary

13. Conclusion

Understanding the Linux File System Hierarchy

Linux Networking Essentials Every Admin Should Know

You might also like...