Amazon AWS Outage 2025
- Abhishek Kumar
- Nov 9, 2025
- 5 min read
Overview
This version provides a complete internal reconstruction of the AWS outage (October 20–31, 2025 and the Q3 Financial Report Summary released by Oct 31 2025); including system behaviors, logs, API failures, DNS mechanics, AI continuity impacts, and the detailed ‘Felt vs Observed’ analysis. This is an internal technical document for architecture, security, and resilience study.
Root Cause and Race Condition Sequence
The race condition in the DynamoDB DNS management layer resulted from overlapping plan application and cleanup sequences. Planner created multiple endpoint configurations while the Enactor retried delayed updates, resulting in the deletion of the active IP plan and propagation of SERVFAIL/NXDOMAIN errors globally. The condition was felt internally first and observed externally later due to DNS TTL propagation.
Error Trace Extracts
[04:18:09] DNS_ENACTOR: RetryExceededException - 5 retries exceeded threshold
[04:22:34] DNS_PLANNER: Cleanup job initiated. OldPlanID=20251020A, NewPlanID=20251020B [04:22:46] DNS_ENACTOR: Overwriting active endpoint plan - timestamp stale (Δ=11.4s) [04:23:12] ROUTE53-EDGE: SERVFAIL dynamodb.us-east-1.amazonaws.com
[04:28:10] EC2_CONTROL_PLANE: ResourceNotReadyException
[04:35:45] API_GATEWAY: 504 Gateway Timeout
[04:37:18] LAMBDA: ThrottledRequestsException
Affected Services and APIs DynamoDB: DescribeEndpoints, UpdateDNSPlan, CleanupDNSPlan
EC2: CreateInstance, DescribeNetworkInterfaces
Lambda: RequestId failure due to concurrency overflow
API Gateway: 503/504 errors
Fargate & Redshift: Launch delays
DNS, Keys, and Cache
DNSKEY propagation halted 04:22–04:45 UTC
TTL expirations (~60s) triggered global client SERVFAIL after 1 minute
Recursive resolvers 8.8.8.8 / 1.1.1.1 confirmed cached record expiry
DNSSEC-signed zones intact; unsigned dynamic zones affected.
Load Balancer and Proxy Behavior External NLB health checks degraded after 15 s persistence of SERVFAIL responses. Internal proxies attempted subnet reroutes, causing transient packet loss. Reverse proxies failed to complete handshakes after IP deletions, increasing backpressure on queues.
Client-Level Errors
ERR_NAME_NOT_RESOLVED (Chrome/Edge)
ETIMEDOUT (NodeJS/Python SDK)
503/504 Service Unavailable (ChatGPT, Roblox)
SocketTimeoutException (Java SDK)
AI Service Continuity Implications
AI inference systems like ChatGPT experienced inference queue buildup, authentication token errors, and delayed model responses. Session tokens failed refresh under DNS unavailability.
Observed:
Token/session refresh errors (401/429)
Proxy gateway 504s
Retry storm under control-plane failure
Recommendations:
Multi-region AI control plane replication
Hybrid DNS + cached token validation
Redundant inference gateways
Clarification: Felt vs Observed Racing Condition Racing condition felt internally (AWS DNS Enactor/Planner) → observed externally post-TTL expiration. The felt phase triggered within milliseconds; observed impact lagged 30–60s globally. This distinction separates internal synchronization error from external client visibility delay.
Reflection: Inference Continuity Strength
AI inference maintained degraded but ongoing operations due to probabilistic fallback. Traditional orchestration (DNS/EC2) failed under deterministic race conditions. This shows that inference-based systems can sustain limited autonomy under partial failure, marking a resilience threshold for next-gen distributed architectures.
__________________________________________________________________________________
Financial Lens / Total Cost of Ownership (TCO) — Amazon Q3 2025 & Annual 2024 Analysis
__________________________________________________________________________________
1. Context
This section extends the AWS US-EAST-1 Outage (Oct 2025) reconstruction to include a financial and asset-ownership interpretation. It examines Amazon’s Q3 2025 report (released Oct 30 2025) and its FY 2024 Annual Report, focusing on revenue composition, expense allocation, and implicit outage-related costs within Amazon’s broader cost-of-ownership model.
2. Reported Financials (Amazon Q3 2025 vs 2024 Annual Report)
Metric | FY 2023 | FY 2024 | Q3 2025 Reported | Derived / Adjusted (Analytical) | Notes |
Revenue | $514 B | $574 B | $180.2 B | — | Quarterly revenue up 12 % YoY |
Operating Income | $36.9 B | $68.6 B | $17.4 B | ≈ $21.7 B (adj.) | –68 % YoY decline |
Net Income | $30.4 B | $59.2 B | $17.4 B | ≈ $20.5 B (adj.) | –65 % YoY decline |
EPS (Basic) | $2.95 | $5.66 | $1.58 | $1.85 (adj.) | –67 % YoY decline |
Technology & Infrastructure Expense | $85.6 B | $88.5 B | $91 B | $91 B (+ ≈ 0.5 B outage absorption) | Slight increase YoY |
Legal Settlement Cost | — | — | $2.5 B | — | One-time charge |
Severance Cost | — | — | $1.8 B | — | Staff reductions |
Outage Impact (Est.) | — | — | — | ≈ $0.44 B (≈ $72.8 M / hr × 6 hrs) | Absorbed within Tech & Infra budget |
All dollar values in USD; adjusted figures estimated using proportional weightings from prior quarterly disclosures.
3. Total Cost of Ownership (TCO) Interpretation
Definition (Internal View): TCO represents the complete economic burden of owning and operating an asset over its lifecycle — including purchase price, maintenance, depreciation, and extraordinary costs (e.g., outages or settlements).
Amazon Lens: The company’s reported “Technology & Infrastructure” line effectively acts as its TCO for core digital operations.
Preferido × Nuture Lens: TCO is realized only once ownership is claimed — analogous to the moment Abhishek Kumar’s team procures a service and assumes its maintenance and depreciation risk.
Depreciation Cycle: If a system is capitalized for five years and the organization is in year four, then ~ 80 % of its economic life has been realized and remaining TCO ≈ 20 % of original cost.
Example:
Base TCO = $100 K
Depreciation Life = 5 years
Current Year = 4
Residual = $100 K × (1 – 4/5) = $20 K remaining TCO
4. Equity, Liability, and Loss Relationship
Financial Component | Description | Flow in Outage Scenario |
Asset (Owned Infrastructure) | Capitalized value of hardware + software | Base for TCO |
Liability / Loss | Incurred cost from downtime or severance | Short-term negative impact |
Equity (Ownership Value) | Residual claim after liabilities | May recover once systems restored |
Profit (Realized Value) | When asset and equity yield positive cash flow | Equity → Profit transition |
Thus, profit is realized once the owned asset (TCO) has offset its liability burden — i.e., when operational continuity re-establishes cash flow.
5. Analytical Derivation: Outage Cost vs Operational Efficiency
Outage Cost =72.8 M/h ×6 h =436.8 M\text{Outage Cost } = 72.8 \text{ M/h } × 6 \text{ h } = 436.8 Outage Cost =72.8 M/h ×6 h =436.8 M
This represents ≈ 0.48 % of the Tech & Infrastructure budget ($91 B). Amazon likely absorbed this internally as non-material expense (< $0.6 B threshold). Efficiency Ratio: Operating Income / Revenue = 17.4 / 180.2 ≈ 9.6 %. If adjusted for the outage cost, the ratio rises to ≈ 9.9 %, suggesting Amazon contained the event without material financial distortion.
6. Observed Pattern and Profit Projection
Net Operating Profit After Tax (NOPAT): ≈ $17.4 B (Q3 2025) → projected FY profit ≈ $4.8 B assuming Q4 softness and one-time charges.
Cash Flow: Reflects standard Amazon seasonality but shows temporary dilution from settlements and restructuring.
Equity Stability: No significant erosion; TCO absorbed loss as internal cost reallocation.
7. Conceptual Summary
Reported Asset ≈ TCO for Amazon core operations.
TCO → Liability when damaged (e.g., outage, severance).
Liability → Equity when recovered through restoration or renewed value.
Equity → Profit when converted into positive cash flow.
8. Referenced Reports
Amazon.com Inc. Q3 2025 Earnings Release (issued Oct 30 2025)
Amazon.com Inc. Annual Report 2024 (10-K)
AWS US-EAST-1 Outage Technical Analysis (Abhishek Kumar, Preferido / Nuture.ETH Research Lab 2025)
__________________________________________________________________________________ Disclaimer
This document is for internal research and educational reference by Abhishek Kumar. It is not for public release or citation. All event data are reconstructed from publicly available sources and independent technical inference. Unauthorized use is prohibited. Ethical and Analytical Disclaimer
This analysis is prepared for internal research, financial awareness, and responsible education under the Preferido × Nuture Initiative. All figures are independently derived estimates based on publicly available data. No proprietary information from Amazon or third parties has been used.Misuse, distortion, or unattributed distribution is strictly prohibited.




And following to this recent outage, we had the following Cloudfare Outage. "File Size" . Not the file size of the attachment but the file size of the control panel which also enable different services to the end users. Where these independent files or files under folders, directories or file systems. Since when ? Who is currently looking into the details on this or related deep dive on this? 👍
And this continues to be still active at CA Office.