top of page

Amazon AWS Outage 2025

Overview

This version provides a complete internal reconstruction of the AWS outage (October 20–31, 2025 and the Q3 Financial Report Summary released by Oct 31 2025); including system behaviors, logs, API failures, DNS mechanics, AI continuity impacts, and the detailed ‘Felt vs Observed’ analysis. This is an internal technical document for architecture, security, and resilience study.


Root Cause and Race Condition Sequence

The race condition in the DynamoDB DNS management layer resulted from overlapping plan application and cleanup sequences. Planner created multiple endpoint configurations while the Enactor retried delayed updates, resulting in the deletion of the active IP plan and propagation of SERVFAIL/NXDOMAIN errors globally. The condition was felt internally first and observed externally later due to DNS TTL propagation.


Error Trace Extracts

[04:18:09] DNS_ENACTOR: RetryExceededException - 5 retries exceeded threshold

[04:22:34] DNS_PLANNER: Cleanup job initiated. OldPlanID=20251020A, NewPlanID=20251020B [04:22:46] DNS_ENACTOR: Overwriting active endpoint plan - timestamp stale (Δ=11.4s) [04:23:12] ROUTE53-EDGE: SERVFAIL dynamodb.us-east-1.amazonaws.com

[04:28:10] EC2_CONTROL_PLANE: ResourceNotReadyException

[04:35:45] API_GATEWAY: 504 Gateway Timeout

[04:37:18] LAMBDA: ThrottledRequestsException


Affected Services and APIs DynamoDB: DescribeEndpoints, UpdateDNSPlan, CleanupDNSPlan

EC2: CreateInstance, DescribeNetworkInterfaces

Lambda: RequestId failure due to concurrency overflow

API Gateway: 503/504 errors

Fargate & Redshift: Launch delays


DNS, Keys, and Cache

DNSKEY propagation halted 04:22–04:45 UTC

TTL expirations (~60s) triggered global client SERVFAIL after 1 minute

Recursive resolvers 8.8.8.8 / 1.1.1.1 confirmed cached record expiry

DNSSEC-signed zones intact; unsigned dynamic zones affected.

Load Balancer and Proxy Behavior External NLB health checks degraded after 15 s persistence of SERVFAIL responses. Internal proxies attempted subnet reroutes, causing transient packet loss. Reverse proxies failed to complete handshakes after IP deletions, increasing backpressure on queues.


Client-Level Errors

ERR_NAME_NOT_RESOLVED (Chrome/Edge)

ETIMEDOUT (NodeJS/Python SDK)

503/504 Service Unavailable (ChatGPT, Roblox)

SocketTimeoutException (Java SDK)


AI Service Continuity Implications

AI inference systems like ChatGPT experienced inference queue buildup, authentication token errors, and delayed model responses. Session tokens failed refresh under DNS unavailability.


Observed:

Token/session refresh errors (401/429)

Proxy gateway 504s

Retry storm under control-plane failure


Recommendations:

Multi-region AI control plane replication

Hybrid DNS + cached token validation

Redundant inference gateways


Clarification: Felt vs Observed Racing Condition Racing condition felt internally (AWS DNS Enactor/Planner) → observed externally post-TTL expiration. The felt phase triggered within milliseconds; observed impact lagged 30–60s globally. This distinction separates internal synchronization error from external client visibility delay.



Reflection: Inference Continuity Strength

AI inference maintained degraded but ongoing operations due to probabilistic fallback. Traditional orchestration (DNS/EC2) failed under deterministic race conditions. This shows that inference-based systems can sustain limited autonomy under partial failure, marking a resilience threshold for next-gen distributed architectures.

__________________________________________________________________________________ Financial Lens / Total Cost of Ownership (TCO) — Amazon Q3 2025 & Annual 2024 Analysis __________________________________________________________________________________

1. Context

This section extends the AWS US-EAST-1 Outage (Oct 2025) reconstruction to include a financial and asset-ownership interpretation. It examines Amazon’s Q3 2025 report (released Oct 30 2025) and its FY 2024 Annual Report, focusing on revenue composition, expense allocation, and implicit outage-related costs within Amazon’s broader cost-of-ownership model.


2. Reported Financials (Amazon Q3 2025 vs 2024 Annual Report)

Metric

FY 2023

FY 2024

Q3 2025 Reported

Derived / Adjusted (Analytical)

Notes

Revenue

$514 B

$574 B

$180.2 B

Quarterly revenue up 12 % YoY

Operating Income

$36.9 B

$68.6 B

$17.4 B

≈ $21.7 B (adj.)

–68 % YoY decline

Net Income

$30.4 B

$59.2 B

$17.4 B

≈ $20.5 B (adj.)

–65 % YoY decline

EPS (Basic)

$2.95

$5.66

$1.58

$1.85 (adj.)

–67 % YoY decline

Technology & Infrastructure Expense

$85.6 B

$88.5 B

$91 B

$91 B (+ ≈ 0.5 B outage absorption)

Slight increase YoY

Legal Settlement Cost

$2.5 B

One-time charge

Severance Cost

$1.8 B

Staff reductions

Outage Impact (Est.)

≈ $0.44 B (≈ $72.8 M / hr × 6 hrs)

Absorbed within Tech & Infra budget

All dollar values in USD; adjusted figures estimated using proportional weightings from prior quarterly disclosures.


3. Total Cost of Ownership (TCO) Interpretation

Definition (Internal View): TCO represents the complete economic burden of owning and operating an asset over its lifecycle — including purchase price, maintenance, depreciation, and extraordinary costs (e.g., outages or settlements).

Amazon Lens: The company’s reported “Technology & Infrastructure” line effectively acts as its TCO for core digital operations.

Preferido × Nuture Lens: TCO is realized only once ownership is claimed — analogous to the moment Abhishek Kumar’s team procures a service and assumes its maintenance and depreciation risk.

Depreciation Cycle: If a system is capitalized for five years and the organization is in year four, then ~ 80 % of its economic life has been realized and remaining TCO ≈ 20 % of original cost.

Example:

Base TCO = $100 K

Depreciation Life = 5 years

Current Year = 4

Residual = $100 K × (1 – 4/5) = $20 K remaining TCO



4. Equity, Liability, and Loss Relationship

Financial Component

Description

Flow in Outage Scenario

Asset (Owned Infrastructure)

Capitalized value of hardware + software

Base for TCO

Liability / Loss

Incurred cost from downtime or severance

Short-term negative impact

Equity (Ownership Value)

Residual claim after liabilities

May recover once systems restored

Profit (Realized Value)

When asset and equity yield positive cash flow

Equity → Profit transition

Thus, profit is realized once the owned asset (TCO) has offset its liability burden — i.e., when operational continuity re-establishes cash flow.


5. Analytical Derivation: Outage Cost vs Operational Efficiency

Outage Cost =72.8 M/h ×6 h =436.8 M\text{Outage Cost } = 72.8 \text{ M/h } × 6 \text{ h } = 436.8 Outage Cost =72.8 M/h ×6 h =436.8 M

This represents ≈ 0.48 % of the Tech & Infrastructure budget ($91 B). Amazon likely absorbed this internally as non-material expense (< $0.6 B threshold). Efficiency Ratio: Operating Income / Revenue = 17.4 / 180.2 ≈ 9.6 %. If adjusted for the outage cost, the ratio rises to ≈ 9.9 %, suggesting Amazon contained the event  without material financial distortion.


6. Observed Pattern and Profit Projection

Net Operating Profit After Tax (NOPAT): ≈ $17.4 B (Q3 2025) → projected FY profit ≈ $4.8 B assuming Q4 softness and one-time charges.

Cash Flow: Reflects standard Amazon seasonality but shows temporary dilution from settlements and restructuring.

Equity Stability: No significant erosion; TCO absorbed loss as internal cost reallocation.


7. Conceptual Summary

Reported Asset ≈ TCO for Amazon core operations.

TCO → Liability when damaged (e.g., outage, severance).

Liability → Equity when recovered through restoration or renewed value.

Equity → Profit when converted into positive cash flow.


8. Referenced Reports


Amazon.com Inc. Q3 2025 Earnings Release (issued Oct 30 2025)

Amazon.com Inc. Annual Report 2024 (10-K)

AWS US-EAST-1 Outage Technical Analysis (Abhishek Kumar, Preferido / Nuture.ETH Research Lab 2025)

__________________________________________________________________________________ Disclaimer

This document is for internal research and educational reference by Abhishek Kumar. It is not for public release or citation. All event data are reconstructed from publicly available sources and independent technical inference. Unauthorized use is prohibited. Ethical and Analytical Disclaimer

This analysis is prepared for internal research, financial awareness, and responsible education   under the Preferido × Nuture Initiative. All figures are independently derived estimates based on publicly available data. No proprietary information from Amazon or third parties has been used.Misuse, distortion, or unattributed distribution is strictly prohibited.

2 Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
Guest
Nov 23, 2025
Rated 5 out of 5 stars.

And following to this recent outage, we had the following Cloudfare Outage. "File Size" . Not the file size of the attachment but the file size of the control panel which also enable different services to the end users. Where these independent files or files under folders, directories or file systems. Since when ? Who is currently looking into the details on this or related deep dive on this? 👍

Like

Guest
Nov 15, 2025
Rated 5 out of 5 stars.

And this continues to be still active at CA Office.

Like

This is the moment. Let's celebrate.

Month Date Year and time
Month
Day
Year
Time
HoursMinutes

Enter Month , Day/Date , Year and Time.

Dropdown

Please Select Applicable to You or Business or Company or Organization

Which of these are important for you or your business during these times?
Power Outage
Connectivity Outage
Natural Disaster
Pandemic
Business Outage
Technology Outage
Legacy Updates Modernization
Something else ?
Tariff
I want to enter something else
Other
Which out of these would you primarily incline to during these times?
Stationary photo

 

Abhishek Kumar© 2002-2025.
All rights reserved. Preferido.
ENS Domain: NUTURE.ETH.
Powered and secured by Wix 

   We don’t sell or share your personal data without appropriate consent. Only essential cookies are used.

bottom of page