AWS overheating in Virginia is a reminder that cloud reliability is now a physical problem

Amazon Web Services reported overheating at a data center in Northern Virginia on Thursday, causing a power loss that disrupted services in the US-East-1 region and affected customers including Coinbase, which experienced degraded performance while assuring users that funds remained safe.

The incident is concentrated in a single data center within the US-East-1 region, AWS's largest and most critical footprint. The overheating led to a power outage that impacted specific hardware, with AWS working to restore normal temperatures. Services reliant on the affected facility experienced impairments, and Coinbase specifically noted that some users may have seen degraded performance as a result of the outage. AWS has not provided a timeline for full recovery, but the company said engineers were automatically engaged and investigating mitigations.

Northern Virginia is not just any cloud region. It is the world's largest concentration of data center capacity, housing the majority of AWS's US-East-1 footprint and a significant share of the global internet's backbone infrastructure. The region processes more traffic than any other AWS availability zone cluster, making even short disruptions felt widely. Coinbase, which relies on AWS for core trading, wallet, and API services, confirmed the impact but said the issue did not compromise customer funds. That distinction matters. The outage was operational, not a security event, but it still interrupted customer experience at a time when crypto exchanges operate on thin margins for uptime.

This is not an isolated event. AWS US-East-1 has been the source of multiple outages over the past two years, including a 2025 incident that took down ChatGPT, Signal, Coinbase, and Fortnite due to a DynamoDB control plane failure, and earlier power and DNS disruptions that cascaded globally. The pattern is not random. As AI workloads fill data centers with high-density GPU racks, the thermal load increases dramatically. Traditional cloud servers generate heat, but AI accelerators running inference or training workloads demand more power per rack and more sophisticated cooling to manage the concentrated thermal output. That density is now spilling into mainstream cloud operations.

For SF readers, the AWS overheating incident is a reminder that cloud reliability is becoming a physical infrastructure problem, not just a software one. Founders building AI, crypto, or high-availability applications cannot fully abstract away the constraints of power, cooling, and physical hardware density. US-East-1's centrality makes it a single point of failure for workloads that choose convenience over resilience. When overheating disrupts a single data center, it exposes how much of the internet still routes through a narrow set of facilities in one geography.

The AI data center density angle is the part that should concern infrastructure-dependent startups most. GPU racks draw 100 kilowatts or more per rack, compared to 10 to 20 kilowatts for traditional servers. Liquid cooling, backup power, and thermal management systems are now as critical as software redundancy. When AWS reports overheating, it is a sign that even the largest operator is pushing the limits of what existing facilities can handle under mixed workloads. That creates fragility. A cooling failure in one building can cascade to power distribution, network routing, and API availability across an entire region. Startups that assume cloud uptime as a given need to price that assumption more conservatively.

Coinbase's exposure is particularly instructive for crypto founders. The exchange has spent years building multi-cloud and multi-region redundancy, but US-East-1 remains a core dependency. When AWS has an issue there, Coinbase feels it. That is true for most crypto infrastructure, from wallets to DEXs to custody providers. The lesson is not to abandon cloud. It is to treat physical infrastructure as a first-order risk factor and build failover capacity, data replication, and regional diversity into the architecture from the beginning. Coinbase's quick communication that funds were safe shows good crisis management. But degraded performance during trading hours still costs real money.

The bigger question is whether these incidents will accelerate a shift away from cloud concentration. Multi-region deployments are expensive and complex, but they reduce the blast radius of a single data center failure. On-premises or edge deployments are gaining traction for AI workloads precisely because they avoid the shared-failure modes of mega-regions like US-East-1. Founders who can deliver reliable inference without depending on Virginia's power grid and cooling capacity may find themselves with a real advantage as these incidents multiply.

Also read: SoftBank's OpenAI margin loan is the most leveraged bet in the history of artificial intelligence • Nvidia's $2.1 billion bet on IREN shows miners with power are becoming AI infrastructure • The Nvidia chip smuggling network shows Washington's AI containment strategy has a serious leakage problem