How a Load Balancing error brought down thousands of services

On October 20, 2025, the internet witnessed one of the largest digital outages in recent history. For over eight hours, thousands of services went offline or experienced severe instability. The culprit? A problem in the load balancing system of Amazon Web Services (AWS), the world's largest cloud infrastructure provider.

Snapchat, Fortnite, PicPay, iFood, Mercado Livre, and countless other services were affected. But what exactly happened? And why can a single problem at AWS cause such a devastating domino effect?

The outage

At 4:12 AM (Brasília time), the first problem reports began. AWS identified a critical failure in an internal subsystem responsible for monitoring the health of network load balancers in the US-EAST-1 region, located in Northern Virginia, United States.

This region is the largest concentration of data centers in the world, with nearly 400 facilities. Because it offers the lowest prices globally (thanks to tax exemptions), US-EAST-1 is extremely popular among Brazilian and international companies. It's estimated that a large portion of data processed by Brazilian services passes through there.

The initial problem affected DynamoDB, AWS's central database, and quickly spread to other critical services like EC2 (virtual servers) and Lambda (serverless code execution). Since these services are the foundation for thousands of applications, the impact was immediate and global.

Timeline

The first problem reports started at 4:12 AM. Less than 40 minutes later, at 4:51 AM, AWS confirmed increased errors and latency in their systems. At 5:26 AM, the problem was identified in DynamoDB, the platform's central database. The first corrections began at 6:22 AM, but the problem was far from resolved.

The situation worsened drastically at 11:14 AM, when the system status was changed to "degrading." Only at 12:43 PM did AWS manage to identify the root cause: the load balancer monitoring subsystem. Additional mitigation measures were applied at 1:13 PM, but the damage was already done.

More than 6.5 million notifications were registered on DownDetector throughout the day. According to Amazon, 91 internal AWS services were impacted simultaneously, creating a cascading effect that spread across the entire internet.

What is Load Balancing?

Imagine a restaurant with only one cashier. If 50 people arrive at the same time, a huge line forms and service becomes slow. The solution? Open more cashiers and distribute customers among them intelligently.

Load balancing is exactly that, but for servers. It's a fundamental technique that distributes network traffic or application requests across multiple servers, ensuring that no server becomes overloaded while others remain idle.

How it works

A load balancer acts as an "intelligent doorman" that sits between users and servers. When you access a website or app, your request doesn't go directly to a specific server—it first passes through the load balancer, which decides which server is best positioned to handle it.

[User] → [Load Balancer] → [Server 1]
                         → [Server 2]
                         → [Server 3]
                         → [Server 4]

Distribution strategies

There are different algorithms to decide which server should receive each request. The Round Robin method, for example, distributes requests in a circular fashion, sending one to each server in sequence. The Least Connections algorithm sends each new request to the server with the fewest active connections at the moment, better balancing the actual load.

Other strategies include IP Hash, which uses the client's IP address to consistently determine which server will handle it, and the Weighted method, which distributes traffic based on each server's capacity. There's also Geographic routing, which directs users to geographically closer servers, reducing latency.

Health Checks

A critical aspect of load balancers is health monitoring, known as health checks. The balancer constantly verifies if each server is healthy and ready to receive traffic. When a server is responding quickly, it receives the normal load of requests. If the server starts to become slow or present errors, the load balancer automatically reduces the amount of traffic directed to it. And when a server goes completely offline, it is immediately removed from rotation, ensuring no user is affected.

It was precisely in this monitoring system that the AWS failure occurred.

Why is it critical?

Load balancing is fundamental to keeping the internet running reliably and efficiently. First, it ensures high availability: if a server goes down, the load balancer automatically redirects traffic to healthy servers, and users don't even notice there was a problem. This automatic recovery capability is essential for services that cannot stop.

Scalability is another crucial benefit. When it's necessary to serve more users, simply add more servers to the pool and the load balancer automatically distributes traffic to them. There's no need to reconfigure the entire infrastructure or make complex changes.

Additionally, distributing the load among multiple servers prevents any of them from becoming overloaded, keeping response times fast and consistent for all users. This directly impacts the end user experience, who perceives the service as fast and responsive.

Finally, load balancing offers flexibility for maintenance. It's possible to remove servers from the pool for updates, fixes, or improvements without taking down the entire service. The load balancer simply stops sending traffic to those servers temporarily, allowing maintenance without downtime.

What went wrong

According to Amazon, the problem was in an internal subsystem responsible for monitoring the health of network load balancers.

In simple terms: the system that checked if the load balancers were working correctly started having problems. This created a devastating cascade effect. First, the monitoring system failed, causing load balancers to start receiving incorrect information about server health. With wrong data, requests were sent to servers that couldn't handle them properly.

The situation worsened when new EC2 instances could no longer be created. AWS needed to limit this intentionally to prevent further deterioration of the problem. Services that depended on these resources started failing in sequence. DynamoDB, Lambda, and other critical services became unstable, and since thousands of applications depend directly on these fundamental AWS services, they also stopped working, creating the widespread outage that affected users worldwide.

The domino effect

AWS has 37% of the global cloud market. When it fails, it's not just "one website" that goes down—it's an infrastructure that supports a large part of the modern internet.

Think of it this way: if AWS were a power company, it would be like a problem at one power plant causing a blackout across an entire metropolitan region. It doesn't matter if your house has good wiring or modern equipment—without power from the source, nothing works.

Conclusion

The AWS outage of October 2025 was a reminder that even the most sophisticated systems can fail—and when they fail in critical components like load balancers, the impact is massive.

Load balancing isn't just an optimization technique; it's the backbone of the modern internet. It's what allows billions of people to access their favorite services simultaneously without everything collapsing.

Need help architecting resilient systems? At Tucupy, we help companies build robust infrastructures that withstand failures and scale with confidence. Get in touch to discuss your project.