Fairness in multi-tenant systems
Architecture | LEVEL 400
Introduction
This article looks at a few approaches Amazon has taken to manage API requests to its systems to avoid overload by implementing API rate limiting (also referred to as “throttling” or "admission control”). Without these kinds of protection, a system becomes overloaded when more traffic flows into the system than it is scaled to handle at that time. API rate limiting lets us shape incoming traffic in different ways, such as prioritizing the client workloads that remain within their planned usage, while applying backpressure to the client workload that spikes unpredictably. In this article, I’ll cover topics ranging from admission control algorithms to operational considerations for safely updating quota values in a production system. I also focus on ways Amazon has designed APIs with fairness in mind to provide predictable performance and availability and to help customers avoid the need for workloads that could lead to rate limiting.
However, before I get into the details of how Amazon systems implement rate limiting, I’ll first describe why Amazon has found rate-based quotas to be important, and why they lead to higher availability for clients of multi-tenant systems. Although you can read this article on its own, if you want some background on how systems behave when they are overloaded, check out the article Using load shedding to avoid overload.
The case for multitenancy
Service-oriented architecture (SoA) is a core part of the Amazon culture of strong ownership and loose coupling of teams and systems. This architecture comes with another important benefit—improved hardware efficiency through resource sharing. If another application wants to make use of an existing service, the service owner doesn’t need to do much work to take on a new tenant. After reviewing the use case and performing a security review, the service owner grants the new client system access to call particular APIs or access particular data. No additional infrastructure setup or installation needs to happen for the service owner to create and operate a new copy of the service for this new use case—it just re-uses the existing one.
Resource sharing is a central benefit in multi-tenant systems. A multi-tenant system handles multiple workloads, such as work from multiple customers at once. This system can also handle low priority, non-urgent workloads along with high-priority, urgent workloads. A single-tenant system, on the other hand, handles workloads from a single customer.
Resource sharing is a powerful concept behind SoA and cloud architecture, leading to savings in infrastructure cost and human operational cost. In addition, resource sharing can lead to reduced environmental costs because higher utilization means fewer servers, and therefore less energy is needed to power and cool the infrastructure.
The evolution of multi-tenant databases
When I compare single-tenant and multi-tenant systems, I like to think of the differences between types of database systems. Nearly all of the systems I’ve worked on as a software engineer have needed some sort of database to store state. Some systems are low-traffic tools to make someone’s life easier occasionally, and other systems are mission-critical and serve enormous amounts of traffic.
Early in my career at Amazon, I worked on a team that was responsible for automating the operations of the Amazon.com web server fleet. My coworkers and I built systems that forecasted how many servers we would need to provision over time for each website fleet, monitored the health of the servers, and performed automatic remediation and replacement of broken servers. We also built systems that helped teams deploy software to the website.
To manage state for many of these tools, we needed a database. At Amazon today, the default choice would be a NoSQL database because these databases are designed for manageability, scalability, availability, and predictable performance. However, this story took place before those technologies were widely available, so we set up some servers running MySQL, with replication pairs for high availability and redundancy. We also implemented backup and restore, and tested for failure to ensure that we could rely on this otherwise single point of failure.
When we built new services and tools that needed new databases, we were often tempted to re-use our existing databases by simply adding more tables to them. Why was that temptation so strong? For one, we looked at database server utilization and saw that the servers were not heavily loaded. Remember, these particular databases were at the scale of the number of developers deploying code to the website back in the mid-2000s and the number of web servers that we had at the time. In addition, the operational overhead for configuring new databases, monitoring them, upgrading them, and trying to automate every aspect of their operation was painful.
The following diagram shows an example architecture for the use of multiple website tools by Amazon.com website fleet operations a few years back. The diagram indicates that both the deployment service and the periodic fleet operator shared the same database, which our team operated.
However, when we gave in to the temptation to use the same set of database servers for multiple applications, we regretted it. I remember being on call and getting paged for a performance degradation in the deployment tool. I scratched my head trying to figure out what was wrong. Then I realized that a completely separate tool that we owned (a fleet auditor of some kind) was running its nightly state synchronization cron job. This put a lot of extra load on the shared database. The fleet auditor tool didn’t mind that the database was slow, but the deployment tool (and its users) sure did!
This constant tension between the desire to share databases (for lower costs for infrastructure and some operations) versus the need to separate databases (for better workload isolation and lower operational costs in other ways) felt like a no-win scenario—a Kobayashi Maru of sorts. The types of databases we used were meant for a single tenant, so, unsurprisingly, we ran into pain when we tried to use them in a multi-tenant way.
When Amazon Relational Database Service (RDS) was launched, it made our lives easier by automating much of that operational work. It was easier for us to run single-tenant systems as separate databases rather than sharing the same database across multiple applications. However, some workloads were quite small and other workloads varied in size, so we still needed to pay attention to the utilization of each database to get the instance sizes just right. Plus, we needed enough headroom to handle periodic swings in load.
Later at Amazon, when I was seeking out a new challenge, I learned about a new type of database we were building in AWS. The goals of the database were to be highly elastic, scalable, available, low latency, and fully managed. These goals were compelling to me because as a software engineer, I really dislike doing the same thing over and over again, especially if it’s hard to do. So, I expend a great deal of effort trying to automate those repeatable tasks (see also https://xkcd.com/1319/). This new database seemed like the perfect opportunity to finally fully automate every aspect of database maintenance that I found painful, so I joined the team that launched Amazon DynamoDB in 2012. Just like Captain Kirk, we used programming to pass the Kobayashi Maru test!
DynamoDB leverages multitenancy to provide a database that is highly elastic, durable, and available. Unlike when I use Amazon RDS, when I create resources in DynamoDB, I don’t even provision a whole Amazon Elastic Compute Cloud (EC2) instance. I simply interact with my database through an API, and behind the scenes DynamoDB figures out the fraction of a server that’s required for the workload. (Actually, it’s using a fraction of multiple servers in multiple Availability Zones for high availability and durability.) As my workload grows and shrinks, DynamoDB adjusts that fraction, and enlists more servers or fewer, as needed.
Just as with databases, there are degrees of multitenancy for general-purpose compute servers. With AWS Lambda, compute resource sharing happens at a sub-second interval, using Firecracker lightweight virtualization for resource isolation. With Amazon API Gateway, resource sharing is at the API request level. Customers of these services benefit from the advantages of multi-tenant systems: elasticity, efficiency, and ease of use. Under the hood, these services work to address challenges that come with multitenancy. Of these challenges, the one I find the most interesting is fairness.

Fairness in multi-tenant systems
Any multitenancy service works in concert with systems to ensure fairness. Fairness means that every client in a multi-tenant system is provided with a single-tenant experience. The systems that ensure fairness in multi-tenant systems are similar to systems that perform bin-packing algorithms, which are classic algorithms in computer science. These fairness systems do the following things:
Perform placement algorithms to find a spot in the fleet for new workload. (Similar to finding a bin with room for the workload.)
Continuously monitor the utilization of each workload and each server to move workloads around. (Similar to moving workloads between bins to ensure that no bin is too full.)
Monitor the overall fleet utilization, and add or remove capacity as needed. (Similar to adding more bins when they’re all getting full, and removing bins when they’re empty.)
Allow workloads to stretch beyond hard-allocated performance boundaries as long as the underlying system isn’t being fully utilized, and hold workloads to their boundaries when the system is fully utilized. (Similar to allowing workloads to stretch within each bin as long as they’re not crowding out other workloads.)
Advanced fairness systems combine these techniques in interesting ways. For example, a fairness system can monitor the utilization of each workload, estimate how likely it is that any two workloads will play nicely together, and move them together into the same bin. As long as one workload isn’t fully utilizing its provisioned resources, another workload in the same bin can borrow those resources.
For this resource sharing to work, the borrowing needs to go unnoticed by the workloads. If a workload needs to use all of its provisioned resources, the time to return those borrowed resources needs to be nearly instantaneous. In addition, workloads need to be moved quickly between bins. If a busy workload grows accustomed to exceeding its provisioned resources by borrowing from a neighbor, but its neighbor changes its behavior and starts using more of its provisioned resources, the busy workload needs to be moved to another bin.
Load shedding plus fairness
As load increases on a system, it should automatically scale up. The simplest way to do this is to add more capacity and scale horizontally. For services that adopt a serverless architecture, such as those built on AWS Lambda, horizontal scaling happens nearly instantaneously as capacity is spun up on demand to handle work. For non-serverless services, it takes longer to auto scale.
Typically, even scaling within minutes is fine. However, what happens if load increases on a service faster than Auto Scaling can add capacity? The service has two options: It can become overloaded and slow for all requests, or it can shed excess load and maintain consistent performance for the requests that it does take on. At Amazon, we greatly prefer maintaining consistent, predictable performance (option two)—especially during any overload situation. Adding increased latency to an overload scenario (option one) can cause ripple effects in a distributed system that spread the impact to other systems. The fail-fast strategy of option two helps overloaded systems continue to make progress and do useful work.
We’ve found that load shedding is one useful tool for dropping excess load in an overload scenario. Load shedding is the act of cheaply rejecting work instead of spending limited resources on it. For an HTTP service, load shedding means returning an immediate error. An example of this is returning an HTTP 503 error code. This buys time for Auto Scaling to kick in and add the necessary capacity, since the alternative to returning the error is to become slow for all requests. Because returning a load shed response to a request is significantly cheaper than fully processing the request, this approach lets the server continue to offer predictable performance for the requests it decides to fully process.
Typically, we design our services to return load shed responses to the client as quickly as possible to minimize the amount of work performed by the server. However, in some situations, we deliberately slow down these responses slightly. For example, when a load balancer is used with a least outstanding requests algorithm, we slow down fast error responses to match the latency of successful responses. This way we avoid having the load balancer send extra traffic to a server that might already be overloaded.
In a multi-tenant service however, load shedding is not enough to make the multi-tenant service appear as a single-tenant service to each customer. Typically, load from multiple tenants is uncorrelated (that is, each customer has its own use case and request rate). Therefore, if the overall load increases abruptly on a service, that increase is most likely driven by a single tenant. With fairness in mind, we want to avoid failing some number of requests across all tenants in response to an unplanned load increase from a single tenant.
To add fairness to multi-tenant systems, we use rate limiting to shape unplanned increases in traffic, but we enforce quotas (maximum values for resources and actions) at a per-tenant or per-workload granularity. This way, if the multi-tenant service experiences an unplanned increase in load, the unplanned portion of that workload is rejected, and the other workloads continue operating with predictable performance.
However, the use of quotas paradoxically both increases and decreases the availability of a service. When one tenant’s workload exceeds its quota, it will start to see its excess requests fail—this can be perceived as a drop in availability. However, in reality, the service might have had plenty of capacity to serve those requests. API rate limiting is a useful technique for protecting our services' availability, but we also spend effort to help our callers avoid exceeding their quota unnecessarily.
Like load shedding, enforcing rate-based quotas involves cheaply sending an error response instead of handling the request. However, this response indicates that the client has exceeded its quota, not that the server is out of capacity. Therefore “API rate limit exceeded” responses generally are returned with a 429 status code. In general, status codes in the 500 range mean that the server failed for some reason, but status codes in the 400 range mean that the client is doing something unexpected, or, in this case, unplanned.
Note You might notice that some AWS services actually return 503 status codes for rate exceeded. The 429 status code was not officially added to the HTTP specification until 2012 in RFC 6585. Many AWS services were created before that, starting with the release of Amazon Simple Queue Service (SQS) in 2004. AWS focuses heavily on backward compatibility, so we haven’t changed the behavior of pre-existing services to avoid breaking clients unnecessarily.
Quota visibility and flexibility
Service owners often configure a quota per client. For example, for AWS services, a client is typically an AWS account. Sometimes quotas are placed on something more fine-grained than client, such as on a particular resource owned by a client, like a DynamoDB table. Service owners define rules that give each caller a default quota. If a client grows its usage in the normal course of business and is approaching its limit, or if the client anticipates an upcoming load increase, they often ask the service to raise its quota.
There are a few types of quotas, each measured with its own units. One type of quota governs as “the number of things the client can have running at the same time.” For example, Amazon EC2 implements quotas for the number of instances that can be launched by a particular AWS account. Another type of quota is a rate-based quota. Rate-based quotas are often measured in units like “requests per second.” Although this article focuses on the nuances of rate-based quotas, many concepts that apply to rate-based quotas also apply to other types, so throughout this article I’ll just use the word “quota.”
The following graph demonstrates the use of quotas. It shows a service with finite capacity (total provisioned capacity is represented by the maximum of the y-axis). The service has three clients: Blue, Orange, and Gray. The service has hard-allocated each client one-third of its total capacity. The graph shows that client Blue is attempting to exceed its hard-allocated throughput, but it’s not able to do so.

Per client utilization
For this quota allocation to scale operationally, services expose information to clients about their quota and how close they are to hitting their quota. After all, when a client exceeds its quota, it’s likely that it’s returning errors to its clients in response. Therefore, the services provide clients with metrics that they can see and use to alarm on when their utilization is approaching the maximum quota value. For example, DynamoDB publishes Amazon CloudWatch metrics that show the throughput that is provisioned for a table, and how much of that throughput is consumed over time.
Some APIs are far more expensive for a service than others. Because of this, services might give each client a lower quota for expensive APIs. Similarly, the cost of an operation is not always known up front. For example, a query that returns a single 1 KB row is less expensive than one that returns up to 1 MB of rows. Pagination prevents this expense from getting too far out of control, but there still can be enough of a cost difference between the minimum and maximum page size to make setting the right threshold challenging. To handle this issue, some services simply count larger responses as multiple requests. One implementation of this technique is to first treat every request as the cheapest request, and then, after the API call is completed, go back and debit the client's quota based on the true request cost, possibly even pushing their quota negative until enough time passes to account for the actual usage.
There can be some flexibility in implementing quotas. Consider the case where client A has a limit of 1,000 transactions per second (TPS), but the service is scaled to handle 10,000 TPS, and the service is currently serving 5,000 TPS across all of its clients. If client A spikes from 500 TPS to 3,000 TPS, only its 1,000 TPS would be allowed, and the other 2,000 TPS would be rejected. However, instead of rejecting those requests, the service could allow them. If other clients simultaneously use more of their quotas, the service can begin to drop client A’s "over quota" requests. Dipping into this "unplanned capacity" should also act as a signal to operators of the client and/or the service. The client should know that it’s exceeding its quota and is at risk of seeing errors in the future. The service should know that it might need to scale its fleet and increase the client's quota automatically.
To demonstrate this situation, we created a graph similar to the one used previously to show a service that hard-allocated capacity to its clients. However, in the following graph the service stacked capacity for its clients, instead of hard-allocating it. Stacking allows the clients to use the unutilized service capacity. Since Orange and Gray aren't using their capacity, Blue is allowed to exceed its provisioned thresholds and tap into (or burst into) the unused capacity. If Orange or Gray were to decide to use their capacity, their traffic should take priority over Blue’s burst traffic.
At Amazon, we also look at flexibility and bursting by considering typical customer use case traffic patterns. For example, we found that EC2 instances (and their attached Amazon Elastic Block Store (EBS) volumes) are frequently busier at the time the instance is launched than they are later on. This is because when an instance launches, its operating system and application code need to be downloaded and started up. When we considered this traffic pattern, we found that we could be more generous with up-front burst quotas. This results in reduced boot times, and still provides the long-term capacity planning tools that we need to provide fairness between workloads.
We also look for ways to allow quotas to be flexible over time and adjust to the increase in a client’s traffic that happens as their business grows. For example, some services automatically increase quotas for clients over time as they grow. However, there are some cases where clients want and depend on fixed quotas, for example, quotas used for cost control. Note that this type of quota is likely to be exposed as a feature of a service rather than a protection mechanism that happens behind the scenes.

Implementing layers of admission control
The systems that shape traffic, shed load, and implement rate-based quotas are known as admission control systems.
Services at Amazon are architected with multiple layers of admission control to protect against large volumes of rejected requests. At Amazon, we often use Amazon API Gateway in front of our services and let it handle some dimensions of quotas and rate limiting. API Gateway can absorb a surge in traffic with its large fleets. This means that our service fleets remain unburdened and free to serve real traffic. We also configure Application Load Balancer, API Gateway, or Amazon CloudFront to use the web application firewall service AWS WAF to offload the act of admission control even further. For a layer of protection beyond that, AWS Shield offers DDoS protection services.
We’ve used a number of techniques to implement these layers of admission control in systems at Amazon over the years. In this section, we explore some of those techniques, including how we build server-side admission control, how we implement clients that react gracefully to backpressure from services they call, and how we think about accuracy in these systems.
Local admission control
One common method for implementing admission control is to use a token bucket algorithm. A token bucket holds tokens, and whenever a request is admitted, a token is taken out of the bucket. If there aren't any tokens available, the request is rejected, and the bucket remains empty. Tokens are added to the bucket at a configured rate, up to a maximum capacity. This maximum capacity is known as the burst capacity because these tokens can be consumed instantly, supporting a burst in traffic.
This instantaneous burst consumption of tokens is a double-edged sword. It allows for some natural non-uniformity in traffic, but if the burst capacity is too large it defeats the protections of rate limiting.
Alternatively, token buckets can be composed together to prevent unbounded burst. One token bucket can have a relatively low rate and a high burst capacity, and a second bucket can have a high rate and a low burst capacity. By checking the first bucket and then the second bucket, we allow for high burst, but with a bounded burst rate.
For a traditional service (one that doesn’t have a serverless architecture), we also consider how uniform or non-uniform the requests are across our servers for a given customer. If requests are non-uniform, we use more relaxed bursting values or distributed admission control techniques.
There are many off-the-shelf implementations of local rate limiting available, including Google Guava’s RateLimiter class.
Distributed admission control
Local admission control is useful for protecting a local resource, but quota enforcement or fairness often needs to be enforced across a horizontally scaled fleet. Teams at Amazon have taken many different approaches to solve this problem of distributed admission control, including:
Computing rates locally and dividing the quota by the number of servers. Using this approach, servers perform admission control based on traffic rates they observe locally, but they divide the quota for each key by the number of servers that are serving traffic for that throttle key. This approach assumes that requests are relatively uniformly distributed across servers. When an Elastic Load Balancing load balancer spreads requests across servers in a round-robin fashion, this generally holds true.
The following diagram shows a service architecture that assumes that traffic is relatively uniform across instances and can be handled using a single logical load balancer.

Non-uniform traffic from multiple endpoints
However, assumptions about uniformity across servers might not always be true in some fleet configurations. For example, when a load balancer is used in a connection balancing mode instead of a request balancing mode, clients with few enough connections will send their requests to a subset of servers at a time. This could be fine in practice when the per-key quota is high enough. Assumptions about uniformity across servers can also break down when there is a very large fleet with multiple load balancers. In cases like this, a customer could connect through a subset of the load balancers, resulting in only a subset of the service instances serving the requests. Again, this could be fine in practice if the quotas are high enough, or if clients are unlikely to get close enough to their quota maximum for this case to apply.
The following diagram illustrates a situation where services that are fronted by multiple load balancers find that traffic from a given client is not uniformly spread across all servers due to DNS caching. This tends to be less of an issue at scale, when clients are opening and closing connections over time.

Using consistent hashing for distributed admission control
Some service owners run a separate fleet, such as an Amazon ElastiCache for Redis fleet. They apply consistent hashing on the throttle keys to a particular rate tracker server, and then have the rate tracker servers perform admission control based on local information. This solution even scales well in cases where key cardinality is high because each rate tracker server only needs to know about a subset of the keys. However, a basic implementation would create a “hot spot” in the cache fleet when a particular throttle key is requested at a high enough rate, so intelligence needs to be added to the service to gradually rely more on local admission control for a particular key as its throughput increases.
The following diagram illustrates using consistent hashing to a data store. Even where traffic isn’t uniform, using consistent hashing to count traffic across some sort of datastore, such as a cache, can solve the distributed admission control problem. However this architecture introduces scaling challenges.

Taking other approaches
At Amazon, we’ve implemented many distributed admission control algorithms in web services over time, with varying degrees of overhead and accuracy depending on the specific use case. These approaches involve periodically sharing the observed rates for each throttle key across a fleet of servers. There are many tradeoffs between scalability, accuracy, and operational simplicity in these approaches, but they would need their own articles to explain and compare them in depth. For some starting points, check out the research papers on fairness and admission control, many of which are from the networking space, which I link to at the end of this article.
The following diagram illustrates the use of asynchronous information sharing between servers to account for non-uniform traffic. This approach comes with its own scaling and accuracy challenges.

Implementing layers of admission control
Architectural approaches to avoid quotas
It might seem easy to add admission control to a service to improve its server-side availability, protect customers from each other, and declare victory. However, we also view quotas as an inconvenience for customers. Quotas slow customers down when they are trying to get something done. As we build fairness mechanisms into our services, we also look for ways to help customers get their work done quickly, without having their throughput limited by their quotas. Our approach for helping clients avoid exceeding their rate-based quota varies depending on whether the API is a control plane API or a data plane API. A data plane API action is intended to be called at a higher and higher rate over time. Examples of data plane API actions include Amazon Simple Storage Service (S3) GetObject, Amazon DynamoDB GetItem, and Amazon SQS ReceiveMessage. On the other hand, a control plane API action can be intended for occasional, low-volume use cases that don't grow with the customer's data plane usage. Examples of control plane API actions include Amazon S3 CreateBucket, Amazon DynamoDB DescribeTable, and Amazon EC2 DescribeInstances.
Conclusion
Multitenant services have resource-sharing properties that enable them to operate with lower infrastructure cost and higher operational efficiency. At Amazon, we build fairness into our multitenant systems to provide our customers with predictable performance and availability.
Service quotas are an important tool for implementing fairness. Rate-based quotas make web services more reliable for multi-tenant customers by preventing an unpredictable increase in one workload from affecting others. However, implementing rate-based quotas is not always enough to provide a good customer experience. Customer visibility, controls, burst sharing, and different flavors of APIs all help customers avoid exceeding their quotas.
The implementation of admission control in a distributed system is complex. Fortunately, over time we have found ways to generalize this functionality and expose it in various AWS services. For API rate limiting, API Gateway offers multiple varieties of throttling as a feature. AWS WAF offers another layer of service protection, and it integrates into Application Load Balancer and API Gateway. DynamoDB offers provisioned throughput controls at the individual index level, letting customers isolate the throughput requirements of different workloads. Similarly, AWS Lambda exposes per-function concurrency isolation to isolate workloads from each other.
At Amazon, we find that admission control using quotas is an important approach for building highly resilient services with predictable performance. However, admission control is not enough. We make sure to solve the easy things too, like using Auto Scaling so that if there is unintended load shedding, our systems automatically respond to increased demand by auto scaling.
On the surface, it seems like there is an inherent tradeoff between cost and workload isolation between exposing a service as a single-tenant service versus as a multi-tenant service. However, we have found that by implementing fairness in multi-tenant systems, our customers get the best of both the multi-tenant and single-tenant worlds. As a result, they can have their cake and eat it too.
Links
Zhang, et al Online Identification of Hierarchical Heavy Hitters: Algorithms, Evaluation, and Applications, AT&T Labs, 2004
Floyd, et al Random Early Detection Gateways for Congestion Avoidance, IEEE/ACM, 1993
Charny An Algorithm for Rate Allocation in a Packet-Switching Network with Feedback, MIT, 1994
Feng, et al BLUE: A New Class of Active Queue Management Algorithms, University of Michigan, 2013
McKenny, Stochastic Fairness Queuing, 2002
Kemp, et al Gossip-Based Computation of Aggregate Information, IEEE, 2003