AWS Database Blog
How to design Amazon DynamoDB global secondary indexes
Back in college, I created entity-relationship diagrams to model the system requirements of a relational database. The process involved finding all of the entities of the software system and defining relationships among them. I then modeled the relationships and entities into database tables before deciding which queries the database had to support. This method of designing database schemas worked well until I started using nonrelational databases in order to take advantage of their scalability and more consistent performance.
With nonrelational databases, the approach for designing a schema proceeds in reverse. You use a “query first” approach to identify the queries the applications need before designing the database schema. Data is therefore explicitly stored the way that the application needs to use it, increasing query efficiency.
If you also want to add flexibility to your queries, you can use global secondary indexes with Amazon DynamoDB. When you use global secondary indexes on a DynamoDB table, you can query data flexibly in other dimensions, using nonkey attributes.
However, in order to maintain efficient query performance, you have to design the schema of your global secondary indexes carefully, in the same way that you designed the schema for the DynamoDB table. In this blog post, I show an approach for designing the schema of a global secondary index, explain how to avoid common pitfalls in the design process, and provide tips for reducing costs.
The schema design process for global secondary indexes
The following diagram summarizes this post’s approach for how to design the schema of global secondary indexes.

Identify the query patterns
Application-specific query patterns (the types of queries your table supports) drive the design of global secondary indexes. The main question driving your design should be “What questions do I need the global secondary index to answer?”
When you’ve determined the questions that have to be answered, map the questions to queries of your table data. Use data filters that are based on range queries such as “greater than,” “less than,” “between,” and “starts with.” You should also think about other data that must be accessed but that does not require filtering or sorting. For example, to display product information on an online shopping website, you filter your data by the ProductId of the product. However, other data that would need to be accessed in the query could include product description, price, weight, product color, and so on. Identify as many queries as possible in advance. Accounting for them in the schema design helps to optimize the cost and performance of your global secondary indexes.
Let’s look at an example to see how an application-specific query translates to table queries. Let’s say an online shopping website stores all of the orders of a customer in an Orders table with OrderId as the partition key. The table also stores other data about the order such as OrderDate, CustomerId, and Status. The following table shows some of the common application-specific questions and their corresponding table queries.
| Application specific question | Table query | 
| Find all orders for a customer sorted by order date | Filter all orders in the Orders table by CustomerIdand then sort byOrderDate | 
| Get orders for a given customer within a given date range | Filter all orders in the Orders table by CustomerIdand then filter by range query onOrderDate | 
| Find all pending orders for a customer | Filter all orders in the Orders table by CustomerIdand then filter byStatusas Pending | 
| Find all pending orders for a customer that are more than five days old | Filter all orders in the Orders table by CustomerIdand then filter byStatusasPendingand range query ofOrderDate < CurrentDate-5 | 
| Get OrderId,OrderDate, andStatusfor all orders of a customer | Filter all orders in the Orders table by CustomerIdand get theirOrderId,OrderDate, andStatus | 
Identify the candidate fields
After you identify the query patterns, you identify the global secondary index candidate fields that are required to query data for these query patterns. First, let’s see how candidate fields are formed and how they should be used.
Candidate fields of a global secondary index include the candidate partition key, candidate sort key, and attribute projections. Global secondary indexes use the value of the partition key as input to an internal hash function, which is similar to how a partition key is used in a DynamoDB table. The output value from the hash function determines the partition in which the item is stored. A partition key value is always required to query a global secondary index.
A sort key is an optional key that can be used to store all of the items with the same partition key value so that they are physically close together, ordered by sort key value. When you query a global secondary index, you can apply a condition to the sort key so that it returns only items within a certain range of values.
Attribute projections include the set of attributes that are copied from a table into the global secondary index. The partition key and sort key of the table are always projected into the global secondary index. You can access the projected attributes by querying the global secondary index.
The following table shows which candidate fields to choose for some of the more common query patterns.
| Query | Candidate partition key | Candidate sort key | Attribute projections | 
| Filter table records on A | A | ||
| Filter table records on Aand sort onB | A | B | |
| Filter table records on AandB | A | B | |
| Filter table records on AandB, and sort onC | A:B (composite partition key) | C | |
| Filter table records on A, and sort onBandC | A | B:C (composite sort key) | |
| Filter table records on Aand then filter onBby a range query such asGREATER THAN,LESS THAN,STARTS WITH, orBETWEEN | A | B | |
| Filter table records on Aand access a subset of attributesB,C, andD, but do not filter or sort by these attributes | A | B, C, D | 
Global secondary indexes do not allow you to query data in fields other than primary keys. This means that all fields that require data to be filtered on them should be part of either the candidate partition key or the candidate sort key.
Answering the following questions can help you determine if the field should be a candidate partition key or a candidate sort key:
- Is data sorting required on this field? If so, the field should be part of candidate sort key.
- Is a range-based query needed on this field? If so, the field should be part of candidate sort key.
- Is data sorting on more than one field required? If so, combine these fields into a single composite field and use it as the candidate sort key.
- Is data filtering required on more than one field? If so, combine these fields into a single composite field and use it as the candidate partition key.
- Is data sorting required on a field along with a range query on a different field? If so, combine these fields into a single composite attribute and use it as the candidate sort key.
- Is filtering on an attribute and then subsequent filtering on another attribute required? If so, the field for initial filtering should be part of candidate partition key and the field for subsequent filtering should be part of candidate sort key.
All other fields in the query that must be accessed but don’t need data to be filtered on should be included in attribute projections.
The following table shows an example, using the online shopping website stores data used earlier.
| Table query | Candidate partition key | Candidate sort key | Attribute projections | 
| Filter all orders in the Orders table by CustomerIdand then sort byOrderDate | CustomerId | OrderDate | |
| Filter all orders in the Orders table by CustomerIdand then filter by range query onOrderDate | CustomerId | OrderDate | |
| Filter all orders in the Orders table by CustomerIdand then filter byStatusasPending | CustomerId | Status | |
| Filter all orders in the Orders table by CustomerIdand then filter byStatusasPendingand range query ofOrderDate < CurrentDate-5 | CustomerId | Status:OrderDate (composite sort key) | |
| Filter all orders in the Orders table by CustomerIdand get theirOrderId,OrderDate, andStatus | CustomerId | OrderId, OrderDate, Status | 
Optimize candidate fields for cost and performance
You should optimize the candidate fields you identify for each query for cost and performance, before you use them in a global secondary index schema.
Optimize candidate fields for cost
Every global secondary index is provisioned independently and maintains its own copy of the data separately from the base table. As a result, sharing indexes to answer more than one query helps reduce the cost of maintaining an index.
Answering the following questions can help you design the best schema and optimize for cost.
- Can I use a single global secondary index to answer multiple queries?
 Sometimes multiple queries can be answered by using a single global secondary index. Some of the common use cases in which global secondary indexes can be reused are:- Using global secondary indexes with a composite primary key (partition key and sort key): A composite primary key gives you additional flexibility when querying data. When you use a composite primary key, you can query your data by either specifying the partition key or by specifying both the partition key and sort key. Let’s go back to the online shopping website example used earlier to understand how we can reuse a global secondary index to answer multiple queries using this technique. The following table shows queries that can be answered by using a single global secondary index.
 Query Table query Candidate partition key Candidate sort key Attribute projections 1 Filter all orders in the Orders table by CustomerIdand then sort byOrderDateCustomerId OrderDate OrderId, OrderDate, Status 2 Filter all orders in the Orders table by CustomerIdand then filter by range query onOrderDate3 Filter all orders in the Orders table by CustomerIdand get theirOrderId,OrderDate,andStatusFor query 1 and 2 in the preceding table, you can query the global secondary index with both the partition key and sort key. For query 3, the same global secondary index can be queried with just the partition key. As a result, a single global secondary index with CustomerId as the partition key and OrderDate as the sort key can be reused to answer all three queries – there’s no need to create a separate global secondary index for each query. 
- Using composite sort keys: Composite sort keys are the sort keys that are created by combining more than one attribute. When you use composite sort keys as sort keys of a global secondary index, they enable powerful querying capabilities by using various key conditions such as BeginsWith,Greater Than,Less Than, andBetween. For example, let’s say a local search engine application allows location-aware search queries of local businesses. Each business listing is stored in a DynamoDB table called Business withBusinessIdas the partition key. The table does not have a sort key. Each business listing hasBusinessType,Country,State,City,BusinessName,Business Address, andPhoneNumberattributes associated with it.The application requires support for the following query patterns:- Filter businesses of Business Type AandCountry B.
- Filter businesses of Business Type A,Country B, andState C.
- Filter businesses of Business Type A,Country B,State C, andCity D.
 The following table shows the candidate fields of the global secondary index for each of these queries. Query Table query Candidate partition key Candidate sort key Attribute projections 1 Filter businesses of BusinessType AandCountry BBusinessType Country 2 Filter businesses of BusinessType A,Country B, andState CBusinessType Country:State 3 Filter businesses of BusinessType A,Country B,State C, andCity DBusinessType Country:State: 
 CityA single composite global secondary index, with BusinessTypeas the partition key andCountry:State:Cityas the sort key can support all three queries. There is no need to create a separate global secondary index for each query. The idea here is to use theBeginsWithcondition for the sort key condition while querying the data from the global secondary index.The following table shows some of the records in this global secondary index. Business type (partition key) Country:State:City (sort key) Business name Business address Phone number FitnessCenter USA:Washington:Seattle Business1 Address1 111-111-1111 CoffeeShop USA:California:PaloAlto Business2 Address2 222-222-2222 Restaurant India:Maharashtra:Mumbai Business3 Address3 123-456-7890 For this example, let’s look at some application queries that use a composite global secondary index to fetch the required data. Query Partition key condition Sort key condition Find all coffee shops in the United States BusinessType equals CoffeeShop Country:State:City begins with USA Find all restaurants in the State of Washington BusinessType equals Restaurant Country:State:City begins with USA:Washington Find all fitness centers in Seattle BusinessType equals FitnessCenter Country:State:City equals USA:Washington:Seattle 
- Filter businesses of 
 
- Using global secondary indexes with a composite primary key (partition key and sort key): A composite primary key gives you additional flexibility when querying data. When you use a composite primary key, you can query your data by either specifying the partition key or by specifying both the partition key and sort key. Let’s go back to the online shopping website example used earlier to understand how we can reuse a global secondary index to answer multiple queries using this technique. The following table shows queries that can be answered by using a single global secondary index.
- Am I projecting the right data?You can reduce storage and read costs by creating global secondary indexes on attributes that store filtered records, rather than attributes that require you to filter the output of your query. Don’t project attributes that you know will never be queried. It’s also best to avoid projecting attributes that you think will rarely be needed in queries. Every time you update an attribute that is projected in an index, you incur the extra cost of also updating the index. Remember that you can still retrieve nonprojected attributes by querying your table.
Optimize for provisioned throughput use
You can achieve a better use of provisioned throughput for a global secondary index by distributing read and write requests across its partitions. This means you should design your global secondary index schema so that the partition key is accessed uniformly. You don’t have to access all partition key values to achieve an efficient throughput level, and the percentage of accessed partition key values doesn’t have to be high. However, your goal should be to maximize the number of distinct partition key values that your workload accesses, because that will increase the number of requests that are spread across the partitioned space. In general, you use your provisioned throughput more efficiently as the ratio of partition key values accessed to the total number of partition key values increases.
Choose global secondary index partition key attributes that have a large number of distinct values. Some common ways to achieve such well-distributed partition keys include:
- Unique attributes: Use attributes such as emailId,phoneNumber,orderId,customerId, andsessionIdthat have distinct values for each item (high cardinality attributes).
- Composite attributes: Try to combine more than one attribute to form a unique key, if that works with your access pattern. For example, an Orders table might have customerid+productid+countrycodeas the partition key andorder_dateas the sort key.
- Add a random number: From a predetermined range (a fixed number from, say, 1 to 10), choose a random number and add it to the partition This applies if you need a global secondary index on a field with few values, such as a Boolean flag or an enum value.
For example, an Orders table might have OrderId as the partition key and Status as one of the attributes. The Status field of an order can be Ordered, In Transit, or Delivered. The application requires querying on all of the orders in a specific status. A global secondary index on the Status field results in low cardinality of the global secondary index’s partition key. Because a low cardinality partition key results in a skewed key distribution, we recommend instead that you append a randomly generated suffix from 1 to 10 to each partition key. The application can then query the global secondary index with each of the 10 partition keys in parallel and merge the results to get all of the orders in a specific status.
Best practices for global secondary indexes
Consider the following recommended practices when using global secondary indexes.
- Provision sufficient capacity: Provisioning sufficient capacity for your global secondary indexes is crucial. Failure to do so can result in provisioned throughput exceptions to your base table writes. Updates to the table require corresponding global secondary index updates, and they consume the provisioned write capacity of the global secondary index and not the table. Under-provisioning your global secondary index not only slows down the updates to your index, but it also ultimately results in failed writes to your table.The simplest way to prevent under-provisioning is to select on-demand capacity mode for the base table (which is then applied to all global secondary indexes associated with the base table).
- Handle eventual consistency: Global secondary indexes are eventually consistent because attributes are projected asynchronously. Updates are usually propagated to global secondary indexes within a fraction of a second. Most of the applications do not really need strong consistency guarantees for their use cases, as long as the propagation to your index is fast. Take a moment to consider whether your use case actually demands a strong consistency guarantee. If it does, consider using DynamoDB transactions instead as a way to enforce strong consistency across tables.
Additional tips for reducing costs when using global secondary indexes
The following additional tips can be helpful in reducing costs when using global secondary indexes:
- Choose attribute names wisely for your global secondary indexes
 The size of an item is calculated by adding the lengths of its attribute names and values. Having shorter attribute names helps reduce the storage and write costs. For example, instead of usingCustomer_Ageas an attribute name, consider using Age.
- Use sparse indexes
 DynamoDB writes a corresponding global secondary index item only if the global secondary index key value is present in the item. If the key doesn’t appear in every table item, the global secondary index is said to be sparse. Avoid storing null and empty values for the global secondary index key attributes. If the value of a global secondary index key attribute is null or empty, it is better to just skip the attribute when writing it. Because global secondary indexes are stored separately, if you skip writing null or empty attributes they are not projected to the global secondary index, saving storage and write cost. Use sparse indexes whenever possible.
- Consider item size
 The total number of write capacity units required to write an item to DynamoDB depends on the item size. Therefore, it’s cost effective to keep the size of item small, regardless of whether your table uses provisioned capacity mode or on-demand capacity mode. When storing large attributes in your DynamoDB table, consider:- Compressing the attributes before storing them in DynamoDB: Compressing large attribute values can help reduce the size of DynamoDB items and reduce your storage costs.
- Storing the attributes in Amazon S3 and having the Amazon S3 mapping in your DynamoDB table: You can store attributes as an object in Amazon S3 and then store the object identifier in your DynamoDB item. Anytime you need to read the full item, you can use the object identifier stored in the item and read it from Amazon S3.
 
For more information about reducing item size, see Best Practices for Storing Large Items and Attributes.
Summary
In this post, I walked through the process of designing a schema for global secondary indexes. A good schema design helps in maximizing the performance and minimizing the cost of querying your data from global secondary indexes. Good schema design includes reusing your global secondary indexes to support multiple queries, and optimizing provisioned throughput usage. I also described best practices that can help you avoid pitfalls when using global secondary indexes.
About the author
 Shubham Sethi is an AWS software development engineer.
Shubham Sethi is an AWS software development engineer.