My first experience with DynamoDB was in a code base that had number of issue with the technology. This interested me and I set for myself the goal to learn and fix our usage. I became very impressed by how elegant and powerful the technology is when correctly used. I also was stuck few times on problem that didn’t has an obvious solution or for which I had difficulty finding useful information online. The goal of this blog post is to present some of the tricks and aha moments I had.
Before diving in, I want to emphasize that the blog post targets persons that already have a fair amount of knowledge about DynamoDB. There is great resources that do a much better job than I would, in presenting the tech and giving you the tools necessary to start being productive. If you aren’t, you can start by looking the following links.
One of the simplest improvement you can do is to add an attribute containing the type of item. This will save you a ton of gymnastic on the primary key, and simply allow you to switch/if on the type of item in order to deserialize it correctly. For instance,
pk | sk | type |
---|---|---|
USER#1 | #METADATA | User |
TEAM#1 | #METADATA | Team |
USER#1 | TEAM#1 | TeamMembership |
It might seems more intuitive to use the prefixes in the primary key, but it gets quite convoluted and isn’t worth the hassle.
Initially, it wasn’t so clear to me how to solve this problem. Taking a simple example where the user is uniquely identified by an arbitrary id (e.g., a uuid) and can connect with an email, or with a username. Naively, you might create the following structure:
pk | sk | |||
---|---|---|---|---|
USER#1 | #METADATA | email: “user1@example.com” | username: “John” | password: bcrypt(“Password!”) |
You can create an index with partition key set to email
, and similarly
create an other index with partition key set to username
. This isn’t a
scalable design, because, you’re limited to 20 global secondary indexes
(by default), and there is many more login methods? Facebook, Google,
Steam, etc.
Instead, you can structure your data in the following way:
pk | sk | ||
---|---|---|---|
USER#1 | #METADATA | email: “user1@example.com” | username: “John” |
USER#1 | EMAIL#user1@example.com | password: bcrypt(“Passw0rd!”) | confirmed: true |
USER#1 | USERNAME#John | password: bcrypt(“Passw0rd!”) |
If you combine that with the so-called “inverted index”, that is the partition
key is sk
and the sort key is pk
, you can easily login a user based on the
email with the following query sk = EMAIL#{email}
. Similarly, a user can
login with a username with the query: sk = USERNAME#{username}
.
One of the mistake I initially did was to use the primary key to store some of the data. Consider the following:
pk | sk | ||
---|---|---|---|
USER#1 | #METADATA | email: “user1@example.com” | username: “John” |
TEAM#1 | #METADATA | name: “John Team” | |
USER#1 | TEAM#1 | role: Admin |
If I wanted to query all the teams the user 1 is in, I would query for
pk = "USER#1" and begins_with(sk, "TEAM#")
. This would give me all the
team memberships. I could extract the team id from sk
by stripping the
prefix TEAM#
. This adds a ton of complexity and isn’t worth the effort.
Instead, duplicate the data in the following way.
pk | sk | |||
---|---|---|---|---|
USER#1 | #METADATA | userId: “1” | email: “user1@example.com” | username: “John” |
TEAM#1 | #METADATA | teamId: “1” | name: “John Inc.” | |
USER#1 | TEAM#1 | userId: “1” | teamId: “1” | role: Admin |
It’s easy to build the pk and sk in order to do a query, but by duplicating the data, you don’t need to care about parsing the format. You can throw those values out. This because especially apparent when using clever sk such as:
pk | sk | chatId | msgId | timestamp |
---|---|---|---|---|
CHAT#1 | MSG#2024-10-14T01:01:01.0Z#1 | 1 | 1 | 2024-10-14T01:01:01.0Z |
CHAT#1 | MSG#2024-10-14T01:01:01.0Z#2 | 1 | 2 | 2024-10-14T01:01:01.0Z |
CHAT#1 | MSG#2024-10-14T01:01:02.0Z#3 | 1 | 3 | 2024-10-14T01:01:02.0Z |
This approach is especially great when you have the ability to automatically
deserialize the DynamoDB result in an structure. Think Rust serde, Go
marshaling, Java annotation, or any other. You may in fact, not even want to
store the pk
and sk
, but simply have a function to build them from the
other fields.
When you learn about structuring many-to-many relationships, one of the first
“aha moment” should be the introduction of the inverted index. It seems that
you should always have one. Don’t do that, simply set the primary key of the
index to a new field gs1pk
and the sort key to gs1sk
.
If we take back the chat example, with the following structure:
pk | sk | chatId | msgId | timestamp |
---|---|---|---|---|
CHAT#1 | MSG#2024-10-14T01:01:01.0Z#1 | 1 | 1 | 2024-10-14T01:01:01.0Z |
CHAT#1 | MSG#2024-10-14T01:01:01.0Z#2 | 1 | 2 | 2024-10-14T01:01:01.0Z |
CHAT#1 | MSG#2024-10-14T01:01:02.0Z#3 | 1 | 3 | 2024-10-14T01:01:02.0Z |
the inverted index would be somewhat awkward to query. Instead, you could have the following (simplified):
pk | sk | gs1pk | gs1sk |
---|---|---|---|
CHAT#1 | MSG#2024-10-14T01:01:01.0Z#1 | MSG#1 | #METADATA |
CHAT#1 | MSG#2024-10-14T01:01:01.0Z#2 | MSG#2 | #METADATA |
CHAT#1 | MSG#2024-10-14T01:01:02.0Z#3 | MSG#3 | #METADATA |
USER#1 | #METADATA | ||
TEAM#1 | #METADATA | ||
USER#1 | TEAM#1 | TEAM#1 | USER#1 |
You can notice, that:
gs1pk
, but don’t really use the gs1sk
#METADATA
as gs1pk
would not serve much purpose other
than listing all users.In this very simple example, it might seems useless, but this can avoid having
to create too many indexes. Indeed, in more complex scenario, we may want to
use the gs1pk
and gs1sk
for an access pattern that isn’t covered by the
inverted index.
When using a SQL database, you never distinguish how many items are typically in a many-to-many relationship. For instance, in Discord, a user can be in many servers, but the number is certainly much lower than how many users can be in a single server. In this relationship, you have “some servers”-to-“a lot of clients”.
This can make a pretty big difference if you need to de-normalize your data. Let’s start with this very simple example.
pk | sk | |
---|---|---|
USER#1 | #METADATA | username: “John” |
SERVER#1 | #METADATA | name: “Gaming” |
SERVER#1 | USER#1 | role: Member |
You can easily list all server ids a given user is in, but that’s probably not what you want. When you query all the server the user is in, you probably also want the name of the server. Similarly, querying all user ids in a server won’t allow you to show a nice list. You will need to follow-up with N queries in order to get the username for every users.
This problem is particularly annoying, because if you duplicate the data, e.g.,
pk | sk | |||
---|---|---|---|---|
USER#1 | #METADATA | username: “John” | ||
SERVER#1 | #METADATA | name: “Gaming” | ||
SERVER#1 | USER#1 | role: Member | username: “John” | name: “Gaming” |
and you modify the server name, you may need to update hundreds, or thousands, or maybe even millions of items. On the other hand, if you have thousands of users in a server, you don’t want to query thousands of usernames in order to list them. Paging may seem to help, but only partially.
But knowing it’s a “some servers”-to-“a lot of clients” let you structure it this way:
pk | sk | ||
---|---|---|---|
USER#1 | #METADATA | username: “John” | |
SERVER#1 | #METADATA | name: “Gaming” | |
SERVER#1 | USER#1 | role: Member | username: “John” |
You can list all users in a server with a single query, including their usernames. You can query the id of all server the user is in, and follow up with multiple queries to get all server ids. But, since the assumption is that a user is never in that many servers, usually 5-10, maybe sometime 50, it’s not that high of a cost, especially with BatchGetItem.
If you edit the username of a user, you also need to update the server memberships, but similarly, it’s a small number. Especially with BatchWriteItem or even TransactWriteItems.
On the other hand, changing the name of a server could have required updating thousands or more items.
This is one of the consideration you should have when deciding which data to de-normalize in a many-to-many relationships.
One of idea I had, which I’m still not sure whether it make sense, but tend to think not nowadays is querying many different item type in a single query. Consider the following:
pk | sk | |
---|---|---|
USER#1 | #METADATA | |
USER#1 | #EMAIL#user1@example.com | email: user1@example.com |
USER#1 | #USERNAME#John | username: John |
USER#1 | #GOOGLE#DEAD | google: DEAD |
USER#1 | #META#F00D | meta: F00D |
USER#1 | TEAM#1 |
In this example, the idea is that every items starting with #
are just an
“extension” of the metadata. The query pk = USER#1 and begins_with(sk, '#')
would return all items, except the team membership, because a user can be in
multiple teams.
Now, this might seems optimized, and probably save some places, but it’s a bit of a mess to deal with from code. You end-up building up a bigger structure that merge all those fields, and especially if you have automated serializing/deserializing, it’s a pain to deal with. Instead, I now prefer:
pk | sk | ||||
---|---|---|---|---|---|
USER#1 | #METADATA | email: user1@example.com | username: John | google: DEAD | meta: F00D |
USER#1 | #EMAIL#user1@example.com | email: user1@example.com | |||
USER#1 | #USERNAME#John | username: John | |||
USER#1 | #GOOGLE#DEAD | google: DEAD | |||
USER#1 | #META#F00D | meta: F00D | |||
USER#1 | ORG#1 |
With this setup, I can query the single item I care about, in the code, I deserialize it automatically in a struct/class containing annotations, and updating those field do require to update more 2 items, but this is really not a problem with BatchWriteItem or even TransactWriteItems.
I mentioned in the previous point that I usually prefer to duplicate the data instead of querying multiple different item types. There can be exception to this rule, for instance, consider the following:
pk | sk | ||
---|---|---|---|
USER#1 | #METADATA | state: <4.3 KBs blob> | click: 24600 |
Imagine a cookie clicker kind of game, where every time a user click a cookie, you increase the attribute count. Since updating the item cost 1 WCU per KB, rounded up, you would pay 5 WCUs for every click. Now, this further increase if you have GSIs. If you have two where the user is replicated, you now pay 15 WCUs per clicks.
Instead, organize your data like this.
pk | sk | |
---|---|---|
USER#1 | #METADATA | state: |
USER#1 | #METADATA#STATS | click: 0 |
Now, on every click, you can update the relevant item, which now cost 1 WCU and this item is very unlikely to be replicated in any GSI. This example is 15 times cheaper.
DynamoDB is a very impressive technology which is extremely cheap to use when looking at AWS pricing. It’s also extremely rigid, because changing access patterns may require you to move your data around instead of updating a query in the code. I don’t think I would ever recommend it to a startup, but I also think that most CRUD app have fairly stable and obvious access patterns. The risk of drastically changing those access patterns is overdone in my opinion.