Does your data model still fit? Or how we uncovered a 24x cost optimisation in our data retrieval

Written by Jan Ciechowski and Jon Soul

Jon Soul
Engineering at Depop

--

Photo by Joshua Sortino on Unsplash

The Trust and Safety team at Depop is constantly working to create a safe and seamless experience for all our buyers and sellers. On top of the wide range of internal detection and moderation capabilities we support, we also provide the ability for users to self-moderate on the platform by blocking users they do not want to see content from or who they do not want to see their content (e.g. for features like recommendations and chat messages.)

In-app screen showing how a user’s profile looks when they are blocked
The Depop app when viewing the profile of a user you have blocked

Blocking is a feature that has existed for several years on Depop, but the underlying service that powers it has seen very few changes in that time. On the other hand, the usage of the service has been far less static, as new use cases for blocking emerge and more users join the platform. Recently this growth in usage reached a tipping point and we decided to revisit some of the original service design choices. In this blog post, we will share the details of that work, the sizeable efficiencies we were able to achieve, and our wider reflections.

Tipping point

Fundamentally we are a peer-to-peer marketplace, and that means nearly every aspect of the Depop experience involves another user. A search result, a chat message, an offer, a purchase, a recommendation, or a like; they are all legitimate use cases for blocking.

Over the past year, blocking has most notably been applied more widely to recommendations, which has contributed to a five-fold increase in requests to our internal service. As the owners of the service, our team was supportive of this and at various times made the necessary adjustments to the underlying infrastructure to enable it.

Underneath, the service uses AWS DynamoDB for persistence, with a DynamoDB Accelerator (DAX) cluster for caching in front of the database. The DynamoDB table is configured with auto-scaling provisioned read/write capacity. In practice, this meant if we began to approach the limits of either DAX capacity (memory/CPU) or maximum read capacity we would scale those components. In the case of read capacity, this is simply a configuration value, whereas scaling DAX may require switching to a new cluster (if tweaking the time to live settings is not sufficient.)

Request flow for retrieving blocking information

There is a cost to this tuning both in terms of developer-hours and AWS billing, but for most of 2023, we were satisfied with paying the price despite some early warning signs that the increase in costs/usage was disproportionate to the increase in traffic we were serving.

That is, until a particular incident overwhelmed the DAX cluster and exceeded maximum read capacity, causing a temporary outage. While there were other exacerbating factors at play (particularly high traffic, and a retry storm), it accelerated the priority of taking a closer look at the scaling challenges of the service itself.

How data was stored and fetched

We store blocked user data in DynamoDB in a very simple way. For each block created by a user, we will store a new item in the table with an attribute for the blocker’s user ID and a second attribute for the target user’s ID. There is also some limited metadata stored with each item.

An example of an item stored in the DynamoDB table

The set of all possible combinations of blocker and target is very large; we have approximately 30 million registered users, meaning there are trillions of possible combinations. However only a small fraction of them actually exist, so we can store them in less than 1GB. This relative sparsity of data is relevant elsewhere, as we will discuss later.

When upstream services want to fetch blocking information, they can ask whether a user blocks or is blocked by a specific batch of users, with a maximum batch size of 100. It is no coincidence that this maximum corresponds to the maximum batch size of DynamoDB’s BatchGetItem operation, as that was the method used to fetch the data. Because blocking is typically considered a bidirectional operation, each request to the service performed two BatchGetItem operations underneath.

There is also an endpoint for answering whether one user blocks one other user, which as you may be able to guess is fulfilled by the GetItem operation underneath. This powers many of the primary use cases for blocking (sending a chat message, viewing a shop, etc.), and probably made up the majority of traffic to the internal service for a long time. However, as already mentioned, we observed over the course of 2023 a major shift in usage, putting far more emphasis on the batch endpoint and making the design choices there significantly more important.

Representation of the DynamoDB operations performed by our service

Does it still fit?

When assessing the suitability of the architecture there were two main warning signs that we noticed quite early on. Firstly, the DAX cluster we were using for caching item results from DynamoDB was very large and quite inefficient at serving cache hits. For an underlying data set of under 1GB, we were easily reaching the capacity of a DAX cluster where each node had 64GB RAM. Despite this volume of cached results, we were still only able to reach a cache hit ratio of around 50%.

To-scale comparison of total DAX capacity to underlying DynamoDB table size

Why was this the case? It’s all about the sparsity of the data; the likelihood of any one user blocking another is tiny and yet upstream services request blocking information on a much larger subset of user combinations. DAX maintains negative cache items for those items that don’t exist, so even though the actual blocking data set is small, the cache is overflowing with user combinations that don’t block each other. On top of that, typical traffic patterns (and some upstream caching) mean that the same blocking information is not frequently requested, hence the low hit ratio.

This is not to say that DAX is generally a bad choice for this type of application, just that our particular requirements made it a pretty poor fit. As already mentioned, the typical traffic pattern meant that no sensible TTL or node size could effectively cache results. Also, our latency requirements do not necessitate the microsecond latency that DAX can achieve. On reflection, even without optimisations elsewhere, using DAX was probably a marginal net-negative given that it cannot auto-scale like DynamoDB on its own would. We were spending roughly the same amount on the DAX cluster as we were saving in DynamoDB read capacity units (RCU.)

The second warning sign came from DynamoDB itself. With GetItem and BatchGetItem you are charged RCU per item request, whether that item exists or not. As discussed, we’re working with a sparse data set, where most requested items don’t exist. This contributed to high RCU consumption, where the maximum provisioned RCU needed to be adjusted frequently just to support relatively small increases in traffic. What if we could find a solution where cost scaled relative to the number of items returned, instead of the number of items requested?

Overall it was clear that the service was geared toward working with individual items: does user A block user B? In reality, the questions we most often wanted answered were about collections of items: who does user A block?

Finding a better fit

Since our underlying dataset was small, suggestions were made to replace DynamoDB and DAX with a technology that would allow for fast lookups of the whole dataset in addition to not being serverless. Some notable candidates included:

  • Storing the dataset in our backend application’s memory directly and handling persistence with a relational database
  • Using a separate in-memory cache like ElastiCache, with configuration to ensure we do not lose data during an outage

We shared these ideas and a recommendation in a design document, which was generally well received by the wider engineering group. The solutions showed promising scaling and cost traits but were quite costly in terms of engineering effort. While we explored them fairly comprehensively, there are always unknowns when you make changes as drastic as this.

Still, the sense that clients were interested in collections of items rather than individual items felt contrary to the underlying service architecture. We evaluated what our access patterns might look like if instead of storing pairs of users, we stored users and their block lists. In this case, in order to maintain O(1) lookups, we would have had to duplicate the data — one copy for users I block and another for the users who block me.

It became apparent that this is largely the same as building an index on the original dataset. As a low-effort first step, we decided to rework our batch endpoint to use a DynamoDB Query operation under the hood, with client-side filtering. For this, we needed a DynamoDB Global Secondary Index to handle the reverse lookup in O(1) time. To build confidence in the accuracy and efficiency of the new implementation, we decided to roll out our change in the following way:

  • For an easily configurable percentage of requests, fetch the result of the original BatchGetItem call and the Query call in parallel
  • Compare the two results and log any discrepancies. If the Query operation fails, proceed as normal
  • Run the Query operation bypassing DAX altogether for simplicity

Results

With the new implementation, we found:

  • No discrepancies in responses with the original BatchGetItem over ~2M requests
  • A huge 24x cost saving based on our average batch size and cache hit ratio
  • 3x lower p95 latency

To put that cost saving into context, it’s roughly equivalent to the yearly salary of a junior engineer. In simple terms, the reason for this is that BatchGetItem expends RCU for each requested item, whereas Query expends RCU relative to the size of the result set returned.

Specifically, when performing an eventually consistent BatchGetItem for N items, we expend N/2 RCU per call. For each item already stored in DAX, we save ½ RCU. Conversely, a Query call will expend ½ RCU in total per 4KB of results. Since our dataset is small and sparse, we rarely require more than a single 4KB to fulfil a request.

Comparing RCU consumption of DynamoDB BatchGetItem and Query operations

We weren’t aiming for, or expecting the observed latency improvement, but it could also be due to the underlying behaviour of BatchGetItem. When performing a BatchGetItem operation that results in N retrievals from the underlying table, the API will return the result only when all individual retrievals are complete. This means that the latency of the overall operation is equal to the latency of the slowest of the individual retrievals. On average, this slows down the whole operation, especially for larger batch sizes. This is not the case for a Query call, which is a single operation. This means that even for a Query that’s slower on average than a GetItem, we would observe better performance.

Reflections

A lot of what you read sounds straightforward. That is because it is. The change we made and its impact is quite obvious in hindsight. We found that it is difficult, however, to read between the lines of someone else’s solution to a problem and understand the original assumptions that may no longer hold. It’s difficult in the sense that it requires meticulousness and circumspection rather than complex engineering, which is sometimes contrary to what appeals to software engineers.

It’s also difficult to know if a project like this is going to deliver a net benefit when you consider the full opportunity cost. We felt confident we would deliver an improvement to the service’s overall availability and reliability in addition to saving money. That alone is sometimes not enough to justify prioritisation above other impactful projects. We prioritised it because we were not happy with the level of service provided by our application to its clients and could see the potential for a relatively low-effort, high-impact change. We also wanted to better understand an application that hasn’t seen much active development recently. At some point, holistic arguments about work being important to pick up become convincing. Identifying that moment is hard and we’re happy with the job we did in this regard.

The outcome of our work and its outsized impact as a relatively small change are to an extent signs that this work was overdue. We would not expect a relatively easy cost saving of this level for similar projects so please do not read this blog post as an endorsement of refactoring older services which are delivering value but “could be better”. Regardless of expectations, the outcomes speak to the value that can be found in revisiting old choices and challenging the underlying assumptions to find a solution that better fits the current and near-future operating context.

--

--