Feature-driven ML @ Depop

Published in

Engineering at Depop

10 min readAug 22, 2023

At Depop, we leverage machine learning in a wide range of impactful use cases including search, recommendations, fraud detection and numerous others. The machine learning cycle involves several steps:

A thorough understanding of the business problem at hand.
Careful engineering of ML features to power our models
Evaluating the model’s performance using historical data to ensure its accuracy.
Deploying successful models into production.

It turns out that implementing production-grade ML features is not easy! That’s why we partner with Tecton to provide a feature store for both storing the features developed by our data scientists and serving those features to our production models.

Tecton, in comparison to the open source tool Feast, implements a range of value-added functionalities (Tecton vs Feast). Tecton handles the entire process of operations involved in creating these features, as well as managing the jobs that ensure our features are regularly updated with the most recent data.

In this blog post we will:

Dive deeper into the challenges addressed by feature stores.
Have a quick look at how a feature is actually defined in Tecton.
Talk through one of the challenges we came up against in our implementation of Tecton and how we solved it.

Why do we need a feature store?

ML features are challenging to productionise because they need to be used in two very different contexts. On the one hand, features need to be available in the form of large datasets for offline training. At training time, our features need to be historical — in other words, they need to represent what we knew about the world (products, users, etc) at some point in the past.

On the other hand, those same features need to be available when we serve the model in production, in real-time. At serving time, our features only need to be representative of now, but we need access to them at a much lower latency that we can afford at offline training time. Given this complex set of requirements, feature stores try to solve a number of challenges that arise.

The first challenge that feature stores try to address is training-serving skew, whereby features computed for offline training differ to those at serving time, leading to degraded model performance. The key to avoiding training-serving skew is implementing point-in-time correctness, which allows you, at training time, to define what data would have been available to calculate a feature at serving time.

Tecton effectively manages the intricacies involved in calculating point-in-time correctness, both when saving features for model training (in the offline store) and model serving (in the online store).

By handling this complexity, Tecton minimises the potential for training-serving skew and significantly reduces the risk of introducing data leakage into our training data (for more details, see this blog post). This ensures that the models trained and deployed with Tecton features perform optimally, offering greater reliability and accuracy in production environments. The ability to address training-serving skew is a key factor that has significantly improved the overall quality and efficiency of our machine learning models.

Another demanding challenge faced by feature stores is the need to serve features at a large scale while maintaining low latency (<100 ms). At Depop, Tecton is configured to use Redis as the backend for the online store, allowing low-latency reads. Even with highly-performant backends like Redis, storing and updating features is tricky.

Tecton addresses some of this complexity by introducing an innovative approach known as tiled time window aggregations. This technique involves storing partially aggregated data, known as tiles. For example, if we have a feature view that counts the number of impressions over a 7 day period and is updated daily, Tecton stores tiles counting impressions for each day. When feature data for a specific key is requested, Tecton retrieves the 7 most recent tiles, before calculating the total count across them. The advantage of this approach over storing the final aggregate is that features can be updated by pushing new tiles, rather than having to perform a full recalculation. This is particularly beneficial if you have features that are the same aggregate but over different window lengths, e.g. 7-day and 30-day. Crucially, these sets of tiles can be sparse, so that if events to be aggregated are only observed on a subset of days, then only the corresponding tiles need to be present, reducing overall memory usage.

Another crucial challenge involves effectively managing the process of materialisation — that is, backfilling and incrementally updating features over time as new data becomes available. Out of the box, Tecton is responsible for the orchestration of all the materialisation jobs within the feature store. Tecton operates a well-defined process for triggering backfill jobs and scheduling batch jobs in relation to the feature’s attributes. When a new feature is added, Tecton initiates backfill jobs that ensure values are populated from the feature start time up to the current date. Additionally, Tecton intelligently schedules incremental materialisation jobs based on the feature’s batch schedule.

In the next section, we will explore how we define features in Tecton and specify materialisation jobs, providing a comprehensive understanding of Tecton’s data management process.

How are features defined in Tecton?

The process of creating a new feature follows a structured workflow. Take a look at the following example:

product_create_batch_config = get_batch_config(
    table="product_create",
    timestamp_key="created_date",
    data_delay=datetime.timedelta(hours=7),
)

product_create_batch = BatchSource(
    name="product_create_batch_v2_2_0",
    batch_config=product_create_batch_config,
    owner=DATA_SCIENCE_TEAM_EMAIL,
    tags=develop_tags,
)
product_likes_config = HiveConfig(
    database="datlake"
    table="product_likes"
    data_delay=timedelta(hours=7) 
product_likes = BatchSource(
    name="product_likes",
    batch_config=product_views_config,
    tags=develop_tags,
)
@batch_feature_view(
    sources=[product_likes, product_create_batch],
    entities=[user],
    mode="pyspark",
    online=True,
    batch_schedule=timedelta(days=1),
    feature_start_time=datetime(2018, 10, 10),
    aggregations=[
Aggregation(time_window=timedelta(days=7), column="product_view",
function="count"),
Aggregation(time_window=timedelta(days=7), column="product_like",
function="count"),
        for time_window_delta in time_window_deltas
    ]column='likes', function='count')],
)
def product_like(
    product_likes,
    product_create,
):
    product_create = product_create.withColumnRenamed("user", "seller_id")
    product_likes = add_seller_id(
        df_to_add_seller_to=product_likes,
        df_with_seller_id=product_create,
    )
    product_likes = product_likes.withColumnRenamed("sender_id",
"product_like")
    product_likes = product_likes.select(
        "event_timestamp",
        "product_id",
        "product_like",
    )
    return product_likes

The first thing we have to do is provide a set of sources. These sources are where data will be read from when calculating a feature. At Depop we use a Hive metastore to track the location and schemas of our data lake on S3, which is where we can find most of the data we need for our features. You’ll also notice that we provide a `data_delay` parameter, which we’ll come back to later.

Once we have our input data sources, we need to define a feature view. A feature view is essentially a function that produces one or more features. At Depop, the majority of our feature views use pyspark as an engine to process input data and derive features. You’ll notice that there is lots of configuration provided within the `batch_feature_view` decorator — the most important argument for our example is the `aggregations` parameter, which defines the aggregations that will be computed.

So once we have our feature definition, how do we put it into action? Tecton implements a declarative feature repository pattern, where all feature definitions are stored as code in a versioned repo (e.g. git). This repo is the ‘source of truth’ for our expectation of the state of our feature store.

Using CI/CD processes integrated into our feature repo, Tecton can plan and apply the new state produced by any code changes in the repo (If you’ve ever used terraform, this pattern should be very familiar to you!). If the code change includes a new feature then Tecton automates the subsequent steps of orchestrating backfill materialisation jobs based on the feature’s attributes. Additionally, incremental materialisation jobs are scheduled according to the feature’s batch schedule + the data delay specified by the data source. This is how we initially implemented materialisation at Depop, but we soon found that we ran into issues, which we’ll discuss in the next section.

When materialisation goes wrong

As mentioned earlier, most of our data sources come from Depop’s data lake. Data lake tables are updated daily, meaning that we will, for example, have access to yesterday’s data at some point this morning. To account for this in Tecton, we defined data delay for each upstream data source. This additional delay attempts to guarantee that when our job starts, all necessary upstream dependencies are equipped with the latest and freshest data. Upstream tables are not updated at the exact same time each day, so we initially added 2 hours to these data delays to account for this variation.

However, we found that even with generous data delays, we were running into problems. Sometimes upstream data lake processes would only complete by the afternoon, or even the next day. We initially made the assumption that this wasn’t ideal, simply because our features would not update and therefore be stale. But the behaviour was actually more complex than first anticipated. This is because Tecton is designed to handle sparse updates, as discussed in the previous section. Let’s break this down to understand the consequences:

Tecton materialisation job commences despite yesterday’s data not yet being available.
The materialisation runs and produces no new tile records because there is no data available.
Tecton doesn’t error because of this and instead updates the feature so that its’ aggregation window ends at the start of today. This means the window slides to the left, so that the oldest tile falls out of the window and a tile of count = 0 for yesterday is effectively added to the aggregation.

Not only were our features stale, they had also become inaccurate. Although we did have visibility of when upstream dependencies failed to produce any data (see data quality visualisations below), we didn’t have anything alerting us to this, which led to silent failures. Even in situations where we did spot an issue, recovering was a fiddly and manual process, requiring us to: figure out which features were sourced from a late upstream data source, wait until the upstream data was ready and then manually rerun materialisation. As a team, one of our primary objectives is to promote automation and minimise the need for manual data science workflows. We recognised that this scenario presented a risk to our ability to scale Tecton usage as more data sources and features were added.

Ultimately we lacked the ability to delay our operations until all upstream dependencies were ready. So we decided to see if we could do something about it …

Airflow Integration

At Depop, Airflow is used extensively for the orchestration and execution of data workflows. Over time, we have developed extensive functionality around Airflow to enhance its capabilities. This includes leveraging its features to effectively orchestrate jobs, monitor the completion of upstream dependencies, and validate our processes using tools like Great Expectations. Airflow emerged as the ideal candidate for orchestrating our Tecton materialisation jobs.

Tecton provides us with the flexibility to manually trigger all the materialisation jobs associated with a feature view. However, if you choose to manually trigger materialisation jobs then you also have the responsibility to initiate the backfill process. To streamline the backfill process, we implemented an additional step in our CI/CD pipeline within our feature repository. This step automatically checks for newly created features or updates to existing ones. Based on this analysis, our CI step determines whether we need to trigger backfills for the respective features, ensuring timely and accurate data population.

For the orchestration of our daily jobs, we decided to leverage Airflow’s capabilities. We construct a dynamic DAG that queries all our Tecton features on a daily basis. Using the Tecton SDK, the DAG retrieves the upstream dependencies for each feature and dynamically creates an airflow sensor to monitor the completion of their respective upstream dependencies. This approach ensures that our daily jobs are orchestrated efficiently, taking into account the readiness of their upstream dependencies.

In our airflow DAG we leverage two types of sensor. The first sensor is dedicated to monitoring tables in redshift (where some data is stored outside of the data lake), which are all updated simultaneously. The second sensor is an AWS glue sensor, responsible for monitoring a given table in the data lake. We also rely on a kubernetes pod operator for triggering materialisation jobs.

Using the upstream dependencies of our feature views, we dynamically generate operators in our DAG. For our example feature view, we would create three operators in our DAG.

Results

Since integrating airflow with Tecton, we have had no incidents where features have been incorrectly updated with no upstream data. This integration has proven to be a crucial step in ensuring the reliability and freshness of our feature views.

By triggering materialisation jobs as soon as the upstream dependencies complete, we also eliminate the need for an additional 2-hour delay that was previously added to ensure data availability. This streamlined approach allows us to rapidly incorporate the latest data into our feature views, significantly reducing the time gap between data updates and model utilisation.

As a result, our models now operate on the most up-to-date and reliable information, and we have eliminated the presence of incorrect feature values in our system, leading to improved performance and accuracy. It also simplifies the operational complexity of materialisation, enabling us to experiment with increasing the frequency of feature updates (e.g. twice-daily) without worrying about the impact in terms of required manual support.

In summary, the integration of Tecton with Airflow has enabled us to ensure the reliability and freshness of our feature views, mitigate incidents of outdated data, and serve models with fresher information. The combined power of Tecton and Airflow has strengthened our data infrastructure, allowing for efficient orchestration and timely insights for our production models.