Data Drift Detection

Enterprise-Only] Data Drift Detection is only available for Enterprise users for now.
Stay tuned for future announcements.
When deploying production models, a key concern is model / data freshness. As the real world constantly changes, an important question is whether our model is still able to make trustworthy predictions. At the heart of this question is the training data: does the data that was used to train our model properly capture the current state of the world - or more importantly are we seeing new types of data in production not seen during training.

What is Data Drift?

When production data begins to change or look different from training data, we call this dataset drift. There are many factors that can lead to dataset drift and many ways that this drift can manifest. When defining dataset drift there are two main drift categories: 1) virtual drift (covariate shift) and 2) concept drift.

Virtual Drift

Virtual drift refers to a change in the underlying data distribution P(x) without a change in P(y|x). Namely, virtual drift is a change in the type of data seen without a change in the relationship between a given data sample and the label it is assigned. Virtual drift can manifest in many different forms, such as changing syntactic structure and style (e.g. new ways of asking a particular question to a QA system) or the appearance novel words and concepts (e.g. Covid).
Virtual drift usually manifests when there is insufficient training data coverage and / or new concepts appear in the real world. Virtual drift in production data can reveal incorrectly learned decision boundaries, leading to incorrect, non-trustworthy predictions (especially in the case of an overfit model).

Concept shift

In contrast to virtual shift, concept shift happens when there is a change in P(Y|X) without a change to P(X). That is a change in the way labels are assigned for a given data sample. This typically manifests as the label for a given data sample changing over time. For example, concept shift occurs if there is a change in labeling criteria / guidelines - certain samples previously labeled Class A should now be labeled Class B.
Comparing Virtual and Concept Drift

Surfacing Virtual Data Drift

Without access to ground truth labels and underlying labeling criteria, surfacing concept drift is intractable. Therefore, we focus on detecting virtual drift - detecting when inference data / features are sufficiently different from the data used to train the model.
When viewing an inference run in the Galileo Console, Drifted appears as a dataset selection tab. Since drift is all about detecting changes in the data, drift is always computed and shown with respect to a reference dataset - the training data. Therefore, Data Drift tab is only active when we are comparing against the training data run. When we then select Drifted, we filter for the data that is not properly covered by the training dataset and may be confused or unknown to the model.
Inference data compared with training data
Drifted inference data compared with training data

Computing Virtual Data Drift

When analyzing a dataset for virtual drift (covariate shift), we focus on the model's data embedding space. Since drift is computed with respect to the training dataset, our algorithm works by comparing the embeddings spaces of a given inference dataset and reference training dataset. Our goal is to identify areas in the embedding space where inference data is not sufficiently covered or represented by training data.
We break down our definition of drift into two categories:
Type 1 Drift: Inference data that are sufficiently far from dense regions in the training data embedding space.
Type 2 Drift: Dense regions of Inference data overlapping sparse regions of training data.
When computing Type 1 and 2 Drift, we first leverage dynamic density based clustering algorithms to identify dense regions within the training and inference data spaces. Moreover, by building upon the underlying mechanism of these density based methods we can map the relative densities within the embedding space so that we can 1) identify inference data samples that do not meet density "requirements" of dense regions in the training data space (type 1 drift), and 2) discover dense regions of inference data within low density regions of the training space.
Overall, the combination of type 1 and 2 drift highlights inference data that exist within "holes" or "blind spots" in the model's embedding representation of the original training data.