Monitoring Data Drift

DMM detects and monitors data drift for input features and output predictions of your model. When you register a model, DMM ingests the training dataset to calculate the probability distributions of all features and prediction columns. It discretizes these columns by creating bins and then counting the frequency for each bin. This acts as the reference pattern.

When prediction data is ingested in DMM, it will calculate the probability distribution using the same bins and then apply the specified statistical divergence or distance test to quantify the dissimilarity or drift between the training and prediction distributions for each column.

DMM supports multiple statistical tests out of the box for the user to choose from. Each data column can have a different test. Along with the test type, the user also needs to choose the passing test condition (greater than, less than, etc) and the threshold. When the drift value for a feature doesn’t meet the set test criteria, the feature status is marked as red (i.e, the feature has drifted beyond safe conditions). In case of Scheduled Checks, the feature will be listed as one of the failing features. The magnitude of the drift value signifies the magnitude of the drift. However, note, the drift value across two different test types can not be directly compared.

When a new model is registered, the DMM’s global default test settings (Settings > Test Defaults) are inherited to the model. Once the user has ingested their first prediction dataset, the user can change the test settings to values suitable for the model. If the user saves the modified test settings, they will be used for all subsequent automated checks such as when a new prediction data file is uploaded (either through UI or API) or when Scheduled Checks run for that model.

Different statistical tests supported in DMM for Data Drift:

1. Kullback–Leibler Divergence

Kullback–Leibler divergence (also called relative entropy) is a measure of how one probability distribution is different from a second, reference probability distribution. Kullback–Leibler divergence of 0 indicates that the two distributions in question are identical. It is a robust test that works for different distributions and hence is most commonly used for detecting drift.

2. Population Stability Index

Population Stability Index (PSI) is a popular metric in the finance industry to measure changes in distribution for two datasets. It produces less noise and has the advantage of a generally accepted threshold of 0.2-0.25.

3. Energy Distance

Energy distance is a statistical distance between the distributions of random vectors, which characterizes equality of distributions. Energy distance is zero if and only if the distributions are identical.

4. Wasserstein distance

Wasserstein distance is the minimum amount of work to transform a reference distribution into the target distribution. It gives a natural measure of the distance between two distributions. In contrast to a number of the common divergences on distributions such as Kullback-Leibler, it is (weakly) continuous, and thus ideal for analyzing corrupted data.

5. Kolmogorov–Smirnov Divergence

The two-sample K–S test is a useful and general nonparametric methods for comparing two samples, as it is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples. It is well suited for numerical data.

6. Chi-square Divergence

Chi-square test in another popular divergence test well suited for categorical data.