back-icon

What is data drift and how to detect it ?

Florian Malbranque
·
02
August 2022

Outlier detection and drift detection are two popular approaches in data analytics. Outlier detection is a powerful method that helps identify unusual data points, whereas drift detection detects gradual changes in a time series. In this article, you'll learn about the differences between outlier and drift detection, their applications and working principles, as well as some benefits of each approach.

What is Outlier Detection?

In statistics, outlier detection is the identification of observations that are far from the rest of the data. These outliers can be caused by errors in the data collection process, or they can simply be unusual values that are not indicative of the general population.

Outlier detection is important because these outliers can have a significant impact on the results of statistical analyses. For example, if an outlier is included in a dataset used to calculate a mean, the resulting mean will be artificially inflated.

There are a variety of methods for detecting outliers, including visual inspection, simple statistical tests, and more sophisticated machine learning algorithms.

What is Drift Detection?

When it comes to data analysis, two of the most important concepts are outlier detection and drift detection. While they may appear to be similar, there are actually some key differences between the two. So, what is drift detection?

Drift detection is the process of determining whether there has been a shift in the distribution of a data set. This shift can be either positive or negative, and it can be either temporary or permanent. Drift detection is important because it can help you identify changes in your data that could impact your analysis or lead to inaccurate results.

There are a few different methods that can be used for drift detection, but the most common is to split the data set into two parts and compare the distributions. If there is a significant difference between the two distributions, then it is likely that there has been a drift.

Drift detection is not always easy, and it can be tricky to know when a drift has occurred. However, it is an important tool for ensuring accuracy in data analysis.

When to use Outlier Detection

There are many different ways to detect outliers in data, but sometimes it can be difficult to know when to use outlier detection. Here are some guidelines to help you decide when to use outlier detection:

-If you want to find unusual data points that could be indicative of errors or problems, outlier detection is a good option.
-If you want to find data points that are significantly different from the rest of the data, outlier detection is a good option.
-If you want to find data points that are potentially interesting and worth further investigation, outlier detection is a good option.

When to use Drift Detection

If you're working with streaming data, it's important to be able to detect when the distribution of that data changes over time. This is known as drift detection, and it's a vital part of keeping your models accurate.

There are a few different ways to approach drift detection, but one of the most popular is using a technique called streaming k-means. This approach works by continually updating a k-means model as new data comes in, and then comparing the current model to a reference model. If the difference between the two models is greater than a certain threshold, then it's considered an indicator of drift.

Another common approach is to use a sliding window technique. This works by dividing the data into a series of fixed-size windows, and then fitting a model to each window. The reference model is then created by averaging the individual window models. Again, if the difference between the current model and reference model is greater than a certain threshold, then it's an indicator of drift.

There are pros and cons to both of these approaches, so it's important to choose the one that makes the most sense for your particular application. But whichever approach you choose, drift detection can be a valuable tool

Conclusion

There you have it — a quick guide to the difference between outlier and drift detection. As you can see, each method has its own strengths and weaknesses, so it's important to choose the right one for your data and your needs. In general, outlier detection is better for small datasets while drift detection is better for large datasets. Whichever method you choose, just make sure you test it thoroughly before using it in production.