Ticker

6/recent/ticker-posts

Outlier detection and removal:

 Outlier detection and removal:

What are outliers?

Outliers are the values which are abnormal or different from the rest of the values. This may be due to measurement error, manual error etc. These outliers affect the analysis and produce less accurate values.


Outliers


Outlier detection and removal and handling missing values are the most important steps in data preprocessing.

Methods for outlier detection and removal:

1) Outlier detection and removal using IQR

Any type of data can be described by five number summaries. First is arranged in ascending order then the values are calculated. They are the minimum value, first quartile (Q1) - it is the quarter way through the data, Median-midway through the data, third quartile (Q3) - third quarter of the data, maximum is the highest value of the data.

Interquartile range(IQR) is calculated by subtracting the first quartile from third quartile.

IQR=Q3-Q1

Interquartile range is calculated by setting the lower limit and the upper limit.

Lower limit =Q1- 1.5(IQR)

Upper limit=Q3+1.5(IQR)

So the data below the lower limit and above the upper limit are considered to be outliers.

2) Outlier detection and removal using Percentile:

When the data is widely distributed, removing the outliers using Interquartile range will not be efficient. This may end up in deleting essential datas. 

Percentile is the extension of IQR range where you can set the range according to the data. For example you can custom the range as 0.1 and 99.0.

3) Outlier detection and removal using Zscore:

This method can be used for normal distribution only and cannot be used for skewed distributions. Z score is a standard score which tells how far the data is distributed from the mean.

In normal distribution 

68% of the datas lies between -1 and +1 standard deviation, 

95% of the datas lies between -2 and +2 standard deviation,

99.7% of the datas lies between -3 and +3 standard deviation.

So when the z score of the data point is outside the range -3 and +3, it is considered to be outliers.

4) Outlier detection using DBSCAN Clustering:

DBSCAN is a density-based clustering algorithm. Here the radius or the epsilon is set and the data points within this distance will be grouped together as a cluster. The data points outside the cluster will be considered as outliers.


Post a Comment

0 Comments