Handling missing values in data science:

 Handling missing values:

What are missing values?

While collecting data by web scrapping or from other sources some values may be missing or may not be present and those values are called as missing values.

The reason behind missing values may be due to data corruption, observation error, user might have missed to provide data etc.

Types of missing values

Types of missing values:

1) Missing completely at random (MCAR):

Here the pattern of missing data cannot be predicted and it is missing completely at random.

2) Missing at random(MAR):

Some pattern of the missing values can be identified and it is not missed completely at random.

3) Missing not at random(MAR):

Some values are not missed at random but the user might have purposefully avoided giving the data.

Methods to handle missing values:

1) Deleting the rows with missing values

This is the most commonly used method. The entire rows or columns with missing data is deleted or dropped.

2) Filling with mean or median or mode value

When the missing values are less, then the mean, median or mode can be calculated and replace the missing values.

3) Replacing with previous(forward fill) or next value(backward fill)

The missing values can be filled with some arbitrary values. Forward fill is the missing values will be filled with previous values and backward fill is filling with the next value.

4) Filling with a single value

In this method all the missing values will be replaced by a particular single value.

5) Interpolation method

Here it is assumed that there is linear relationship between the adjacent values and the missing values are calculated from the adjacent non missing values.

6) Handling missing values using KNN Imputer

Here the distance or the number of nearest neighbor is chosen. The KNN imputer calculates the average mean of the nearest neighbors and the missing values are replaced with that value.

7) Handling missing values using Simple Imputer:

Here Simple imputer from scikit library is used. The missing values are replaced by variables central tendency mean, median or mode. 

8) Handling missing values using machine learning:

Filling missing values using Linear Regression

Step 1: Test data will be missing values

Step 2: Drop the null values and consider it as train data

Checking null values in train data.

Step 3:  Create x_train and y_train from the dataset

y_train is the rows of age with non null values

x_train is the dataset except age column with non null values

Step 4 : Building the model

Step 5: Creating X_test from Test_data

Step 6: Applying the model and predicting the missing values

Step 7: Replacing the missing values by predicted values:

Post a Comment