Checking Data¶

Checking the quality of data and cleaning it¶

Before extracting information from data, we need to check its quality to ensure:

  1. Data type and format: Remove unwanted values. For example, the name of a person can not be a number.

  2. Duplicate data: Duplicated records might cause biases in Machine Learning

  3. Missing data: There is perhaps a (or couple of) empty field(s) in data. This missing information might not be retrieved.

  4. Balance in data: uneven distribution of observations might reveal bias or unreliable predictions. In an example of fraud detection, you have 5% of fraud records, so your model could predict the accuracy of 95%. It sounds good, but it is unreliable.

When we prepare and process the data, either remove the records or fill them with some values (a mean value), altered data will affect the final prediction. We might need to investigate how much the impact is.

There are other steps when working with unknown data:¶

  1. There is a subset of data, or data fields were labeled differently.

  2. Repetitive errors in which information could be retrieved.

  3. Damage occurs when moving data. The original data is still good, but the working data was damaged, lost, or crashed.

  4. Data is not enough to generalize the problem. Too much data might slow down dramatically the process of training.

  5. Remove noise or outliers. If you already know noise data and rare outliers, then it's safe to remove them.

  6. Any abnormal or singular data you find out could be revised carefully.

Cleaning data¶

Removing the missing records¶

We can delete rows or columns where

  1. Any missing
  2. All/most missing
  3. Under a condition to be removed

By reducing the size of the data, we can train the model faster. However, the model can be less accurate and in cases, the number of deleted records is too large, the prediction is unreliable.

Filling data¶

Besides removing the whole record where the issue occurs, we can fill or replace the missing or bad data with:

Numeric field¶

  1. Mean or median values. (They are less impact on the overall)
  2. Mode (or most frequent) values.
  3. Zero (0).

Categorical field¶

  1. The most frequent item
  2. new item in cases of huge missing.
  3. N/A

Although it has little impact overall, the data leakage might occur because the model is learning on external information (i.e., artificial data). Read more

Prediction¶

  1. Based on a pattern: Using other available fields to guess a suitable value.
  2. Interpolation and extrapolation: For example, an event time between 2 neighbor events, or value following a distribution / function.
  3. Constraint or the nature of data type. For example, the age of people hardly is over 100 years; a number of people is an integer;
  4. Python Library: DataWig or Impyute could be a good handler for missing values. These packages could easily install via pip.
  5. Random select a value from available lists.

Multiple Imputation¶

Although the filling values seem accurate, it is not actual. And any artificial method has its own bias. By combining many of them, we can elimenate the bias and have better data analysis. Read more here

Conclusion¶

Despite the model could learn better with more accurate information, the filling process might take more time, effort to carefully predict the true value. It causes bias if the guess is totally wrong.

That's why we want to utilize the external tools to speed up the process yet produce high accuracies.

Tools for imputation¶

On GitHub, there are several tools which support the process of imputing data. Some of the most popular are:

  1. DataWig, link GitHub

  2. Impyute, link GitHub

Importing all data¶

Checking first 5 rows and info of data

To be continued¶