Classification : Missing Data Imputation
Multivariate data sets frequently have missing observations scattered throughout the data set. Many machine learning algorithms assume that there is no particular significance in the fact that a particular observation has an attribute value missing. A common approach in coping with these missing values is to replace the missing value using some plausible value, and the resulting completed data set is analyzed using standard methods.
The method I will present it perform in a similar manner regardless of the amount of missing data and have the highest mean percentage of observations correctly classified.
The procedure that replaces the missing values using some value is known as imputation, some algorithms treat the missing attribute as a value in its own right, whilst other classifiers such as Naïve Bayes ignore the missing data ( it use the likelihood calculation on the observed attributes).
Understanding of Missing Data Mechanisms :
1. Missing Completely At Random (MCAR), If the missingness does not depend on the values of the data, either missing or observed, then the data are MCAR .
2. Missing At Random (MAR), If the missingness depends only on the data that are observed but not on the components that are missing, the data are MAR.
3. Not Missing At Random (NMAR), If the distribution of the missing data depends on missing values in the data matrix, then the mechanism is NMAR.
Knowledge of the mechanism that led to the values being missing is important in choosing an appropriate analysis to use for the data, hence it is important to consider how the classifier handles the missing data to avoid bias being introduced into the knowledge induced from that classifier.
Strategies Handling Missing Data In Classification :
- Complete Case Analysis (CCA)
Complete case analysis, the elimination approach, in which observations that have any missing attributes are deleted from the data set.
it can be satisfactory when are the amounts of missing data are small.
But with large amounts of data, it is possible to lose considerable sample size.
The critical concern — with this strategy is that it can lead to biased estimates as it requires the assumption that the complete cases are a random sub-sample of the original observations. - Available Case Analysis (ACA)
As this procedure uses all observations that have values for a particular attribute, there is no loss of information as all cases are used.
However, the sample base changes from attribute to attribute depending on the pattern of missing data, and hence any statistics calculated can be based on different numbers of observations.
disadvantage — this procedure can lead to covariance and correlation matrices that are not positive definite. , see, for example. This approach is used, for example, by Bayesian classifiers. - Weighting Procedures (WP)
Weighting Procedures are another approach to dealing with missing data. This approach is frequently used in the analysis of survey data. In survey data, the sampled units are weighted by their design weight which is inversely proportional to the probability of selection.
Weighting procedures for non-response modify the weights in an attempt to adjust for non-response as if it were part of the sample design. - Imputation Procedures (IP)
Imputation procedures in which the missing data values are replaced with some value is another commonly used strategy for dealing with missing value.
These procedures result in a hypothetical ‘complete’ data set that will cause no problems with the analysis.
Many machine learning algorithms are designed to use either a complete case analysis or an imputation procedure.
Dealing with Missing Values
- Mean, Median, Mode Imputation
Imputation of the missing value by either the mean, median or mode for the attribute are commonly used imputations.
These types of imputation ignore any relationships between the variables.
Mean — mean imputation, it is well known that this method of imputation will underestimate the variance covariance matrices for that data.
with mean imputation the distribution of the “new values” is an incorrect representation of the population values as the shape of the distribution is distorted by adding values at the mean, both mean and median imputation can only be used on continuous attributes.
Mode — imputed usually for categorical data whilst using either mean or median imputation.
now lets look how many missingness there is :
now we can choose for the quantitative attribute mean or median and for qualitative attribute we choose the mode. but we see in the data info that all the missingness are numerical attribute, so we choose mean or median:
and there are many technique for doing this in dataset (see the next section that i using knn imputing for sklearn ).
2. Hot Deck Imputation
Hot deck imputation involves replacing the missing values using values from one or more similar instances that are in the same classification group.
There are various forms of hot deck imputation commonly used:
-Random Hot Deck imputation involves replacing the missing value with a randomly selected value from the pool of potential donor values. -Deterministic Hot Deck imputation involve replacing the missing values with those from a single donor, often the nearest neighbour that is determined using some distance measure.
Note that Hot deck imputation does not rely on model fitting for the missing value that is to be imputed and thus is potentially less sensitive to model misspecification than an imputation method based on a parametric model.
1. sorted(euclidean, key=lambda l: l[0], reverse=True)
2. lst = [imported[euclidean[r][1]][j] for r in range(kHD)]
3. imported[i][j] = Counter(lst).most_common(1)[0][0]
we can find a this nice example implementation of this technique in this GitHub repository : https://github.com/tarikbir/missing_data_imputation .
3. k-th Nearest Neighbour Imputation
This approach can predict both categorical and continuous attributes and can easily handle observations that have multiple missing values, it takes into account the correlation structure of the data and requires the specification of number of neighbours, k, and the distance function that is to be used.
The algorithm searches through the entire data set looking for most similar instances.
here we can see that it handle the 15 and 20 changed and there is no missingness, note that attribute 7 is nonnumerical so to use the knn imputing we were needed to use some technique to convert it to numerical representation ( like OneHotEncoder ) but just for this explanation i just not consider this attribute when i run the KNN imputer and after just added it.
4. Iterative Model-Based Imputation
This technique for coping with missing values is an iterative model-based imputation (IRMI) that uses standard and robust methods.
This algorithm has the advantage that it can cope with mixed data. In the first step of the algorithm, the missing values are initialized either using mean or KNN imputation.
we can see the sklearn framework and see the example : https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html .
5. Factorial Analysis for Mixed Data Imputation
This algorithm imputes the missing values using an iterative FAMD algorithm that uses the EM algorithm or a regularized FAMD algorithm where the method is regularized.
we can see here implementation and how to use: https://rdrr.io/cran/missMDA/man/imputeFAMD.html.
6. Random Forest Imputation
This approach uses an iterative imputation scheme in which a random forest is trained on the observed values in the first stage, the missing vales are predicted and then the algorithm proceeds iteratively.
The algorithm begins by making an initial guess for the missing values in the data matrix.
This data matrix is the imputed data matrix.
The guesses for the missing values could be obtained using mean imputation or some other imputation method.
In the first stage, a random forest is trained on the observed values. The missing values are then predicted using the random forest that was trained on the observed values, and the imputed matrix is updated. This procedure is repeated until the difference between the updated imputed data matrix and the previous imputed data matrix increases for the first time for both the categorical and the continuous types of variables.
we can see how to use the Random Forest Imputation: https://towardsdatascience.com/how-to-use-python-and-missforest-algorithm-to-impute-missing-data-ed45eb47cb9a .
Summry
The purpose of the article is more for my personal progress in the field of data science and I try to write as short content as possible in a way that I see fit so that I can go back and refresh my personal learning. If you liked or want to add on the subject or comment you are really welcome I am here to learn and maybe even teach others. Thanks.
Reference
- https://github.com/tarikbir/missing_data_imputation.
- Francesco Palumbo Angela Montanari MaurizioVichi Editors, Data Science Innovative Developments in Data Analysis and Clustering.
- https://towardsdatascience.com/how-to-use-python-and-missforest-algorithm-to-impute-missing-data-ed45eb47cb9a .
- https://rdrr.io/cran/missMDA/man/imputeFAMD.html .
- https://www.youtube.com/watch?v=ttBs_wfw_6U .
- https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html .