The purpose of this paper is to explore the mechanisms of data missingness and evaluate various imputation
techniques used to handle missing data. Missing data is a common issue in data analysis, and its treatment is crucial for
accurate modeling and analysis. This paper assesses prevalent imputation methods, including mean imputation, median
imputation, K-Nearest Neighbor imputation (KNN), Classification and Regression Trees (CART), and Random Forest (RF).
These techniques were chosen for their widespread use and varying levels of complexity and accuracy. Simple methods like
mean and median imputation are computationally efficient but may introduce bias, especially when the missingness is not
random. In contrast, more advanced methods like KNN, CART,andRFofferbetter handling of complex missingness patterns
byconsidering relationships among variables. This paper aims to provide guidance for data scientists and analysts in selecting
the most appropriate imputation methods based on their data characteristics and analysis objectives. By understanding the
strengths and weaknesses of each technique, practitioners can improve the quality and reliability of their analyses. |