Partition-Based Technique to Enhance Missing Data Prediction

Mohammad Mahdi Barati Jozan, Hamed Tabesh

Abstract


Managing missing data is a critical aspect of preprocessing in data mining endeavors, significantly influencing output accuracy during both model development and utilization phases. This study introduces a novel approach to predicting missing values by partitioning data into disjoint subsets based on partitioning measures. The rationale behind this approach is the elimination of unrelated data through partitioning, thereby improving the accuracy of missing value prediction within each subset. Through a combination of expert panel insights and statistical tests (including the Chi-square test and Cramer's V coefficient), the database partitioning measure was determined using operational data from the Mashhad Fire and Safe Services Organization. Models were constructed for each partition, and missing data were segmented accordingly, with the corresponding models employed for prediction. The results revealed that in 44% of cases, models built on partitioned data outperformed those constructed on the entire dataset. The evaluation of this method underscores its capability to predict missing values with heightened accuracy. Notably, this approach is independent of the method employed for missing value prediction, enabling seamless integration into existing methods as an additional step to bolster prediction accuracy.

 


Keywords


Preprocessing, Missing value imputation, Text Mining, Expert panle

References


Che Z, Purushotham S, Cho K, Sontag D, Liu Y. Recurrent Neural Networks for Multivariate Time Series with Missing Values. Scientific Reports. 2018 Apr 17;8(1).

Han J, Kamber M, Computer P. Data mining : concepts and techniques. Amsterdam ; Boston: Elsevier/Morgan Kaufmann; 2012.

Rahman MdG, Islam MZ. Missing value imputation using a fuzzy clustering-based EM approach. Knowledge and Information Systems. 2015 Feb 25;46(2):389–422.

Wang H, Wang S. Mining incomplete survey data through classification. Knowledge and Information Systems. 2009 Aug 20;24(2):221–33.

Fletcher Mercaldo S, Blume JD. Missing data and prediction: the pattern submodel. Biostatistics. 2018 Sep 6;21(2):236–52.

Köpcke F, Trinczek B, Majeed RW, Schreiweis B, Wenk J, Leusch T, et al. Evaluation of data completeness in the electronic health record for the purpose of patient recruitment into clinical trials: a retrospective analysis of element presence. BMC Medical Informatics and Decision Making. 2013 Mar 21;13(1).

Pivovarov R, Albers DJ, Sepulveda JL, Elhadad N. Identifying and mitigating biases in EHR laboratory tests. Journal of Biomedical Informatics. 2014 Oct;51:24–34.

Horvath MM, Rusincovitch SA, Richesson RL. Clinical Research Informatics and Electronic Health Record Data. Yearbook of Medical Informatics. 2014 Aug;23(01):215–23.

Zhang X, Yan C, Gao C, Malin BA, Chen Y. Predicting Missing Values in Medical Data Via XGBoost Regression. Journal of Healthcare Informatics Research. 2020 Aug 3;4(4):383–94.

Khan H, Wang X, Liu H. Missing value imputation through shorter interval selection driven by Fuzzy C-Means clustering. Computers & Electrical Engineering. 2021 Jul;93:107230.

Ngueilbaye A, Wang H, Mahamat DA, Junaidu SB. Modulo 9 model-based learning for missing data imputation. Applied Soft Computing. 2021 May;103:107167.

Xu D, Hu PJH, Huang TS, Fang X, Hsu CC. A deep learning–based, unsupervised method to impute missing values in electronic health records for improved patient management. Journal of Biomedical Informatics. 2020 Nov;111:103576.

Beaulieu-Jones BK, Lavage DR, Snyder JW, Moore JH, Pendergrass SA, Bauer CR. Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis. JMIR Medical Informatics. 2018 Feb 23;6(1):e11.

Ancker JS, Witteman HO, Hafeez B, Provencher T, Van de Graaf M, Wei E. The Invisible Work of Personal Health Information Management Among People With Multiple Chronic Conditions: Qualitative Interview Study Among Patients and Providers. Journal of Medical Internet Research. 2015 Jun 4;17(6):e137.

Forster AJ, Kyeremanteng K, Hooper J, Shojania KG, van Walraven C. The impact of adverse events in the intensive care unit on hospital mortality and length of stay. BMC Health Services Research . 2008 Dec;8(1).

Hu Z, Du D. A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction. Kaderali L, editor. PLOS ONE. 2020 Sep 21;15(9):e0237724.

Kohli R, Tan SSL. Electronic Health Records: How Can IS Researchers Contribute to Transforming Healthcare? MIS Quarterly. 2016;40(3):553–74.

Hersh WR, Weiner MG, Embi PJ, Logan JR, Payne PRO, Bernstam EV, et al. Caveats for the Use of Operational Electronic Health Record Data in Comparative Effectiveness Research. Medical Care. 2013 Aug;51:S30–7.

Little RJ, D’Agostino R, Cohen ML, Dickersin K, Emerson SS, Farrar JT, et al. The Prevention and Treatment of Missing Data in Clinical Trials. New England Journal of Medicine. 2012 Oct 4;367(14):1355–60.

Nakagawa S, Freckleton RP. Missing inaction: the dangers of ignoring missing data. Trends in Ecology & Evolution. 2008 Nov;23(11):592–6.

Groenwold RHH, White IR, Donders ART, Carpenter JR, Altman DG, Moons KGM. Missing covariate data in clinical research: when and when not to use the missing-indicator method for analysis. Canadian Medical Association Journal. 2012 Feb 27;184(11):1265–9.

Roderick, Rubin DB. Statistical Analysis with Missing Data. John Wiley & Sons; 2014.

Myrtveit I, Stensrud E, Olsson UH. Analyzing data sets with missing data: an empirical evaluation of imputation methods and likelihood-based methods. IEEE Transactions on Software Engineering. 2001;27(11):999–1013.

Khosravi H, Das S, Al-Mamun A, Ahmed I. Binary Gaussian Copula Synthesis: A Novel Data Augmentation Technique to Advance ML-based Clinical Decision Support Systems for Early Prediction of Dialysis Among CKD Patients. arXiv.org. 2024. Available from: https://arxiv.org/abs/2403.00965

MARIMONT RB, SHAPIRO MB. Nearest Neighbour Searches and the Curse of Dimensionality. IMA Journal of Applied Mathematics. 1979 ;24(1):59–70.

Purwar A, Singh SK. Hybrid prediction model with missing value imputation for medical data. Expert Systems with Applications. 2015 Aug;42(13):5621–31.

White IR, Carlin JB. Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values. Statistics in Medicine. 2010 Sep 13;29(28):2920–31.

García-Laencina PJ, Sancho-Gómez JL, Figueiras-Vidal AR. Pattern classification with missing data: a review. Neural Computing and Applications. 2009 Sep 3;19(2):263–82.

Willem S, T. Katrien J. Groenhof, Hoogland J, Bots ML, Menno Brandjes, John J.L. Jacobs, et al. Real-time imputation of missing predictor values improved the application of prediction models in daily practice. Journal of Clinical Epidemiology. 2021 Jun 1;134:22–34.

Peng D, Zou M, Liu C, Lu J. RESI: A Region-Splitting Imputation method for different types of missing data. Expert Systems with Applications. 2021 Apr;168:114425.

Yang K, Li J, Wang C. Missing Values Estimation in Microarray Data with Partial Least Squares Regression. Lecture Notes in Computer Science. 2006 Jan 1;662–9.

Zhao P, Tang X. Imputation based statistical inference for partially linear quantile regression models with missing responses. Metrika. 2016 Jun 9;79(8):991–1009.

Sentas P, Angelis L. Categorical missing data imputation for software cost estimation by multinomial logistic regression. Journal of Systems and Software. 2006 Mar;79(3):404–14.

Malan L, Smuts CM, Baumgartner J, Ricci C. Missing data imputation via the expectation-maximization algorithm can improve principal component analysis aimed at deriving biomarker profiles and dietary patterns. Nutrition Research. 2020 Mar;75:67–76.

Dempster AP, Laird NM, Rubin DB. Maximum Likelihood from Incomplete Data Via the EM Algorithm. Journal of the Royal Statistical Society: Series B (Methodological). 1977 Sep;39(1):1–22.

Rubul Kumar Bania, Halder A. R-Ensembler: A greedy rough set based ensemble attribute selection algorithm with kNN imputation for classification of medical data. Computer Methods and Programs in Biomedicine. 2020 Feb 1;184:105122–2.

Zhang S. Nearest neighbor selection for iteratively kNN imputation. Journal of Systems and Software. 2012 Nov;85(11):2541–52.

Batista GEAPA, Monard MC. An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence [Internet]. 2003 May;17(5-6):519–33. Available from: http://conteudo.icmc.usp.br/pessoas/gbatista/files/aai2003.pdf

Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, et al. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001 Jun 1;17(6):520–5.

Aydilek IB, Arslan A. A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Information Sciences. 2013 Jun;233:25–35.

Loes C.M. Bertens, Broekhuizen BDL, Naaktgeboren CA, Rutten FH, Hoes AW, Yvonne van Mourik, et al. Use of Expert Panels to Define the Reference Standard in Diagnostic Research: A Systematic Review of Published Methods and Reporting. PLOS Medicine. 2013 Oct 15;10(10):e1001531–1.

Allen DM. The Relationship Between Variable Selection and Data Agumentation and a Method for Prediction. Technometrics. 1974 Feb;16(1):125–7.

www.amar.org.ir, Iran Statistics Center Portal

Jianqiang Z, Xiaolin G. Comparison Research on Text Pre-processing Methods on Twitter Sentiment Analysis. IEEE Access. 2017;5:2870–9.

SALTON G. Developments in Automatic Text Retrieval. Science. 1991 Aug 30;253(5023):974–80.

Fabian P. Scikit-learn: Machine learning in Python. Journal of machine learning research 12. 2011;2825.

Suthaharan S. Support Vector Machine. Machine Learning Models and Algorithms for Big Data Classification. 2016;36:207–35.

Mccullagh P, Nelder JA. Generalized Linear Models. Boca Raton Crc Press Llc Ann Arbor, Michigan Proquest; 1989.

Statistics, I. S. "Ibm corp. released 2013. ibm spss statistics for windows, version 22.0. armonk, ny: Ibm corp.".‏


Full Text: PDF

Refbacks

  • There are currently no refbacks.


 

Indonesian Journal of Electrical Engineering and Informatics (IJEEI)
ISSN 2089-3272

Creative Commons Licence

This work is licensed under a Creative Commons Attribution 4.0 International License.

web analytics
View IJEEI Stats

503 Service Unavailable

Service Unavailable

The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.

Additionally, a 503 Service Unavailable error was encountered while trying to use an ErrorDocument to handle the request.