Handling Imbalanced Data through Re-sampling: Systematic Review

Razan Eltayeb, Abdelrahman Elsharif Karrar, Waleed Ibrahim Osman, Moez Mutasim

Abstract


Handling imbalanced data is an important issue that can affect the validity and reliability of the results. One common approach to addressing this issue is through re-sampling the data. Re-sampling is a technique that allows researchers to balance the class distribution of their dataset by either over-sampling the minority class or under-sampling the majority class. Over-sampling involves adding more copies of the minority class examples to the dataset in order to balance out the class distribution. On the other hand, under-sampling involves removing some of the majority class examples from the dataset in order to balance out the class distribution. It's also common to combine both techniques, usually called hybrid sampling. It is important to note that re-sampling techniques can have an impact on the model's performance, and it is essential to evaluate the model using different evaluation metrics and to consider other techniques such as cost-sensitive learning and anomaly detection. In addition, it is important to keep in mind that increasing the sample size is always a good idea to improve the performance of the model. In this systematic review, we aim to provide an overview of existing methods for re-sampling imbalanced data. We will focus on methods that have been proposed in the literature and evaluate their effectiveness through a thorough examination of experimental results. The goal of this review is to provide practitioners with a comprehensive understanding of the different re-sampling methods available, as well as their strengths and weaknesses, to help them make informed decisions when dealing with imbalanced data.


Keywords


Data Mining, Imbalance Data, Re-sampling ,Over-sampling, Under-sampling, Hybrid Sampling, SMOTE

References


Rahmanian, M. and Mansoori, E.G., 2022. ‘An unsupervised gene selection method based on multivariate normalized mutual information of genes ’. Chemometrics and Intelligent Laboratory Systems, 222, p.104512.

Bajal, E., Katara, V., Bhatia, M. and Hooda, M., 2022. ‘A Review of Clustering Algorithms: Comparison of DBSCAN and K-mean with Oversampling and t-SNE’. Recent Patents on Engineering, 16(2), pp.17-31.

Puri, A. and Kumar Gupta, M., 2022. ‘Improved hybrid bag-boost ensemble with K-means-SMOTE–ENN technique for handling noisy class imbalanced data.’ The Computer Journal, 65(1), pp.124-138.

Wang, Y., Wang, D., Zhou, Y., Zhang, X. and Quek, C., 2021. ‘VDPC: Variational Density Peak Clustering Algorithm’. arXiv preprint arXiv:2201.00641.

Shemirani, R., Belbin, G.M., Burghardt, K., Lerman, K., Avery, C.L., Kenny, E.E., Gignoux, C.R. and Ambite, J., 2021. ‘Selecting Clustering Algorithms for IBD Mapping’. bioRxiv.

Oh, Yoosoo & Min, Seonghee. (2021). ‘Practical Application Using the Clustering Algorithm’. 10.5772/intechopen. 99314.

Thrun, M.C. and Stier, Q., 2021. ‘Fundamental clustering algorithms suite’. SoftwareX, 13, p.100642.

Kume, A. and Walker, S.G., 2021. ‘The utility of clusters and a Hungarian clustering algorithm’. Plos one, 16(8), p.e0255174.

Starczewski, A., Scherer, M.M., Ksiazek, W., Debski, M. and Wang, L., 2021. ‘A novel grid-based clustering algorithm’. Journal of Artificial Intelligence and Soft Computing Research, 11.

Li, J. and Kais, S., 2021. ‘A universal quantum circuit design for periodical functions’. New Journal of Physics, 23(10), p.103022.

Mittal, Rohan. (2021). ‘Fuzzy C-Means Clustering Algorithm’.

Khan, M.K., Ahmed, S.M., Sarker, S. and Khan, M.H., 2021. ‘K-Cosine-Medoids Clustering Algorithm’. In 2021 5th International Conference on Electrical Information and Communication Technology (EICT) (pp. 1-5). IEEE.

Cebrian, J.M., Imbernón, B., Soto, J. and Cecilia, J.M., 2021. ‘Evaluation of Clustering Algorithms on HPC Platforms’. Mathematics, 9(17), p.2156.

Khan, T., Tian, W., Kadhim, M.R. and Buyya, R., 2021. ‘A Novel Cluster Ensemble based on a Single Clustering Algorithm’. In 2021 16th Conference on Computer Science and Intelligence Systems (FedCSIS) (pp. 127-135). IEEE.

Xu, Z., Shen, D., Nie, T., Kou, Y., Yin, N. and Han, X., 2021. ‘A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data’. Information Sciences, 572, pp.574-589.

Lukauskas, M. and Ruzgas, T., 2021. ‘Analysis of clustering methods performance across multiple datasets’. In DAMSS 2021: 12th conference on data analysis methods for software systems, Druskininkai, Lithuania, December 2–4, 2021 (pp. 45-46). Vilnius university press.

Patibandla, R.L. and Veeranjaneyulu, N., 2021. ‘Clustering Algorithms: An Exploratory Review’.

Insausti, X., Zárraga-Rodríguez, M., Nolasco-Ferencikova, C. and Gutierrez-Gutierrez, J., 2021. ‘Distributed clustering algorithm for adaptive pandemic control’. IEEE Access, 9, pp.160688-160696.

GÜLDAL, S., 2021. ‘Improving Machine Learning Performance of Imbalanced Data by Resampling: DBSCAN and Weighted Arithmetic Mean’. Bitlis Eren Üniversitesi Fen Bilimleri Dergisi, 10(4), pp.1563-1574.

Shahabadi, M.S.E., Tabrizchi, H., Rafsanjani, M.K., Gupta, B.B. and Palmieri, F., 2021. ‘A combination of clustering-based under-sampling with ensemble methods for solving imbalanced class problem in intelligent systems’. Technological Forecasting and Social Change, 169, p.120796.

Jian, S., Li, D. and Yu, Y., 2021. ‘Research on Taxi Operation Characteristics by Improved DBSCAN Density Clustering Algorithm and K-means Clustering Algorithm’. In Journal of Physics: Conference Series (Vol. 1952, No. 4, p. 042103). IOP Publishing.

Lukauskas, Mantas & Ruzgas, Tomas. (2021). ‘Comparative analysis of clustering algorithms for synthetic and real data’.

Goyal, A., Rathore, L. and Kumar, S., 2021. ‘A Survey on Solution of Imbalanced Data Classification Problem Using SMOTE and Extreme Learning Machine’. In Communication and Intelligent Systems (pp. 31-44). Springer, Singapore.

Ijaz, M.F., Attique, M. and Son, Y., 2020. ‘Data-driven cervical cancer prediction model with outlier detection and over-sampling methods’. Sensors, 20(10), p.2809.

Saqlain, M., Abbas, Q. and Lee, J.Y., 2020. ‘A deep convolutional neural network for wafer defect identification on an imbalanced dataset in semiconductor manufacturing processes’. IEEE Transactions on Semiconductor Manufacturing, 33(3), pp.436-444.

Chakraborty, T. and Chakraborty, A.K., 2020. ‘Superensemble classifier for improving predictions in imbalanced datasets’. Communications in Statistics: Case Studies, Data Analysis and Applications, 6(2), pp.123-141.

Pratiwi, N.B.I. and Saputro, D.R.S., 2020, August. ‘Fuzzy c-shells clustering algorithm’. In Journal of Physics: Conference Series (Vol. 1613, No. 1, p. 012006). IOP Publishing.

Mirzaei, B., Nikpour, B. and Nezamabadi-Pour, H., 2020. ‘An under-sampling technique for imbalanced data classification based on DBSCAN algorithm’. In 2020 8th Iranian Joint Congress on Fuzzy and intelligent Systems (CFIS) (pp. 21-26). IEEE.

Moslehi, F. and Haeri, A., 2020. ‘An evolutionary computation-based approach for feature selection’. Journal of Ambient Intelligence and Humanized Computing, 11(9), pp.3757-3769.

Niranjana, R., Kumar, V.A. and Sheen, S., 2020. ‘Darknet traffic analysis and classification using numerical agm and mean shift clustering algorithm’. SN Computer Science, 1(1), pp.1-10.

Li, H., Liu, X., Li, T. and Gan, R., 2020. ‘A novel density-based clustering algorithm using nearest neighbor graph’. Pattern Recognition, 102, p.107206.

Vuttipittayamongkol, P. and Elyan, E., 2020. ‘Neighbourhood-based undersampling approach for handling imbalanced and overlapped data’. Information Sciences, 509, pp.47-70.

Tian, C., Zhou, L., Zhang, S. and Zhao, Y., 2020. ‘A new majority weighted minority oversampling technique for classification of imbalanced datasets’. In 2020 International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE) (pp. 154-157). IEEE.

Nugraha, W., Maulana, M.S. and Sasongko, A., 2020. ‘Clustering Based Undersampling for Handling Class Imbalance in C4.5 Classification Algorithm’. In Journal of Physics: Conference Series (Vol. 1641, No. 1, p. 012014). IOP Publishing.

Hou, X., 2019. ‘An improved k-means clustering algorithm based on hadoop platform’. In The International Conference on Cyber Security Intelligence and Analytics (pp. 1101-1109). Springer, Cham.

Shih, Y.H. and Ting, C.K., 2019. ‘Evolutionary optimization on k-nearest neighbors classifier for imbalanced datasets’. In 2019 IEEE Congress on Evolutionary Computation (CEC) (pp. 3348-3355). IEEE.

Saah, D., Tenneson, K., Matin, M., Uddin, K., Cutter, P., Poortinga, A., Nguyen, Q.H., Patterson, M., Johnson, G., Markert, K. and Flores, A., 2019. ‘Land cover mapping in data scarce environments: challenges and opportunities’. Frontiers in Environmental Science, 7, p.150.

Lin, D. and Wang, Q., 2019. ‘An energy-efficient clustering algorithm combined game theory and dual-cluster-head mechanism for WSNs’. IEEE Access, 7, pp.49894-49905.

Last, F., Douzas, G. and Bacao, F., 2017. ‘Oversampling for imbalanced learning based on k-means and smote’. arXiv preprint arXiv:1711.00837.

Morey, A.M., Noo, F. and Kadrmas, D.J., 2016. ‘Effect of using 2 mm voxels on observer performance for PET lesion detection’. IEEE transactions on nuclear science, 63(3), pp.1359-1366.

A. E. Karrar, “A Proposed Model for Improving the Performance of Knowledge Bases in Real-World Applications by Extracting Semantic Information,” International Journal of Advanced Computer Science and Applications, vol. 13, no. 2, 2022, doi: 10.14569/ijacsa.2022.0130214.

A. E. Karrar, “Investigate the Ensemble Model by Intelligence Analysis to Improve the Accuracy of the Classification Data in the Diagnostic and Treatment Interventions for Prostate Cancer,” International Journal of Advanced Computer Science and Applications, vol. 13, no. 1, 2022, doi: 10.14569/ijacsa.2022.0130122.

M. Umair et al., “Main Path Analysis to Filter Unbiased Literature,” Intelligent Automation & Soft Computing, vol. 32, no. 2, pp. 1179–1194, 2022, doi: 10.32604/iasc.2022.018952.

A. E. Karrar, “The Effect of Using Data Pre-Processing by Imputations in Handling Missing Values,” Indonesian Journal of Electrical Engineering and Informatics (IJEEI), vol. 10, no. 2, Apr. 2022, doi: 10.52549/ijeei.v10i2.3730.

[F1] https://www.urbanstat.com/handlingimbalanceddatasets/?doing_wp_cron=1659547994.1132130622863769531250

[F2] Xia, Wei & Ma, Caihong & Liu, Jianbo & Liu, Shibin & Chen, Fu & Zhi, Yang & Duan, Jianbo. (2019). High-Resolution Remote Sensing Imagery Classification of Imbalanced Data Using Multistage Sampling Method and Deep Neural Networks. Remote Sensing. 11. 2523. 10.3390/rs11212523.

[F3] Vijayvargiya, Ankit & Prakash, Chandra & Kumar, Rajesh & Bansal, Sanjeev & Tavares, Joao. (2021). Human Knee Abnormality Detection from Imbalanced sEMG Data. Biomedical Signal Processing and Control. 66. 10.1016/j.bspc.2021.102406.

[F4] Tuong Le, Minh Thanh Vo, Bay Vo, Mi Young Lee, Sung Wook Baik, "A Hybrid Approach Using Oversampling Technique and Cost-Sensitive Learning for Bankruptcy Prediction", Complexity, vol. 2019, Article ID 8460934, 12 pages, 2019. https://doi.org/10.1155/2019/8460934


Full Text: PDF

Refbacks

  • There are currently no refbacks.


 

Indonesian Journal of Electrical Engineering and Informatics (IJEEI)
ISSN 2089-3272

Creative Commons Licence

This work is licensed under a Creative Commons Attribution 4.0 International License.

web analytics
View IJEEI Stats

503 Service Unavailable

Service Unavailable

The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.

Additionally, a 503 Service Unavailable error was encountered while trying to use an ErrorDocument to handle the request.