Malware Classification Using Machine Learning and Dimension Reduction Techniques on PE File Data

Arif Harsa Pradipta, Lili Ayu Wulandhari

Abstract


The digital transformation has enhanced efficiency, transparency, and accessibility but has also led to a notable increase in cyber incidents, including malware attacks. According to the 2022 annual report from the Honeynet Project by the National Cyber and Encryption Agency, Indonesia experienced over 370 million cyber attacks, with 800,000 of these being malware attacks. The increasing complexity of Portable Executable files further complicates accurate classification in machine learning models. This research aims to develop an effective malware detection approach using machine learning classifiers—Random Forest, XGBoost, and AdaBoost—on raw feature dataset and integrated feature dataset. Dimension reduction techniques such as Principal Component Analysis and Linear Discriminant Analysis were utilized to enhance classification efficiency. The results demonstrated that Random Forest and XGBoost consistently outperformed AdaBoost, particularly in classifying ransomware, achieving recall values ranging from 0.72 to 0.85 and F1-scores from 0.74 to 0.81 For the trojan class, both Random Forest and XGBoost achieved recall values ranging from 0.96 to 0.97, with corresponding F1-scores between 0.95 and 0.97. Both classifiers maintained high precision, recall, and F1-scores across all malware classes, even with reduced feature sets.

Keywords


Cyber Security; Machine Learning; Data Dimension Reduction; Malware Classification; Portable Executable Format

References


Badan Siber dan Sandi Negara, “Laporan Tahunan Honeynet Project Tahun 2022,” 2023.

Y. Alosefer, “Analysing web-based malware behaviour through client honeypots,” Cardiff University, 2012.

M. S. Yousaf, M. H. Durad, and M. Ismail, “Implementation of portable executable file analysis framework (PEFAF),” in 2019 16th International Bhurban Conference on Applied Sciences and Technology (IBCAST), IEEE, 2019, pp. 671–675.

L. Gao, J. Song, X. Liu, J. Shao, J. Liu, and J. Shao, “Learning in high-dimensional multimedia data: the state of the art,” Multimed Syst, vol. 23, no. 3, pp. 303–313, 2017, doi: 10.1007/s00530-015-0494-1.

S. Ayesha, M. K. Hanif, and R. Talib, “Overview and comparative study of dimensionality reduction techniques for high dimensional data,” Information Fusion, vol. 59, pp. 44–58, 2020.

I. M. M. Matin and B. Rahardjo, “Malware detection using honeypot and machine learning,” in 2019 7th international conference on cyber and IT service management (CITSM), IEEE, 2019, pp. 1–4.

C. Hwang, J. Hwang, J. Kwak, and T. Lee, “Platform-independent malware analysis applicable to windows and linux environments,” Electronics (Switzerland), vol. 9, no. 5, May 2020, doi: 10.3390/electronics9050793.

S. K. Smmarwar, G. P. Gupta, and S. Kumar, “A hybrid feature selection approach-based Android malware detection framework using machine learning techniques,” in Cyber Security, Privacy and Networking: Proceedings of ICSPN 2021, Springer, 2022, pp. 347–356.

S. Jeon and J. Moon, “Malware-Detection Method with a Convolutional Recurrent Neural Network Using Opcode Sequences,” Inf Sci (N Y), vol. 535, pp. 1–15, Oct. 2020, doi: 10.1016/j.ins.2020.05.026.

T. Rezaei and A. Hamze, “An efficient approach for malware detection using PE header specifications,” in 2020 6th International Conference on Web Research (ICWR), IEEE, 2020, pp. 234–239.

R. K. V. Penmatsa, A. Kalidindi, and S. K. R. Mallidi, “Feature reduction and optimization of malware detection system using ant colony optimization and rough sets,” International Journal of Information Security and Privacy (IJISP), vol. 14, no. 3, pp. 95–114, 2020.

F. Manavi and A. Hamzeh, “A new method for ransomware detection based on PE header using convolutional neural networks,” in 2020 17th International ISC Conference on Information Security and Cryptology (ISCISC), IEEE, 2020, pp. 82–87.

T. Rezaei, F. Manavi, and A. Hamzeh, “A PE header-based method for malware detection using clustering and deep embedding techniques,” Journal of Information Security and Applications, vol. 60, p. 102876, 2021.

F. Manavi and A. Hamzeh, “Static detection of ransomware using LSTM network and PE header,” in 2021 26th International Computer Conference, Computer Society of Iran (CSICC), IEEE, 2021, pp. 1–5.

Ajit Kumar, “ClaMP (Classification of Malware with PE headers).” Mendeley Data, V1, 2020. doi: 10.17632/xvyv59vwvz.1.

A. Kumar, K. S. Kuppusamy, and G. Aghila, “A learning model to detect maliciousness of portable executable using integrated feature set,” Journal of King Saud University-Computer and Information Sciences, vol. 31, no. 2, pp. 252–265, 2019.

R. E. Schapire, “Explaining adaboost,” in Empirical Inference: Festschrift in Honor of Vladimir N. Vapnik, Springer, 2013, pp. 37–52.

T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794.



Refbacks

  • There are currently no refbacks.


 

Indonesian Journal of Electrical Engineering and Informatics (IJEEI)
ISSN 2089-3272

Creative Commons Licence

This work is licensed under a Creative Commons Attribution 4.0 International License.

web analytics
View IJEEI Stats

503 Service Unavailable

Service Unavailable

The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.

Additionally, a 503 Service Unavailable error was encountered while trying to use an ErrorDocument to handle the request.