BERT-BiLSTM model for hierarchical Arabic text classification

Benamar Hamzaoui, Djelloul Bouchiha, Abdelghani Bouziane, Noureddine Doumi

Abstract


Text classification is a fundamental task in natural language processing (NLP) aimed at categorizing text documents into predefined categories or labels. Leveraging artificial intelligence (AI) tools, particularly deep learning and machine learning, has significantly enhanced text classification capabilities. However, for the Arabic language, which lacks comprehensive resources in this domain, the challenge is even more pronounced. Hierarchical text classification, which organizes categories into a tree-like structure, presents added complexity due to inter-category similarities and connections across different levels. In addressing this challenge, we propose a deep learning model based on BERT (Bidirectional Encoder Representations from Transformers) and BiLSTM (Bidirectional Long Short-Term Memory). Experimental evaluations demonstrate the effectiveness of our approach compared to existing methods, yielding promising results. Our study contributes to advancing text classification methodologies, particularly in the context of Arabic language processing.

Keywords


Text classification; Hierarchical classification; Deep learning; BERT (Bidirectional Encoder Representations from Transformers); BiLSTM (Bidirectional Long Short-Term Memory); Arabic language

References


Abbas, M. and K. Smaïli (2005). Comparison of Topic Identification methods for Arabic Language. International Conference on Recent Advances in Natural Language Processing - RANLP 2005, Borovets, Bulgaria.

Abdul-Mageed, M., A. Elmadany, et al. (2020). "ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic." ArXiv abs/2101.01785.

Abdulghani, F. A. and N. A. Abdullah (2022). "A survey on Arabic text classification using deep and machine learning algorithms." Iraqi Journal of Science: 409-419.

Alhawarat, M. and A. O. Aseeri (2020). "A superior Arabic text categorization deep model (SATCDM)." IEEE Access 8: 24653-24661.

Aljedani, N., R. Alotaibi, et al. (2021). "HMATC: Hierarchical multi-label Arabic text classification model using machine learning." Egyptian Informatics Journal 22(3): 225-237.

Almuzaini, H. A. and A. M. Azmi (2020). "Impact of stemming and word embedding on deep learning-based Arabic text categorization." IEEE Access 8: 127913-127928.

Alsukhni, B. (2021). Multi-Label Arabic Text Classification Based on Deep Learning. Proceedings of the 2021 12th International Conference on Information and Communication Systems, ICICS.

Antoun, W., F. Baly, et al. (2020). AraBERT: Transformer-based Model for Arabic Language Understanding, Marseille, France, European Language Resource Association.

Bdeir, A. M. and F. Ibrahim (2020). A framework for arabic tweets multi-label classification using word embedding and neural networks algorithms. Proceedings of the 2020 2nd International Conference on Big Data Engineering.

Bouchiha, D., A. Bouziane, et al. (2022). "Machine Learning for Arabic Text Classification: A Comparative Study." Malaysian Journal of Science and Advanced Technology: 163-173.

BOUCHIHA, D., A. BOUZIANE, et al. (2023). WiHArD: Wikipedia based Hierarchical Arabic Dataset. Mendeley Data.

Chouigui, A., O. B. Khiroun, et al. (2017). "ANT Corpus: An Arabic News Text Collection for Textual Classification." 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA): 135-142.

Devlin, J., M.-W. Chang, et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, Association for Computational Linguistics.

El-Alami, F.-Z., A. El Mahdaouy, et al. (2020). "A deep autoencoder-based representation for Arabic text categorization." Journal of Information and Communication Technology 19(3): 381-398.

El Rifai, H., L. Al Qadi, et al. (2022). "Arabic text classification: the need for multi-labeling systems." Neural Computing and Applications 34(2): 1135-1159.

Elghannam, F. (2022). "Multi-Label Annotation and Classification of Arabic Texts Based on Extracted Seed Keyphrases and Bi-Gram Alphabet Feed Forward Neural Networks Model." ACM Transactions on Asian and Low-Resource Language Information Processing 22(1): 1-16.

Galal, M., M. M. Madbouly, et al. (2019). "Classifying Arabic text using deep learning." Journal of Theoretical and Applied Information Technology 97(23): 3412-3422.

Haviana, S. F. C. and B. S. W. Poetro (2022). "Deep Learning Model for Sentiment Analysis on Short Informal Texts." Indonesian Journal of Electrical Engineering and Informatics (IJEEI) 10(1): 82-89.

Hochreiter, S. and J. Schmidhuber (1997). "Long short-term memory." Neural computation 9(8): 1735-1780.

Le, Q. and T. Mikolov (2014). Distributed Representations of Sentences and Documents. Proceedings of the 31st International Conference on Machine Learning, Beijing China, PMLR.

Metz, C. E. (1978). "Basic principles of ROC analysis." Seminars in Nuclear Medicine 8(4): 283-298.

Mikolov, T., I. Sutskever, et al. (2013). "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems 26.

Rifai, H. E., L. Al Qadi, et al. (2021). Arabic Multi-label Text Classification of News Articles. Advanced Machine Learning Technologies and Applications: Proceedings of AMLTA 2021, Springer.

Riyanto, S., I. S. Sitanggang, et al. (2024). "Plant-Disease Relation Model through BERT-BiLSTM-CRF Approach." Indonesian Journal of Electrical Engineering and Informatics (IJEEI) 12(1): 113-124.

Roslidar, R., N. Brilianty, et al. (2024). "Improving Bi-LSTM for High Accuracy Protein Sequence Family Classifier." Indonesian Journal of Electrical Engineering and Informatics (IJEEI) 12(1): 40-52.

Saad, M. and W. M. Ashour (2010). OSAC: Open Source Arabic Corpora.

Safaya, A., M. Abdullatif, et al. (2020). KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media, Barcelona (online), International Committee for Computational Linguistics.

Samuel, A. L. (1959). "Some Studies in Machine Learning Using the Game of Checkers." IBM Journal of Research and Development 3(3): 210-229.

Schuster, M. and K. K. Paliwal (1997). "Bidirectional recurrent neural networks." IEEE transactions on Signal Processing 45(11): 2673-2681.

Shalev-Shwartz, S. and S. Ben-David (2014). Understanding machine learning: From theory to algorithms, Cambridge university press.

Sundus, K., F. Al-Haj, et al. (2019). A deep learning approach for arabic text classification. 2019 2nd International Conference on New Trends in Computing Sciences (ICTCS), IEEE.

Svetlana, K., M. Stan, et al. (2006). A. Fazel F.“Learning and evaluation in the presence of class hierarchies: application to text categorizationn”. Conference of the Canadian Society for Computational Studies of Intelligence.

Wahdan, A., M. Al-Emran, et al. (2024). "A systematic review of Arabic text classification: areas, applications, and future directions." Soft Computing 28(2): 1545-1566.


Full Text: PDF

Refbacks

  • There are currently no refbacks.


 

Indonesian Journal of Electrical Engineering and Informatics (IJEEI)
ISSN 2089-3272

Creative Commons Licence

This work is licensed under a Creative Commons Attribution 4.0 International License.

web analytics
View IJEEI Stats

Error. Page cannot be displayed. Please contact your service provider for more details. (8)