Classification of Human Emotions Based on Javanese Speech Using Convolutional Neural Network and Multilayer Perceptron

Muji Ernawati, Dwiza Riana


Emotions in speech are considered a basic principle of human interaction and play an important role in decision-making, learning, and everyday communication. Research on speech emotion recognition is still being carried out by many researchers to develop speech emotion recognition models with better performance. In this research, we combine the application of data augmentation techniques (Add Noise, Time Stretch, and Pitch Shift) to increase the data size of the Javanese Speech Emotion Database (Java-SED). Mel Frequency Cepstral Coefficients (MFCC) is used for feature extraction and then builds a Convolutional Neural Network (CNN) model and applies Multilayer Perceptron (MLP) to classify human emotions from sound. In this research, we produced eight experimental models with a combination of different augmentation techniques. The CNN model parameters include 40 input neurons, four hidden layers with varying neuron counts, Relu activation functions, L2 regularization, dropout rates, Adam optimization, and ModelCheckpoint callbacks to save the best model based on validation loss. From the results of the evaluation that has been carried out, the CNN algorithm produces the highest performance with an accuracy of 96.43%, recall of 96.43%, precision of 96.57%, F1-score of 96.48%, and kappa of 95.71% by applying the Add Noise technique, Time Stretch, and Pitch Shift.


Speech Emotion Recognition; Convolutional Neural Network; Multilayer Perceptron; Data Augmentation; Mel Frequency Cepstral Coefficients


N. Patel, S. Patel, and S. H. Mankad, “Impact of autoencoder based compact representation on emotion detection from audio,” J. Ambient Intell. Humaniz. Comput., vol. 13, no. 2, pp. 867–885, 2021, doi: 10.1007/s12652-021-02979-3.

H. Ibrahim, C. K. Loo, and F. Alnajjar, “Bidirectional parallel echo state network for speech emotion recognition,” Neural Comput. Appl., vol. 34, no. 20, pp. 17581–17599, 2022, doi: 10.1007/s00521-022-07410-2.

I. OZER, “Pseudo-colored rate map representation for speech emotion recognition,” Biomed. Signal Process. Control, vol. 66, no. February, p. 102502, 2021, doi: 10.1016/j.bspc.2021.102502.

D. Issa, M. F. Demirci, and A. Yazici, “Speech emotion recognition with deep convolutional neural networks,” Biomed. Signal Process. Control, vol. 59, pp. 1–11, 2020, doi: 10.1016/j.bspc.2020.101894.

S. M. Hoseini, “Persian Speech Emotion Recognition Approach based on Multilayer Perceptron,” Int. J. Digit. Content Manag., vol. 2, no. 3, pp. 177–187, 2021, doi:

A. P. Wibawa, W. Lestari, A. B. P. Utama, I. T. Saputra, and Z. N. Izdihar, “Multilayer Perceptron untuk Prediksi Sessions pada Sebuah Website Journal Elektronik,” Indones. J. Data Sci., vol. 1, no. 3, pp. 57–67, 2020, doi: 10.33096/ijodas.v1i3.15.

R. H. Aljuhani, A. Alshutayri, and S. Alahdal, “Arabic Speech Emotion Recognition from Saudi Dialect Corpus,” IEEE Access, vol. 9, pp. 127081–127085, 2021, doi: 10.1109/ACCESS.2021.3110992.

R. Y. Rumagit, G. Alexander, and I. F. Saputra, “Model Comparison in Speech Emotion Recognition for Indonesian Language,” in Procedia Computer Science, Elsevier B.V., 2021, pp. 789–797. doi: 10.1016/j.procs.2021.01.098.

A. A. Alnuaim et al., “Human-Computer Interaction for Recognizing Speech Emotions Using Multilayer Perceptron Classifier,” J. Healthc. Eng., vol. 2022, pp. 1–12, 2022, doi: 10.1155/2022/6005446.

R. Jahangir, Y. W. Teh, G. Mujtaba, R. Alroobaea, Z. H. Shaikh, and I. Ali, “Convolutional neural network-based cross-corpus speech emotion recognition with data augmentation and features fusion,” Mach. Vis. Appl., vol. 33, no. 41, pp. 1–16, 2022, doi: 10.1007/s00138-022-01294-x.

M. Deng, T. Meng, J. Cao, S. Wang, J. Zhang, and H. Fan, “Heart sound classification based on improved MFCC features and convolutional recurrent neural networks,” Neural Networks, vol. 130, pp. 22–32, 2020, doi: 10.1016/j.neunet.2020.06.015.

A. A. Alnuaim et al., “Human-Computer Interaction with Detection of Speaker Emotions Using Convolution Neural Networks,” Comput. Intell. Neurosci., vol. 2022, pp. 1–16, 2022, doi: 10.1155/2022/7463091.

S. Sultana, M. Z. Iqbal, M. R. Selim, M. Rashid, and M. S. Rahman, “Bangla Speech Emotion Recognition and Cross-Lingual Study Using Deep CNN and BLSTM Networks,” IEEE Access, vol. 10, pp. 564–578, 2022, doi: 10.1109/ACCESS.2021.3136251.

O. S. Ghongade, S. K. S. Reddy, Y. C. Gavini, S. Tokala, and M. K. Enduri, “Acute Lymphoblastic Leukemia Blood Cells Prediction Using Deep Learning & Transfer Learning Technique,” Indones. J. Electr. Eng. Informatics, vol. 11, no. 3, pp. 778–790, 2023, doi: 10.52549/ijeei.v11i3.4855.

R. Nurcahyo and M. Iqbal, “Pengenalan Emosi Pembicara Menggunakan Convolutional Neural Networks,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 6, no. 1, pp. 115–122, 2022, doi: 10.29207/resti.v6i1.3726.

M. R. Ahmed, S. Islam, A. K. M. M. Islam, and S. Shatabda, “An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition,” Expert Syst. Appl., vol. 218, pp. 1–21, 2023, doi: 10.1016/j.eswa.2023.119633.

N. T. Pham et al., “Hybrid data augmentation and deep attention-based dilated convolutional-recurrent neural networks for speech emotion recognition,” Expert Syst. Appl., vol. 230, pp. 1–13, 2023, doi: 10.1016/j.eswa.2023.120608.

F. Arifin, A. S. Priambodo, A. Nasuha, A. Winursito, and T. S. Gunawan, “Development of Javanese Speech Emotion Database ( Java-SED ),” Indones. J. Electr. Eng. Informatics, vol. 10, no. 3, pp. 584–591, 2022, doi: 10.52549/ijeei.v10i3.3888.

S. Wei, S. Zou, F. Liao, and W. Lang, “A Comparison on Data Augmentation Methods Based on Deep Learning for Audio Classification,” in Journal of Physics: Conference Series, IOP Publishing Ltd, 2020, pp. 1–8. doi: 10.1088/1742-6596/1453/1/012085.

A. A. C. Alves et al., “Integrating Audio Signal Processing and Deep Learning Algorithms for Gait Pattern Classification in Brazilian Gaited Horses,” Front. Anim. Sci., vol. 2, no. August, pp. 1–19, 2021, doi: 10.3389/fanim.2021.681557.

S. K. Challa, A. Kumar, and V. B. Semwal, “A multibranch CNN-BiLSTM model for human activity recognition using wearable sensor data,” Vis. Comput., vol. 38, pp. 4095–4109, 2021, doi: 10.1007/s00371-021-02283-3.

M. Grandini, E. Bagli, and G. Visani, “METRICS FOR MULTI-CLASS CLASSIFICATION: AN OVERVIEW,” Arxiv, vol. abs/2008.0, pp. 1–17, 2020, doi: 10.48550/arXiv.2008.05756.

A. A. Abdelhamid et al., “Robust Speech Emotion Recognition Using CNN+LSTM Based on Stochastic Fractal Search Optimization Algorithm,” IEEE Access, vol. 10, pp. 49265–49284, 2022, doi: 10.1109/ACCESS.2022.3172954.

A. C. Shruti, R. H. Rifat, M. Kamal, and M. G. R. Alam, “A Comparative Study on Bengali Speech Sentiment Analysis Based on Audio Data,” in 2023 IEEE International Conference on Big Data and Smart Computing (BigComp), IEEE, 2023, pp. 219–226. doi: 10.1109/BigComp57234.2023.00043.

Full Text: PDF


  • There are currently no refbacks.


Indonesian Journal of Electrical Engineering and Informatics (IJEEI)
ISSN 2089-3272

Creative Commons Licence

This work is licensed under a Creative Commons Attribution 4.0 International License.

web analytics
View IJEEI Stats