Comparative Analysis of Spectrogram and MFCC Representations for Speech Emotion Recognition Using Machine Learning

Authors

  • Rexcharles Enyinna Donatus Department of Computer Science and Engineering, Mewar University, Rajasthan, India & Africa Centre of Excellence on Technology Enhanced Learning (ACETEL), National Open University of Nigeria, Nigeria
  • B. L. Pal 2Department of Computer Science and Engineering, Mewar University, Rajasthan, India
  • Ifeyinwa Happiness Donatus Department of Computer Science, Kaduna State University, Kaduna, Nigeria
  • Ubadike Osichinaka Chiedu Africa Centre of Excellence on Technology Enhanced Learning (ACETEL), National Open University of Nigeria, Nigeria & Department of Aerospace Engineering, Air Force Institute of Technology, Kaduna, Nigeria

DOI:

https://doi.org/10.70112/ajcst-2024.13.2.4284

Keywords:

Emotion Recognition, Human-Computer Interaction, Mel-Frequency Cepstral Coefficients (MFCC), Support Vector Machine (SVM), Random Forest (RF)

Abstract

Emotion recognition is a key area of research within human-computer interaction, addressing the growing need for systems that can respond to human emotional states. While advancements have been made, challenges remain, particularly in selecting appropriate datasets, identifying effective audio features, and optimizing classification models. This study explores how different audio feature representations, specifically Mel-Frequency Cepstral Coefficients (MFCC) and spectrograms, influence the accuracy of emotion classification. By extracting these features from the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and applying Random Forest (RF) and Support Vector Machine (SVM) classifiers, the research compares the performance of each feature-classifier pairing. Results indicate that RF and SVM classifiers with MFCC features achieved 50% accuracy, while spectrogram features led to 45% and 54% accuracy, respectively. These findings suggest that simpler models, when combined with appropriate features, can offer promising performance, contributing to more responsive and adaptive human-computer interaction applications.

References

M. A. H. Akhand, S. Roy, N. Siddique, M. A. S. Kamal, and T. Shimamura, “Facial emotion recognition using transfer learning in the deep CNN,” Electronics, vol. 10, no. 9, p. 1036, 2021, doi: 10.3390/electronics10091036.

J. de Lope and M. Graña, “An ongoing review of speech emotion recognition,” Neurocomputing, vol. 528, pp. 1-11, 2023, doi: 10.1016/j.neucom.2023.01.002.

H. S. Kumbhar and S. U. Bhandari, “Speech emotion recognition using MFCC features and LSTM network,” in Proc. 2019 5th Int. Conf. Comput. Commun. Control Autom. ICCUBEA 2019, vol. 1,pp. 1-3, 2019, doi: 10.1109/ICCUBEA47591.2019.9129067.

R. E. Donatus, I. H. Donatus, and U. O. Chiedu, “Exploring the impact of convolutional neural networks on facial emotion detection and recognition,” Asian Journal of Electrical Sciences, vol. 13, no. 1,pp. 35-45, 2024.

S. Mp and S. A. Hariprasad, “Facial emotion recognition using a modified deep convolutional neural network based on the concatenation of XCEPTION and RESNET50 V2,” Electronics, vol. 10, no. 6, pp. 94-105, 2023.

D. Shin, D. Shin, and D. Shin, “Development of emotion recognition interface using complex EEG/ECG bio-signal for interactive contents,” Multimed. Tools Appl., vol. 76, no. 9, pp. 11449-11470, 2017, doi: 10.1007/s11042-016-4203-7.

Q. Wang, M. Wang, Y. Yang, and X. Zhang, “Multi-modal emotion recognition using EEG and speech signals,” Comput. Biol. Med., vol. 149, p. 105907, 2022, doi: 10.1016/j.compbiomed.2022.105907.

Z. Yang, Z. Li, S. Zhou, L. Zhang, and S. Serikawa, “Speech emotion recognition based on multi-feature speed rate and LSTM,” Neurocomputing, vol. 601, p. 128177, 2024, doi: 10.1016/j.neucom.2024.128177.

A. Bhavan, P. Chauhan, Hitkul, and R. R. Shah, “Bagged support vector machines for emotion recognition from speech,” Knowledge-Based Syst., vol. 184, p. 104886, 2019, doi: 10.1016/j.knosys.2019.104886.

S. S. Chandurkar, S. V. Pede, and S. A. Chandurkar, “System forprediction of human emotions and depression level with recommendation of suitable therapy,” Asian Journal of Computer Science and Technology, vol. 6, no. 2, pp. 5-12, 2017, doi: 10.51983/ ajcst-2017.6.2.1787.

M. Ghai, S. Lal, S. D. L., and S. Manik, “Emotion recognition on speech attributes using machine learning,” in Proc. 2024 IEEE Int. Conf. Inf. Technol. Electron. Intell. Commun. Syst. ICITEICS 2024, pp. 22-27, 2024, doi: 10.1109/ICITEICS61368.2024.10624904.

S. Madanian et al., “Speech emotion recognition using machine learning — A systematic review,” Intell. Syst. with Appl., vol. 20, p. 200266, 2023, doi: 10.1016/j.iswa.2023.200266.

M. Hao, W. Cao, Z. Liu, M. Wu, and P. Xiao, “Visual-audio emotion recognition based on multi-task and ensemble learning with multiple features,” Neurocomputing, vol. 391, pp. 42-51, 2020, doi: 10.1016/j.neucom.2020.01.048.

F. Noroozi, T. Sapiński, D. Kamińska, and G. Anbarjafari, “Vocal-based emotion recognition using random forests and decision tree,” Int. J. Speech Technol., vol. 20, no. 2, pp. 239-246, 2017,doi: 10.1007/s10772-017-9396-2.

[15] A. V. Geetha, T. Mala, D. Priyanka, and E. Uma, “Multimodal emotion recognition with deep learning: Advancements, challenges, and future directions,” Inf. Fusion, vol. 105, p. 102218, 2024,doi: 10.1016/j.inffus.2023.102218.

D. Issa, M. F. Demirci, and A. Yazici, “Speech emotion recognition with deep convolutional neural networks,” Biomed. Signal Process.Control, vol. 59, p. 101894, 2020, doi: 10.1016/j.bspc.2020.101894.

M. M. R. Mashhadi and K. Osei-Bonsu, “Speech emotion recognition using machine learning techniques: Feature extraction and comparison of convolutional neural network and random forest,”PLoS One, vol. 18, no. 11, pp. 1-13, 2023, doi: 10.1371/journal.pone.0291500.

S. B. Jagtap, K. R. Desai, and M. J. K. Patil, “A survey on speech emotion recognition using MFCC and different classifiers,” in 8th Natl. Conf. Emerg. Trends Engg. Technol., pp. 502-509, 2018.

T. Arikrishnan and C. P. Darani, “A pathological voices assess mentusing classification,” Asian Journal of Engineering and Applied Technology, vol. 3, no. 1, pp. 5-8, 2014, doi: 10.51983/ajeat-2014.3.1.710.

S. Suke et al., “Speech emotion recognition system,” Int. J. Adv. Res. Sci. Commun. Technol., vol. 4, no. 3, pp. 156-159, 2021, doi: 10.48175/ijarsct-v4-i3-024.

R. Panda, R. Malheiro, and R. P. Paiva, “Audio features for music emotion recognition: A survey,” IEEE Trans. Affect. Comput., vol. 14, no. 1, pp. 68-88, 2023, doi: 10.1109/TAFFC.2020.3032373.

N. C. Ristea, L. C. Dutu, and A. Radoi, “Emotion recognition system from speech and visual information based on convolutional neural networks,” in Proc. 2019 10th Int. Conf. Speech Technol. Human-Computer Dialogue, SpeD 2019, pp. 1-6, 2019, doi: 10.1109/SPED.2019.8906538.

G. C. Jyothi, C. Prakash, G. A. Babitha, and G. H. Kiran Kumar,“Comparison analysis of CNN, SVC and random forest algorithms in segmentation of teeth X-ray images,” Asian Journal of Computer Science and Technology, vol. 11, no. 1, pp. 40-47, 2022,doi: 10.51983/ajcst-2022.11.1.3283.

R. S. Agrawal and U. N. Agrawal, “A review on emotion recognition using hybrid classifier,” in Spec. Issue Natl. Conf. Recent Adv. Technol. Manag. Integr. Growth 2013 (RATMIG 2013), no. Icicct, 2017.

A. Hussain, N. Saikia, and C. Dev, “Advancements in Indian sign language recognition systems: Enhancing communication and accessibility for the deaf and hearing impaired,” Asian Journal of Electrical Sciences, vol. 12, no. 2, pp. 37-49, 2023.

S. Shankaracharya, S. S. S. K. R. Kumar, S. L. Y. G. Varma, and D. S. R. Reddy, “The accuracy analysis of different machine learning classifiers for detecting suicidal ideation and content,” Int. J. Intell. Eng. Syst., vol. 12, no. 1, pp. 46-56, 2023.

S. Sathurthi, R. Kamalakannan, and T. Rameshkumar, “Study of ensemble classifier for prediction in health care data,” Asian Journal of Computer Science and Technology, vol. 8, no. S1, pp. 36-37, 2019, doi: 10.51983/ajcst-2019.8.s1.1963.

J. Wei, X. Yang, and Y. Dong, “User-generated video emotion recognition based on key frames,” Multimed. Tools Appl., vol. 80, no. 9, pp. 14343-14361, 2021, doi: 10.1007/s11042-020-10203-1.

Downloads

Published

10-11-2024

How to Cite

Donatus, R. E., Pal, B. L., Donatus, I. H., & Chiedu, U. O. (2024). Comparative Analysis of Spectrogram and MFCC Representations for Speech Emotion Recognition Using Machine Learning. Asian Journal of Computer Science and Technology, 13(2), 41–47. https://doi.org/10.70112/ajcst-2024.13.2.4284