Bengali Word Detection from Lip Movements Using Mask RCNN and Generalized Linear Model

Abul Bashar Bhuiyan, Jia Uddin

Abstract


Speech processing with the help of lip detection and lip reading is an advancing field. For this, we need proper algorithms and techniques to detect lips and movements of lips perfectly. Lip detection and configuration are the most important parts of speech recognition. In this paper, we focus on detecting the lip segment properly. Mask R-CNN (Regional Convolutional Neural Network) performs object detection and instance segmentation per video frame to detect the lip segment. The process of mask R-CNN adds only a small overhead to Faster R-CNN and is quite simple to train, running at 5 frames per second. The Mask R-CNN involves keypoint detection which helps to extract the location of the lip landmarks pixel by pixel. Once the lip region is extracted and the landmarks are highlighted, we observe how the lip landmarks change as the object's lips move over time to each Bengali word. The keypoint changes that are observed during each millisecond are then the landmarks used to train the GLM (Generalized Linear Model). In addition, we compare the performance of GLM with Naive Bayes, Logistic Regression, and Decision Tree. The GLM has exhibited the highest 91.8% accuracy, whereas the Naive Bayes, Logistic Regression, and Decision Tree show the accuracy of 87.1%, 38.3%, and 82.2%, respectively.

Keywords


Word Detection; Lip Movements; Machine learning; Image Segmentation; Accuracy;

References


REFERENCES

S. W. Chin, K. P. Seng, L.-M. Ang, and K. H. Lim, “New lips detection and tracking system,” in Proceedings of the international multiconference of engineers and computer scientists, vol. 1, 2009, pp. 18-20.

N. Oliver, A. P. Pentland, and F. Berard, "LAFTER: lips and face real time tracker," in Proc. IEEE Computer Society Conf. Comput. Vis. Pattern Recogn., San Juan, PR, USA, 1997, pp. 123-129.

Z.-M. Chan, C. Y. Lau, and K. F. Thang, “Visual Speech Recognition of Lips Images Using Convolutional Neural Network in VGG-M Model,” J. Inf. Hiding Multim. Signal Process., vol. 11, pp. 116-125, 2020.

A. Aripin and A. Setiawan, "Indonesian Lip-Reading Recognition Using Long-Term Recurrent Convolutional Network," SSRN Electronic Journal, 2022. [Online]. Available: https://ssrn.com/abstract=4444973

R. El-Bialy et al., “Developing Phoneme-based Lip-reading Sentences System for Silent Speech Recognition,” CAAI Trans. Intell. Technol., 2022.

Y. Fu, Y. Lu, and R. Ni, “Chinese Lip-Reading Research Based on ShuffleNet and CBAM,” Applied Sciences, vol. 13, no. 2, p. 1106, Jan. 2023.

M. M. Rahman, M. R. Tanjim, S. S. Hasan, S. M. Shaiban, and M. A. Khan, "Lip Reading Bengali Words," in Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI '22), Sanya, China, 2023, Art. no. 22, pp. 1-6, doi: 10.1145/3579654.3579677.

G. Zhang and Y. Lu, “Research on a Lip-Reading Algorithm Based on Efficient-GhostNet,” Electronics, vol. 12, no. 5, p. 1151, Feb. 2023.

A. Berkol et al., “Visual Lip-Reading Dataset in Turkish,” Data, vol. 8, no. 1, p. 15, Jan. 2023.

Uddin, J., Arko, F. N., Tabassum, N., Trisha, T. R., & Ahmed, F. (2017, December). Bangla sign language interpretation using bag of features and Support Vector Machine. In 2017 3rd International Conference on Electrical Information and Communication Technology (EICT) (pp. 1-4). IEEE.

P. Bharati and A. Pramanik, "Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey," in Computational Intelligence in Pattern Recognition, A. Das, J. Nayak, B. Naik, S. Pati, and D. Pelusi, Eds. Singapore: Springer, 2020, vol. 999. [Online]. Available: https://doi.org/10.1007/978-981-13-9042-5_56.

K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961-2969.

K. Ishizaki, K. Saruta and H. Uehara, "Detecting Keypoints for Automated Annotation of Bounding Boxes using Keypoint Extraction," 2020 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 2020, pp. 1691-1694, doi: 10.1109/CSCI51800.2020.00312.

"Free Video to JPG Converter," DVDVideoSoft, 2017. [Online]. Available: https://www.dvdvideosoft.com/products/dvd/Free-Video-to-JPG-Converter.htm. [Accessed: 03, July, 2023]

J.-X. Zhang, G. Wan, and J. Pan, "Is lip region-of-interest sufficient for lipreading?," in Proceedings of the 2022 International Conference on Multimodal Interaction, 2022.

"BIRME - Bulk Image Resizing Made Easy 2.0," BIRME. 2018. [Online]. Available: https://www.birme.net/. [Accessed: 03, July, 2023].

A. Dutta and A. Zisserman, "The VIA Annotation Software for Images, Audio and Video," in Proceedings of the 27th ACM International Conference on Multimedia (MM '19), New York, NY, USA: ACM, 2019. [Online]. Available: https://doi.org/10.1145/3343031.3350535

W. Abdulla, "Splash of color: Instance segmentation with mask r-cnn and tensorflow," Matterport Engineering Techblog, Mar. 20, 2018. [Online]. Available: https://engineering.matterport.com/splash-of-color-instance-segmentation-with-mask-r-cnn-and-tensorflow-7c761e238b46.

Q. Liu and Y. Wu, "Supervised Learning," in Encyclopedia of the Sciences of Learning, N. M. Seel, Ed. 2012. [Online]. Available: https://doi.org/10.1007/978-1-4419-1428-6_451.

D. W. Hosmer Jr, S. Lemeshow, and R. X. Sturdivant, Applied Logistic Regression, Vol. 398. John Wiley & Sons, 2013.

J. A. Nelder and R. W. Wedderburn, "Generalized linear models," Journal of the Royal Statistical Society Series A: Statistics in Society, vol. 135, no. 3, pp. 370-384, 1972.

A. J. Myles, R. N. Feudale, Y. Liu, N. A. Woody, and S. D. Brown, "An introduction to decision tree modeling," Journal of Chemometrics: A Journal of the Chemometrics Society, vol. 18, no. 6, pp. 275-285, 2004.

K. P. Murphy, "Naive bayes classifiers," University of British Columbia, vol. 18, no. 60, 2006.


Full Text: PDF

Refbacks

  • There are currently no refbacks.


 

Indonesian Journal of Electrical Engineering and Informatics (IJEEI)
ISSN 2089-3272

Creative Commons Licence

This work is licensed under a Creative Commons Attribution 4.0 International License.

web analytics
View IJEEI Stats