Improving Bi-LSTM for High Accuracy Protein Sequence Family Classifier

Roslidar Roslidar, Novia Brilianty, Muhammad Jurej Alhamdi, Cut Nanda Nurbadriani, Essy Harnelly, Zulkarnain Zulkarnain

Abstract


The primary nutrient that is crucial for identifying biochemical processes and biological norms in living cells is protein. Proteins are usually centered around one or a few functions which are defined by their family type. Hence, identification and classification are needed to separate the proteins according to their structure and families. In this work, we built a model to classify families of protein sequences. We used the protein sequences dataset consists of various macromolecules of biological significance. The classifier is built up using deep learning of Bi-LSTM. We began the research by collecting the dataset from the Protein Data Bank of the Research Collaboratory for Structural Bioinformatics, pre-processing the data using tokenizing, and modeling the classifier based on deep learning network of Bi-LSTM. As we get the best accuracy rate of the trained model, we figure out the model performance using the evaluation metrics of learning curve, accuracy rate, and loss. The results show that Deep Bi-LSTM provides excellent performance with fit learning curve, 99% accuracy rate, and 0.042 loss.


Keywords


protein sequence, classification, Bi-LSTM, deep learning

References


B. Alberts, D. Bray, K. Hopkin, A. Johnson, J. Lewis, M. Raff, K. Roberts, P. Walter, Protein structure and function, Essential cell biology, pp.120–170, 4th Ed, New York, Garland Science, Taylor and Francis Group, 2010.

A. K. Wong, E.-S. A. Lee, “Aligning and clustering patterns to reveal the protein functionality of sequences”, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.11, no.3, pp 548–560, 2014.

Z. Zulkarnain, H. Sujuti, D. W. Soeatmadji, D. H. Utomo, et al., “Tshr169 antigen specifically binds to the thyroid-stimulating autoan- tibody, representing an effective biomarker for graves’ disease”, International Journal Bioautomation, vol. 23, no. 1, pp 51-60, 2019.

D. Zhang, M. R. Kabuka, “Protein family classification with multi-layer graph convolutional networks”, 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, pp. 2390–2393, 2018.

L. Deng, D. Yu, et al., “Deep learning: methods and applications”, Foundations and trends® in signal processing vol.7, no.3–4, pp. 197–387, 2014.

S. J. Giri, P. Dutta, P. Halani, S. Saha, “Multipredgo: deep multimodal protein function prediction by amalgamating protein structure, sequence, and interaction information”, IEEE Journal of Biomedical and Health Informatics, vol. 25, no.5, pp. 1832–1838, 2020.

R. S. Singh, D. J. Gelmecha, S. Mishra, G. Dengia, D. K. Sinha, “A novel machine learning approach for detection of coronary artery dis- ease using reduced non-linear and chaos features”, International Journal Bioautomation, vol. 26, no. 3, 2022.

Bihter, D.A.Ş. and Toraman, S., “Classifying protein sequences using convolutional neural network”, Bitlis Eren Üniversitesi Fen Bilimleri Dergisi, vol. 9, no. 4, pp.1663-1671, 2020.

Pandey, A. and Roy, S.S., “Protein sequence classification using convolutional neural network and natural language processing”. In Handbook of Machine Learning Applications for Genomics (pp. 133-144). Singapore: Springer Nature Singapore. 2022.

Nadav Brandes, Dan Ofer, Yam Peleg, Nadav Rappoport, Michal Linial, “ProteinBERT: a universal deep-learning model of protein sequence and function”, Bioinformatics, vol. 38, no. 8, pp 2102–2110, 2022.

Abu-Qasmieh, I., Al Fahoum, A., Alquran, H. and Zyout, A., “An Innovative Bispectral Deep Learning Method for Protein Family Classification”. Computers, Materials & Continua, vol. 75, no.2, 2023.

UniProt, Aligning multiple protein sequences, https://www.ebi.ac.uk/training/online/courses/uniprot-exploring-protein-sequen, 2022.

NCBI, Protein, https://www.ncbi.nlm.nih.gov/protein, 2022.

Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. “Speech recognition with deep recurrent neural networks”. 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013.

Hu, H., Li, Z., Elofsson, A. and Xie, S., “A Bi-LSTM based ensemble algorithm for prediction of protein secondary structure”. Applied Sciences, vol. 9, no. 17, p.3538, 2019.

Jin, H., Du, W., Gu, J., Zhang, T. and Shi, X., “Combining GCN and Bi-LSTM for protein secondary structure prediction”. 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (pp. 44-49). IEEE, 2021.

Sharma, A.K. and Srivastava, R., “Protein secondary structure prediction using character bi-gram embedding and Bi-LSTM”. Current Bioinformatics, vol. 16, no. 2, pp.333-338, 2021.

Ema, R.R., Khatun, A., Hossain, M.A., Akhond, M.R., Hossain, N. and Arafat, M.Y., “Protein Secondary Structure Prediction using Hybrid Recurrent Neural Networks”. Journal of Computer Science, vol. 18, no. 7, pp.599-611, 2022.

Hochreiter S, Schmidhuber J. “Long short-term memory”. Neural Computation, vol.9, pp. 1735-1780, 1997.

Noumi, T., Inoue, S., Fujita, H., Sadamitsu, K., Sakaguchi, M., Tenma, A. and Nakagami, H., “Epitope prediction of antigen protein using attention-based LSTM network”. Journal of Information Processing, vol. 29, pp.321-327. 2021.

Chung J, Gulcehre C, Cho K, Bengio Y., “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling”, arXiv; arXiv:1412.3555. 2014.

Li, J., Wang, L., Zhang, X., Liu, B. and Wang, Y., “Gonet: a deep network to annotate proteins via recurrent convolution networks”. 2020 IEEE international conference on bioinformatics and biomedicine (BIBM), IEEE, pp. 29-34, 2020.

Sharma, L., Deepak, A., Ranjan, A. and Krishnasamy, G., “A novel hybrid CNN and BiGRU-Attention based deep learning model for protein function prediction”, Statistical Applications in Genetics and Molecular Biology, vol. 22, no. 1, p.20220057, 2023.

Wang, Z., Lin, T., Yang, X., Liang, Y. and Shi, X., “Protein Subcellular Localization Prediction by Combining ProtBert and BiGRU”, 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, pp. 86-89, 2022.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I., “Attention is all you need”. Advances in neural information processing systems, vol. 30, 2017.

Cao, Y. and Shen, Y., “TALE: Transformer-based protein function Annotation with joint sequence–Label Embedding”, Bioinformatics, vol. 37, no. 18, pp.2825-2833, 2021.

Clauwaert, J. and Waegeman, W., “Novel transformer networks for improved sequence labeling in genomics”, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 19, no. 1, pp.97-106, 2020.

A. Villegas-Morcillo, A. M. Gomez, J. A. Morales-Cordovilla and V. Sanchez, "Protein Fold Recognition From Sequences Using Convolutional and Recurrent Neural Networks," in IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 18, no. 6, pp. 2848-2854, 1 Nov.-Dec. 2021.

Rani TS, Babu AY, Haritha D. Wrapper, “Fuzzy Approach with 3d Fast Convolution Neural Network (FCNN) Based Feature Selection in Protein Sequence Classification”, International Journal of Intelligent Systems and Applications in Engineering, vol. 10, no. 2s, pp. 28–34, 2022.

Qabel A, Ennadir S, Nikolentzos G, Lutzeyer JF, Chatzianastasis M, Boström H, Vazirgiannis M. “Structure-Aware Antibiotic Resistance Classification Using Graph Neural Networks”. InNeurIPS 2022 AI for Science: Progress and Promises, 2022.

J. Brownlee, Ordinal and one-hot encodings for categorical data, https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/, 2020.

C. Spearman, The proof and measurement of association between two things. 1961.

H. Wang, Y. Zhang, J. Liang, L. Liu, “Dafa-bilstm: Deep autoregression feature augmented bidirectional lstm network for time series prediction”, Neural Networks, vol. 157, pp. 240–256, 2023.

R. Wang, X. Liang, X. Zhu, Y. Xie, “A feasibility of respiration predic- tion based on deep bi-lstm for real-time tumor tracking”, IEEE Access, vol. 6, pp. 51262–51268, 2018.

R. Roslidar, M. Syaryadhi, K. Saddami, B. Pradhan, F. Arnia, M. Syukri, K. Munadi, R. Roslidar, M. Syaryadhi, K. Saddami., “Breacnet: A high-accuracy breast thermogram classifier based on mobile convolutional neural network”, Math. Bioscience Enginering, vol. 19, pp.1304–1331, 2022.

M. Desai, M. Shah, “An anatomization on breast cancer detection and diagnosis employing multi-layer perceptron neural network (MLP) and convolutional neural network (CNN), Clinical eHealth, vol. 4, pp. 1–11, 2021.

A. C. Hughes, M. Mort, L. Elliston, R. M. Thomas, S. P. Brooks, S. B. Dunnett, L. Jones, “Identification of novel alternative splicing events in the huntingtin gene and assessment of the functional consequences us- ing structural protein homology modelling”, Journal of molecular biology , vol. 426, no. 7, pp. 1428–1438, 2014.


Full Text: PDF

Refbacks

  • There are currently no refbacks.


 

Indonesian Journal of Electrical Engineering and Informatics (IJEEI)
ISSN 2089-3272

Creative Commons Licence

This work is licensed under a Creative Commons Attribution 4.0 International License.

web analytics
View IJEEI Stats

Error. Page cannot be displayed. Please contact your service provider for more details. (1)