A Translation Framework for Cross Language Information Retrieval in Tamil and Malayalam

SAKTHI VEL S., PRIYA R

Abstract


Cross Language Information Retrieval (CLIR) stands as an essential element in multilingual information accessibility, enabling users to obtain relevant information even when the query language and the language of the documents diverge. This paper proposes a translation framework for CLIR in Tamil and Malayalam, two Dravidian languages widely spoken in South India. Different challenges prevail in CLIR of these languages due to their linguistic differences, translation equivalence, mapping source to target languages, semantic equivalence, limited dataset and tools for ongoing research in this domain. The proposed methodology resolves some of the issues around training of a corpus utilizing a Long Short-Term Memory (LSTM) based encoder-decoder translation model. The study incorporates two bilingual parallel corpora comprising 373 sentences pairs each. Evaluation of the model's accuracy is conducted by equivalency its translations against reference translations using the Bilingual Evaluation Understudy (BLEU Score). Furthermore, BLEU scores obtained from proposed LSTM-based encoder-decoder model is compared with those from Google Translate. The findings reveal that the LSTM model attains an average BLEU score of 0.933, where, performance of Google Translate, achieved a score of 0.813. Finally, the study conducts a comparative analysis with selected CLIR models in different languages, to evaluate the overall performance of the proposed approach.

Keywords


Cross Language IR; LSTM encoder-decoder; Machine Translation; Text Processing; Query expansion; BLEU Score

References


Verma, Nitin., Arora, Suket., Verma, Preeti.: Cross-Language Information Retrieval on Indian Language: A Review. IITM Journal of Management and IT, Vol. 8, Issue 1, pp. 63-66, (2017).

Dolores, Maria., and Lobo. Olvera, “Cross-Language Information Retrieval on the Web”, IGI Global, pp.704-719, 2009.

S. Pourmahmoud and M. Shamsfard, "Semantic Cross-lingual Information Retrieval," 2008 23rd International Symposium on Computer and Information Sciences, Istanbul, Turkey, 2008, pp. 1-4, doi: 10.1109/ISCIS.2008.4717868.

Litschko. Robert, Glavas. Goran, Ponzetto. Simone Paolo, and Vulic. Ivan, “Unsupervised Cross-Lingual Information Retrieval using Monolingual Data Only:, ACM, pp.1-5, arXiv: 1805.00879v1 [cs.CL.], 2018.

Zhuhadar, Leyla., Nasraoui, Olfa.: Evaluating a Cross-Language Semantically Enriched Search Engine. Proceedings of 2010 Seventh International Conference on Information Technology, pp. 1074-1079, 978-0-7695-3984-3/10, (2010).

Zhuhadar, Leyla., Nasraoui, Olfa., Wyatt, Robert., Romero, Elizabeth.: Multi-Language Ontology-based Search Engine. Proceedings of 2010 Third International Conference on Advances in Computer-Human Interactions, pp. 13-18, 978-0-7695-3957-1/10, (2010).

N. Jian-Yun, “Cross-Language Information Retrieval”, IEEE Computational Intelligence Bulletin, Vol.2, No.1, pp. 19-24, 2003.

M. N. Asim, M. Wasim, M. U. Ghani Khan, N. Mahmood and W. Mahmood, "The Use of Ontology in Retrieval: A Study on Textual, Multilingual, and Multimedia Retrieval," in IEEE Access, vol. 7, pp. 21662-21686, 2019, doi: 10.1109/ACCESS.2019.2897849.

A. Mustafa, T. John, and O. Michael, “Cross Language Information Retrieval using Multilingual Ontology as Translation and Query Expansion Base”, Polibits (40), pp.13-16, 2009.

S. Pourmahmoud and M. Shamsfard, "Semantic Cross-lingual Information Retrieval," 2008 23rd International Symposium on Computer and Information Sciences, Istanbul, Turkey, 2008, pp. 1-4, doi: 10.1109/ISCIS.2008.4717868.

Sharma. Monika, and Morwal, Sudha, “Refinement of search results using cross lingual reference technique”, International Journal of Advanced Research in Computer and communication Engineering, Vol. 3, Issue 12, pp. 8692-8695, ISSN: 2278-1021, 2014.

Gupta. Parul, and Sharma. AK, “Context based Indexing in Search Engines using Ontology”, International Journal of Computer Applications, Vol. I, No.14, pp.49-52, ISSN: 0975-8887, 2010.

B. A. Kumar, "Profound Survey on Cross Language Information Retrieval Methods (CLIR)," 2012 Second International Conference on Advanced Computing & Communication Technologies, Rohtak, India, 2012, pp. 64-68, doi: 10.1109/ACCT.2012.91.

Zeeshan, Jawad and M. Zakira, ”Research on Chinese-Urdu Machine Translation Based on Deep Learning,” Journal of Autonomous Intelligence, 2020, Vol. 3, Issue 2, pp. 34-44. Doi:10.32629/jai.v3i2.279.

K. Aditi, K. Hemant, P.Shashi, K. Ajai and D. Hemant, “Evaluation and Ranking of Machine Translated Output in Hindi Language using Precision and Recall Oriented Metrics,” International Journal of Advanced Computer Research, 2014-March, Vol. 4, Issue. 14, pp. 54-59.

P. L. Nikesh, S. M. Idicula and S. David Peter, "English-Malayalam Cross-Lingual Information Retrieval- an experience," 2008 IEEE International Conference on Electro/Information Technology, Ames, IA, USA, 2008, pp. 271-275, doi: 10.1109/EIT.2008.4554312.

Kassa, Ibrahim Gashaw., Shashirekha, H.L.: A2 CLIR: Amharic-Arabic Cross Language Information Retrieval Using Language Modeling. UGC Shodhganga Inflibnet, https://shodhganga.inflibnet.ac.in/handle/10603/380186, (2021).

Thenmozhi, D., Aravindan, Chandrabose.: Ontology-based Tamil-English Cross-lingual Information Retrieval system. Indian Academy of Sciences, Sadhana, 43:157, pp. 3-14, https://doi.org/10.1007/s12046-018-0942-7, (2018).

PV. Vidya, PC. Raj, V. Reghu, and Jayan, “Web Page Ranking Using Multilingual Information Search Algorithm: A Novel Approach”, ICETEST-2015, Procedia Technology-Elsevier Publication, pp.1240-1247, doi: 10.1016/j.protcy.2016.05.102, 2015.

Shree. KV, Saviya. E, Umamaheswari, J. Balaji., Geetha, TV., and Parthasarathi, Ranjani, “Conceptual Based Search Engine (CBSE) system for Tamil and English”,TaCoLa Lab, CEG, Anna University, Chennai. pp.105-111.

E. Katta and A. Arora, "An improved approach to English-Hindi based Cross Language Information Retrieval system," 2015 Eighth International Conference on Contemporary Computing (IC3), Noida, India, 2015, pp. 354-359, doi: 10.1109/IC3.2015.7346706.

Mayanale. Savita C, and Pawar. SS, “Marathi-English CLIR using detailed user query and unsupervised corpus-based WSD”, Int. Journal of Engineering Research and Applications, Vol.5, Issue 6, pp.86-91, ISSN: 2248-962, 2015.

S. Saraswathi, A. Siddhiqaa, K. Kalaimagal, and M. Kalaiyarasi, “Bi-Lingual Information Retrieval System for English and Tamil”, Journal of Computing, Vol.2, Issue 4, pp. 85-89, ISSN 2151-9617, 2010.

Reddy, Mallamma V., Hanumanthappa, M., and Kumar, Manish, “Cross Lingual Information Retrieval Using Search Engine and Data Mining”, ACEEE Int. J. on Information Technology, Vol.01, No.02, pp.10-13, 2011.

Chandra, Ganesh., Dwivedi, Sanjay Kumar.: Applying Query Expansion in Cross Lingual IR (Hindi-English) for Relevancy Improvements. UGC Shodhganga- Inflibnet, https://shodhganga.inflibnet.ac.in/handle/10603/260610. 2017.

A. Mustafa, T. John, and O. Michael, “Cross Language Information Retrieval using Multilingual Ontology as Translation and Query Expansion Base”, Polibits (40), pp.13-16, 2009.

P. Bajpai, Pratibha, V. and, Parul, “Cross Language Information Retrieval: In Indian Language Perspective”, International Journal of Research in Engineering and Technology. Vol. 03, Special Issue. 10, pp.46-52, 2014.

Sharma. Monika, and Morwal, Sudha, “Refinement of search results using cross lingual reference technique”, International Journal of Advanced Research in Computer and communication Engineering, Vol. 3, Issue 12, pp. 8692-8695, ISSN: 2278-1021, 2014.

Litschko. Robert, Glavas. Goran, Ponzetto. Simone Paolo, and Vulic. Ivan, “Unsupervised Cross-Lingual Information Retrieval using Monolingual Data Only:, ACM, pp.1-5, arXiv: 1805.00879v1 [cs.CL.], 2018.

Song, Xiang., Zhou, Jialiang., Kimura, Fuminori., Maeda, Akira.: A Japanese-Chinese Cross-Language Entity Linking Method with Entity Disambiguation Based on Document Similarity. International Journal of Knowledge Engineering, Vol.2, No.3, pp.122-127 (2016).

Zhuhadar, Leyla., Nasraoui, Olfa.: Evaluating a Cross-Language Semantically Enriched Search Engine. Proceedings of 2010 Seventh International Conference on Information Technology, pp. 1074-1079, 978-0-7695-3984-3/10, (2010).

S. J. Shim, "Using Cross-Language Information Retrieval Methods for Bilingual Search of the Web," International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC'06), Vienna, Austria, 2005, pp. 19-24, doi: 10.1109/CIMCA.2005.1631439.

O. Attia, M. Azmy, Emeira. Ahmed Abu., Azzouni, Karim El., Hussein, Omar., El-Makky, Nagwa M., N. and, Khaled, “Using Deep Learning in Arabic-English Cross Language Information Retrieval”, Egyptian Information Technology Industry Development Agency (ITIDA), pp.1-8, 2012.

M. Nassirudin and A. Purwarianti, "Indonesian-Japanese term extraction from bilingual corpora using machine learning," 2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS), Depok, Indonesia, 2015, pp. 111-116, doi: 10.1109/ICACSIS.2015.7415180.

Zhuhadar, Leyla., Nasraoui, Olfa., Wyatt, Robert., Romero, Elizabeth.: Multi-Language Ontology-based Search Engine. Proceedings of 2010 Third International Conference on Advances in Computer-Human Interactions, pp. 13-18, 978-0-7695-3957-1/10, (2010).

Azarbonyad, H., Shakery, A., Faili, H. (2013). Exploiting Multiple Translation Resources for English-Persian Cross Language Information Retrieval. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B. (eds.) Information Access Evaluation. Multilinguality, Multimodality, and Visualization. CLEF 2013. Lecture Notes in Computer Science, vol. 8138. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40802-1_11.

https://www.clarin.eu/resource-families/parallel-corpora

https://github.com/Kartikaggarwal98/Indian_ParallelCorpus

Vel, Sakthi. And R, Priya. (2022). "Text Pre-Processing Methods on Cross Language Information Retrieval," 2022 International Conference on Connected Systems & Intelligence (CSI), Trivandrum, India, pp. 1-5, doi: 10.1109/CSI54720.2022.9923952.

https://www.analyticsvidhya.com/blog/2021/03/introduction-to-long-short-term-memory-lstm/

https://www.geeksforgeeks.org/nlp-bleu-score-for-evaluating-neural-machine-translation-python/

Choudhary, Himanshu., Rao, Shivansh., and Rohilla, Rajesh. “Neural Machine Translation for Low-Resourced Indian Languages”, proceedings of the 12th conference on Language Resources and Evaluation, pp.3610-3615. 2020.

Sebastian, Mary Priya., Kurian L, Sheena., and Kumar, G.Santhosh. “English to Malayalam Translation: A Statistical Approach”, A2CWiC '10: Proceedings of the 1st Amrita ACM-W Celebration on Women in Computing in India. September 2010. Article No.: 64. Pages 1–5. https://doi.org/10.1145/1858378.1858442.


Full Text: PDF

Refbacks

  • There are currently no refbacks.


 

Indonesian Journal of Electrical Engineering and Informatics (IJEEI)
ISSN 2089-3272

Creative Commons Licence

This work is licensed under a Creative Commons Attribution 4.0 International License.

web analytics
View IJEEI Stats

https://journalofhealthandcaringsciences.org/pasar2/https://jlt.ac/https://jgaa.info/public/www/idn/https://jgaa.info/public/www/mpo/https://mitrasmart.co.id/akun-pro-thailand/https://algede.org/kamboja/https://lnx.gatm.it/analiticaojs/https://viguera.com/slot-thailand/https://www.cienciaecuador.com.ec/https://ejournal.aibpmjournals.com/scatter-hitam/https://pijarpemikiran.com/https://hr.tarunabakti.or.id/zeus-slot/https://www.vertitech.gr/wp-content/situs/https://ube.edu.ec/depo10k/https://ejournal.aibpmjournals.com/gates-of-olympus/https://viguera.com/depo-10k/https://tangseldaily.com/https://esic.novacanaapaulista.sp.gov.br/uploads/sigmaslot/https://rbiad.com.br/sigmaslot/https://fjot.anfe.fr/https://www.viguera.com/slot-gacor/http://revista.tce.gob.ec/ojs-3.1.2-4/sweet-bonanza/http://citaitb.com/wp-content/document/https://rdsp.msp.gob.do/sgm/https://rdsp.msp.gob.do/https://thepab.org/public/pro/https://www.unjc.cu/sigmaslot/https://ojs.co.id/wp-content/cache/https://ktadigitalpgri.org/assets/dist/img/scatter-hitam/https://pasarantogel2.live/http://www.inmedsur.cfg.sld.cu/pasaran2/http://controlvisible.auditoria.gov.co/public/https://isbrmj.org/starlight-princess/https://fjot.anfe.fr/https://journalofhealthandcaringsciences.org/atm88/https://journalofhealthandcaringsciences.org/idn/https://www.viguera.com/sigmaslot/https://seemedj.mefos.unios.hr/public/http://ojs3.bkstm.org/sigma/https://masonhq.org/http://www.inmedsur.cfg.sld.cu//https://iojpe.org/jepang/https://ojs.ukscip.com/pages/2024/https://www.journalprenatalife.com/public/http://citaitb.com/idn/https://journalofhealthandcaringsciences.org/mpo/https://asianmedjam.com/slot-deposit-pulsa/https://asianmedjam.com/akun-pro-kamboja/https://isbrmj.org/public/https://caet.inspirees.com/slot-luar/https://isnujatim.org/slot-dana/https://journal.shamlands.sy/pages/io/https://www.viguera.com/slot-kamboja/https://kpmsurabaya.id/akun-pro-kamboja/https://iojpe.org/atmos88/https://www.remap.ugto.mx/pages/slot-luar-negeri-winrate-tertinggi/http://www.inmedsur.cfg.sld.cu/docs/https://www.viguera.com/pasarantogel2/https://webscience-journal.net/https://humanika.penapersada.com/public/wp/https://caet.inspirees.com/scatter-hitam/https://ojs.ahe.lodz.pl/pg/https://ojs.co.id/id/pasarantogel2/https://snman.science/https://algede.org/