Semantic Similarity Measure Using a Combination of Word2Vec and WordNet Models
Abstract
The cognitive effort required for humans to perceive similarities and relationships between words is considerable. Measuring similarity and relatedness between text components such as words, texts, or documents is challenging, and it continues to be an active area of research across various domains. The complexity of language and the diverse factors that influence similarity and relatedness make this task an ongoing research focus. Researchers are exploring diverse approaches, to improve the accuracy and effectiveness of measuring similarity and relatedness in text. The utilization of knowledge sources, such as WordNet, has been a popular approach for modeling semantic relationships between words. However, Recently, distributional semantic models, such as Word2Vec, have demonstrated their ability to effectively capture semantic information and outperform lexiconbased methods in terms of unidirectional contextual similarity outcomes. In contrast to lexicon-based approaches, which rely on structure, distributional models leverage context to capture semantics. This study proposes a novel approach that linearly combines the lexical databases WordNet and Word2Vec to measure semantic similarity, focusing on improving upon previous techniques. The proposed approach is thoroughly detailed and evaluated using popular datasets to determine its effectiveness. The experimental results indicate that the proposed approach achieves highly satisfactory results and surpasses the performance of individual methods.
Keywords
References
D. Chandrasekaran and V. Mago, “Evolution of Semantic Similarity—A Survey,” ACM Computing Surveys, vol. 54, no. 2, pp. 1–37, Apr. 2021, doi: https://doi.org/10.1145/3440755.
M. A. Hadj Taieb, T. Zesch, and M. Ben Aouicha, “A survey of semantic relatedness evaluation datasets and procedures,” Artificial Intelligence Review, vol. 53, no. 6, pp. 4407–4448, Dec. 2019, doi: https://doi.org/10.1007/s10462-019-09796-3.
J. J. Lastra-Díaz, Josu Goikoetxea, M. Ali, A. García-Serrano, Mohamed Ben Aouicha, and E. Agirre, “A reproducible survey on word embeddings and ontology-based methods for word similarity: Linear combinations outperform the state of the art,” Engineering Applications of Artificial Intelligence, vol. 85, pp. 645–665, Oct. 2019, doi: https://doi.org/10.1016/j.engappai.2019.07.010.
A. Fellah, M. Malki, and A. Elci, “A Similarity Measure across Ontologies for Web Services Discovery,” International Journal of Information Technology and Web Engineering, vol. 11, no. 1, pp. 22–43, Jan. 2016, doi: https://doi.org/10.4018/ijitwe.2016010102.
Celik, Duygu, and Atilla Elçi. "Towards a Semantic Based Workflow Model for Composition of OWL-S Based Atomic Processes." Journal of internet Technology 12.1 (2011): 153-170. https://doi.org/10.6138/JIT.2011.12.1.15
D. Çelik and A. Elçi, “A broker-based semantic agent for discovering Semantic Web services through process similarity matching and equivalence considering quality of service,” Science China Information Sciences, vol. 56, no. 1, pp. 1–24, Oct. 2012, doi: https://doi.org/10.1007/s11432-012-4697-1.
Finkelstein, Lev, et al. "Placing search in context: The concept revisited." Proceedings of the 10th international conference on World Wide Web. 2001.
Wu, Zhibiao, and Martha Palmer. "Verb semantics and lexical selection." arXiv preprint cmp-lg/9406033 (1994).
R. Rada, Hafedh Mili, E. J. Bicknell, and M. Blettner, “Development and application of a metric on semantic nets,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 19, no. 1, pp. 17–30, Jan. 1989, doi: https://doi.org/10.1109/21.24528.
D. Sánchez and M. Batet, “A semantic similarity method based on information content exploiting multiple ontologies,” Expert Systems with Applications, vol. 40, no. 4, pp. 1393–1399, Mar. 2013, doi: https://doi.org/10.1016/j.eswa.2012.08.049.
Jiang, Jay J., and David W. Conrath. "Semantic similarity based on corpus statistics and lexical taxonomy." arXiv preprint cmp-lg/9709008 (1997).
A. Elekes, M. Schaeler, and K. Boehm, “On the Various Semantics of Similarity in Word Embedding Models,” 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), Jun. 2017, doi: https://doi.org/10.1109/jcdl.2017.7991568.
Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013a). https://doi.org/10.48550/arXiv.1301.3781
“ Google Code Archive - Long-term storage for Google Code Project Hosting.,” code.google.com. https://code.google.com/archive/p/word2vec (accessed May 11, 2023).
R. Qu, Y. Fang, W. Bai, and Y. Jiang, “Computing semantic similarity based on novel models of semantic representation using Wikipedia,” Information Processing & Management, vol. 54, no. 6, pp. 1002–1021, Nov. 2018, doi: https://doi.org/10.1016/j.ipm.2018.07.002.
I acobacci, Ignacio, Mohammad Taher Pilehvar, and Roberto Navigli. "Sensembed: Learning sense embeddings for word and relational similarity." Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2015.
Lee, Yang-Yin, et al. "Combining word embedding and lexical database for semantic relatedness measurement." Proceedings of the 25th international conference companion on world wide web. 2016.
S. Rothe and H. Schütze, “AutoExtend: Combining Word Embeddings with Semantic Resources,” Computational Linguistics, vol. 43, no. 3, pp. 593–617, Sep. 2017, doi: https://doi.org/10.1162/coli_a_00294.
K. Sugathadasa et al., “Synergistic union of Word2Vec and lexicon for domain specific semantic similarity,” 2017 IEEE International Conference on Industrial and Information Systems (ICIIS), Dec. 2017, doi: https://doi.org/10.1109/iciinfs.2017.8300343.
F. Li, L. Liao, L. Zhang, X. Zhu, B. Zhang, and Z. Wang, “An Efficient Approach for Measuring Semantic Similarity Combining WordNet and Wikipedia,” IEEE Access, vol. 8, pp. 184318–184338, 2020, doi: https://doi.org/10.1109/access.2020.3025611.
M. J. Hussain, H. Bai, and Y. Jiang, “Wikipedia bi-linear link (WBLM) model: A new approach for measuring semantic similarity and relatedness between linguistic concepts using Wikipedia link structure,” Information Processing & Management, vol. 60, no. 2, p. 103202, Mar. 2023, doi: https://doi.org/10.1016/j.ipm.2022.103202.
G. A. Miller, “WordNet: a lexical database for English,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, Nov. 1995, doi: https://doi.org/10.1145/219717.219748.
J. Tian, Z. Zhou, M. Lan, and Y. Wu, “ECNU at SemEval-2017 Task 1: Leverage Kernel-based Traditional NLP features and Neural Networks to Build a Universal Model for Multilingual and Cross-lingual Semantic Textual Similarity,” Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 2017, doi: https://doi.org/10.18653/v1/s17-2028.
G. Zhu and C. A. Iglesias, “Computing Semantic Similarity of Concepts in Knowledge Graphs,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 1, pp. 72–85, Jan. 2017, doi: https://doi.org/10.1109/tkde.2016.2610428.
A. Pawar and V. Mago, “Challenging the Boundaries of Unsupervised Learning for Semantic Similarity,” IEEE Access, vol. 7, pp. 16291–16308, 2019, doi: https://doi.org/10.1109/access.2019.2891692.
Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems 26 (2013b).
Rush, Alexander M., Sumit Chopra, and Jason Weston. "A neural attention model for abstractive sentence summarization." arXiv preprint arXiv:1509.00685 (2015). https://doi.org/10.48550/arXiv.1509.00685
B. Altınel and M. C. Ganiz, “Semantic text classification: A survey of past and recent advances,” Information Processing & Management, vol. 54, no. 6, pp. 1129–1153, Nov. 2018, doi: https://doi.org/10.1016/j.ipm.2018.08.001.
M. A. Sultan, S. Bethard, and T. Sumner, “DLS$@$CU: Sentence Similarity from Word Alignment and Semantic Vector Composition,” Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), 2015, doi: https://doi.org/10.18653/v1/s15-2027.
Gabrilovich, Evgeniy, and Shaul Markovitch. "Computing semantic relatedness using Wikipedia-based explicit semantic analysis." IJcAI. Vol. 7. 2007.
R. L. Cilibrasi and P. M. B. Vitanyi, “The Google Similarity Distance,” IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 3, pp. 370–383, Mar. 2007, doi: https://doi.org/10.1109/TKDE.2007.48.
Shawe-Taylor, John, and Nello Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004.
K.Knight, and K. L.Steve, "Building a large-scale knowledge base for
machine translation." AAAI. Vol. 94. 1994.
S. L.Reed, and B. L.Douglas "Mapping ontologies into Cyc." AAAI 2002 Conference workshop on ontologies for the semantic Web. 2002.
R. Navigli and S. P. Ponzetto, “BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network,” Artificial Intelligence, vol. 193, pp. 217–250, Dec. 2012, doi: https://doi.org/10.1016/j.artint.2012.07.001.
B. L. Humphreys, “The 1994 Unified Medical Language System Knowledge Sources,” Health Libraries Review, vol. 11, no. 3, pp. 200–203, Sep. 1994, doi: https://doi.org/10.1046/j.1365-2532.1994.11301972.x.
S. O. Nelson, W. Douglas Johnston, and B. L. Humphreys, “Relationships in Medical Subject Headings (MeSH),” Information science and knowledge management, pp. 171–184, Jan. 2001, doi: https://doi.org/10.1007/978-94-015-9696-1_11.
K. S. Tai, , R. Socher, , and C. D. Manning, “Improved semantic representations from tree-structured long short-term memory networks”arXiv preprint arXiv:1503.00075. 2015. https://doi.org/10.48550/arXiv.1503.00075
N. H. Tien, N. M. Le, Y. Tomohiro, and I. Tatsuya, “Sentence modeling via multiple word embeddings and multi-level comparison for semantic textual similarity,” Information Processing & Management, vol. 56, no. 6, p. 102090, Nov. 2019, doi: https://doi.org/10.1016/j.ipm.2019.102090.
S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997.
P. Zhou et al., “Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification,” ACLWeb, Aug. 01, 2016. https://www.aclweb.org/anthology/P16-2034 (accessed May 30, 2020).
J. Gu et al., “Recent advances in convolutional neural networks,” Pattern Recognition, vol. 77, pp. 354–377, May 2018, doi: https://doi.org/10.1016/j.patcog.2017.10.013.
J. Devlin, M. W. Chang, , K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,”, arXiv preprint arXiv:1810.04805 ,(2018). https://doi.org/10.48550/arXiv.1810.04805
J. Camacho-Collados, M. T. Pilehvar, and R. Navigli, “Nasari : Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities,” Artificial Intelligence, vol. 240, pp. 36–64, Nov. 2016, doi: https://doi.org/10.1016/j.artint.2016.07.005.
Y. Lee, H. Ke, T. Yen, H. Huang, and H. Chen, “Combining and learning word embedding with WordNet for semantic relatedness and similarity measurement,” Journal of the Association for Information Science and Technology, vol. 71, no. 6, pp. 657–670, Jul. 2019, doi: https://doi.org/10.1002/asi.24289.
K. Orkphol and W. Yang, “Word Sense Disambiguation Using Cosine Similarity Collaborates with Word2vec and WordNet,” Future Internet, vol. 11, no. 5, p. 114, May 2019, doi: https://doi.org/10.3390/fi11050114.
P. Resnik, “Using information content to evaluate semantic similarity in a taxonomy,” in Proceedings of the 14th International Joint Conference on Artificial Intelligence, arXiv preprint cmp-lg/9511007, 1995.
Y. Bai, L. Zhao, Z. Wang, J. Chen, and P. Lian, “Entity Thematic Similarity Measurement for Personal Explainable Searching Services in the Edge Environment,” IEEE Access, vol. 8, pp. 146220–146232, 2020, doi: https://doi.org/10.1109/access.2020.3014185.
E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Paşca, and A. Soroa, “A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches,” Association for Computational Linguistics, 2009. Accessed: May 11, 2023. [Online]. Available: https://aclanthology.org/N09-1003.pdf
R. Rehurek, and P. Sojka, “Software framework for topic modelling with large corpora”. In In Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks., 2010.
S. Bird, E. Klein, and E. Loper, Natural language processing with Python. Beijing Etc.: O’reilly, 2009.
P. Virtanen et al., “SciPy 1.0: fundamental algorithms for scientific computing in Python,” Nature Methods, vol. 17, no. 3, pp. 261–272, Feb. 2020, doi: https://doi.org/10.1038/s41592-019-0686-2.
G. A. Miller and W. G. Charles, “Contextual correlates of semantic similarity,” Language and Cognitive Processes, vol. 6, no. 1, pp. 1–28, Jan. 1991, doi: https://doi.org/10.1080/01690969108406936.
H. Rubenstein and J. B. Goodenough, “Contextual correlates of synonymy,” Communications of the ACM, vol. 8, no. 10, pp.627–633, Oct. 1965, doi: https://doi.org/10.1145/365628.365657.
Refbacks
- There are currently no refbacks.
Indonesian Journal of Electrical Engineering and Informatics (IJEEI)
ISSN 2089-3272
This work is licensed under a Creative Commons Attribution 4.0 International License.