ABSTRACT
Computing semantic similarity between two terms is essential for a variety of text analytics and understanding applications. However, existing approaches are more suitable for semantic similarity between words rather than the more general multi-word expressions (MWEs), and they do not scale very well. Therefore, we propose a lightweight and effective approach for semantic similarity using a large scale semantic network automatically acquired from billions of web documents. Given two terms, we map them into the concept space, and compare their similarity there. Furthermore, we introduce a clustering approach to orthogonalize the concept space in order to improve the accuracy of the similarity measure. Extensive studies demonstrate that our approach can accurately compute the semantic similarity between terms with MWEs and ambiguity, and significantly outperforms 12 competing methods.
Supplemental Material
Available for Download
All figures involved in the source file of CIKM841-Li.tex.
- http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/.Google Scholar
- http://wn-similarity.sourceforge.net/.Google Scholar
- http://www.math.uwo.ca/~mdawes/courses/344/kuhn-munkres.html.Google Scholar
- http://www.codeproject.com/Articles/11835/Word-Net-based-semantic-similarity-measurement.Google Scholar
- E. Agirre, M. Cuadros, G. Rigau, and A. Soroa. Exploring knowledge bases for similarity. In Proceedings of LREC'10, pages 373--377, 2010.Google Scholar
- E. Agirre and A. Soroa. Personalizing pagerank for word sense disambiguation. In Proceedings of EACL'09, pages 33--41, 2009. Google ScholarDigital Library
- E. Agirre, A. Soroa, E. Alfonseca, K. Hall, J. Kravalova, and M. Pasca. A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of NAACL'09, pages 19--27, 2009. Google ScholarDigital Library
- M. Alvarez and S. Lim. A graph modeling of semantic similarity between words. In Proceedings of the Conference on Semantic Computing, pages 355--362, 2007. Google ScholarDigital Library
- S. Banerjee and T. Pedersen. An adapted lesk algorithm for word sense disambiguation using wordnet. In Proceedings of CICLING'02, pages 136--145, 2002. Google ScholarDigital Library
- M. Batet, D. Sánchez, and A. Valls. An ontology-based measure to compute semantic similarity in biomedicine. Journal of Biomedical Informatics, 44(1):118--125, 2011. Google ScholarDigital Library
- D. Bollegala, Y. Matsuo, and M. Ishizuka. A web search engine-based approach to measure semantic similarity between words. IEEE TKDE, 23:977--990, 2011. Google ScholarDigital Library
- A. Budanitsky and G. Hirst. Evaluating wordnet-based measures of lexical semantic relatedness. Computational Linguistics, 32:13--47, 2006. Google ScholarDigital Library
- H. Chen, M. Lin, and Y. Wei. Novel association measures using web search with double checking. In Proceedings of the COLING/ACL 2006, pages 1009--1016, 2006. Google ScholarDigital Library
- Q. Do, D. Roth, M. Sammons, Y. Tu, and V. Vydiswaran. Robust, light-weight approaches to compute lexical similarity. Technical report, 2009.Google Scholar
- M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of COLING'92, pages 539--545, 1992. Google ScholarDigital Library
- G. Hirst and D. St-Onge. Lexical chains as representations of context for the detection and correction of malapropisms. In WordNet: An Electronic Lexical Database, pages 305--332, 1998.Google Scholar
- J. Jiang and D. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of International Conference on Research in Computational Linguistics, pages 19--33, 1997.Google Scholar
- D. Lin. An information-theoretic definition of similarity. In Proceedings of ICML'98, pages 296--304, 1998. Google ScholarDigital Library
- G. Miller and W. Charles. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6:1--28, 1998.Google ScholarCross Ref
- G. A. Miller. WordNet: A lexical database for english. Commun. ACM, 38(11):39--41, 1995. Google ScholarDigital Library
- A. W. Moore. An intoductory tutorial on kd-trees. Technical report, 1991.Google Scholar
- T. Pedersen, S. V. S. Pakhomov, S. Patwardhan, and C. G. Chute. Measures of semantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics, 40(3):288--299, 2007. Google ScholarDigital Library
- R. Rada, H. Mili, E. Bichnell, and M. Blettner. Development and application of a metric on semanticnets. IEEE Transactions on Systems, Man and Cybernetics, 9:17--30, 1989.Google ScholarCross Ref
- K. Radinsky, E. Agichtein, E. Gabrilovich, and S. Markovitch. A word at a time: Computing word relatedness using temporal semantic analysis. In Proceedings of WWW'11, pages 337--346, 2011. Google ScholarDigital Library
- P. Resnik. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of IJCAI'95, pages 448--453, 1995. Google ScholarDigital Library
- H. Rubenstein and J. B. Goodenough. Contextual correlates of synonymy. Communications of the ACM, 8(10):627--633, 1965. Google ScholarDigital Library
- D. Sánchez, M. Batet, and D. Isern. Ontology-based information content computation. Knowledge-Based Systems, 24:297--303, 2011. Google ScholarDigital Library
- N. Seco, T. Veale, and J. Hayes. An intrinsic information content metric for semantic similarity in wordnet. In Proceedings of ECAI'04, pages 1089--1090, 2004.Google Scholar
- Y. Wang, H. Li, H. Wang, and K. Q. Zhu. Concept-based web search. In ER, pages 449--462, 2012. Google ScholarDigital Library
- W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: a probabilistic taxonomy for text understanding. In Proceedings of SIGMOD'12, pages 481--492, 2012. Google ScholarDigital Library
Index Terms
- Computing term similarity by large probabilistic isA knowledge
Recommendations
A Large Probabilistic Semantic Network Based Approach to Compute Term Similarity
Measuring semantic similarity between two terms is essential for a variety of text analytics and understanding applications. Currently, there are two main approaches for this task, namely the knowledge based and the corpus based approaches. However, ...
A new path based hybrid measure for gene ontology similarity
Gene Ontology (GO) consists of a controlled vocabulary of terms, annotating a gene or gene product, structured in a directed acyclic graph. In the graph, semantic relations connect the terms, that represent the knowledge of functional description and ...
Knowledge-based vector space model for text clustering
This paper presents a new knowledge-based vector space model (VSM) for text clustering. In the new model, semantic relationships between terms (e.g., words or concepts) are included in representing text documents as a set of vectors. The idea is to ...
Comments