Abstract
Semantic similarity has typically been measured across items of approximately similar sizes. As a result, similarity measures have largely ignored the fact that different types of linguistic item can potentially have similar or even identical meanings, and therefore are designed to compare only one type of linguistic item. Furthermore, nearly all current similarity benchmarks within NLP contain pairs of approximately the same size, such as word or sentence pairs, preventing the evaluation of methods that are capable of comparing different sized items. To address this, we introduce a new semantic evaluation called cross-level semantic similarity (CLSS), which measures the degree to which the meaning of a larger linguistic item, such as a paragraph, is captured by a smaller item, such as a sentence. Our pilot CLSS task was presented as part of SemEval-2014, which attracted 19 teams who submitted 38 systems. CLSS data contains a rich mixture of pairs, spanning from paragraphs to word senses to fully evaluate similarity measures that are capable of comparing items of any type. Furthermore, data sources were drawn from diverse corpora beyond just newswire, including domain-specific texts and social media. We describe the annotation process and its challenges, including a comparison with crowdsourcing, and identify the factors that make the dataset a rigorous assessment of a method’s quality. Furthermore, we examine in detail the systems participating in the SemEval task to identify the common factors associated with high performance and which aspects proved difficult to all systems. Our findings demonstrate that CLSS poses a significant challenge for similarity methods and provides clear directions for future work on universal similarity methods that can compare any pair of items.
Notes
A notable exception are benchmarks in Information Retrieval where a relatively-short query is paired with full documents. Although these items are often compared using a common representation like a vector space, the interpretation of the comparison is not similarity, but rather relatedness.
We calculate an item’s length in terms of the number of its content words, i.e., nouns, verbs, adjectives, and adverbs.
Annotation materials along with all training and test data are available on the task website http://alt.qcri.org/semeval2014/task3/.
Defined as “large electrical home appliances (refrigerators or washing machines etc.) that are typically finished in white enamel”.
Defined as “the form of a word that is used to denote more than one”.
We consider only words with one of the four parts of speech: noun, verb, adjective, and adverb.
We note that the α for unadjudicated items is higher than that for all items, since the former set includes only those items for which annotators’ scores differed by at most one point on the rating scale and thus the ratings had high agreement.
We observed that working with WordNet senses in the crowdsourced setting proved too complex, e.g., due to the need to easily view a sense’s hypernyms; hence, word-to-sense data was not replicated.
A similar setup was used in SemEval-2013 Task 11 (Navigli and Vannella 2013) that evaluated Word Sense Induction and Disambiguation within an end-user application of search results clustering.
References
Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., & Soroa, A. (2009). A study on similarity and relatedness using distributional and WordNet-based approaches. In Proceedings of NAACL, Boulder, CO (pp. 19–27).
Agirre, E., Cer, D., Diab, M., & Gonzalez-Agirre, A. (2012). SemEval-2012 task 6: A pilot on semantic textual similarity. In Proceedings of the 6th international workshop on semantic evaluation (SemEval-2012), Montréal, Canada (pp. 385–393).
Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., & Guo, W. (2013). *SEM 2013 shared task: Semantic textual similarity, including a pilot on typed-similarity. In Proceedings of the second joint conference on lexical and computational semantics (*SEM), Atlanta, GA (pp. 32–43).
Agirre, E., Banea, C., Cardie, C., Cer, D., Diab, M., Gonzalez-Agirre, A., et al. (2014). SemEval-2014 task 10: Multilingual semantic textual similarity. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014), Dublin, Ireland (pp. 81–91).
Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555–596.
Bär, D., Biemann, C., Gurevych, I., & Zesch, T. (2012). UKP: Computing semantic textual similarity by combining multiple content similarity measures. In Proceedings of SemEval-2012, Montréal, Canada (pp. 435–440).
Clough, P., & Stevenson, M. (2011). Developing a corpus of plagiarised short answers. Language Resources and Evaluation, 45(1), 5–24.
Diab, M. (2013). Semantic textual similarity: Past present and future. In Joint symposium on semantic processing, keynote address. http://jssp2013.fbk.eu/sites/jssp2013.fbk.eu/files/Mona.pdf.
Dolan, B., Quirk, C., & Brockett, C. (2004). Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the 20th international conference on computational linguistics, Geneva, Switzerland (pp. 350–356).
Erk, K., & McCarthy, D. (2009). Graded word sense assignment. In Proceedings of the 2009 conference on empirical methods in natural language processing (EMNLP), Singapore (pp. 440–449).
Erk, K., McCarthy, D., & Gaylord, N. (2013). Measuring word meaning in context. Computational Linguistics, 39(3), 511–554.
Fellbaum, C. (Ed.). (1998). WordNet: An electronic database. Cambridge, MA: MIT Press.
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., et al. (2001). Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20(1), 116–131.
Ganitkevitch, J., Van Durme, B., & Callison-Burch, C. (2013). PPDB: The paraphrase database. In Proceedings of NAACL, Atlanta, GA (pp. 758–764).
Hill, F., Reichart, R., & Korhonen, A. (2014). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. arXiv:1408.3456.
Ide, N., & Suderman, K. (2004). The American National Corpus first release. In Proceedings of the 4th language resources and evaluation conference (LREC), Lisbon, Portugal (pp. 1681–1684).
Jimenez, S., Gonzalez, F., & Gelbukh, A. (2010). Text comparison using soft cardinality. In Proceedings of the 17th international conference on string processing and information retrieval (pp. 297–302). Berlin: Springer.
Jurgens, D., & Klapaftis, I. (2013). SemEval-2013 task 13: Word sense induction for graded and non-graded senses. In Second joint conference on lexical and computational semantics (*SEM). Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013), Atlanta, GA, USA (Vol. 2, pp. 290–299).
Jurgens, D., & Navigli, R. (2014). It’s all fun and games until someone annotates: Video games with a purpose for linguistic annotation. Transactions of the Association for Computational Linguistics (TACL), 2, 449–464.
Jurgens, D., & Pilehvar, M. T. (2015). Reserating the awesometastic: An automatic extension of the WordNet taxonomy for novel terms. In Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: Human language technologies, Denver, CO (pp. 1459–1465).
Jurgens, D., Pilehvar, M. T., & Navigli, R. (2014). SemEval-2014 task 3: Cross-level semantic similarity. In Proceedings of the 8th international workshop on semantic evaluation, Dublin, Ireland (pp. 17–26).
Jurgens, D., Mohammad, S., Turney, P., & Holyoak, K. (2012). SemEval-2012 task 2: Measuring degrees of relational similarity. In Proceedings of the 6th international workshop on semantic evaluation (SemEval-2012), Montréal, Canada (pp. 356–364).
Kilgarriff, A. (2001). English lexical sample task description. In The proceedings of the second international workshop on evaluating word sense disambiguation systems (SENSEVAL-2), Toulouse, France (pp. 17–20).
Kim, S. N., Medelyan, O., Kan, M. Y., & Baldwin, T. (2010). SemEval-2010 task 5: Automatic keyphrase extraction from scientific articles. In Proceedings of the 5th international workshop on semantic evaluation (SemEval-2010), Los Angeles, CA (pp. 21–26).
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of machine translation summit X, Phuket, Thailand (pp. 79–86).
Krippendorff, K. (2004). Content analysis: An introduction to its methodology (2nd ed.). Thousand Oaks, CA: Sage.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211.
Li, Y., McLean, D., Bandar, Z. A., O’shea, J. D., & Crockett, K. (2006). Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering, 18(8), 1138–1150.
Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of the fifteenth international conference on machine learning, San Francisco, CA (pp. 296–304).
Marelli, M., Menini, S., Baroni, M., Bentivogli, L., Bernardi, R., & Zamparelli, R. (2014). SemEval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In Proceedings of SemEval-2014, Dublin, Ireland (pp. 1–8).
McAuley, J.J., Leskovec, J. (2013). From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. In Proceedings of the 22nd international conference on World Wide Web, Rio de Janeiro, Brazil (pp. 897–908).
McCarthy, D., & Navigli, R. (2009). The English lexical substitution task. Language Resources and Evaluation, 43(2), 139–159.
Mikolov, T., Yih, W. T., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings of the conference of the North American chapter of the association for computational linguistics (NAACL), Atlanta, GA (pp. 746–751).
Navigli, R. (2006). Meaningful clustering of senses helps boost word sense disambiguation performance. In Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the association for computational linguistics (COLING-ACL), Sydney, Australia (pp. 105–112).
Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys, 41(2), 1–69.
Navigli, R., & Vannella, D. (2013). SemEval-2013 task 11: Evaluating word sense induction and disambiguation within an end-user application. In Proceedings of the 7th international workshop on semantic evaluation (SemEval 2013), in conjunction with the second joint conference on lexical and computational semantics (*SEM 2013), Atlanta, USA (pp. 193–201).
Pavlick, E., Post, M., Irvine, A., Kachaev, D., & Callison-Burch, C. (2014). The language demographics of amazon mechanical turk. Transactions of the Association for Computational Linguistics, 2, 79–92.
Pilehvar, M. T., & Navigli, R. (2014a). A large-scale pseudoword-based evaluation framework for state-of-the-art word sense disambiguation. Computational Linguistics, 40(4), 837–881.
Pilehvar, M. T., & Navigli, R. (2014b). A robust approach to aligning heterogeneous lexical resources. In Proceedings of the 52nd annual meeting of the association for computational linguistics, Baltimore, USA (pp. 468–478).
Pilehvar, M. T., & Navigli, R. (2015). From senses to texts: An all-in-one graph-based approach for measuring semantic similarity. Artificial Intelligence, 228, 95–128.
Rubenstein, H., & Goodenough, J. B. (1965). Contextual correlates of synonymy. Communications of the ACM, 8(10), 627–633.
Šarić, F., Glavaš, G., Karan, M., Šnajder, J., & Dalbelo Bašić, B. (2012). Takelab: Systems for measuring semantic text similarity. In Proceedings of SemEval-2012, Montréal, Canada (pp. 441–448).
Snow, R., Prakash, S., Jurafsky, D., & Ng, A. Y. (2007). Learning to merge word senses. In The 2012 conference on empirical methods on natural language processing and computational natural language learning, Prague, Czech Republic (pp. 1005–1014).
Spärck Jones, K. (2007). Automatic summarising: The state of the art. Information Processing and Management, 43(6), 1449–1481.
Specia, L., Jauhar, S. K., & Mihalcea, R. (2012). SemEval-2012 task 1: English lexical simplification. In Proceedings of the sixth international workshop on semantic evaluation (SemEval-2012), Montréal, Canada (pp. 347–355).
Sultan, M. A., Bethard, S., & Sumner, T. (2014). Back to basics for monolingual alignment: Exploiting word similarity and contextual evidence. Transactions of the Association for Computational Linguistics, 2, 219–230.
Sultan, M. A., Bethard, S., & Sumner, T. (2015). DLS@CU: Sentence similarity from word alignment and semantic vector composition. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), Denver, CO (pp. 148–153).
Vannella, D., Jurgens, D., Scarfini, D., Toscani, D., & Navigli, R. (2014). Validating and extending semantic knowledge bases using video games with a purpose. In Proceedings of the 52nd annual meeting of the association for computational linguistics (ACL 2014), Baltimore, MD (pp. 1294–1304).
Wise, M. J. (1996). YAP3: Improved detection of similarities in computer program and other texts. In Proceedings of the twenty-seventh SIGCSE technical symposium on computer science education, Philadelphia, PA, USA (pp. 130–134).
Acknowledgments
The authors gratefully acknowledge the support of the ERC Starting Grant MultiJEDI No. 259234.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Jurgens, D., Pilehvar, M.T. & Navigli, R. Cross level semantic similarity: an evaluation framework for universal measures of similarity. Lang Resources & Evaluation 50, 5–33 (2016). https://doi.org/10.1007/s10579-015-9318-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-015-9318-3