Skip to main content

f: Phrase Relatedness Function Using Overlapping Bi-gram Context

  • Conference paper
  • First Online:
Advances in Artificial Intelligence (Canadian AI 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9673))

Included in the following conference series:

Abstract

We present an unsupervised phrase relatedness function (f) that has been applied in a Semantic Textual Similarity system (TrWP) of SemEval-2015. The best run of TrWP was ranked 33 among 73 runs. f finds the relatedness strength between two phrases using overlapping bi-gram context extracted from the Google-n-gram corpus. The relatedness strength is the strength of association capturing how similar or dissimilar two phrases are. In order to find the relatedness strength, f applies a sum-ratio (SR) technique based on the statistics of the overlapping n-grams associated with two input phrases. The experimental result from f demonstrates improvement over existing phrase relatedness methods on two standard datasets of 216 phrase-pairs. f does not require any human annotated resource and is independent of the syntactic structure of phrases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We use ‘relatedness’ and ‘similarity’ interchangeably in our paper, albeit ‘similarity’ is a special case or a subset of ‘relatedness’.

  2. 2.

    We use the term Sum-Ratio as the weighted mean of two numbers.

  3. 3.

    Perform pruning on the bi-gram contexts implies to the pruning of the Google-n-grams from which those contexts are extracted.

  4. 4.

    We prefer Pearson’s r to Spearman’s \(\rho \) because Agirre et al. [28] stated that Pearson’s r is more informative than Spearman’s \(\rho \). Spearman’s \(\rho \) considers the rank differences while Pearson’s r takes into account the value differences. Moreover, SemEval-2013 [28] used Pearson’s r for evaluation task.

  5. 5.

    Pearson’s r is not computed using Mitchell and Lapata’s [7] system due to the unavailability of their individual phrase-pair score. Moreover, in an attempt to reproduce Mitchell and Lapata’s [7] method, Hartung and Frank [6] get Spearman’s \(\rho = 0.34\) instead of \(\rho =0.46\) on 108 adjective-noun pairs.

References

  1. Zamir, O., Etzioni, O.: Grouper: a dynamic clustering interface to web search results. In: Proceedings of the Eighth International Conference on World Wide Web, WWW 1999, New York, USA, pp. 1361–1374 (1999)

    Google Scholar 

  2. Chim, H., Deng, X.: Efficient phrase-based document similarity for clustering. IEEE Trans. Knowl. Data Eng. 20(9), 1217–1229 (2008)

    Article  Google Scholar 

  3. Charniak, E.: Statistical Language Learning. MIT Press, Cambridge (1993)

    Google Scholar 

  4. Hammouda, K., Kamel, M.: Efficient phrase-based document indexing for web document clustering. IEEE Trans. Knowl. Data Eng. 16(10), 1279–1296 (2004)

    Article  Google Scholar 

  5. Pera, M.S., Ng, Y.K.: Spamed: a spam e-mail detection approach based on phrase similarity. J. Am. Soc. Inf. Sci. Technol. 60(2), 393–409 (2009)

    Article  Google Scholar 

  6. Hartung, M., Frank, A.: Assessing interpretable, attribute-related meaning representations for adjective-noun phrases in a similarity prediction task. In: Proceedings of the GEMS 2011 Workshop, Stroudsburg, PA, USA, pp. 52–61(2011)

    Google Scholar 

  7. Mitchell, J., Lapata, M.: Composition in distributional models of semantics. Cogn. Sci. 34(8), 1388–1429 (2010)

    Article  Google Scholar 

  8. Baroni, M.: Composition in distributional semantics. Lang. Linguist. Compass 7(10), 511–522 (2013)

    Article  Google Scholar 

  9. Annesi, P., Storch, V., Basili, R.: Space projections as distributional models for semantic composition. In: Gelbukh, A. (ed.) CICLing 2012, Part I. LNCS, vol. 7181, pp. 323–335. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  10. Han, L., Kashyap, A.L., Finin, T., Mayfield, J., Weese, J.: UMBC_EBIQUITY-CORE: semantic textual similarity systems. In: Proceedings of the Second Joint Conference on Lexical and Computational Semantics, June 2013

    Google Scholar 

  11. Tsatsaronis, G., Varlamis, I., Vazirgiannis, M., Nørvåg, K.: Omiotis: a thesaurus-based measure of text relatedness. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009, Part II. LNCS, vol. 5782, pp. 742–745. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  12. Bollegala, D., Matsuo, Y., Ishizuka, M.: A web search engine-based approach to measure semantic similarity between words. IEEE Trans. Knowl. Data Eng. 23(7), 977–990 (2011)

    Article  Google Scholar 

  13. Cilibrasi, R.L., Vitanyi, P.M.B.: The google similarity distance. IEEE Trans. Knowl. Data Eng. 19(3), 370–383 (2007)

    Article  Google Scholar 

  14. Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the Fifteenth ICML, ICML 1998, San Francisco, CA, USA, pp. 296–304 (1998)

    Google Scholar 

  15. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)

    MATH  Google Scholar 

  16. Turney, P.D.: Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: Flach, P.A., De Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, p. 491. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  17. Rakib, M.R.H., Islam, A., Milios, E.: TrWP: text relatedness using word and phrase relatedness. In: Proceedings of the SemEval 2015, Colorado, pp. 90–95 (2015)

    Google Scholar 

  18. Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Int. Res. 37(1), 141–188 (2010)

    MathSciNet  MATH  Google Scholar 

  19. Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, ACL 1998, pp. 768–774 (1998)

    Google Scholar 

  20. Brants, T., Franz, A.: Web 1T 5-gram corpus version 1.1. Linguistic Data Consortium (2006)

    Google Scholar 

  21. Reddy, S., Klapaftis, I., McCarthy, D., Manandhar, S.: Dynamic and static prototype vectors for semantic composition. In: Proceedings of the 5th International Joint Conference on Natural Language Processing, Thailand, pp. 705–713, November 2011

    Google Scholar 

  22. Lund, K., Burgess, C.: Producing high-dimensional semantic spaces from lexical co-occurrence. Behav. Res. Methods Instrum. Comput. 28(2), 203–208 (1996)

    Article  Google Scholar 

  23. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  24. Vilares, M., Ribadas, F.J., Vilares, J.: Phrase similarity through the edit distance. In: Galindo, F., Takizawa, M., Traunmüller, R. (eds.) DEXA 2004. LNCS, vol. 3180, pp. 306–317. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  25. Islam, A., Milios, E., Kešelj, V.: Comparing word relatedness measures based on google-n-grams. In: COLING (Posters), pp. 495–506 (2012)

    Google Scholar 

  26. Gracia, J., Trillo, R., Espinoza, M., Mena, E.: Querying the web: a multiontology disambiguation method. In: Proceedings of the 6th International Conference on Web Engineering, ICWE 2006, pp. 241–248. ACM, New York (2006)

    Google Scholar 

  27. Bohm, G., Zech, G.: Introduction to statistics and data analysis for physicists. DESY (2010)

    Google Scholar 

  28. Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo, W.: *SEM 2013 shared task: semantic textual similarity. In: Second Joint Conference on Lexical and Computational Semantics, Atlanta, Georgia, USA, pp. 32–43, June 2013

    Google Scholar 

  29. Zou, G.Y.: Toward using confidence intervals to compare correlations. Psychol. Methods 12(4), 399–413 (2007)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Md. Rashadul Hasan Rakib .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Rakib, M.R.H., Islam, A., Milios, E. (2016). f: Phrase Relatedness Function Using Overlapping Bi-gram Context. In: Khoury, R., Drummond, C. (eds) Advances in Artificial Intelligence. Canadian AI 2016. Lecture Notes in Computer Science(), vol 9673. Springer, Cham. https://doi.org/10.1007/978-3-319-34111-8_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-34111-8_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-34110-1

  • Online ISBN: 978-3-319-34111-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics