Skip to main content

Similarity Measure for Polish Short Texts Based on Wordnet-Enhanced Bag-of-words Representation

  • Conference paper
  • First Online:
Human Language Technology. Challenges for Computer Science and Linguistics (LTC 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10930))

Included in the following conference series:

  • 521 Accesses

Abstract

We present a method for computing semantic similarity of Polish texts with main focus given to short texts. We have taken into account the limited set of language tools for Polish, and especially that syntactic and semantic parsers do not express accuracy and robustness high enough and to become a stable basis for similarity computation. A very large wordnet of Polish, namely plWordNet is used to construct meaning representations for words in such a way that different words of the similar meaning receive similar representations. The use of a Word Sense Disambiguation (WSD) tool for Polish brought positive results in one of the method variants, regardless of the limited accuracy of the WSD tool. The proposed measures have been compared with the manual evaluation of sentence pairs. The measures were also applied as a part of the Question Answering system. Improved performance of answer finding was achieved in several types of tests.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    A lemma is a basic morphological form (or entry form) which represents a set of word forms that differ only in the values of their grammatical categories, like case, gender, person.

  2. 2.

    plWordNet includes also relations linking lexical units (i.e. triples: lemma, Part of Speech and sense identifier), and selected of them were used also in the proposed methods. plWordNet 3.0 emo is the largest wordnet in the world including 197,721 synsets, 179,125 lemmas and 260,214 lexical units described by more 40 types of lexico-semantic relations (more than 90 together with subtypes) and more than 600,000 relation links, cf. [13]. The vast majority of plWordNet has been manually mapped onto Princeton WordNet.

  3. 3.

    Web application and Web Service: http://ws.clarin-pl.eu/wsd.shtml.

  4. 4.

    The balanced test set is a more difficult test case as even very unfrequent senses are represented in it, while the average text sample has a character of running text and do not include many rare senses.

  5. 5.

    The total score in the whole graph is equal to 1.

  6. 6.

    For Proper Names, the morpho-syntactic tagger used returns often the word form as a lemma.

  7. 7.

    Synonymy is expressed by synsets. Hypernymy links more general synsets with more specific ones. However, in fact, in plWordNet all relations are defined for lexical units, cf. [14]. Synset relations are notational abbreviations and a link between two synsets means that all pairs of their members are linked by the given relation. Hyponymy (a reverse relation to the hypernymy). Meronymy links a part/element/portion etc. to the whole, e.g. [18], and holonymy links a whole to its part/element/portion, but it is not necessarily reverse to meronymy.

  8. 8.

    Instance links Proper Names with common nouns that represent their most specific characterisation and type is the reverse relation. Inter-register synonymy links synsets whose members are very close in meaning but differ in use in the language practice, i.e. they differ by their stylistic register, e.g. one synset belongs to the general language and the other one to vulgar.

  9. 9.

    These relations are not shared among lexical units, and that is why they link only selected lexical units, not synsets. Femininity links lexical units of feminine forms with their masculine counterparts. Markedness is a general relation for several semantic associations signalled by the derivational relations, i.e. diminutive, augmentative, young being. Antonymy expresses binary opposition in meaning, and converse is also binary semantic opposition, but such that the two lexical units have (verbs) argument structures with arguments of opposite roles or filling the opposite roles in some predicates (nouns).

  10. 10.

    https://www.wikinews.org .

  11. 11.

    The opposition relations resulted in improvement in the application of the word expansion algorithm, as lemmas linked by the opposition relations are often very similar according to the Distributional Semantic methods, e.g. word embeddings.

References

  1. Achananuparp, P., Hu, X., Shen, X.: The evaluation of sentence similarity measures. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 305–316. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85836-2_29

    Chapter  Google Scholar 

  2. Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo, W.: *SEM 2013 shared task: semantic textual similarity, including a pilot on typed-similarity. In: Second Joint Conference on Lexical and Computational Semantics (*SEM) - Proceedings of the Main Conference and the Shared Task, vol. 1, pp. 32–43. Association for Computational Linguistics, Atlanta (2013)

    Google Scholar 

  3. Bär, D., Biemann, C., Gurevych, I., Zesch, T.: UKP: computing semantic textual similarity by combining multiple content similarity measures. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics, pp. 435–440. ACL (2012)

    Google Scholar 

  4. Corley, C., Mihalcea, R.: Measuring the semantic similarity of texts. In: Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, pp. 13–18. ACL (2005)

    Google Scholar 

  5. Fellbaum, C. (ed.): WordNet - An Electronic Lexical Database. The MIT Press, Cambridge (1998)

    MATH  Google Scholar 

  6. Kędzia, P., Piasecki, M., Orlińska, M.J.: Word sense disambiguation based on large scale polish clarin heterogeneous lexical resources. Cognitive Studies 14 (2015). To appear

    Google Scholar 

  7. Kouylekov, M., Magnini, B.: Recognizing textual entailment with tree edit distance. In: Proceedings of the PASCAL RTE Challenge, pp. 17–20 (2005)

    Google Scholar 

  8. Li, Y., McLean, D., Bandar, Z.A., O’shea, J.D., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8), 1138–1150 (2006)

    Article  Google Scholar 

  9. Liu, H., Wang, P.: Assessing text semantic similarity using ontology. J. Softw. 9(2), 490–497 (2014)

    Google Scholar 

  10. Marcinczuk, M., Radziszewski, A., Piasecki, M., Piasecki, D., Ptak, M.: Evaluation of baseline information retrieval for polish open-domain question answering system. In: Angelova, G., Bontcheva, K., Mitkov, R. (eds.) Recent Advances in Natural Language Processing, RANLP 2013, 9–11 September 2013, Hissar, Bulgaria, pp. 428–435. RANLP 2011 Organising Committee/ACL (2013). http://aclweb.org/anthology/R/R13/R13-1056.pdf. ACL Anthology

  11. Marcińczuk, M., Radziszewski, A., Piasecki, M., Piasecki, D., Ptak, M.: Open dataset for development of Polish question answering systems. In: Proceedings of 6th Language & Technology Conference LTC 2013, Poznań (2013)

    Google Scholar 

  12. Maziarz, M., Piasecki, M., Rudnicka, E., Szpakowicz, S.: plWordNet as the cornerstone of a toolkit of lexico-semantic resources. In: Proceedings of the Seventh Global Wordnet Conference, pp. 304–312 (2014). http://aclweb.org/anthology/W14-0142

  13. Maziarz, M., Piasecki, M., Rudnicka, E., Szpakowicz, S., Kędzia, P.: plwordnet 3.0 - a comprehensive lexical-semantic resource. In: Calzolari, N., Matsumoto, Y., Prasad, R. (eds.) COLING 2016, 26th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 11–16 December 2016, Osaka, Japan, pp. 2259–2268. ACL (2016). http://aclweb.org/anthology/C/C16/

  14. Maziarz, M., Piasecki, M., Szpakowicz, S.: The chicken-and-egg problem in wordnet design: synonymy, synsets and constitutive relations. Lang. Resour. Eval. 47(3), 769–796 (2013). http://link.springer.com/article/10.1007 15pkt

    Article  Google Scholar 

  15. McCandless, M., Hatcher, E., Gospodnetic, O.: Lucene in Action. Manning Publications, Greenwich (2010)

    Google Scholar 

  16. Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the 21st National Conference on Artificial Intelligence, AAAI 2006, vol. 1. pp. 775–780. AAAI (2006). http://dl.acm.org/citation.cfm?id=1597538.1597662

  17. Piasecki, M., Ramocki, R., Kaliński, M.: Information spreading in expanding wordnet hypernymy structure. In: Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, pp. 553–561. INCOMA Ltd., Hissar, September 2013. http://www.aclweb.org/anthology/R13-1073

  18. Piasecki, M., Szpakowicz, S., Broda, B.: A Wordnet from the Ground Up. Wrocław University of Technology Press, Wrocław (2009)

    Google Scholar 

  19. Pourgholamali, F., Kahani, M.: Semantic role based sentence compression. In: 2nd International eConference on Computer and Knowledge Engineering (ICCKE), pp. 210–214. IEEE Computer Society (2012)

    Google Scholar 

  20. Przepiórkowski, A., Wróblewska, A.: Supporting LFG parsing with dependency parsing. In: Dickinson, M., Hinrichs, E., Patejuk, A., Przepiórkowski, A. (eds.) Proceedings of the Fourteenth International Workshop on Treebanks and Linguistic Theories (TLT 14), pp. 168–178. Institute of Computer Science, Polish Academy of Sciences, Warsaw (2015). http://nlp.ipipan.waw.pl/Bib/prz:wro:15.pdf

  21. Punyakanok, V., Roth, D., Yih, W.: Mapping dependencies trees: an application to question answering. In: Proceedings of AI&Math, pp. 1–10, January 2004. http://cogcomp.cs.illinois.edu/papers/PunyakanokRoYi04a.pdf

  22. Radziszewski, A.: A tiered CRF tagger for Polish. In: Bembenik, R., Skonieczny, Ł., Rybiński, H., Kryszkiewicz, M., Niezgódka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform: Advanced Architectures and Solutions. SCI, vol. 467, pp. 215–230. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35647-6_16

    Chapter  Google Scholar 

  23. Radziszewski, A., Śniatowski, T.: Maca – a configurable tool to integrate Polish morphological data. In: Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation (2011)

    Google Scholar 

  24. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)

    MATH  Google Scholar 

  25. Wilcoxon, F.: Individual comparisons by ranking methods. Biom. Bull. 1(6), 80–83 (1945)

    Article  Google Scholar 

Download references

Acknowledgments

This work was co-financed as a part of the investment in the CLARIN-PL research infrastructure (www.clarin-pl.eu) funded by the Polish Ministry of Science and Higher Education.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maciej Piasecki .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Piasecki, M., Gut, A. (2018). Similarity Measure for Polish Short Texts Based on Wordnet-Enhanced Bag-of-words Representation. In: Vetulani, Z., Mariani, J., Kubis, M. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2015. Lecture Notes in Computer Science(), vol 10930. Springer, Cham. https://doi.org/10.1007/978-3-319-93782-3_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-93782-3_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-93781-6

  • Online ISBN: 978-3-319-93782-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics