Similarity Measure for Polish Short Texts Based on Wordnet-Enhanced Bag-of-words Representation

Piasecki, Maciej; Gut, Anna

doi:10.1007/978-3-319-93782-3_13

Maciej Piasecki¹⁶ &
Anna Gut¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10930))

Included in the following conference series:

Language and Technology Conference

558 Accesses

Abstract

We present a method for computing semantic similarity of Polish texts with main focus given to short texts. We have taken into account the limited set of language tools for Polish, and especially that syntactic and semantic parsers do not express accuracy and robustness high enough and to become a stable basis for similarity computation. A very large wordnet of Polish, namely plWordNet is used to construct meaning representations for words in such a way that different words of the similar meaning receive similar representations. The use of a Word Sense Disambiguation (WSD) tool for Polish brought positive results in one of the method variants, regardless of the limited accuracy of the WSD tool. The proposed measures have been compared with the manual evaluation of sentence pairs. The measures were also applied as a part of the Question Answering system. Improved performance of answer finding was achieved in several types of tests.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Gathering Information About Word Similarity from Neighbor Sentences

A novel approach for automatic Bengali question answering system using semantic similarity analysis

Article 05 November 2020

Comparative Evaluation of Semantic Similarity Upon Sentential Text of Varied (Generic) Lengths

Notes

1.
A lemma is a basic morphological form (or entry form) which represents a set of word forms that differ only in the values of their grammatical categories, like case, gender, person.
2.
plWordNet includes also relations linking lexical units (i.e. triples: lemma, Part of Speech and sense identifier), and selected of them were used also in the proposed methods. plWordNet 3.0 emo is the largest wordnet in the world including 197,721 synsets, 179,125 lemmas and 260,214 lexical units described by more 40 types of lexico-semantic relations (more than 90 together with subtypes) and more than 600,000 relation links, cf. [13]. The vast majority of plWordNet has been manually mapped onto Princeton WordNet.
3.
Web application and Web Service: http://ws.clarin-pl.eu/wsd.shtml.
4.
The balanced test set is a more difficult test case as even very unfrequent senses are represented in it, while the average text sample has a character of running text and do not include many rare senses.
5.
The total score in the whole graph is equal to 1.
6.
For Proper Names, the morpho-syntactic tagger used returns often the word form as a lemma.
7.
Synonymy is expressed by synsets. Hypernymy links more general synsets with more specific ones. However, in fact, in plWordNet all relations are defined for lexical units, cf. [14]. Synset relations are notational abbreviations and a link between two synsets means that all pairs of their members are linked by the given relation. Hyponymy (a reverse relation to the hypernymy). Meronymy links a part/element/portion etc. to the whole, e.g. [18], and holonymy links a whole to its part/element/portion, but it is not necessarily reverse to meronymy.
8.
Instance links Proper Names with common nouns that represent their most specific characterisation and type is the reverse relation. Inter-register synonymy links synsets whose members are very close in meaning but differ in use in the language practice, i.e. they differ by their stylistic register, e.g. one synset belongs to the general language and the other one to vulgar.
9.
These relations are not shared among lexical units, and that is why they link only selected lexical units, not synsets. Femininity links lexical units of feminine forms with their masculine counterparts. Markedness is a general relation for several semantic associations signalled by the derivational relations, i.e. diminutive, augmentative, young being. Antonymy expresses binary opposition in meaning, and converse is also binary semantic opposition, but such that the two lexical units have (verbs) argument structures with arguments of opposite roles or filling the opposite roles in some predicates (nouns).
10.
https://www.wikinews.org .
11.
The opposition relations resulted in improvement in the application of the word expansion algorithm, as lemmas linked by the opposition relations are often very similar according to the Distributional Semantic methods, e.g. word embeddings.

References

Achananuparp, P., Hu, X., Shen, X.: The evaluation of sentence similarity measures. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 305–316. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85836-2_29
Chapter Google Scholar
Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo, W.: *SEM 2013 shared task: semantic textual similarity, including a pilot on typed-similarity. In: Second Joint Conference on Lexical and Computational Semantics (*SEM) - Proceedings of the Main Conference and the Shared Task, vol. 1, pp. 32–43. Association for Computational Linguistics, Atlanta (2013)
Google Scholar
Bär, D., Biemann, C., Gurevych, I., Zesch, T.: UKP: computing semantic textual similarity by combining multiple content similarity measures. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics, pp. 435–440. ACL (2012)
Google Scholar
Corley, C., Mihalcea, R.: Measuring the semantic similarity of texts. In: Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, pp. 13–18. ACL (2005)
Google Scholar
Fellbaum, C. (ed.): WordNet - An Electronic Lexical Database. The MIT Press, Cambridge (1998)
MATH Google Scholar
Kędzia, P., Piasecki, M., Orlińska, M.J.: Word sense disambiguation based on large scale polish clarin heterogeneous lexical resources. Cognitive Studies 14 (2015). To appear
Google Scholar
Kouylekov, M., Magnini, B.: Recognizing textual entailment with tree edit distance. In: Proceedings of the PASCAL RTE Challenge, pp. 17–20 (2005)
Google Scholar
Li, Y., McLean, D., Bandar, Z.A., O’shea, J.D., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8), 1138–1150 (2006)
Article Google Scholar
Liu, H., Wang, P.: Assessing text semantic similarity using ontology. J. Softw. 9(2), 490–497 (2014)
Google Scholar
Marcinczuk, M., Radziszewski, A., Piasecki, M., Piasecki, D., Ptak, M.: Evaluation of baseline information retrieval for polish open-domain question answering system. In: Angelova, G., Bontcheva, K., Mitkov, R. (eds.) Recent Advances in Natural Language Processing, RANLP 2013, 9–11 September 2013, Hissar, Bulgaria, pp. 428–435. RANLP 2011 Organising Committee/ACL (2013). http://aclweb.org/anthology/R/R13/R13-1056.pdf. ACL Anthology
Marcińczuk, M., Radziszewski, A., Piasecki, M., Piasecki, D., Ptak, M.: Open dataset for development of Polish question answering systems. In: Proceedings of 6th Language & Technology Conference LTC 2013, Poznań (2013)
Google Scholar
Maziarz, M., Piasecki, M., Rudnicka, E., Szpakowicz, S.: plWordNet as the cornerstone of a toolkit of lexico-semantic resources. In: Proceedings of the Seventh Global Wordnet Conference, pp. 304–312 (2014). http://aclweb.org/anthology/W14-0142
Maziarz, M., Piasecki, M., Rudnicka, E., Szpakowicz, S., Kędzia, P.: plwordnet 3.0 - a comprehensive lexical-semantic resource. In: Calzolari, N., Matsumoto, Y., Prasad, R. (eds.) COLING 2016, 26th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 11–16 December 2016, Osaka, Japan, pp. 2259–2268. ACL (2016). http://aclweb.org/anthology/C/C16/
Maziarz, M., Piasecki, M., Szpakowicz, S.: The chicken-and-egg problem in wordnet design: synonymy, synsets and constitutive relations. Lang. Resour. Eval. 47(3), 769–796 (2013). http://link.springer.com/article/10.1007 15pkt
Article Google Scholar
McCandless, M., Hatcher, E., Gospodnetic, O.: Lucene in Action. Manning Publications, Greenwich (2010)
Google Scholar
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the 21st National Conference on Artificial Intelligence, AAAI 2006, vol. 1. pp. 775–780. AAAI (2006). http://dl.acm.org/citation.cfm?id=1597538.1597662
Piasecki, M., Ramocki, R., Kaliński, M.: Information spreading in expanding wordnet hypernymy structure. In: Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, pp. 553–561. INCOMA Ltd., Hissar, September 2013. http://www.aclweb.org/anthology/R13-1073
Piasecki, M., Szpakowicz, S., Broda, B.: A Wordnet from the Ground Up. Wrocław University of Technology Press, Wrocław (2009)
Google Scholar
Pourgholamali, F., Kahani, M.: Semantic role based sentence compression. In: 2nd International eConference on Computer and Knowledge Engineering (ICCKE), pp. 210–214. IEEE Computer Society (2012)
Google Scholar
Przepiórkowski, A., Wróblewska, A.: Supporting LFG parsing with dependency parsing. In: Dickinson, M., Hinrichs, E., Patejuk, A., Przepiórkowski, A. (eds.) Proceedings of the Fourteenth International Workshop on Treebanks and Linguistic Theories (TLT 14), pp. 168–178. Institute of Computer Science, Polish Academy of Sciences, Warsaw (2015). http://nlp.ipipan.waw.pl/Bib/prz:wro:15.pdf
Punyakanok, V., Roth, D., Yih, W.: Mapping dependencies trees: an application to question answering. In: Proceedings of AI&Math, pp. 1–10, January 2004. http://cogcomp.cs.illinois.edu/papers/PunyakanokRoYi04a.pdf
Radziszewski, A.: A tiered CRF tagger for Polish. In: Bembenik, R., Skonieczny, Ł., Rybiński, H., Kryszkiewicz, M., Niezgódka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform: Advanced Architectures and Solutions. SCI, vol. 467, pp. 215–230. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35647-6_16
Chapter Google Scholar
Radziszewski, A., Śniatowski, T.: Maca – a configurable tool to integrate Polish morphological data. In: Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation (2011)
Google Scholar
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)
MATH Google Scholar
Wilcoxon, F.: Individual comparisons by ranking methods. Biom. Bull. 1(6), 80–83 (1945)
Article Google Scholar

Download references

Acknowledgments

This work was co-financed as a part of the investment in the CLARIN-PL research infrastructure (www.clarin-pl.eu) funded by the Polish Ministry of Science and Higher Education.

Author information

Authors and Affiliations

Faculty of Computer Science and Management, Wrocław University of Technology, Wrocław, Poland
Maciej Piasecki & Anna Gut

Authors

Maciej Piasecki
View author publications
You can also search for this author in PubMed Google Scholar
Anna Gut
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maciej Piasecki .

Editor information

Editors and Affiliations

Adam Mickiewicz University, Poznań, Poland
Zygmunt Vetulani
LIMSI-CNRS, Orsay Cedex, France
Joseph Mariani
Adam Mickiewicz University, Poznań, Poland
Marek Kubis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Piasecki, M., Gut, A. (2018). Similarity Measure for Polish Short Texts Based on Wordnet-Enhanced Bag-of-words Representation. In: Vetulani, Z., Mariani, J., Kubis, M. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2015. Lecture Notes in Computer Science(), vol 10930. Springer, Cham. https://doi.org/10.1007/978-3-319-93782-3_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-93782-3_13
Published: 16 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93781-6
Online ISBN: 978-3-319-93782-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics