Semantic Compression for Text Document Processing

Ceglarek, Dariusz

doi:10.1007/978-3-662-44509-9_2

Dariusz Ceglarek¹⁷

Part of the book series: Lecture Notes in Computer Science ((TCCI,volume 8615))

360 Accesses

Abstract

Ongoing research on novel methods and tools that can be applied in Natural Language Processing tasks has resulted in the design of a semantic compression mechanism. Semantic compression is a technique that allows for correct generalization of terms in some given context. Thanks to this generalization a common thought can be detected. The rules governing the generalization process are based on a data structure which is referred to as a domain frequency dictionary. Having established the domain for a given text fragment the disambiguation of possibly many hypernyms becomes a feasible task. Semantic compression, thus an informed generalization, is possible through the use of semantic networks as a knowledge representation structure. In the given overview, it is worth noting that the semantic compression allows for a number of improvements in comparison to already established Natural Language Processing techniques. These improvements, along with a detailed discussion of the various elements of algorithms and data structures that are necessary to make semantic compression a viable solution, are the core of this work. Semantic compression can be applied in a variety of scenarios, e.g. in detection of plagiarism. With increasing effort being spent on developing semantic compression, new domains of application have been discovered. What is more, semantic compression itself has evolved and has been refined by the introduction of new solutions that boost the level of disambiguation efficiency. Thanks to the remodeling of already existing data sources to suit algorithms enabling semantic compression, it has become possible to use semantic compression as a base for automata that, thanks to the exploration of hypernym-hyponym and synonym relations, new concepts that may be included in the knowledge representation structures can now be discovered.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co. Inc., Boston (1999)
Google Scholar
Boyd-Graber, J., Blei, D.M., Zhu, X.: A topic model for word sense disambiguation. In: EMNLP (2007)
Google Scholar
Burrows, S., Tahaghoghi, S.M.M., Zobel, J.: Efficient plagiarism detection for large code repositories. Softw.: Pract. Exper. 37(2), 151–175 (2007)
Google Scholar
Ceglarek, D., Haniewicz, K., Rutkowski, W.: Quality of semantic compression in classification. In: Pan, J.-S., Chen, S.-M., Nguyen, N.T. (eds.) ICCCI 2010, Part I. LNCS, vol. 6421, pp. 162–171. Springer, Heidelberg (2010)
Google Scholar
Ceglarek, D., Haniewicz, K., Rutkowski, W.: Semantic compression for specialised information retrieval systems. In: Nguyen, N.T., Katarzyniak, R., Chen, S.-M. (eds.) Advances in Intelligent Information and Database Systems. SCI, vol. 283, pp. 111–121. Springer, Heidelberg (2010)
Chapter Google Scholar
Ceglarek, D., Haniewicz, K., Rutkowski, W.: Domain based semantic compression for automatic text comprehension augmentation and recommendation. In: Jędrzejowicz, P., Nguyen, N.T., Hoang, K. (eds.) ICCCI 2011, Part II. LNCS, vol. 6923, pp. 40–49. Springer, Heidelberg (2011)
Google Scholar
Ceglarek, D., Haniewicz, K., Rutkowski, W.: Towards knowledge acquisition with WiSENet. In: Nguyen, N.T., Trawiński, B., Jung, J.J. (eds.) New Challenges for Intelligent Information and Database Systems. SCI, vol. 351, pp. 75–84. Springer, Heidelberg (2011)
Chapter Google Scholar
Erk, K., Padó, S.: A structured vector space model for word meaning in context. In: EMNLP, pp. 897–906. ACL (2008)
Google Scholar
Frakes, W.B., Baeza-Yates, R.A. (eds.): Information Retrieval: Data Structures & Algorithms. Prentice-Hall, Upper Saddle River (1992)
Google Scholar
Hotho, A., Staab, S., Stumme, G.: Explaining text clustering results using semantic structures. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 217–228. Springer, Heidelberg (2003)
Google Scholar
Khan, L., McLeod, D., Hovy, E.: Retrieval effectiveness of an ontology-based model for information selection. VLDB J. 13, 71–85 (2004)
Article Google Scholar
Krovetz, R., Croft, W.B.: Lexical ambiguity and information retrieval. ACM Trans. Inf. Syst. 10, 115–141 (1992)
Article Google Scholar
Lukashenko, R., Graudina, V., Grundspenkis, J.: Computer-based plagiarism detection methods and tools: an overview. In: Proceedings of the 2007 International Conference on Computer Systems and Technologies, CompSysTech ’07, New York, NY, USA, pp. 40:1–40:6. ACM (2007)
Google Scholar
Mikowski, M.: Automated building of error corpora of polish. In: Lewandowska-Tomaszczyk, B. (ed.) Corpus Linguistics, Computer Tools, and Applications State of the Art, PALC 2007, pp. 631–639. Peter Lang, Frankfurt am Main, Berlin, Bern, Bruxelles, New York, Oxford, Wien, (2008)
Google Scholar
Miller, G.A.: WordNet: a lexical database for english. Commun. ACM 38, 39–41 (1995)
Article Google Scholar
Nock, R., Nielsen, F.: On weighting clustering. IEEE Trans. Pattern Anal. Mach. Intell. 28(8), 1223–1235 (2006)
Article Google Scholar
Ota, T., Masuyama, S.: Automatic plagiarism detection among term papers. In: Proceedings of the 3rd International Universal Communication Symposium, IUCS ’09, pp. 395–399, New York, NY, USA. ACM (2009)
Google Scholar
Sanderson, M.: Word sense disambiguation and information retrieval. In: Croft, W.B., van Rijsbergen, C.J. (eds.) SIGIR ’94, pp. 142–151. ACM/Springer, London (1994)
Google Scholar
Sinha, R., Mihalcea, R.: Unsupervised graph-based word sense disambiguation using measures of word semantic similarity. In: ICSC, pp. 363–369. IEEE Computer Society (2007)
Google Scholar
Snow, R., Jurafsky, D., Ng, A.Y.: Learning syntactic patterns for automatic hypernym discovery. In: Advances in Neural Information Processing Systems (NIPS 2004), November 2004. This is a draft version from the NIPS preproceedings; the final version will be published by April 2005
Google Scholar
Staab, S., Hotho, A.: Ontology-based text document clustering. In: Klopotek, M.A., Wierzchon, S.T., Trojanowski, K. (eds.) Intelligent Information Processing and Web Mining. Advances in Soft Computing, vol. 22, pp. 451–452. Springer, Heidelberg (2003)
Chapter Google Scholar
Ceglarek, D.: Architecture of the semantically enhanced intellectual property protection system. In: Burduk, R., Jackowski, K., Kurzynski, M., Wozniak, M., Zolnierek, A. (eds.) CORES 2013. AISC, vol. 226, pp. 711–720. Springer, Heidelberg (2013)
Chapter Google Scholar
Ceglarek, D.: Single-pass corpus to corpus comparison by sentence hashing. In: Badica, A., Trawinski, B., Nguyen, N.T. (eds.) Recent Developments in Computational Collective Intelligence - Concepts. Applications and Systems, volume 7092 of Studies in Computational Intelligence, pp. 167–177. Springer, Heidelberg (2013)
Google Scholar
Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarized documents. J. Am. Soc. Inf. Sci. Technol. 54(3), 203–215 (2003)
Article Google Scholar
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing, STOC ’02, pp. 380–388. ACM (2002)
Google Scholar
Manber, U.: Finding similar files in a large file system. In: Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference, WTEC’94, Berkeley, CA, USA, p. 2. USENIX Association (1994)
Google Scholar
Stein, B., Lipka, N., Prettenhoferr, P.: Intrinsic plagiarism analysis. Lang. Resour. Eval. 45(1), 63–82 (2010). Springer, Netherlands
Article Google Scholar

Download references

Author information

Authors and Affiliations

Poznan School of Banking, Poznan, Poland
Dariusz Ceglarek

Authors

Dariusz Ceglarek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dariusz Ceglarek .

Editor information

Editors and Affiliations

Institute of Informatics, Wroclaw University of Technology, Wroclaw, Poland
Ngoc Thanh Nguyen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ceglarek, D. (2014). Semantic Compression for Text Document Processing. In: Nguyen, N. (eds) Transactions on Computational Collective Intelligence XIV. Lecture Notes in Computer Science(), vol 8615. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44509-9_2

Download citation

DOI: https://doi.org/10.1007/978-3-662-44509-9_2
Published: 17 August 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44508-2
Online ISBN: 978-3-662-44509-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics