Skip to main content

Semantic Compression for Text Document Processing

  • Chapter
  • First Online:
Transactions on Computational Collective Intelligence XIV

Part of the book series: Lecture Notes in Computer Science ((TCCI,volume 8615))

  • 360 Accesses

Abstract

Ongoing research on novel methods and tools that can be applied in Natural Language Processing tasks has resulted in the design of a semantic compression mechanism. Semantic compression is a technique that allows for correct generalization of terms in some given context. Thanks to this generalization a common thought can be detected. The rules governing the generalization process are based on a data structure which is referred to as a domain frequency dictionary. Having established the domain for a given text fragment the disambiguation of possibly many hypernyms becomes a feasible task. Semantic compression, thus an informed generalization, is possible through the use of semantic networks as a knowledge representation structure. In the given overview, it is worth noting that the semantic compression allows for a number of improvements in comparison to already established Natural Language Processing techniques. These improvements, along with a detailed discussion of the various elements of algorithms and data structures that are necessary to make semantic compression a viable solution, are the core of this work. Semantic compression can be applied in a variety of scenarios, e.g. in detection of plagiarism. With increasing effort being spent on developing semantic compression, new domains of application have been discovered. What is more, semantic compression itself has evolved and has been refined by the introduction of new solutions that boost the level of disambiguation efficiency. Thanks to the remodeling of already existing data sources to suit algorithms enabling semantic compression, it has become possible to use semantic compression as a base for automata that, thanks to the exploration of hypernym-hyponym and synonym relations, new concepts that may be included in the knowledge representation structures can now be discovered.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co. Inc., Boston (1999)

    Google Scholar 

  2. Boyd-Graber, J., Blei, D.M., Zhu, X.: A topic model for word sense disambiguation. In: EMNLP (2007)

    Google Scholar 

  3. Burrows, S., Tahaghoghi, S.M.M., Zobel, J.: Efficient plagiarism detection for large code repositories. Softw.: Pract. Exper. 37(2), 151–175 (2007)

    Google Scholar 

  4. Ceglarek, D., Haniewicz, K., Rutkowski, W.: Quality of semantic compression in classification. In: Pan, J.-S., Chen, S.-M., Nguyen, N.T. (eds.) ICCCI 2010, Part I. LNCS, vol. 6421, pp. 162–171. Springer, Heidelberg (2010)

    Google Scholar 

  5. Ceglarek, D., Haniewicz, K., Rutkowski, W.: Semantic compression for specialised information retrieval systems. In: Nguyen, N.T., Katarzyniak, R., Chen, S.-M. (eds.) Advances in Intelligent Information and Database Systems. SCI, vol. 283, pp. 111–121. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  6. Ceglarek, D., Haniewicz, K., Rutkowski, W.: Domain based semantic compression for automatic text comprehension augmentation and recommendation. In: Jędrzejowicz, P., Nguyen, N.T., Hoang, K. (eds.) ICCCI 2011, Part II. LNCS, vol. 6923, pp. 40–49. Springer, Heidelberg (2011)

    Google Scholar 

  7. Ceglarek, D., Haniewicz, K., Rutkowski, W.: Towards knowledge acquisition with WiSENet. In: Nguyen, N.T., Trawiński, B., Jung, J.J. (eds.) New Challenges for Intelligent Information and Database Systems. SCI, vol. 351, pp. 75–84. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  8. Erk, K., Padó, S.: A structured vector space model for word meaning in context. In: EMNLP, pp. 897–906. ACL (2008)

    Google Scholar 

  9. Frakes, W.B., Baeza-Yates, R.A. (eds.): Information Retrieval: Data Structures & Algorithms. Prentice-Hall, Upper Saddle River (1992)

    Google Scholar 

  10. Hotho, A., Staab, S., Stumme, G.: Explaining text clustering results using semantic structures. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 217–228. Springer, Heidelberg (2003)

    Google Scholar 

  11. Khan, L., McLeod, D., Hovy, E.: Retrieval effectiveness of an ontology-based model for information selection. VLDB J. 13, 71–85 (2004)

    Article  Google Scholar 

  12. Krovetz, R., Croft, W.B.: Lexical ambiguity and information retrieval. ACM Trans. Inf. Syst. 10, 115–141 (1992)

    Article  Google Scholar 

  13. Lukashenko, R., Graudina, V., Grundspenkis, J.: Computer-based plagiarism detection methods and tools: an overview. In: Proceedings of the 2007 International Conference on Computer Systems and Technologies, CompSysTech ’07, New York, NY, USA, pp. 40:1–40:6. ACM (2007)

    Google Scholar 

  14. Mikowski, M.: Automated building of error corpora of polish. In: Lewandowska-Tomaszczyk, B. (ed.) Corpus Linguistics, Computer Tools, and Applications State of the Art, PALC 2007, pp. 631–639. Peter Lang, Frankfurt am Main, Berlin, Bern, Bruxelles, New York, Oxford, Wien, (2008)

    Google Scholar 

  15. Miller, G.A.: WordNet: a lexical database for english. Commun. ACM 38, 39–41 (1995)

    Article  Google Scholar 

  16. Nock, R., Nielsen, F.: On weighting clustering. IEEE Trans. Pattern Anal. Mach. Intell. 28(8), 1223–1235 (2006)

    Article  Google Scholar 

  17. Ota, T., Masuyama, S.: Automatic plagiarism detection among term papers. In: Proceedings of the 3rd International Universal Communication Symposium, IUCS ’09, pp. 395–399, New York, NY, USA. ACM (2009)

    Google Scholar 

  18. Sanderson, M.: Word sense disambiguation and information retrieval. In: Croft, W.B., van Rijsbergen, C.J. (eds.) SIGIR ’94, pp. 142–151. ACM/Springer, London (1994)

    Google Scholar 

  19. Sinha, R., Mihalcea, R.: Unsupervised graph-based word sense disambiguation using measures of word semantic similarity. In: ICSC, pp. 363–369. IEEE Computer Society (2007)

    Google Scholar 

  20. Snow, R., Jurafsky, D., Ng, A.Y.: Learning syntactic patterns for automatic hypernym discovery. In: Advances in Neural Information Processing Systems (NIPS 2004), November 2004. This is a draft version from the NIPS preproceedings; the final version will be published by April 2005

    Google Scholar 

  21. Staab, S., Hotho, A.: Ontology-based text document clustering. In: Klopotek, M.A., Wierzchon, S.T., Trojanowski, K. (eds.) Intelligent Information Processing and Web Mining. Advances in Soft Computing, vol. 22, pp. 451–452. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  22. Ceglarek, D.: Architecture of the semantically enhanced intellectual property protection system. In: Burduk, R., Jackowski, K., Kurzynski, M., Wozniak, M., Zolnierek, A. (eds.) CORES 2013. AISC, vol. 226, pp. 711–720. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  23. Ceglarek, D.: Single-pass corpus to corpus comparison by sentence hashing. In: Badica, A., Trawinski, B., Nguyen, N.T. (eds.) Recent Developments in Computational Collective Intelligence - Concepts. Applications and Systems, volume 7092 of Studies in Computational Intelligence, pp. 167–177. Springer, Heidelberg (2013)

    Google Scholar 

  24. Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarized documents. J. Am. Soc. Inf. Sci. Technol. 54(3), 203–215 (2003)

    Article  Google Scholar 

  25. Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing, STOC ’02, pp. 380–388. ACM (2002)

    Google Scholar 

  26. Manber, U.: Finding similar files in a large file system. In: Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference, WTEC’94, Berkeley, CA, USA, p. 2. USENIX Association (1994)

    Google Scholar 

  27. Stein, B., Lipka, N., Prettenhoferr, P.: Intrinsic plagiarism analysis. Lang. Resour. Eval. 45(1), 63–82 (2010). Springer, Netherlands

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dariusz Ceglarek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Ceglarek, D. (2014). Semantic Compression for Text Document Processing. In: Nguyen, N. (eds) Transactions on Computational Collective Intelligence XIV. Lecture Notes in Computer Science(), vol 8615. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44509-9_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-44509-9_2

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-44508-2

  • Online ISBN: 978-3-662-44509-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics