Skip to main content

Abstract

This chapter presents an architecture of the system for fast text search and documents comparison with main focus on N-gram-based algorithm and its parallel implementation. The algorithm which is one of several computational procedures implemented in the system is used to generate a fingerprint of analyzed documents as a set of hashes which represent the file. This work examines the performance of the system, both in terms of a file comparison quality and a fingerprint generation. Several tests were conducted of N-gram-based algorithm for Intel Xeon E5645, 2.40 GHz which show approximately 8x speedup of multi over single core implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Jalil, Z., Mirza, A.M., Iqbal, T.: A zero-watermarking algorithm for text documents based on structural components. In: International Conference on Information and Emerging Technologies (ICIET). vol. 1(5), pp. 14–16, June 2010

    Google Scholar 

  2. Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD’03). pp. 76–85

    Google Scholar 

  3. Miller, E., Shen, D., Liu, J., Nicholas, Ch., Chen, T.: Techniques for gigabyte-scale N-gram based information retrieval on personal computers. In: Proceedings of the 1999 International Conference on Parallel and Distributed Processing Techniques and applications—PDPTA’99

    Google Scholar 

  4. Heintze, N.: Scalable document fingerprinting. In: Proceedings USENIX Workshop on Electronic Commerce. 1996

    Google Scholar 

  5. Forner, P., Karlgren, J., Womser-hacker, Ch., Potthast, M., Gollub, T., Hagen, M., Graβegger, J., Kiesel, J., Michel, M., Oberländer, A., Barrón-cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th international competition on plagiarism detection. In: Forner, P., Karlgren, J., Womser-Hacke, C. (eds.) Notebook Papers of CLEF 2012 LABs and Workshops, CLEF-2012 17–20 September. Rome, Italy

    Google Scholar 

  6. Chin, O.S., Kulathuramaiyer, N., Yeo, A.W.: Automatic discovery of concepts from text. In: IEEE/WIC/ACM International Conference on Web Intelligence. pp. 1046–1049. 18–22 Dec 2006

    Google Scholar 

  7. Amine, A., Elberrichi, Z., Simonet, M., Malki, M.: WordNet-based and N-grams-based document clustering: a comparative study, broadband communications. In: 3rd International Conference on Information Technology and Biomedical Applications. pp. 394–401. 23–26 Nov 2008

    Google Scholar 

  8. Synat NCBiR Project: SP/I/1/77065/10. http://synat.pl

  9. Stevenson, M., Greenwood, M.A.: Learning information extraction patterns using WordNet. In: Proceedings of the 5th International Conference on Language Resources and Evaluations, LREC 2006. pp. 95–102. 22–28 May 2006

    Google Scholar 

  10. Natural Language Toolkit. http://nltk.org/

Download references

Acknowledgments

The work presented in this chapter was financed through the research program—Synat [8].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maciej Wielgosz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Wielgosz, M., Janiszewski, M., Russek, P., Pietron, M., Jamro, E., Wiatr, K. (2014). Implementation of a System for Fast Text Search and Document Comparison. In: Bembenik, R., Skonieczny, Ł., Rybiński, H., Kryszkiewicz, M., Niezgódka, M. (eds) Intelligent Tools for Building a Scientific Information Platform: From Research to Implementation. Studies in Computational Intelligence, vol 541. Springer, Cham. https://doi.org/10.1007/978-3-319-04714-0_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-04714-0_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-04713-3

  • Online ISBN: 978-3-319-04714-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics