Implementation of a System for Fast Text Search and Document Comparison

Wielgosz, Maciej; Janiszewski, Marcin; Russek, Pawel; Pietron, Marcin; Jamro, Ernest; Wiatr, Kazimierz

doi:10.1007/978-3-319-04714-0_11

Maciej Wielgosz⁷,
Marcin Janiszewski⁸,
Pawel Russek⁷,
Marcin Pietron⁸,
Ernest Jamro⁷ &
…
Kazimierz Wiatr⁷

Part of the book series: Studies in Computational Intelligence ((SCI,volume 541))

599 Accesses
1 Citations

Abstract

This chapter presents an architecture of the system for fast text search and documents comparison with main focus on N-gram-based algorithm and its parallel implementation. The algorithm which is one of several computational procedures implemented in the system is used to generate a fingerprint of analyzed documents as a set of hashes which represent the file. This work examines the performance of the system, both in terms of a file comparison quality and a fingerprint generation. Several tests were conducted of N-gram-based algorithm for Intel Xeon E5645, 2.40 GHz which show approximately 8x speedup of multi over single core implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Jalil, Z., Mirza, A.M., Iqbal, T.: A zero-watermarking algorithm for text documents based on structural components. In: International Conference on Information and Emerging Technologies (ICIET). vol. 1(5), pp. 14–16, June 2010
Google Scholar
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD’03). pp. 76–85
Google Scholar
Miller, E., Shen, D., Liu, J., Nicholas, Ch., Chen, T.: Techniques for gigabyte-scale N-gram based information retrieval on personal computers. In: Proceedings of the 1999 International Conference on Parallel and Distributed Processing Techniques and applications—PDPTA’99
Google Scholar
Heintze, N.: Scalable document fingerprinting. In: Proceedings USENIX Workshop on Electronic Commerce. 1996
Google Scholar
Forner, P., Karlgren, J., Womser-hacker, Ch., Potthast, M., Gollub, T., Hagen, M., Graβegger, J., Kiesel, J., Michel, M., Oberländer, A., Barrón-cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th international competition on plagiarism detection. In: Forner, P., Karlgren, J., Womser-Hacke, C. (eds.) Notebook Papers of CLEF 2012 LABs and Workshops, CLEF-2012 17–20 September. Rome, Italy
Google Scholar
Chin, O.S., Kulathuramaiyer, N., Yeo, A.W.: Automatic discovery of concepts from text. In: IEEE/WIC/ACM International Conference on Web Intelligence. pp. 1046–1049. 18–22 Dec 2006
Google Scholar
Amine, A., Elberrichi, Z., Simonet, M., Malki, M.: WordNet-based and N-grams-based document clustering: a comparative study, broadband communications. In: 3rd International Conference on Information Technology and Biomedical Applications. pp. 394–401. 23–26 Nov 2008
Google Scholar
Synat NCBiR Project: SP/I/1/77065/10. http://synat.pl
Stevenson, M., Greenwood, M.A.: Learning information extraction patterns using WordNet. In: Proceedings of the 5th International Conference on Language Resources and Evaluations, LREC 2006. pp. 95–102. 22–28 May 2006
Google Scholar
Natural Language Toolkit. http://nltk.org/

Download references

Acknowledgments

The work presented in this chapter was financed through the research program—Synat [8].

Author information

Authors and Affiliations

AGH University of Science and Technology, Al. Mickiewicza 30, 30-059, Krakow, Poland
Maciej Wielgosz, Pawel Russek, Ernest Jamro & Kazimierz Wiatr
ACK Cyfronet AGH, Ul. Nawojki 11, 30-950, Krakow, Poland
Marcin Janiszewski & Marcin Pietron

Authors

Maciej Wielgosz
View author publications
You can also search for this author in PubMed Google Scholar
Marcin Janiszewski
View author publications
You can also search for this author in PubMed Google Scholar
Pawel Russek
View author publications
You can also search for this author in PubMed Google Scholar
Marcin Pietron
View author publications
You can also search for this author in PubMed Google Scholar
Ernest Jamro
View author publications
You can also search for this author in PubMed Google Scholar
Kazimierz Wiatr
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maciej Wielgosz .

Editor information

Editors and Affiliations

Faculty of Electronics and Information Technology, Warsaw University of Technology, Institute of Computer Science, Warsaw, Poland
Robert Bembenik
Faculty of Electronics and Information Technology, Warsaw University of Technology, Institute of Computer Science, Warsaw, Poland
Łukasz Skonieczny
Faculty of Electronics and Information Technology, Warsaw University of Technology, Institute of Computer Science, Warsaw, Poland
Henryk Rybiński
Faculty of Electronics and Information Technology, Warsaw University of Technology, Institute of Computer Science, Warsaw, Poland
Marzena Kryszkiewicz
InterdisciplinaryCentre for Mathematical and Computational Modelling (ICM), University of Warsaw, Warsaw, Poland
Marek Niezgódka

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wielgosz, M., Janiszewski, M., Russek, P., Pietron, M., Jamro, E., Wiatr, K. (2014). Implementation of a System for Fast Text Search and Document Comparison. In: Bembenik, R., Skonieczny, Ł., Rybiński, H., Kryszkiewicz, M., Niezgódka, M. (eds) Intelligent Tools for Building a Scientific Information Platform: From Research to Implementation. Studies in Computational Intelligence, vol 541. Springer, Cham. https://doi.org/10.1007/978-3-319-04714-0_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-04714-0_11
Published: 27 February 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-04713-3
Online ISBN: 978-3-319-04714-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics