Abstract
This chapter presents an architecture of the system for fast text search and documents comparison with main focus on N-gram-based algorithm and its parallel implementation. The algorithm which is one of several computational procedures implemented in the system is used to generate a fingerprint of analyzed documents as a set of hashes which represent the file. This work examines the performance of the system, both in terms of a file comparison quality and a fingerprint generation. Several tests were conducted of N-gram-based algorithm for Intel Xeon E5645, 2.40 GHz which show approximately 8x speedup of multi over single core implementation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Jalil, Z., Mirza, A.M., Iqbal, T.: A zero-watermarking algorithm for text documents based on structural components. In: International Conference on Information and Emerging Technologies (ICIET). vol. 1(5), pp. 14–16, June 2010
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD’03). pp. 76–85
Miller, E., Shen, D., Liu, J., Nicholas, Ch., Chen, T.: Techniques for gigabyte-scale N-gram based information retrieval on personal computers. In: Proceedings of the 1999 International Conference on Parallel and Distributed Processing Techniques and applications—PDPTA’99
Heintze, N.: Scalable document fingerprinting. In: Proceedings USENIX Workshop on Electronic Commerce. 1996
Forner, P., Karlgren, J., Womser-hacker, Ch., Potthast, M., Gollub, T., Hagen, M., Graβegger, J., Kiesel, J., Michel, M., Oberländer, A., Barrón-cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th international competition on plagiarism detection. In: Forner, P., Karlgren, J., Womser-Hacke, C. (eds.) Notebook Papers of CLEF 2012 LABs and Workshops, CLEF-2012 17–20 September. Rome, Italy
Chin, O.S., Kulathuramaiyer, N., Yeo, A.W.: Automatic discovery of concepts from text. In: IEEE/WIC/ACM International Conference on Web Intelligence. pp. 1046–1049. 18–22 Dec 2006
Amine, A., Elberrichi, Z., Simonet, M., Malki, M.: WordNet-based and N-grams-based document clustering: a comparative study, broadband communications. In: 3rd International Conference on Information Technology and Biomedical Applications. pp. 394–401. 23–26 Nov 2008
Synat NCBiR Project: SP/I/1/77065/10. http://synat.pl
Stevenson, M., Greenwood, M.A.: Learning information extraction patterns using WordNet. In: Proceedings of the 5th International Conference on Language Resources and Evaluations, LREC 2006. pp. 95–102. 22–28 May 2006
Natural Language Toolkit. http://nltk.org/
Acknowledgments
The work presented in this chapter was financed through the research program—Synat [8].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Wielgosz, M., Janiszewski, M., Russek, P., Pietron, M., Jamro, E., Wiatr, K. (2014). Implementation of a System for Fast Text Search and Document Comparison. In: Bembenik, R., Skonieczny, Ł., Rybiński, H., Kryszkiewicz, M., Niezgódka, M. (eds) Intelligent Tools for Building a Scientific Information Platform: From Research to Implementation. Studies in Computational Intelligence, vol 541. Springer, Cham. https://doi.org/10.1007/978-3-319-04714-0_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-04714-0_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-04713-3
Online ISBN: 978-3-319-04714-0
eBook Packages: EngineeringEngineering (R0)