Abstract
This paper proposes a method of deleting duplicate web pages through tf-idf and splay tree. According to the keywords which are extracted by TextRank, those pages which may be duplicate copies will be sent to a group. Then these pages will be judged by the method above. We use three Map-Reduce tasks to ensure the method of calculating tf-idf and deleting duplicate web pages. The experiment result shows that the algorithm can remove duplicate web pages efficiently and accurately.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Lopresti, D.P.: Models and algorithms for duplicate document detection. In: Proceedings of the Fifth International Conference on Document Analysis and Recognition, ICDAR 1999, pp. 297–300. IEEE (1999)
Jianyong, W., Zhengmao, X., Ming, L., et al.: Research and evaluation of near-replicas of Web pages detection algorithms. Chin. J. Electron. (2000)
Liu, S., Zhang, Y., Xia, Y., et al.: Duplicate web page elimination based on HTML and extraction of long sentence. Microcomput. Appl. (2009)
Salton, G., McGill, M.J.: Introduction to modern information retrieval (1986)
Salton, G., Fox, E.A., Wu, H.: Extended Boolean information retrieval. Commun. ACM 26(11), 1022–1036 (1983)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Wan, J., Yu, W., Xu, X.: Design and implement of distributed document clustering based on MapReduce. In: Proceedings of the Second Symposium International Computer Science and Computational Technology (ISCSCT), Huangshan, PR China, pp. 278–280 (2009)
Mihalcea, R., Tarau, P.: TextRank: bringing order into texts. Association for Computational Linguistics (2004)
Page, L., Brin, S., Motwani, R., et al.: The PageRank citation ranking: bringing order to the web. Stanford InfoLab (1999)
Sleator, D.D., Tarjan, R.E.: Self-adjusting binary search trees. J. ACM (JACM) 32(3), 652–686 (1985)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Broder, A.Z., Glassman, S.C., Manasse, M.S., et al.: Syntactic clustering of the web. Comput. Netw. ISDN Syst. 29(8–13), 1157–1166 (1997)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Song, J., Liu, J., Zheng, Y. (2018). Hadoop Based Parallel Deduplication Method for Web Documents. In: Park, J., Loia, V., Yi, G., Sung, Y. (eds) Advances in Computer Science and Ubiquitous Computing. CUTE CSA 2017 2017. Lecture Notes in Electrical Engineering, vol 474. Springer, Singapore. https://doi.org/10.1007/978-981-10-7605-3_82
Download citation
DOI: https://doi.org/10.1007/978-981-10-7605-3_82
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7604-6
Online ISBN: 978-981-10-7605-3
eBook Packages: EngineeringEngineering (R0)