Hadoop Based Parallel Deduplication Method for Web Documents

Song, Junjie; Liu, Jin; Zheng, Yuhui

doi:10.1007/978-981-10-7605-3_82

Junjie Song³⁶,
Jin Liu³⁶ &
Yuhui Zheng³⁷

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 474))

Included in the following conference series:

64 Accesses

Abstract

This paper proposes a method of deleting duplicate web pages through tf-idf and splay tree. According to the keywords which are extracted by TextRank, those pages which may be duplicate copies will be sent to a group. Then these pages will be judged by the method above. We use three Map-Reduce tasks to ensure the method of calculating tf-idf and deleting duplicate web pages. The experiment result shows that the algorithm can remove duplicate web pages efficiently and accurately.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Hardcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Lopresti, D.P.: Models and algorithms for duplicate document detection. In: Proceedings of the Fifth International Conference on Document Analysis and Recognition, ICDAR 1999, pp. 297–300. IEEE (1999)
Google Scholar
Jianyong, W., Zhengmao, X., Ming, L., et al.: Research and evaluation of near-replicas of Web pages detection algorithms. Chin. J. Electron. (2000)
Google Scholar
Liu, S., Zhang, Y., Xia, Y., et al.: Duplicate web page elimination based on HTML and extraction of long sentence. Microcomput. Appl. (2009)
Google Scholar
Salton, G., McGill, M.J.: Introduction to modern information retrieval (1986)
Google Scholar
Salton, G., Fox, E.A., Wu, H.: Extended Boolean information retrieval. Commun. ACM 26(11), 1022–1036 (1983)
Article MathSciNet Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Article Google Scholar
Wan, J., Yu, W., Xu, X.: Design and implement of distributed document clustering based on MapReduce. In: Proceedings of the Second Symposium International Computer Science and Computational Technology (ISCSCT), Huangshan, PR China, pp. 278–280 (2009)
Google Scholar
Mihalcea, R., Tarau, P.: TextRank: bringing order into texts. Association for Computational Linguistics (2004)
Google Scholar
Page, L., Brin, S., Motwani, R., et al.: The PageRank citation ranking: bringing order to the web. Stanford InfoLab (1999)
Google Scholar
Sleator, D.D., Tarjan, R.E.: Self-adjusting binary search trees. J. ACM (JACM) 32(3), 652–686 (1985)
Article MathSciNet Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Broder, A.Z., Glassman, S.C., Manasse, M.S., et al.: Syntactic clustering of the web. Comput. Netw. ISDN Syst. 29(8–13), 1157–1166 (1997)
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of Information, Shanghai Martime University, Shanghai, China
Junjie Song & Jin Liu
School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing, 210044, China
Yuhui Zheng

Authors

Junjie Song
View author publications
You can also search for this author in PubMed Google Scholar
Jin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yuhui Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuhui Zheng .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Seoul University of Science and Technology, Seoul, Korea (Republic of)
James J. Park
Department of Business Science, University of Salerno, Salerno, Italy
Vincenzo Loia
Department of Multimedia Engineering, Dongguk University, Seoul, Soul-t’ukpyolsi, Korea (Republic of)
Gangman Yi
Department of Multimedia Engineering, Dongguk University, Seoul, Soul-t’ukpyolsi, Korea (Republic of)
Yunsick Sung

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Song, J., Liu, J., Zheng, Y. (2018). Hadoop Based Parallel Deduplication Method for Web Documents. In: Park, J., Loia, V., Yi, G., Sung, Y. (eds) Advances in Computer Science and Ubiquitous Computing. CUTE CSA 2017 2017. Lecture Notes in Electrical Engineering, vol 474. Springer, Singapore. https://doi.org/10.1007/978-981-10-7605-3_82

Download citation

DOI: https://doi.org/10.1007/978-981-10-7605-3_82
Published: 20 December 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7604-6
Online ISBN: 978-981-10-7605-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics