Remove-Duplicate Algorithm Based on Meta Search Result

Wang, Hongbin; He, Ming; Zhou, Lianke; Li, Zijin; Zhan, Haomin; Wang, Rang

doi:10.1007/978-3-030-00009-7_4

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11064))

Included in the following conference series:

International Conference on Cloud Computing and Security

1966 Accesses

Abstract

According to the characteristics of duplicate web pages in the meta search engine, a duplicate web pages detection algorithm is proposed based on a web page URL, title and abstract, and according to their different characteristics, different similarity computing method is proposed, firstly, the page URL is standardization processed in the algorithm, and then for the title detection, the algorithm improves the title string fuzzy matching algorithm and calculate the similarity based on the word frequency of each items in the query, for the abstract judgment, similarity computing is in accordance with the sentences of the abstract, for each sentence the algorithm gives three weights, and calculates the weights of similarity on base of each summary statement, the effect of the algorithm is obvious, it has been verified by experiment that the algorithm is superior to the traditional algorithm in the precision and recall rate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Semantic-Based Duplicate Web Page Detection

Near-Duplicate Document Detection Using Semantic-Based Similarity Measure: A Novel Approach

A Novel Approach for Detecting Near-Duplicate Web Documents by Considering Images, Text, Size of the Document and Domain

References

Smyth, B., Boydell, O.: Meta search engine. US (2010)
Google Scholar
Lawrence, S., Giles, L.: The NECI meta search engine. In: World Wide Web Conference Series
Google Scholar
Lawrence, S., Giles, C.L.: Inquirus, the NECI meta search engine. Comput. Netw. ISDN Syst. 30, 95–105 (1998)
Article Google Scholar
Zhao-Hui, X.U., Zhao, S.M., Yan, F.L., Qin, J.: An improved DSC removing duplicated webpages algorithm based on feature vector. Sci. Technol. Eng. (2013)
Google Scholar
Guo-Rong, S.U., Yang, Y.X., Deng, J.S.: An algorithm of removing duplicate URL. J. Guangxi Normal Univ. (2010)
Google Scholar
Qiang, S., Cheng, G.: Comparison and analysis of hash algorithm based on flows. J. Nanjing Normal Univ. (2008)
Google Scholar
Rathinasabapathy, R., Bhaskaran, R.: Performance comparison of hashing algorithm with Apriori. In: International Conference on Advances in Computing, Control, & Telecommunication Technologies, pp. 729–733
Google Scholar
Ye, F., Liu, J., Liu, B., Chai, K.: Duplicate page detection algorithm based on the field characteristic clustering. In: Luo, X., Cao, Y., Yang, B., Liu, J., Ye, F. (eds.) ICWL 2010. LNCS, vol. 6537, pp. 75–84. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20539-2_9
Chapter Google Scholar
Yang, J.W.: A Chinese web page clustering algorithm based on the suffix tree. In: Conference on Web Information System and Applications, pp. 817–822
Google Scholar
YuJun, Y., YiMei, Y.: Text information hiding algorithm based on dot-matrix character code. Comput. Syst. Appl. 19, 231–233 (2010)
Google Scholar

Download references

Acknowledgments

This work was funded by the National Natural Science Foundation of China under Grant (No. 61772152 and No. 61502037), the Basic Research Project (No. JCKY2016206B001, JCKY2014206C002 and JCKY2017604C010), and the Technical Foundation Project (No. JSQB2017206C002).

Author information

Authors and Affiliations

College of Computer Science and Technology, Harbin Engineering University, Harbin, 150001, China
Hongbin Wang, Ming He, Lianke Zhou, Zijin Li & Rang Wang
College of Computer and Information Engineering, Heilongjiang University of Science and Technology, Harbin, 150022, China
Ming He
Beijing General Institute of Electronic Engineering, Beijing, 100854, China
Haomin Zhan

Authors

Hongbin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ming He
View author publications
You can also search for this author in PubMed Google Scholar
Lianke Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Zijin Li
View author publications
You can also search for this author in PubMed Google Scholar
Haomin Zhan
View author publications
You can also search for this author in PubMed Google Scholar
Rang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lianke Zhou .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Xingming Sun
Nanjing University of Information Science and Technology, Nanjing, China
Zhaoqing Pan
Department of Computer Science, Purdue University, West Lafayette, IN, USA
Elisa Bertino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, H., He, M., Zhou, L., Li, Z., Zhan, H., Wang, R. (2018). Remove-Duplicate Algorithm Based on Meta Search Result. In: Sun, X., Pan, Z., Bertino, E. (eds) Cloud Computing and Security. ICCCS 2018. Lecture Notes in Computer Science(), vol 11064. Springer, Cham. https://doi.org/10.1007/978-3-030-00009-7_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-00009-7_4
Published: 21 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00008-0
Online ISBN: 978-3-030-00009-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics