Duplicate Page Detection Algorithm Based on the Field Characteristic Clustering

Ye, Feiyue; Liu, Junlei; Liu, Bing; Chai, Kun

doi:10.1007/978-3-642-20539-2_9

Feiyue Ye²¹,
Junlei Liu²¹,
Bing Liu²¹ &
…
Kun Chai²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6537))

Included in the following conference series:

International Conference on Web-Based Learning

1353 Accesses

Abstract

The speed and accuracy for the cognitive based interactive-computing is crucial in an information retrieval system of web wisdom. In this page, we propose a new duplicate detection algorithm based on the field characteristic clustering after the analysis of the common duplicate detection algorithm and finding their existing drawbacks. By using the field knowledge to build the characteristic string and taking advantage of the improved k-means clustering algorithm, we shorten the time in the comparison process for the duplicate detection. Finally, through the experiment to compare the performance of the traditional SCAM, DSC with this algorithm on the time consumption, the rate of accuracy and the recalling rate quality. The result shows this algorithm overcome the time and storage consumption when compared with the traditional SCAM algorithm. On comparison with another DSC algorithm, it improves the drawback of the inaccuracy brought by the use of shingles to representing a page in the duplicate detection process. We conclude the duplicate detection algorithm based on the field characteristic clustering raise its precision and recall rate in the field of web duplicate page detection and will improve the speed and accuracy in an information retrieval system of web wisdom.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Semantic-Based Duplicate Web Page Detection

A Novel Approach for Detecting Near-Duplicate Web Documents by Considering Images, Text, Size of the Document and Domain

Near-Duplicate Document Detection Using Semantic-Based Similarity Measure: A Novel Approach

References

Bai, G.: Research and Application on Automatic Detect Duplication Technology in Internet. Chinese Academy of Science graduate school (2006)
Google Scholar
Li, X., Yan, H., Wang, W.: The principle, technology and system of the Search engine. Science Press, Beijing (2005)
Google Scholar
Xie, H., Qin, J.: Study on the Duplicated Web Pages Detection Algorithm with Meta Search Engine. Computer System and Application 17(8) (2005)
Google Scholar
Yao, M.: Research on Duplicate Page Detection Technology in Internet Based on Document Clustering. Beijing Jiaotong University (2008)
Google Scholar
Shivakumar, N., Garcia-Molina, H.: SCAM: A Copy Detection Mechanism for Digital Documents. In: Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries (1995)
Google Scholar
Elhadi, M., Al-Tobi, A.: Use of Text Syntactical Structures in Detection of Document Duplicates. In: 3rd International Conference on Digital Information Management, ICDIM 2008, pp. 520–525 (2008)
Google Scholar
Broder, A.Z., Classman, S.C., Manasse, M.S.: Syntactic Clustering of the Web. In: Proceedings of the 6th International Web Conference (1997)
Google Scholar
Zhang, Z., Chen, J., Li, X.: An Approach to Reduce Noise in HTML Pages. Journal of the China Society for Scientific and Technical Information 23(004), 387–393 (2004)
Google Scholar
Xin, C., Wang, X.: Method of Parallel Removing Duplicates in Large Scale Chinese Web Pages Based on Feature Code. Harbin Institute of Technology (2008)
Google Scholar
Zhang, C., Xia, S.: K-means Clustering Algorithm with Improved Initial Center. In: Secend International Workshop on Knowledge Discovery and Data Mining, WKKD (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Engineering and Science, Shanghai University, Shanghai, China
Feiyue Ye, Junlei Liu, Bing Liu & Kun Chai

Authors

Feiyue Ye
View author publications
You can also search for this author in PubMed Google Scholar
Junlei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Bing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Kun Chai
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Shanghai University, Xingjian Building, No. 149 Yanchang Road, 200072, Shanghai, China
Xiangfeng Luo
Information Systems and Databases, RWTH Aachen University, Ahornstr. 55, 52056, Aachen, Germany
Yiwei Cao
School of Computer Science and Engineering, University of Electronic Science and Techology of China, No. 2006 Xiyuan Avenue, High-Tech Zone (West), 611731, Chengdu, China
Bo Yang
Knowledge Grid Lab, Hunan University of Science and Technology, 411202, Xiangtan, Hunan, China
Jianxun Liu
School of Computer Engineering and Science, Shanghai University, Xingjian Building, No. 149, Yanchang Road, 200072, Shanghai, China
Feiyue Ye

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ye, F., Liu, J., Liu, B., Chai, K. (2011). Duplicate Page Detection Algorithm Based on the Field Characteristic Clustering. In: Luo, X., Cao, Y., Yang, B., Liu, J., Ye, F. (eds) New Horizons in Web-Based Learning - ICWL 2010 Workshops. ICWL 2010. Lecture Notes in Computer Science, vol 6537. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20539-2_9

Download citation

DOI: https://doi.org/10.1007/978-3-642-20539-2_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20538-5
Online ISBN: 978-3-642-20539-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics