Abstract
The speed and accuracy for the cognitive based interactive-computing is crucial in an information retrieval system of web wisdom. In this page, we propose a new duplicate detection algorithm based on the field characteristic clustering after the analysis of the common duplicate detection algorithm and finding their existing drawbacks. By using the field knowledge to build the characteristic string and taking advantage of the improved k-means clustering algorithm, we shorten the time in the comparison process for the duplicate detection. Finally, through the experiment to compare the performance of the traditional SCAM, DSC with this algorithm on the time consumption, the rate of accuracy and the recalling rate quality. The result shows this algorithm overcome the time and storage consumption when compared with the traditional SCAM algorithm. On comparison with another DSC algorithm, it improves the drawback of the inaccuracy brought by the use of shingles to representing a page in the duplicate detection process. We conclude the duplicate detection algorithm based on the field characteristic clustering raise its precision and recall rate in the field of web duplicate page detection and will improve the speed and accuracy in an information retrieval system of web wisdom.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bai, G.: Research and Application on Automatic Detect Duplication Technology in Internet. Chinese Academy of Science graduate school (2006)
Li, X., Yan, H., Wang, W.: The principle, technology and system of the Search engine. Science Press, Beijing (2005)
Xie, H., Qin, J.: Study on the Duplicated Web Pages Detection Algorithm with Meta Search Engine. Computer System and Application 17(8) (2005)
Yao, M.: Research on Duplicate Page Detection Technology in Internet Based on Document Clustering. Beijing Jiaotong University (2008)
Shivakumar, N., Garcia-Molina, H.: SCAM: A Copy Detection Mechanism for Digital Documents. In: Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries (1995)
Elhadi, M., Al-Tobi, A.: Use of Text Syntactical Structures in Detection of Document Duplicates. In: 3rd International Conference on Digital Information Management, ICDIM 2008, pp. 520–525 (2008)
Broder, A.Z., Classman, S.C., Manasse, M.S.: Syntactic Clustering of the Web. In: Proceedings of the 6th International Web Conference (1997)
Zhang, Z., Chen, J., Li, X.: An Approach to Reduce Noise in HTML Pages. Journal of the China Society for Scientific and Technical Information 23(004), 387–393 (2004)
Xin, C., Wang, X.: Method of Parallel Removing Duplicates in Large Scale Chinese Web Pages Based on Feature Code. Harbin Institute of Technology (2008)
Zhang, C., Xia, S.: K-means Clustering Algorithm with Improved Initial Center. In: Secend International Workshop on Knowledge Discovery and Data Mining, WKKD (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ye, F., Liu, J., Liu, B., Chai, K. (2011). Duplicate Page Detection Algorithm Based on the Field Characteristic Clustering. In: Luo, X., Cao, Y., Yang, B., Liu, J., Ye, F. (eds) New Horizons in Web-Based Learning - ICWL 2010 Workshops. ICWL 2010. Lecture Notes in Computer Science, vol 6537. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20539-2_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-20539-2_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20538-5
Online ISBN: 978-3-642-20539-2
eBook Packages: Computer ScienceComputer Science (R0)