Skip to main content

Duplicate Page Detection Algorithm Based on the Field Characteristic Clustering

  • Conference paper
Book cover New Horizons in Web-Based Learning - ICWL 2010 Workshops (ICWL 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6537))

Included in the following conference series:

Abstract

The speed and accuracy for the cognitive based interactive-computing is crucial in an information retrieval system of web wisdom. In this page, we propose a new duplicate detection algorithm based on the field characteristic clustering after the analysis of the common duplicate detection algorithm and finding their existing drawbacks. By using the field knowledge to build the characteristic string and taking advantage of the improved k-means clustering algorithm, we shorten the time in the comparison process for the duplicate detection. Finally, through the experiment to compare the performance of the traditional SCAM, DSC with this algorithm on the time consumption, the rate of accuracy and the recalling rate quality. The result shows this algorithm overcome the time and storage consumption when compared with the traditional SCAM algorithm. On comparison with another DSC algorithm, it improves the drawback of the inaccuracy brought by the use of shingles to representing a page in the duplicate detection process. We conclude the duplicate detection algorithm based on the field characteristic clustering raise its precision and recall rate in the field of web duplicate page detection and will improve the speed and accuracy in an information retrieval system of web wisdom.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bai, G.: Research and Application on Automatic Detect Duplication Technology in Internet. Chinese Academy of Science graduate school (2006)

    Google Scholar 

  2. Li, X., Yan, H., Wang, W.: The principle, technology and system of the Search engine. Science Press, Beijing (2005)

    Google Scholar 

  3. Xie, H., Qin, J.: Study on the Duplicated Web Pages Detection Algorithm with Meta Search Engine. Computer System and Application 17(8) (2005)

    Google Scholar 

  4. Yao, M.: Research on Duplicate Page Detection Technology in Internet Based on Document Clustering. Beijing Jiaotong University (2008)

    Google Scholar 

  5. Shivakumar, N., Garcia-Molina, H.: SCAM: A Copy Detection Mechanism for Digital Documents. In: Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries (1995)

    Google Scholar 

  6. Elhadi, M., Al-Tobi, A.: Use of Text Syntactical Structures in Detection of Document Duplicates. In: 3rd International Conference on Digital Information Management, ICDIM 2008, pp. 520–525 (2008)

    Google Scholar 

  7. Broder, A.Z., Classman, S.C., Manasse, M.S.: Syntactic Clustering of the Web. In: Proceedings of the 6th International Web Conference (1997)

    Google Scholar 

  8. Zhang, Z., Chen, J., Li, X.: An Approach to Reduce Noise in HTML Pages. Journal of the China Society for Scientific and Technical Information 23(004), 387–393 (2004)

    Google Scholar 

  9. Xin, C., Wang, X.: Method of Parallel Removing Duplicates in Large Scale Chinese Web Pages Based on Feature Code. Harbin Institute of Technology (2008)

    Google Scholar 

  10. Zhang, C., Xia, S.: K-means Clustering Algorithm with Improved Initial Center. In: Secend International Workshop on Knowledge Discovery and Data Mining, WKKD (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ye, F., Liu, J., Liu, B., Chai, K. (2011). Duplicate Page Detection Algorithm Based on the Field Characteristic Clustering. In: Luo, X., Cao, Y., Yang, B., Liu, J., Ye, F. (eds) New Horizons in Web-Based Learning - ICWL 2010 Workshops. ICWL 2010. Lecture Notes in Computer Science, vol 6537. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20539-2_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20539-2_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20538-5

  • Online ISBN: 978-3-642-20539-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics