Skip to main content

An FW-BF Based Approach on Elimination of Duplicated Web Pages

  • Conference paper
  • First Online:
  • 1793 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9937))

Abstract

With the blooming development of social network, Internet turns into the most widely information source. However, there are a large amount of duplicated web pages most of which are from being reprinted. Border et al. used to do an experiment on a collection of 30,000,000 HTML and text documents. It turned out that nearly 18 % of the pages are exactly the same and 41 % of the pages share 51 % similarity. These replicas of web pages has brought a major burden for the search engines and affecting the performance of the search engines badly. So elimination of duplicated web pages has become a very hot spot in information retrieval field in these years. In this paper, we have proposed a function word(FW) based approach which involves the concept of Bloom Filter(BF) to eliminate duplicated web pages without extracting the web main text. Our approach involves three separate stages. Stage 1 is to extract sample text according to function words feature in web pages. In stage 2, the feature code is extracted using function words. In stage 3, the duplicated web pages would be eliminated by similarity calculation of their BloomFilters.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Weng, Y.: Research on NLP-based duplicated web pages detection algorithm. Beijing University of Posts and Telecommunications (2009)

    Google Scholar 

  2. Yang, H., et al.: Eliminated duplicate search web pages with Hash algorithm. Control Autom. 27, 299–301 (2006)

    Google Scholar 

  3. Ding, Z., et al.: Research of large-scale URL filter based on Bloom filter. New Technol. Libr. Inf. Serv. 3, 45–50 (2008)

    Google Scholar 

  4. Zhang, J., et al.: A study of the identification of authorship for Chinese texts. In: IEEE International Conference on Intelligence and Security Informatics, ISI 2008, pp. 263–264 (2008)

    Google Scholar 

  5. Ding, J., et al.: Existential state and presentation of Chinese style. Rhetor. Learn. 3, 1–6 (2006)

    Google Scholar 

  6. Xu, N., et al.: BloomFilter based duplicated webpage elimination approach. Microcomput. Appl. 27(3), 48–51 (2011)

    Google Scholar 

  7. Yang, H., Callan, J.: Near-duplicate detection for eRulemaking. In: National Conference on Digital Government Research. Digital Government Society of North America, pp. 78–86 (2005)

    Google Scholar 

  8. Ma, L., Xia, Z.: An FW-DTSS based approach for news page information extraction. In: Tan, Y., Shi, Y. (eds.) DMBD 2016. LNCS, vol. 9714, pp. 227–234. Springer, Heidelberg (2016). doi:10.1007/978-3-319-40973-3_22

    Chapter  Google Scholar 

  9. Mitzenmacher, M.: Compressed Bloom filters. IEEE/ACM Trans. Netw. 10(5), 604–612 (2002)

    Article  MATH  Google Scholar 

  10. BloomFilter concepts and principles. http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html

  11. Laber, E.S., et al.: A fast and simple method for extracting relevant content from News web pages. In: Proceedings of CIKM, pp. 1685–1688 (2009)

    Google Scholar 

  12. Xia, Z., Bu, Z.: Community detection based on a semantic network. Knowl. Based Syst. 26, 30–39 (2012)

    Article  Google Scholar 

  13. Bu, Z., Xia, Z.: A last updating evolution model for online social networks. Phys. A Stat. Mech. Appl. 392(9), 2240–2247 (2013)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhengyou Xia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Ma, L., Xia, Z. (2016). An FW-BF Based Approach on Elimination of Duplicated Web Pages. In: Yin, H., et al. Intelligent Data Engineering and Automated Learning – IDEAL 2016. IDEAL 2016. Lecture Notes in Computer Science(), vol 9937. Springer, Cham. https://doi.org/10.1007/978-3-319-46257-8_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-46257-8_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-46256-1

  • Online ISBN: 978-3-319-46257-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics