Skip to main content

Learning URL Normalization Rules Using Multiple Alignment of Sequences

  • Conference paper
String Processing and Information Retrieval (SPIRE 2013)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8214))

Included in the following conference series:

Abstract

In this work, we present DUSTER, a new approach to detect and eliminate redundant content when crawling the web. DUSTER takes advantage of a multi-sequence alignment strategy to learn rewriting rules able to transform URLs to other likely to have similar content, when it is the case. We show the alignment strategy that can lead to a reduction in the number of duplicate URLs 54% larger than the one achieved by our best baseline.

The original version of this chapter was revised: The copyright line was incorrect. This has been corrected. The Erratum to this chapter is available at DOI: 10.1007/978-3-319-02432-5_33

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agarwal, A., Koppula, H.S., Leela, K.P., Chitrapura, K.P., Garg, S., GM, P.K., Haty, C., Roy, A., Sasturkar, A.: Url normalization for de-duplication of web pages. In: CIKM 2009, pp. 1987–1990. ACM, New York (2009)

    Google Scholar 

  2. Bar-Yossef, Z., Keidar, I., Schonfeld, U.: Do not crawl in the dust: Different urls with similar text. ACM Trans. Web 3(1), 3:1–3:31 (2009)

    Google Scholar 

  3. Blackshields, G., Sievers, F., Shi, W., Wilm, A., Higgins, D.G.: Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol. Biol. 5, 21 (2010)

    Article  Google Scholar 

  4. Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the trec 2004 terabyte track. In: Voorhees, E.M., Buckland, L.P. (eds.) TREC, Volume Special Publication 500-261. NIST (2004)

    Google Scholar 

  5. Dasgupta, A., Kumar, R., Sasturkar, A.: De-duping urls via rewrite rules. In: KDD 2008, pp. 186–194. ACM, New York (2008)

    Google Scholar 

  6. Feng, D.F., Doolittle, R.F.: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of molecular evolution 25(4), 351–360 (1987)

    Article  Google Scholar 

  7. Fetterly, D., Manasse, M., Najork, M.: On the evolution of clusters of near-duplicate web pages. In: LA-WEB 2003, pp. 37–45. IEEE Computer Society, Washington, DC (2003)

    Google Scholar 

  8. Katoh, K., Misawa, K., Kuma, K., Miyata, T.: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30(14), 3059–3066 (2002)

    Article  Google Scholar 

  9. Koppula, H.S., Leela, K.P., Agarwal, A., Chitrapura, K.P., Garg, S., Sasturkar, A.: Learning url patterns for webpage de-duplication. In: WSDM 2010, pp. 381–390. ACM, New York (2010)

    Google Scholar 

  10. Lei, T., Cai, R., Yang, J.-M., Ke, Y., Fan, X., Zhang, L.: A pattern tree-based approach to learning url normalization rules. In: WWW 2010, pp. 611–620. ACM Press, New York (2010)

    Google Scholar 

  11. Mao, X., Liu, X., Di, N., Li, X., Yan, H.: SizeSpotSigs: An effective deduplicate algorithm considering the size of page content. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part I. LNCS, vol. 6634, pp. 537–548. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  12. Mitchell, T.M.: Machine Learning. McGraw-Hill, Inc., New York (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lima Rodrigues, K.W., Cristo, M., de Moura, E.S., da Silva, A.S. (2013). Learning URL Normalization Rules Using Multiple Alignment of Sequences. In: Kurland, O., Lewenstein, M., Porat, E. (eds) String Processing and Information Retrieval. SPIRE 2013. Lecture Notes in Computer Science, vol 8214. Springer, Cham. https://doi.org/10.1007/978-3-319-02432-5_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-02432-5_23

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-02431-8

  • Online ISBN: 978-3-319-02432-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics