Learning URL Normalization Rules Using Multiple Alignment of Sequences

Lima Rodrigues, Kaio Wagner; Cristo, Marco; de Moura, Edleno Silva; da Silva, Altigran Soares

doi:10.1007/978-3-319-02432-5_23

Kaio Wagner Lima Rodrigues¹⁹,
Marco Cristo¹⁹,
Edleno Silva de Moura¹⁹ &
…
Altigran Soares da Silva¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8214))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

1172 Accesses
7 Citations

Abstract

In this work, we present DUSTER, a new approach to detect and eliminate redundant content when crawling the web. DUSTER takes advantage of a multi-sequence alignment strategy to learn rewriting rules able to transform URLs to other likely to have similar content, when it is the case. We show the alignment strategy that can lead to a reduction in the number of duplicate URLs 54% larger than the one achieved by our best baseline.

The original version of this chapter was revised: The copyright line was incorrect. This has been corrected. The Erratum to this chapter is available at DOI: 10.1007/978-3-319-02432-5_33

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agarwal, A., Koppula, H.S., Leela, K.P., Chitrapura, K.P., Garg, S., GM, P.K., Haty, C., Roy, A., Sasturkar, A.: Url normalization for de-duplication of web pages. In: CIKM 2009, pp. 1987–1990. ACM, New York (2009)
Google Scholar
Bar-Yossef, Z., Keidar, I., Schonfeld, U.: Do not crawl in the dust: Different urls with similar text. ACM Trans. Web 3(1), 3:1–3:31 (2009)
Google Scholar
Blackshields, G., Sievers, F., Shi, W., Wilm, A., Higgins, D.G.: Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol. Biol. 5, 21 (2010)
Article Google Scholar
Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the trec 2004 terabyte track. In: Voorhees, E.M., Buckland, L.P. (eds.) TREC, Volume Special Publication 500-261. NIST (2004)
Google Scholar
Dasgupta, A., Kumar, R., Sasturkar, A.: De-duping urls via rewrite rules. In: KDD 2008, pp. 186–194. ACM, New York (2008)
Google Scholar
Feng, D.F., Doolittle, R.F.: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of molecular evolution 25(4), 351–360 (1987)
Article Google Scholar
Fetterly, D., Manasse, M., Najork, M.: On the evolution of clusters of near-duplicate web pages. In: LA-WEB 2003, pp. 37–45. IEEE Computer Society, Washington, DC (2003)
Google Scholar
Katoh, K., Misawa, K., Kuma, K., Miyata, T.: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30(14), 3059–3066 (2002)
Article Google Scholar
Koppula, H.S., Leela, K.P., Agarwal, A., Chitrapura, K.P., Garg, S., Sasturkar, A.: Learning url patterns for webpage de-duplication. In: WSDM 2010, pp. 381–390. ACM, New York (2010)
Google Scholar
Lei, T., Cai, R., Yang, J.-M., Ke, Y., Fan, X., Zhang, L.: A pattern tree-based approach to learning url normalization rules. In: WWW 2010, pp. 611–620. ACM Press, New York (2010)
Google Scholar
Mao, X., Liu, X., Di, N., Li, X., Yan, H.: SizeSpotSigs: An effective deduplicate algorithm considering the size of page content. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part I. LNCS, vol. 6634, pp. 537–548. Springer, Heidelberg (2011)
Chapter Google Scholar
Mitchell, T.M.: Machine Learning. McGraw-Hill, Inc., New York (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Universidade Federal do Amazonas, Manaus, Brazil
Kaio Wagner Lima Rodrigues, Marco Cristo, Edleno Silva de Moura & Altigran Soares da Silva

Authors

Kaio Wagner Lima Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar
Marco Cristo
View author publications
You can also search for this author in PubMed Google Scholar
Edleno Silva de Moura
View author publications
You can also search for this author in PubMed Google Scholar
Altigran Soares da Silva
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Industrial Engineering and Management Technion, Technion Institute of Technology, Bloomfield Hall 308, 32000, Haifa, Israel
Oren Kurland
Bar-Ilan University, Israel
Moshe Lewenstein
Department of Computer Science, Bar-Ilan University, 52900, Ramat-Gan, Israel
Ely Porat

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lima Rodrigues, K.W., Cristo, M., de Moura, E.S., da Silva, A.S. (2013). Learning URL Normalization Rules Using Multiple Alignment of Sequences. In: Kurland, O., Lewenstein, M., Porat, E. (eds) String Processing and Information Retrieval. SPIRE 2013. Lecture Notes in Computer Science, vol 8214. Springer, Cham. https://doi.org/10.1007/978-3-319-02432-5_23

Download citation

DOI: https://doi.org/10.1007/978-3-319-02432-5_23
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-02431-8
Online ISBN: 978-3-319-02432-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics