Abstract
In this work, we present DUSTER, a new approach to detect and eliminate redundant content when crawling the web. DUSTER takes advantage of a multi-sequence alignment strategy to learn rewriting rules able to transform URLs to other likely to have similar content, when it is the case. We show the alignment strategy that can lead to a reduction in the number of duplicate URLs 54% larger than the one achieved by our best baseline.
The original version of this chapter was revised: The copyright line was incorrect. This has been corrected. The Erratum to this chapter is available at DOI: 10.1007/978-3-319-02432-5_33
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agarwal, A., Koppula, H.S., Leela, K.P., Chitrapura, K.P., Garg, S., GM, P.K., Haty, C., Roy, A., Sasturkar, A.: Url normalization for de-duplication of web pages. In: CIKM 2009, pp. 1987–1990. ACM, New York (2009)
Bar-Yossef, Z., Keidar, I., Schonfeld, U.: Do not crawl in the dust: Different urls with similar text. ACM Trans. Web 3(1), 3:1–3:31 (2009)
Blackshields, G., Sievers, F., Shi, W., Wilm, A., Higgins, D.G.: Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol. Biol. 5, 21 (2010)
Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the trec 2004 terabyte track. In: Voorhees, E.M., Buckland, L.P. (eds.) TREC, Volume Special Publication 500-261. NIST (2004)
Dasgupta, A., Kumar, R., Sasturkar, A.: De-duping urls via rewrite rules. In: KDD 2008, pp. 186–194. ACM, New York (2008)
Feng, D.F., Doolittle, R.F.: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of molecular evolution 25(4), 351–360 (1987)
Fetterly, D., Manasse, M., Najork, M.: On the evolution of clusters of near-duplicate web pages. In: LA-WEB 2003, pp. 37–45. IEEE Computer Society, Washington, DC (2003)
Katoh, K., Misawa, K., Kuma, K., Miyata, T.: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30(14), 3059–3066 (2002)
Koppula, H.S., Leela, K.P., Agarwal, A., Chitrapura, K.P., Garg, S., Sasturkar, A.: Learning url patterns for webpage de-duplication. In: WSDM 2010, pp. 381–390. ACM, New York (2010)
Lei, T., Cai, R., Yang, J.-M., Ke, Y., Fan, X., Zhang, L.: A pattern tree-based approach to learning url normalization rules. In: WWW 2010, pp. 611–620. ACM Press, New York (2010)
Mao, X., Liu, X., Di, N., Li, X., Yan, H.: SizeSpotSigs: An effective deduplicate algorithm considering the size of page content. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part I. LNCS, vol. 6634, pp. 537–548. Springer, Heidelberg (2011)
Mitchell, T.M.: Machine Learning. McGraw-Hill, Inc., New York (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lima Rodrigues, K.W., Cristo, M., de Moura, E.S., da Silva, A.S. (2013). Learning URL Normalization Rules Using Multiple Alignment of Sequences. In: Kurland, O., Lewenstein, M., Porat, E. (eds) String Processing and Information Retrieval. SPIRE 2013. Lecture Notes in Computer Science, vol 8214. Springer, Cham. https://doi.org/10.1007/978-3-319-02432-5_23
Download citation
DOI: https://doi.org/10.1007/978-3-319-02432-5_23
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-02431-8
Online ISBN: 978-3-319-02432-5
eBook Packages: Computer ScienceComputer Science (R0)