Comparison of Text-Similarity Metrics for the Purpose of Identifying Identical Web Pages During Automated Web Application Testing

Zachara, Marek; Pałka, Dariusz

doi:10.1007/978-3-319-28561-0_3

Marek Zachara⁶ &
Dariusz Pałka⁶

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 430))

506 Accesses
3 Citations

Abstract

The paper focuses on the evaluation of effectiveness of a number of algorithms used to assess text similarity. The purpose of such evaluation is to determine the best methods for comparing and identifying near-identical web pages. Such comparison of web pages is in turn a prerequisite for building new automated testing tools and security scanners. The goal is to build scanners that will be able to automatically test the web application behavior for a large range of supplied parameters (known as fuzzing). Such testing requires massive generation and processing of requests, which in turn require fast page comparison methods. The similarity comparison is performed on a shortened, tokenized version of web pages, using a test set of pages downloaded from popular websites. A methodology for the evaluation of similarity metrics is proposed, together with a quality metric for the intended task. Several tokenization strategies are also tested and their impact on the final result is assessed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alpuente, M., Romero, D.: A visual technique for web pages comparison. Electr. Notes Theor. Comput. Sci. 235, 3–18 (2009)
Article Google Scholar
Clayton R.: String Metrics Library: https://github.com/rclayton/StringSimilarity
Cohen W., Ravikumar P., Fienberg S.: A comparison of string metrics for matching names and records. In: KDD Workshop on Data Cleaning, Vol. 3 (2003)
Google Scholar
Fu, A.Y., Wenyin, L., Deng, X.: Detecting phishing web pages with visual similarity assessment based on earth mover’s distance (EMD). IEEE Trans. Dependable Sec. Comput. 3(4), 301–311 (2006)
Article Google Scholar
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Pietarinen, L., Srivastava, D.: Using q-grams in a DBMS for approximate string processing. IEEE Data Eng. Bull. 24(4), 28–34 (2001)
Google Scholar
Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 284–291. ACM (2006)
Google Scholar
Jaccard, P.: The distribution of the flora in the alpine zone. New Phytol. 11, 37–50 (1912)
Article Google Scholar
Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Statist. Assoc. 84(406), 414–420 (1989)
Article Google Scholar
Levenshtein, V.: Binary codes capable of correcting deletions and insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)
MathSciNet Google Scholar
Lucca, G.D., Penta, M.D., Fasolino, A.: An approach to identify duplicated web pages. In: Proceedings of International Computer Software and Applications Conference (COMPSAC), pp. 481–486 (2002)
Google Scholar
Lukashenko, R., Graudina, V., Grundspenkis, J.: Computer-based plagiarism detection methods and tools: an overview. In: CompSysTech. ACM International Conference Proceeding Series, Vol. 285, p. 40. ACM (2007)
Google Scholar
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Article Google Scholar
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)
Article Google Scholar
Pera, M.S., Ng, Y.K.: Identifying spam web pages based on content similarity. In: ICCSA (2), Vol. 5073, pp. 204–219. Lecture Notes in Computer Science. Springer, Berlin (2008)
Google Scholar
Rosiello, A.P., Kirda, E., Kruegel, C., Ferrandi, F.: A layout-similarity based approach for detecting phishing pages. In: Security and Privacy in Communications Networks and the Workshops, pp. 454–463. IEEE (2007)
Google Scholar
SimMetrics, a Similarity Metric Library: http://sourceforge.net/projects/simmetrics/
Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)
Article Google Scholar
Sorensen, T.: A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons”. Kongelige Danske Videnskabernes Selskab 5(4), 1–34 (1948)
MathSciNet Google Scholar
Symantec Internet Security Threat Report, Vol. 20, http://www.symantec.com/about/news/resources/press_kits/detail.jsp?pkid=istr-20 (2015)
Wenyin, L., Huang, G., Xiaoyue, L., Min, Z., Deng, X.: Detection of phishing webpages based on visual similarity. In: 14th international conference on World Wide Web, pp. 1060–1061. ACM (2005)
Google Scholar
Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In: Proceedings of the Section on Survey Research, pp. 354–359 (1990)
Google Scholar
Zachara M., Piskor-Ignatowicz C.: Comparison of string metrics effectiveness for the purpose of estimating the number of unique job offers. PAR (11), pp. 213–216, PIAP (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

AGH University of Science and Technology, 30 Mickiewicza Avenue, Krakow, Poland
Marek Zachara & Dariusz Pałka

Authors

Marek Zachara
View author publications
You can also search for this author in PubMed Google Scholar
Dariusz Pałka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marek Zachara .

Editor information

Editors and Affiliations

Faculty of Computer Science and Manageme, Wrocław University of Technology, Wrocła, Wrocław, Poland
Adam Grzech
Faculty of Computer Science and Manageme, Wrocław University of Technology, Wrocła, Wroclaw, Poland
Leszek Borzemski
Faculty of Computer Science and Manageme, Wrocław University of Technology, Wrocła, Wrocław, Poland
Jerzy Świątek
Faculty of Computer Science and Manageme, Wrocław University of Technology, Wrocła, Wrocław, Poland
Zofia Wilimowska

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zachara, M., Pałka, D. (2016). Comparison of Text-Similarity Metrics for the Purpose of Identifying Identical Web Pages During Automated Web Application Testing. In: Grzech, A., Borzemski, L., Świątek, J., Wilimowska, Z. (eds) Information Systems Architecture and Technology: Proceedings of 36th International Conference on Information Systems Architecture and Technology – ISAT 2015 – Part II. Advances in Intelligent Systems and Computing, vol 430. Springer, Cham. https://doi.org/10.1007/978-3-319-28561-0_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-28561-0_3
Published: 24 February 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28559-7
Online ISBN: 978-3-319-28561-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics