Abstract
The paper focuses on the evaluation of effectiveness of a number of algorithms used to assess text similarity. The purpose of such evaluation is to determine the best methods for comparing and identifying near-identical web pages. Such comparison of web pages is in turn a prerequisite for building new automated testing tools and security scanners. The goal is to build scanners that will be able to automatically test the web application behavior for a large range of supplied parameters (known as fuzzing). Such testing requires massive generation and processing of requests, which in turn require fast page comparison methods. The similarity comparison is performed on a shortened, tokenized version of web pages, using a test set of pages downloaded from popular websites. A methodology for the evaluation of similarity metrics is proposed, together with a quality metric for the intended task. Several tokenization strategies are also tested and their impact on the final result is assessed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alpuente, M., Romero, D.: A visual technique for web pages comparison. Electr. Notes Theor. Comput. Sci. 235, 3–18 (2009)
Clayton R.: String Metrics Library: https://github.com/rclayton/StringSimilarity
Cohen W., Ravikumar P., Fienberg S.: A comparison of string metrics for matching names and records. In: KDD Workshop on Data Cleaning, Vol. 3 (2003)
Fu, A.Y., Wenyin, L., Deng, X.: Detecting phishing web pages with visual similarity assessment based on earth mover’s distance (EMD). IEEE Trans. Dependable Sec. Comput. 3(4), 301–311 (2006)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Pietarinen, L., Srivastava, D.: Using q-grams in a DBMS for approximate string processing. IEEE Data Eng. Bull. 24(4), 28–34 (2001)
Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 284–291. ACM (2006)
Jaccard, P.: The distribution of the flora in the alpine zone. New Phytol. 11, 37–50 (1912)
Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Statist. Assoc. 84(406), 414–420 (1989)
Levenshtein, V.: Binary codes capable of correcting deletions and insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)
Lucca, G.D., Penta, M.D., Fasolino, A.: An approach to identify duplicated web pages. In: Proceedings of International Computer Software and Applications Conference (COMPSAC), pp. 481–486 (2002)
Lukashenko, R., Graudina, V., Grundspenkis, J.: Computer-based plagiarism detection methods and tools: an overview. In: CompSysTech. ACM International Conference Proceeding Series, Vol. 285, p. 40. ACM (2007)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)
Pera, M.S., Ng, Y.K.: Identifying spam web pages based on content similarity. In: ICCSA (2), Vol. 5073, pp. 204–219. Lecture Notes in Computer Science. Springer, Berlin (2008)
Rosiello, A.P., Kirda, E., Kruegel, C., Ferrandi, F.: A layout-similarity based approach for detecting phishing pages. In: Security and Privacy in Communications Networks and the Workshops, pp. 454–463. IEEE (2007)
SimMetrics, a Similarity Metric Library: http://sourceforge.net/projects/simmetrics/
Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)
Sorensen, T.: A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons”. Kongelige Danske Videnskabernes Selskab 5(4), 1–34 (1948)
Symantec Internet Security Threat Report, Vol. 20, http://www.symantec.com/about/news/resources/press_kits/detail.jsp?pkid=istr-20 (2015)
Wenyin, L., Huang, G., Xiaoyue, L., Min, Z., Deng, X.: Detection of phishing webpages based on visual similarity. In: 14th international conference on World Wide Web, pp. 1060–1061. ACM (2005)
Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In: Proceedings of the Section on Survey Research, pp. 354–359 (1990)
Zachara M., Piskor-Ignatowicz C.: Comparison of string metrics effectiveness for the purpose of estimating the number of unique job offers. PAR (11), pp. 213–216, PIAP (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Zachara, M., Pałka, D. (2016). Comparison of Text-Similarity Metrics for the Purpose of Identifying Identical Web Pages During Automated Web Application Testing. In: Grzech, A., Borzemski, L., Świątek, J., Wilimowska, Z. (eds) Information Systems Architecture and Technology: Proceedings of 36th International Conference on Information Systems Architecture and Technology – ISAT 2015 – Part II. Advances in Intelligent Systems and Computing, vol 430. Springer, Cham. https://doi.org/10.1007/978-3-319-28561-0_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-28561-0_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28559-7
Online ISBN: 978-3-319-28561-0
eBook Packages: EngineeringEngineering (R0)