Skip to main content

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 430))

Abstract

The paper focuses on the evaluation of effectiveness of a number of algorithms used to assess text similarity. The purpose of such evaluation is to determine the best methods for comparing and identifying near-identical web pages. Such comparison of web pages is in turn a prerequisite for building new automated testing tools and security scanners. The goal is to build scanners that will be able to automatically test the web application behavior for a large range of supplied parameters (known as fuzzing). Such testing requires massive generation and processing of requests, which in turn require fast page comparison methods. The similarity comparison is performed on a shortened, tokenized version of web pages, using a test set of pages downloaded from popular websites. A methodology for the evaluation of similarity metrics is proposed, together with a quality metric for the intended task. Several tokenization strategies are also tested and their impact on the final result is assessed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alpuente, M., Romero, D.: A visual technique for web pages comparison. Electr. Notes Theor. Comput. Sci. 235, 3–18 (2009)

    Article  Google Scholar 

  2. Clayton R.: String Metrics Library: https://github.com/rclayton/StringSimilarity

  3. Cohen W., Ravikumar P., Fienberg S.: A comparison of string metrics for matching names and records. In: KDD Workshop on Data Cleaning, Vol. 3 (2003)

    Google Scholar 

  4. Fu, A.Y., Wenyin, L., Deng, X.: Detecting phishing web pages with visual similarity assessment based on earth mover’s distance (EMD). IEEE Trans. Dependable Sec. Comput. 3(4), 301–311 (2006)

    Article  Google Scholar 

  5. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Pietarinen, L., Srivastava, D.: Using q-grams in a DBMS for approximate string processing. IEEE Data Eng. Bull. 24(4), 28–34 (2001)

    Google Scholar 

  6. Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 284–291. ACM (2006)

    Google Scholar 

  7. Jaccard, P.: The distribution of the flora in the alpine zone. New Phytol. 11, 37–50 (1912)

    Article  Google Scholar 

  8. Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Statist. Assoc. 84(406), 414–420 (1989)

    Article  Google Scholar 

  9. Levenshtein, V.: Binary codes capable of correcting deletions and insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)

    MathSciNet  Google Scholar 

  10. Lucca, G.D., Penta, M.D., Fasolino, A.: An approach to identify duplicated web pages. In: Proceedings of International Computer Software and Applications Conference (COMPSAC), pp. 481–486 (2002)

    Google Scholar 

  11. Lukashenko, R., Graudina, V., Grundspenkis, J.: Computer-based plagiarism detection methods and tools: an overview. In: CompSysTech. ACM International Conference Proceeding Series, Vol. 285, p. 40. ACM (2007)

    Google Scholar 

  12. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)

    Article  Google Scholar 

  13. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)

    Article  Google Scholar 

  14. Pera, M.S., Ng, Y.K.: Identifying spam web pages based on content similarity. In: ICCSA (2), Vol. 5073, pp. 204–219. Lecture Notes in Computer Science. Springer, Berlin (2008)

    Google Scholar 

  15. Rosiello, A.P., Kirda, E., Kruegel, C., Ferrandi, F.: A layout-similarity based approach for detecting phishing pages. In: Security and Privacy in Communications Networks and the Workshops, pp. 454–463. IEEE (2007)

    Google Scholar 

  16. SimMetrics, a Similarity Metric Library: http://sourceforge.net/projects/simmetrics/

  17. Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)

    Article  Google Scholar 

  18. Sorensen, T.: A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons”. Kongelige Danske Videnskabernes Selskab 5(4), 1–34 (1948)

    MathSciNet  Google Scholar 

  19. Symantec Internet Security Threat Report, Vol. 20, http://www.symantec.com/about/news/resources/press_kits/detail.jsp?pkid=istr-20 (2015)

  20. Wenyin, L., Huang, G., Xiaoyue, L., Min, Z., Deng, X.: Detection of phishing webpages based on visual similarity. In: 14th international conference on World Wide Web, pp. 1060–1061. ACM (2005)

    Google Scholar 

  21. Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In: Proceedings of the Section on Survey Research, pp. 354–359 (1990)

    Google Scholar 

  22. Zachara M., Piskor-Ignatowicz C.: Comparison of string metrics effectiveness for the purpose of estimating the number of unique job offers. PAR (11), pp. 213–216, PIAP (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marek Zachara .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Zachara, M., Pałka, D. (2016). Comparison of Text-Similarity Metrics for the Purpose of Identifying Identical Web Pages During Automated Web Application Testing. In: Grzech, A., Borzemski, L., Świątek, J., Wilimowska, Z. (eds) Information Systems Architecture and Technology: Proceedings of 36th International Conference on Information Systems Architecture and Technology – ISAT 2015 – Part II. Advances in Intelligent Systems and Computing, vol 430. Springer, Cham. https://doi.org/10.1007/978-3-319-28561-0_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-28561-0_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-28559-7

  • Online ISBN: 978-3-319-28561-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics