Skip to main content
Log in

An investigation of implicit features in compression-based learning for comparing webpages

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

We investigate compression-based learning for image classification tasks. These algorithms are claimed to approximate the Kolmogorov complexity of the difference between two object descriptions, but in practice are a measure over an induced feature space. We investigate if these algorithms can be improved via feature selection. Our experiments cover a corpus of legitimate websites and Phishing websites impersonating them; the task is to classify a webpage as either legitimate or a Phish. We perform feature selection in the feature space induced by a well-known compression algorithm (specifically, the entries of the compression dictionary). We then apply four well-known classification algorithms to the reduced feature sets, and conduct a Receiver Operating Characteristic analysis on them. We find that a subset of the features is sufficient for a near-perfect classification of these webpages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. http://www.7-zip.org/7z.html.

  2. Note that for any pair of objects, there are two NCD values, as C(xy) ≠ C(yx) in general. Chen et al. [12] explored aggregating the values by their maximum or their mean; both yielded similar results. There is no way to simply pick one value, as we have no rationale to favour the xy or yx concatenation. Thus, after aggregation, there are 120 of these pairwise NCDs.

References

  1. Aks DJ, Sprott JC (1996) Quantifying aesthetic preference for chaotic patterns. Empir Stud Arts 14(1):1–16

    Article  Google Scholar 

  2. Bell AJ, Sejnowski TJ (1997) The independent components of natural scenes are edge filters. Vis Res 37:3327–3338

    Article  Google Scholar 

  3. Billock VA (2000) Neural acclimation to 1/f spatial frequency spectra in natural images transduced by the human visual system. Phys D 137:379–391

    Article  Google Scholar 

  4. Broder A, Glassman S, Manasse M, Zweig G (1997) Syntactic clustering of the web. Paper presented at the Proceedings of the 6th international World Wide Web conference, Santa Clara, CA

  5. Brown WRJ (1952) Statistics of color-matching data. J Opt Soc Am 42:252

    Article  Google Scholar 

  6. Burton GJ, Moorhead IR (1987) Color and spatial structure in natural scenes. Appl Opt 26(1):157–170

    Article  Google Scholar 

  7. Cai D, Yu S, Wen JR, Ma WY (2003) Extracting content structure for web pages based on visual representation. Lect Notes Comput Sci 2642:406–417

    Article  MATH  Google Scholar 

  8. Cebrián M, Alfonseca M, Ortega A (2005) Common pitfalls using the normalized compression distance: what to watch out for in a compressor. Commun Inf Syst 5(4):367–384

    MathSciNet  MATH  Google Scholar 

  9. Chaitin GJ (1987) Algorithmic information theory. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  10. Charikar MS (2002) Similarity estimation techniques from rounding algorithms. Paper presented at the ACM Symposium on theory of computing, Montreal, QC, Canada

  11. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    MATH  Google Scholar 

  12. Chen TC, Dick S, Miller J (2010) Detecting visually similar web pages: application to phishing detection. ACM Trans Internet Technol 10(2):5:1–5:38

    Article  Google Scholar 

  13. Chen T-C, Dick S, Miller J (2014) An anti-phishing system employing compression-based similarity measures. ACM Trans Inf Syst Secur 16(4):16:11–16:31

    Google Scholar 

  14. Chen X, Kwong S, Li M (1999) A compression algorithm for DNA sequences and its applications in genome comparison. Paper presented at the proceedings of international conference on computational molecular biology, Tokyo, Japan

  15. Cilibrasi R, Vitanyi PMB (2005) Clustering by compression. IEEE Trans Inf Theory 51(4):1523–1545

    Article  MathSciNet  MATH  Google Scholar 

  16. Dorner D (ed) (1996) The Logic of failure: recognizing and avoiding error in complex situations New York. Metropolitan Books, NY

    Google Scholar 

  17. Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New York, NY

    MATH  Google Scholar 

  18. Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874

    Article  MathSciNet  Google Scholar 

  19. Field DJ (1987) Relations between the statistics of natural images and the response properties of cortical cells. J Opt Soc Am A 4(12):2379–2394

    Article  Google Scholar 

  20. Field DJ (1994) What is the goal of sensory coding? Neural Comput 6:559–601

    Article  Google Scholar 

  21. Field DJ, Brady N (1997) Visual sensitivity, blur and the sources of variability in the amplitude spectra of natural scenes. Vis Res 37(23):3367–3383

    Article  Google Scholar 

  22. Frazor RA, Geisler WS (2006) Local luminance and contrast in natural images. Vis Res 46:1585–1598

    Article  Google Scholar 

  23. Gilks WR, Richardson S, Spiegelhalter DJ (1996) Markov chain Monte Carlo in practice. Chapman & Hall, London

    MATH  Google Scholar 

  24. Gordon IE (2004) Theories of visual perception, 3rd edn. Psychology Press, New York

    Google Scholar 

  25. Graham DJ, Chandler DM, Field DJ (2006) Can the theory of ‘‘whitening’’ explain the center-surround properties of retinal ganglion cell receptive fields? Vis Res 46:2901–2913

    Article  Google Scholar 

  26. Graham DJ, Field DJ (2007) Statistical regularities of art images and natural scenes: spectra, sparseness and nonlinearities. Spat Vis 21(1–2):149–164

    Article  Google Scholar 

  27. Graham DJ, Field DJ (2008) Variations in intensity statistics for representational and abstract art, and for art from the Eastern and Western hemispheres. Perception 37:1341–1352

    Article  Google Scholar 

  28. Graham L (2008) Gestalt theory in interactive media design. J Humanit Soc Sci 2(1):1–12

    Google Scholar 

  29. Hagerhall CM, Purcell T, Taylor R (2004) Fractal dimension of landscape silhouette outlines as a predictor of landscape preference. J Environ Psychol 24:247–255

    Article  Google Scholar 

  30. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18

    Article  Google Scholar 

  31. Haveliwala TH, Gionis A, Klein D, Indyk P (2002) Evaluating strategies for similarity search on the web. Paper presented at the proceedings of the international conference on World Wide Web, Honolulu, Hawaii, USA

  32. Heintze N (1996) Scalable document fingerprinting. Paper presented at the USENIX workshop on electronic commerce, Oakland, CA, USA

  33. Henzinger M (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms. Paper presented at the Proceedings of the international. ACM SIGIR conference on research and development in information retrieval, Seattle, Washington, USA

  34. Hescott B, Koulomzin D (2007) On clustering images using compression. B. U. Computer Science Department, Trans., Boston University, Boston

  35. Jackowski K (2012) Evolutionary adapted ensemble for reoccurring context. Lect Notes Comput Sci 7209:550–557

  36. Kalviainen M (2007) The role of sign elements in holistic product meaning. Paper presented at the Proceedings of the SeFun international seminar: design semiotics in use, Helsinki, Finland

  37. Keogh E, Lonardi S, Ratanamahatana C (2004) Toward parameter-free data mining. Paper presented at the ACM SIGKDD international conference on knowledge discovery and data mining, Seattle, WA, USA

  38. Kira K, Rendell LA (1992) The feature selection problem: traditional methods and a new algorithm. Paper presented at the Proceedings of the AAAI, San Jose, CA, USA

  39. Klir GJ, Yuan B (1995) Fuzzy sets and fuzzy logic: theory and applications. Prentice Hall, Upper Saddle River

    MATH  Google Scholar 

  40. Knill DC, Field D, Kersten D (1990) Human discrimination of fractal images. J Opt Soc Am A 7(6):1113–1123

    Article  Google Scholar 

  41. Kocsor A, Kertész-Farkas A, Kaján L, Pongor S (2006) Application of compression-based distance measures to protein sequence classification: a methodological study. Bioinformatics 22(4):407–412

    Article  Google Scholar 

  42. Kononenko I (1994) Estimating attributes: analysis and extensions of Relief. Paper presented at the European conference on machine learning Catania, Italy

  43. Krawczyk B, Wozniak M, Cyganek B (2014) Clustering-based ensembles for one-class classification. Inf Sci 264:182–195

    Article  MathSciNet  Google Scholar 

  44. Li M, Chen X, Li X, Ma B, Vitanyi PMB (2004) The similarity metric. IEEE Trans Inf Theory 50(12):3250–3264

    Article  MathSciNet  MATH  Google Scholar 

  45. Li M, Zhu Y (2006) Image classification via LZ78 based string kernel: a comparative study. Lect Notes Comput Sci 3918:704–712

    Article  Google Scholar 

  46. Macedonas A, Besiris D, Economou G, Fotopoulos S (2008) Dictionary based color image retrieval. J Vis Commun Image Retr 19:464–470

    Article  Google Scholar 

  47. Marpe D, Schwarz H, Wiegand T (2003) Context-based adaptive binary arithmetic coding in the H. 264/AVC video compression. IEEE Trans Circuits Syst Video Technol 13(7):620–636

    Article  Google Scholar 

  48. Mitchell TM (1997) Machine learning. McGraw-Hill, New York

    MATH  Google Scholar 

  49. Nigel G, Martin N (1979) Range encoding: an algorithm for removing redundancy from a digitized message. Paper presented at the proceedings of the video and data recording conference, Southampton, UK

  50. Olshausen BA, Field DJ (1996) Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381:607–609

    Article  Google Scholar 

  51. Párraga CA, Troscianko T, Tolhurst DJ (2000) The human visual system is optimized for processing the spatial information in natural visual images. Curr Biol 10:35–38

    Article  Google Scholar 

  52. Provost F, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. Paper presented at the proceedings of the International conference on machine learning, Madison, WI, USA

  53. Qi X, Davison BD (2009) Web page classification: features and algorithms. ACM Comput Surv 41(2):12:11–12:31

    Article  Google Scholar 

  54. Redies C (2007) A universal model of esthetic perception based on the sensory coding of natural stimuli. Spat Vis 21:97–117

    Article  Google Scholar 

  55. Redies C, Hasenstein J, Denzler J (2007) Fractal-like image statistics in visual art: similarity to natural scenes. Spat Vis 21(137–148)

  56. Rice JA (1995) Mathematical statistics and data analysis, 2nd edn. Duxbury Press, Belmont

    MATH  Google Scholar 

  57. Rogowitz BE, Voss RF (1990) Shape perception and low-dimensional fractal boundary contours. Proc SPIE 1249:387–394

  58. Rosen BE, Goodwin JM, Vidal JJ (1990) Adaptive range coding. Paper presented at the Proceedings of the conference on advances in neural information processing systems, Denver, CO, USA

  59. Sculley D, Brodley CE (2006) Compression and machine learning: a new perspective on feature space vectors. Paper presented at the proceedings of the data compression conference, Snowbird, UT, USA

  60. Spehar B, Clifford CWG, Newell BR, Taylor RP (2003) Universal aesthetic of fractals. Comput Graph 27:813–820

    Article  Google Scholar 

  61. Sprott JC (1993) Automatic generation of strange attractors. Comput Graph 17(3):325–332

    Article  MathSciNet  Google Scholar 

  62. Staff (2009) Convert HTML to image. http://www.converthtmltoimage.com/

  63. Staff (2011) PhishTank Retrieved July 5. http://www.phishtank.com/

  64. Staff (2013) Welcome to eBay—sign in. https://signin.ebay.com/ws/eBayISAPI.dll?SignIn&ru=http%3A%2F%2Fwww.ebay.com

  65. Taylor R, Micolich A, Jonas D (1999) Fractal expressionism. Phys World 12(10):25–28

    Article  Google Scholar 

  66. Telles GP, Minghim R, Paulovich FV (2007) Normalized compression distance for visual analysis of document collections. Comput Graph 31(3):327–337

    Article  Google Scholar 

  67. Tolhurst DJ, Tadmor Y, Chao T (1992) Amplitude spectra of natural images. Ophthal Physiol Opt 12(2):229–232

    Article  Google Scholar 

  68. Weckström M, Laughlin SB (1995) Visual ecology and voltage-gated ion channels in insect photoreceptors. Trends Neurosci 18(1):17–21

    Article  Google Scholar 

  69. Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23(3):337–343

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

This work was supported in part by the Natural Sciences and Engineering Research Council of Canada under Grant No. G121210906.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Scott Dick.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, TC., Stepan, T., Dick, S. et al. An investigation of implicit features in compression-based learning for comparing webpages. Pattern Anal Applic 19, 397–410 (2016). https://doi.org/10.1007/s10044-014-0432-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-014-0432-4

Keywords

Navigation