An investigation of implicit features in compression-based learning for comparing webpages

Chen, Teh-Chung; Stepan, Torin; Dick, Scott; Miller, James

doi:10.1007/s10044-014-0432-4

An investigation of implicit features in compression-based learning for comparing webpages

Theoretical Advances
Published: 29 November 2014

Volume 19, pages 397–410, (2016)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Teh-Chung Chen¹,
Torin Stepan¹,
Scott Dick¹ &
…
James Miller¹

337 Accesses
2 Citations
Explore all metrics

Abstract

We investigate compression-based learning for image classification tasks. These algorithms are claimed to approximate the Kolmogorov complexity of the difference between two object descriptions, but in practice are a measure over an induced feature space. We investigate if these algorithms can be improved via feature selection. Our experiments cover a corpus of legitimate websites and Phishing websites impersonating them; the task is to classify a webpage as either legitimate or a Phish. We perform feature selection in the feature space induced by a well-known compression algorithm (specifically, the entries of the compression dictionary). We then apply four well-known classification algorithms to the reduced feature sets, and conduct a Receiver Operating Characteristic analysis on them. We find that a subset of the features is sufficient for a near-perfect classification of these webpages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Application of Data Compression Models to Handwritten Digit Classification

The Feature Selection Method Based on a Probabilistic Approach and a Cross-Entropy Metric for the Image Recognition Problem

Article 01 December 2021

Improved Compression-Based Pattern Recognition Exploiting New Useful Features

Notes

http://www.7-zip.org/7z.html.
Note that for any pair of objects, there are two NCD values, as C(xy) ≠ C(yx) in general. Chen et al. [12] explored aggregating the values by their maximum or their mean; both yielded similar results. There is no way to simply pick one value, as we have no rationale to favour the xy or yx concatenation. Thus, after aggregation, there are 120 of these pairwise NCDs.

References

Aks DJ, Sprott JC (1996) Quantifying aesthetic preference for chaotic patterns. Empir Stud Arts 14(1):1–16
Article Google Scholar
Bell AJ, Sejnowski TJ (1997) The independent components of natural scenes are edge filters. Vis Res 37:3327–3338
Article Google Scholar
Billock VA (2000) Neural acclimation to 1/f spatial frequency spectra in natural images transduced by the human visual system. Phys D 137:379–391
Article Google Scholar
Broder A, Glassman S, Manasse M, Zweig G (1997) Syntactic clustering of the web. Paper presented at the Proceedings of the 6th international World Wide Web conference, Santa Clara, CA
Brown WRJ (1952) Statistics of color-matching data. J Opt Soc Am 42:252
Article Google Scholar
Burton GJ, Moorhead IR (1987) Color and spatial structure in natural scenes. Appl Opt 26(1):157–170
Article Google Scholar
Cai D, Yu S, Wen JR, Ma WY (2003) Extracting content structure for web pages based on visual representation. Lect Notes Comput Sci 2642:406–417
Article MATH Google Scholar
Cebrián M, Alfonseca M, Ortega A (2005) Common pitfalls using the normalized compression distance: what to watch out for in a compressor. Commun Inf Syst 5(4):367–384
MathSciNet MATH Google Scholar
Chaitin GJ (1987) Algorithmic information theory. Cambridge University Press, Cambridge
Book MATH Google Scholar
Charikar MS (2002) Similarity estimation techniques from rounding algorithms. Paper presented at the ACM Symposium on theory of computing, Montreal, QC, Canada
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
MATH Google Scholar
Chen TC, Dick S, Miller J (2010) Detecting visually similar web pages: application to phishing detection. ACM Trans Internet Technol 10(2):5:1–5:38
Article Google Scholar
Chen T-C, Dick S, Miller J (2014) An anti-phishing system employing compression-based similarity measures. ACM Trans Inf Syst Secur 16(4):16:11–16:31
Google Scholar
Chen X, Kwong S, Li M (1999) A compression algorithm for DNA sequences and its applications in genome comparison. Paper presented at the proceedings of international conference on computational molecular biology, Tokyo, Japan
Cilibrasi R, Vitanyi PMB (2005) Clustering by compression. IEEE Trans Inf Theory 51(4):1523–1545
Article MathSciNet MATH Google Scholar
Dorner D (ed) (1996) The Logic of failure: recognizing and avoiding error in complex situations New York. Metropolitan Books, NY
Google Scholar
Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New York, NY
MATH Google Scholar
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874
Article MathSciNet Google Scholar
Field DJ (1987) Relations between the statistics of natural images and the response properties of cortical cells. J Opt Soc Am A 4(12):2379–2394
Article Google Scholar
Field DJ (1994) What is the goal of sensory coding? Neural Comput 6:559–601
Article Google Scholar
Field DJ, Brady N (1997) Visual sensitivity, blur and the sources of variability in the amplitude spectra of natural scenes. Vis Res 37(23):3367–3383
Article Google Scholar
Frazor RA, Geisler WS (2006) Local luminance and contrast in natural images. Vis Res 46:1585–1598
Article Google Scholar
Gilks WR, Richardson S, Spiegelhalter DJ (1996) Markov chain Monte Carlo in practice. Chapman & Hall, London
MATH Google Scholar
Gordon IE (2004) Theories of visual perception, 3rd edn. Psychology Press, New York
Google Scholar
Graham DJ, Chandler DM, Field DJ (2006) Can the theory of ‘‘whitening’’ explain the center-surround properties of retinal ganglion cell receptive fields? Vis Res 46:2901–2913
Article Google Scholar
Graham DJ, Field DJ (2007) Statistical regularities of art images and natural scenes: spectra, sparseness and nonlinearities. Spat Vis 21(1–2):149–164
Article Google Scholar
Graham DJ, Field DJ (2008) Variations in intensity statistics for representational and abstract art, and for art from the Eastern and Western hemispheres. Perception 37:1341–1352
Article Google Scholar
Graham L (2008) Gestalt theory in interactive media design. J Humanit Soc Sci 2(1):1–12
Google Scholar
Hagerhall CM, Purcell T, Taylor R (2004) Fractal dimension of landscape silhouette outlines as a predictor of landscape preference. J Environ Psychol 24:247–255
Article Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18
Article Google Scholar
Haveliwala TH, Gionis A, Klein D, Indyk P (2002) Evaluating strategies for similarity search on the web. Paper presented at the proceedings of the international conference on World Wide Web, Honolulu, Hawaii, USA
Heintze N (1996) Scalable document fingerprinting. Paper presented at the USENIX workshop on electronic commerce, Oakland, CA, USA
Henzinger M (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms. Paper presented at the Proceedings of the international. ACM SIGIR conference on research and development in information retrieval, Seattle, Washington, USA
Hescott B, Koulomzin D (2007) On clustering images using compression. B. U. Computer Science Department, Trans., Boston University, Boston
Jackowski K (2012) Evolutionary adapted ensemble for reoccurring context. Lect Notes Comput Sci 7209:550–557
Kalviainen M (2007) The role of sign elements in holistic product meaning. Paper presented at the Proceedings of the SeFun international seminar: design semiotics in use, Helsinki, Finland
Keogh E, Lonardi S, Ratanamahatana C (2004) Toward parameter-free data mining. Paper presented at the ACM SIGKDD international conference on knowledge discovery and data mining, Seattle, WA, USA
Kira K, Rendell LA (1992) The feature selection problem: traditional methods and a new algorithm. Paper presented at the Proceedings of the AAAI, San Jose, CA, USA
Klir GJ, Yuan B (1995) Fuzzy sets and fuzzy logic: theory and applications. Prentice Hall, Upper Saddle River
MATH Google Scholar
Knill DC, Field D, Kersten D (1990) Human discrimination of fractal images. J Opt Soc Am A 7(6):1113–1123
Article Google Scholar
Kocsor A, Kertész-Farkas A, Kaján L, Pongor S (2006) Application of compression-based distance measures to protein sequence classification: a methodological study. Bioinformatics 22(4):407–412
Article Google Scholar
Kononenko I (1994) Estimating attributes: analysis and extensions of Relief. Paper presented at the European conference on machine learning Catania, Italy
Krawczyk B, Wozniak M, Cyganek B (2014) Clustering-based ensembles for one-class classification. Inf Sci 264:182–195
Article MathSciNet Google Scholar
Li M, Chen X, Li X, Ma B, Vitanyi PMB (2004) The similarity metric. IEEE Trans Inf Theory 50(12):3250–3264
Article MathSciNet MATH Google Scholar
Li M, Zhu Y (2006) Image classification via LZ78 based string kernel: a comparative study. Lect Notes Comput Sci 3918:704–712
Article Google Scholar
Macedonas A, Besiris D, Economou G, Fotopoulos S (2008) Dictionary based color image retrieval. J Vis Commun Image Retr 19:464–470
Article Google Scholar
Marpe D, Schwarz H, Wiegand T (2003) Context-based adaptive binary arithmetic coding in the H. 264/AVC video compression. IEEE Trans Circuits Syst Video Technol 13(7):620–636
Article Google Scholar
Mitchell TM (1997) Machine learning. McGraw-Hill, New York
MATH Google Scholar
Nigel G, Martin N (1979) Range encoding: an algorithm for removing redundancy from a digitized message. Paper presented at the proceedings of the video and data recording conference, Southampton, UK
Olshausen BA, Field DJ (1996) Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381:607–609
Article Google Scholar
Párraga CA, Troscianko T, Tolhurst DJ (2000) The human visual system is optimized for processing the spatial information in natural visual images. Curr Biol 10:35–38
Article Google Scholar
Provost F, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. Paper presented at the proceedings of the International conference on machine learning, Madison, WI, USA
Qi X, Davison BD (2009) Web page classification: features and algorithms. ACM Comput Surv 41(2):12:11–12:31
Article Google Scholar
Redies C (2007) A universal model of esthetic perception based on the sensory coding of natural stimuli. Spat Vis 21:97–117
Article Google Scholar
Redies C, Hasenstein J, Denzler J (2007) Fractal-like image statistics in visual art: similarity to natural scenes. Spat Vis 21(137–148)
Rice JA (1995) Mathematical statistics and data analysis, 2nd edn. Duxbury Press, Belmont
MATH Google Scholar
Rogowitz BE, Voss RF (1990) Shape perception and low-dimensional fractal boundary contours. Proc SPIE 1249:387–394
Rosen BE, Goodwin JM, Vidal JJ (1990) Adaptive range coding. Paper presented at the Proceedings of the conference on advances in neural information processing systems, Denver, CO, USA
Sculley D, Brodley CE (2006) Compression and machine learning: a new perspective on feature space vectors. Paper presented at the proceedings of the data compression conference, Snowbird, UT, USA
Spehar B, Clifford CWG, Newell BR, Taylor RP (2003) Universal aesthetic of fractals. Comput Graph 27:813–820
Article Google Scholar
Sprott JC (1993) Automatic generation of strange attractors. Comput Graph 17(3):325–332
Article MathSciNet Google Scholar
Staff (2009) Convert HTML to image. http://www.converthtmltoimage.com/
Staff (2011) PhishTank Retrieved July 5. http://www.phishtank.com/
Staff (2013) Welcome to eBay—sign in. https://signin.ebay.com/ws/eBayISAPI.dll?SignIn&ru=http%3A%2F%2Fwww.ebay.com
Taylor R, Micolich A, Jonas D (1999) Fractal expressionism. Phys World 12(10):25–28
Article Google Scholar
Telles GP, Minghim R, Paulovich FV (2007) Normalized compression distance for visual analysis of document collections. Comput Graph 31(3):327–337
Article Google Scholar
Tolhurst DJ, Tadmor Y, Chao T (1992) Amplitude spectra of natural images. Ophthal Physiol Opt 12(2):229–232
Article Google Scholar
Weckström M, Laughlin SB (1995) Visual ecology and voltage-gated ion channels in insect photoreceptors. Trends Neurosci 18(1):17–21
Article Google Scholar
Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23(3):337–343
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

This work was supported in part by the Natural Sciences and Engineering Research Council of Canada under Grant No. G121210906.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of Alberta, 2nd Flr ECERF Bldg., Edmonton, AB, T6G 2V4, Canada
Teh-Chung Chen, Torin Stepan, Scott Dick & James Miller

Authors

Teh-Chung Chen
View author publications
You can also search for this author in PubMed Google Scholar
Torin Stepan
View author publications
You can also search for this author in PubMed Google Scholar
Scott Dick
View author publications
You can also search for this author in PubMed Google Scholar
James Miller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Scott Dick.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, TC., Stepan, T., Dick, S. et al. An investigation of implicit features in compression-based learning for comparing webpages. Pattern Anal Applic 19, 397–410 (2016). https://doi.org/10.1007/s10044-014-0432-4

Download citation

Received: 27 September 2012
Accepted: 16 November 2014
Published: 29 November 2014
Issue Date: May 2016
DOI: https://doi.org/10.1007/s10044-014-0432-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An investigation of implicit features in compression-based learning for comparing webpages

Abstract

Access this article

Similar content being viewed by others

An Application of Data Compression Models to Handwritten Digit Classification

The Feature Selection Method Based on a Probabilistic Approach and a Cross-Entropy Metric for the Image Recognition Problem

Improved Compression-Based Pattern Recognition Exploiting New Useful Features

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An investigation of implicit features in compression-based learning for comparing webpages

Abstract

Access this article

Similar content being viewed by others

An Application of Data Compression Models to Handwritten Digit Classification

The Feature Selection Method Based on a Probabilistic Approach and a Cross-Entropy Metric for the Image Recognition Problem

Improved Compression-Based Pattern Recognition Exploiting New Useful Features

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation