Skip to main content
Log in

Extreme-value-theoretic estimation of local intrinsic dimensionality

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

This paper is concerned with the estimation of a local measure of intrinsic dimensionality (ID) recently proposed by Houle. The local model can be regarded as an extension of Karger and Ruhl’s expansion dimension to a statistical setting in which the distribution of distances to a query point is modeled in terms of a continuous random variable. This form of intrinsic dimensionality can be particularly useful in search, classification, outlier detection, and other contexts in machine learning, databases, and data mining, as it has been shown to be equivalent to a measure of the discriminative power of similarity functions. Several estimators of local ID are proposed and analyzed based on extreme value theory, using maximum likelihood estimation, the method of moments, probability weighted moments, and regularly varying functions. An experimental evaluation is also provided, using both real and artificial data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Alves MF, de Haan L, Lin T (2003a) Estimation of the parameter controlling the speed of convergence in extreme value theory. Math Method Stat 12(2):155–176

    MathSciNet  Google Scholar 

  • Alves MIF, Gomes MI, de Haan L (2003b) A new class of semi-parametric estimators of the second order parameter. Port Math 60(2):193–214

    MathSciNet  MATH  Google Scholar 

  • Amsaleg L, Chelly O, Furon T, Girard S, Houle ME, Kawarabayashi K, Nett M (2015) Estimating local intrinsic dimensionality. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 29–38

  • Balkema AA, De Haan L (1974) Residual life time at great age. Ann Probab 2(5):792–804

    Article  MathSciNet  MATH  Google Scholar 

  • Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbor meaningful? In: International conference on database theory. Springer, Berlin, pp 217–235

  • Beygelzimer A, Kakade S, Langford J (2006) Cover trees for nearest neighbor. In: International conference on machine learning. ACM, pp 97–104

  • Bingham NH, Goldie CM, Teugels JL (1989) Regular variation, vol 27. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  • Boujemaa N, Fauqueur J, Ferecatu M, Fleuret F, Gouet V, LeSaux B, Sahbi H (2001) IKONA for interactive specific and generic image retrieval. In: Proceedings of international workshop on multimedia content-based indexing and retrieval

  • Bouveyron C, Celeux G, Girard S (2011) Intrinsic dimension estimation by maximum likelihood in isotropic probabilistic PCA. Pattern Recogn Lett 32(14):1706–1713

    Article  Google Scholar 

  • Bruske J, Sommer G (1998) Intrinsic dimensionality estimation with optimally topology preserving maps. IEEE Trans Pattern Anal Mach Intell 20(5):572–575

    Article  Google Scholar 

  • Camastra F, Vinciarelli A (2002) Estimating the intrinsic dimension of data with a fractal-based method. IEEE Trans Pattern Anal Mach Intell 24(10):1404–1407

    Article  Google Scholar 

  • Cole R, Fanty M (1990) Spoken letter recognition. In: Proceedings of the third DARPA speech and natural language workshop, pp 385–390

  • Coles S, Bawa J, Trenner L, Dorazio P (2001) An introduction to statistical modeling of extreme values, vol 208. Springer, Berlin

    Book  Google Scholar 

  • Costa JA, Hero AO III (2004) Entropic graphs for manifold learning. In: Asilomar conference on signals, systems and computers, vol 1. IEEE, pp 316–320

  • Dahan E, Mendelson H (2001) An extreme-value model of concept testing. Manag Sci 47(1):102–116

    Article  Google Scholar 

  • de Vries T, Chawla S, Houle ME (2012) Density-preserving projections for large-scale local anomaly detection. Knowl Inf Syst 32(1):25–52

    Article  Google Scholar 

  • Dong W, Moses C, Li K (2011) Efficient k-nearest neighbor graph construction for generic similarity measures. In: International World Wide Web conference. ACM, pp 577–586

  • Donoho DL, Grimes C (2003) Hessian eigenmaps: locally linear embedding techniques for high-dimensional data. Proc Natl Acad Sci 100(10):5591–5596

    Article  MathSciNet  MATH  Google Scholar 

  • Faloutsos C, Kamel I (1994) Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension. In: Proceedings of the 13th ACM SIGACT–SIGMOD–SIGART symposium on principles of database systems. ACM, pp 4–13

  • Fisher RA, Tippett LHC (1928) Limiting forms of the frequency distribution of the largest or smallest member of a sample. Math Proc Camb Philos Soc 24(02):180–190

    Article  MATH  Google Scholar 

  • Fukunaga K, Olsen DR (1971) An algorithm for finding intrinsic dimensionality of data. IEEE Trans Comput 100(2):176–183

    Article  MATH  Google Scholar 

  • Furon T, Jégou H (2013) Using extreme value theory for image detection. Research Report RR-8244, INRIA

  • Gnedenko B (1943) Sur la distribution limite du terme maximum d’une série aléatoire. Ann Math 44(3):423–453

    Article  MathSciNet  MATH  Google Scholar 

  • Grimshaw SD (1993) Computing maximum likelihood estimates for the Generalized Pareto Distribution. Technometrics 35(2):185–191

    Article  MathSciNet  MATH  Google Scholar 

  • Gupta A, Krauthgamer R, Lee JR (2003) Bounded geometries, fractals, and low-distortion embeddings. In: Proceedings of the 44th annual IEEE symposium on foundations of computer science. IEEE, pp 534–543

  • Guyon I, Gunn S, Ben-Hur A, Dror G (2004) Result analysis of the NIPS 2003 feature selection challenge. In: Neural information processing systems, pp 545–552

  • Harris R (2001) The accuracy of design values predicted from extreme value analysis. J Wind Eng Ind Aerodyn 89(2):153–164

    Article  Google Scholar 

  • Hein M, Audibert JY (2005) Intrinsic dimensionality estimation of submanifolds in \({R}^d\). In: International conference on machine learning. ACM, pp 289–296

  • Hill BM et al (1975) A simple general approach to inference about the tail of a distribution. Ann Stat 3(5):1163–1174

    Article  MathSciNet  MATH  Google Scholar 

  • Hosking JR, Wallis JR (1987) Parameter and quantile estimation for the Generalized Pareto Distribution. Technometrics 29(3):339–349

    Article  MathSciNet  MATH  Google Scholar 

  • Houle ME (2013) Dimensionality, discriminability, density & distance distributions. In: 13th International conference on data mining workshops. IEEE, pp 468–473

  • Houle ME (2015) Inlierness, outlierness, hubness and discriminability: an extreme-value-theoretic foundation. Tech. Rep. 2015-002E, National Institute of Informatics

  • Houle ME, Kashima H, Nett M (2012a) Generalized expansion dimension. In: 12th international conference on data mining workshops, IEEE, pp 587–594

  • Houle ME, Ma X, Nett M, Oria V (2012b) Dimensional testing for multi-step similarity search. In: 12th International conference on data mining. IEEE, pp 299–308

  • Houle ME, Ma X, Oria V, Sun J (2014) Efficient algorithms for similarity search in axis-aligned subspaces. In: International conference on similarity search and applications. Springer, Berlin, pp 1–12

    Google Scholar 

  • Jégou H, Tavenard R, Douze M, Amsaleg L (2011) Searching in one billion vectors: re-rank with source coding. In: International conference on acoustics, speech and signal processing. IEEE, pp 861–864

  • Jolliffe IT (1986) Principal component analysis and factor analysis. In: Principal component analysis. Springer, Berlin, pp 115–128

    Chapter  Google Scholar 

  • Karger DR, Ruhl M (2002) Finding nearest neighbors in growth-restricted metrics. In: ACM symposium on theory of computing. ACM, pp 741–750

  • Karhunen J, Joutsensalo J (1994) Representation and separation of signals using nonlinear PCA type learning. IEEE Trans Neural Netw 7(1):113–127

    Article  Google Scholar 

  • Landwehr JM, Matalas N, Wallis J (1979) Probability weighted moments compared with some traditional techniques in estimating Gumbel parameters and quantiles. Water Resour Res 15(5):1055–1064

    Article  Google Scholar 

  • Larrañaga P, Lozano JA (2002) Estimation of distribution algorithms: a new tool for evolutionary computation, vol 2. Springer, Berlin

    MATH  Google Scholar 

  • Lavenda BH, Cipollone E (2000) Extreme value statistics and thermodynamics of earthquakes: aftershock sequences. Ann Geophys 43(5):967–982

  • LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  • Levina E, Bickel PJ (2004) Maximum likelihood estimation of intrinsic dimension. In: Neural information processing systems, pp 777–784

  • McNulty PJ, Scheick LZ, Roth DR, Davis MG, Tortora MR (2000) First failure predictions for EPROMs of the type flown on the MPTB satellite. IEEE Trans Nucl Sci 47(6):2237–2243

    Article  Google Scholar 

  • Millán JdR (2004) On the need for on-line learning in brain-computer interfaces. In: Proceedings of the IEEE international joint conference on neural networks, vol 4. IEEE, pp 2877–2882

  • Nett M (2014) Intrinsic dimensional design and analysis of similarity search. Ph.D. thesis, University of Tokyo

  • Pestov V (2000) On the geometry of similarity search: dimensionality curse and concentration of measure. Inf Process Lett 73(1):47–51

    Article  MathSciNet  MATH  Google Scholar 

  • Pettis KW, Bailey TA, Jain AK, Dubes RC (1979) An intrinsic dimensionality estimator from near-neighbor information. IEEE Trans Pattern Anal Mach Intell 1:25–37

    Article  MATH  Google Scholar 

  • Pickands J III (1975) Statistical inference using extreme order statistics. Ann Stat 3(1):119–131

  • Radovanović M, Nanopoulos A, Ivanović M (2010a) Hubs in space: popular nearest neighbors in high-dimensional data. J Mach Learn Res 11:2487–2531

    MathSciNet  MATH  Google Scholar 

  • Radovanović M, Nanopoulos A, Ivanović M (2010b) Time-series classification in many intrinsic dimensions. In: Proceedings of the 2010 SIAM international conference on data mining. Citeseer, pp 677–688

  • Rao CR (2009) Linear statistical inference and its applications, vol 22. Wiley, New York

    Google Scholar 

  • Roberts SJ (2000) Extreme value statistics for novelty detection in biomedical data processing. Proc Sci Meas Technol 147:363–367

    Article  Google Scholar 

  • Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326

    Article  Google Scholar 

  • Rozza A, Lombardi G, Ceruti C, Casiraghi E, Campadelli P (2012) Novel high intrinsic dimensionality estimators. Mach Learn J 89(1–2):37–65

    Article  MathSciNet  MATH  Google Scholar 

  • Schölkopf B, Smola A, Müller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5):1299–1319

    Article  Google Scholar 

  • Shaft U, Ramakrishnan R (2006) Theory of nearest neighbors indexability. ACM Trans Database Syst 31(3):814–838

    Article  Google Scholar 

  • Takens F (1985) On the numerical determination of the dimension of an attractor. Springer, Berlin

    Book  MATH  Google Scholar 

  • Tenenbaum JB, De Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323

    Article  Google Scholar 

  • Tryon RG, Cruse TA (2000) Probabilistic mesomechanics for high cycle fatigue life prediction. J Eng Mater Technol 122(2):209–214

    Article  Google Scholar 

  • Venna J, Kaski S (2006) Local multidimensional scaling. IEEE Trans Neural Netw 19(6):889–899

    Article  MATH  Google Scholar 

  • Verveer PJ, Duin RPW (1995) An evaluation of intrinsic dimensionality estimators. IEEE Trans Pattern Anal Mach Intell 17(1):81–86

    Article  Google Scholar 

  • von Brünken J, Houle ME, Zimek A (2015) Intrinsic dimensional outlier detection in high-dimensional data. Tech. Rep. 2015-003E, National Institute of Informatics

  • Weber R, Schek HJ, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: 24th international conference on very large data bases, vol 98, pp 194–205

  • Zhang Z, Zha H (2004) Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. J Shanghai Univ (English Edition) 8(4):406–424

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael E. Houle.

Additional information

Responsible editor: Jieping Ye.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

L.A. and T.F. supported by French Project Secular ANR-12-CORD-0014. O.C., M.E.H. and K.K. supported by JST ERATO Kawarabayashi Project. M.E.H. supported by JSPS Kakenhi Kiban (A) Research Grant 25240036. O.C. and M.E.H. supported by JSPS Kakenhi Kiban (B) Research Grant 15H02753.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Amsaleg, L., Chelly, O., Furon, T. et al. Extreme-value-theoretic estimation of local intrinsic dimensionality. Data Min Knowl Disc 32, 1768–1805 (2018). https://doi.org/10.1007/s10618-018-0578-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-018-0578-6

Keywords

Navigation