Abstract
This paper is concerned with the estimation of a local measure of intrinsic dimensionality (ID) recently proposed by Houle. The local model can be regarded as an extension of Karger and Ruhl’s expansion dimension to a statistical setting in which the distribution of distances to a query point is modeled in terms of a continuous random variable. This form of intrinsic dimensionality can be particularly useful in search, classification, outlier detection, and other contexts in machine learning, databases, and data mining, as it has been shown to be equivalent to a measure of the discriminative power of similarity functions. Several estimators of local ID are proposed and analyzed based on extreme value theory, using maximum likelihood estimation, the method of moments, probability weighted moments, and regularly varying functions. An experimental evaluation is also provided, using both real and artificial data.
Similar content being viewed by others
References
Alves MF, de Haan L, Lin T (2003a) Estimation of the parameter controlling the speed of convergence in extreme value theory. Math Method Stat 12(2):155–176
Alves MIF, Gomes MI, de Haan L (2003b) A new class of semi-parametric estimators of the second order parameter. Port Math 60(2):193–214
Amsaleg L, Chelly O, Furon T, Girard S, Houle ME, Kawarabayashi K, Nett M (2015) Estimating local intrinsic dimensionality. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 29–38
Balkema AA, De Haan L (1974) Residual life time at great age. Ann Probab 2(5):792–804
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbor meaningful? In: International conference on database theory. Springer, Berlin, pp 217–235
Beygelzimer A, Kakade S, Langford J (2006) Cover trees for nearest neighbor. In: International conference on machine learning. ACM, pp 97–104
Bingham NH, Goldie CM, Teugels JL (1989) Regular variation, vol 27. Cambridge University Press, Cambridge
Boujemaa N, Fauqueur J, Ferecatu M, Fleuret F, Gouet V, LeSaux B, Sahbi H (2001) IKONA for interactive specific and generic image retrieval. In: Proceedings of international workshop on multimedia content-based indexing and retrieval
Bouveyron C, Celeux G, Girard S (2011) Intrinsic dimension estimation by maximum likelihood in isotropic probabilistic PCA. Pattern Recogn Lett 32(14):1706–1713
Bruske J, Sommer G (1998) Intrinsic dimensionality estimation with optimally topology preserving maps. IEEE Trans Pattern Anal Mach Intell 20(5):572–575
Camastra F, Vinciarelli A (2002) Estimating the intrinsic dimension of data with a fractal-based method. IEEE Trans Pattern Anal Mach Intell 24(10):1404–1407
Cole R, Fanty M (1990) Spoken letter recognition. In: Proceedings of the third DARPA speech and natural language workshop, pp 385–390
Coles S, Bawa J, Trenner L, Dorazio P (2001) An introduction to statistical modeling of extreme values, vol 208. Springer, Berlin
Costa JA, Hero AO III (2004) Entropic graphs for manifold learning. In: Asilomar conference on signals, systems and computers, vol 1. IEEE, pp 316–320
Dahan E, Mendelson H (2001) An extreme-value model of concept testing. Manag Sci 47(1):102–116
de Vries T, Chawla S, Houle ME (2012) Density-preserving projections for large-scale local anomaly detection. Knowl Inf Syst 32(1):25–52
Dong W, Moses C, Li K (2011) Efficient k-nearest neighbor graph construction for generic similarity measures. In: International World Wide Web conference. ACM, pp 577–586
Donoho DL, Grimes C (2003) Hessian eigenmaps: locally linear embedding techniques for high-dimensional data. Proc Natl Acad Sci 100(10):5591–5596
Faloutsos C, Kamel I (1994) Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension. In: Proceedings of the 13th ACM SIGACT–SIGMOD–SIGART symposium on principles of database systems. ACM, pp 4–13
Fisher RA, Tippett LHC (1928) Limiting forms of the frequency distribution of the largest or smallest member of a sample. Math Proc Camb Philos Soc 24(02):180–190
Fukunaga K, Olsen DR (1971) An algorithm for finding intrinsic dimensionality of data. IEEE Trans Comput 100(2):176–183
Furon T, Jégou H (2013) Using extreme value theory for image detection. Research Report RR-8244, INRIA
Gnedenko B (1943) Sur la distribution limite du terme maximum d’une série aléatoire. Ann Math 44(3):423–453
Grimshaw SD (1993) Computing maximum likelihood estimates for the Generalized Pareto Distribution. Technometrics 35(2):185–191
Gupta A, Krauthgamer R, Lee JR (2003) Bounded geometries, fractals, and low-distortion embeddings. In: Proceedings of the 44th annual IEEE symposium on foundations of computer science. IEEE, pp 534–543
Guyon I, Gunn S, Ben-Hur A, Dror G (2004) Result analysis of the NIPS 2003 feature selection challenge. In: Neural information processing systems, pp 545–552
Harris R (2001) The accuracy of design values predicted from extreme value analysis. J Wind Eng Ind Aerodyn 89(2):153–164
Hein M, Audibert JY (2005) Intrinsic dimensionality estimation of submanifolds in \({R}^d\). In: International conference on machine learning. ACM, pp 289–296
Hill BM et al (1975) A simple general approach to inference about the tail of a distribution. Ann Stat 3(5):1163–1174
Hosking JR, Wallis JR (1987) Parameter and quantile estimation for the Generalized Pareto Distribution. Technometrics 29(3):339–349
Houle ME (2013) Dimensionality, discriminability, density & distance distributions. In: 13th International conference on data mining workshops. IEEE, pp 468–473
Houle ME (2015) Inlierness, outlierness, hubness and discriminability: an extreme-value-theoretic foundation. Tech. Rep. 2015-002E, National Institute of Informatics
Houle ME, Kashima H, Nett M (2012a) Generalized expansion dimension. In: 12th international conference on data mining workshops, IEEE, pp 587–594
Houle ME, Ma X, Nett M, Oria V (2012b) Dimensional testing for multi-step similarity search. In: 12th International conference on data mining. IEEE, pp 299–308
Houle ME, Ma X, Oria V, Sun J (2014) Efficient algorithms for similarity search in axis-aligned subspaces. In: International conference on similarity search and applications. Springer, Berlin, pp 1–12
Jégou H, Tavenard R, Douze M, Amsaleg L (2011) Searching in one billion vectors: re-rank with source coding. In: International conference on acoustics, speech and signal processing. IEEE, pp 861–864
Jolliffe IT (1986) Principal component analysis and factor analysis. In: Principal component analysis. Springer, Berlin, pp 115–128
Karger DR, Ruhl M (2002) Finding nearest neighbors in growth-restricted metrics. In: ACM symposium on theory of computing. ACM, pp 741–750
Karhunen J, Joutsensalo J (1994) Representation and separation of signals using nonlinear PCA type learning. IEEE Trans Neural Netw 7(1):113–127
Landwehr JM, Matalas N, Wallis J (1979) Probability weighted moments compared with some traditional techniques in estimating Gumbel parameters and quantiles. Water Resour Res 15(5):1055–1064
Larrañaga P, Lozano JA (2002) Estimation of distribution algorithms: a new tool for evolutionary computation, vol 2. Springer, Berlin
Lavenda BH, Cipollone E (2000) Extreme value statistics and thermodynamics of earthquakes: aftershock sequences. Ann Geophys 43(5):967–982
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Levina E, Bickel PJ (2004) Maximum likelihood estimation of intrinsic dimension. In: Neural information processing systems, pp 777–784
McNulty PJ, Scheick LZ, Roth DR, Davis MG, Tortora MR (2000) First failure predictions for EPROMs of the type flown on the MPTB satellite. IEEE Trans Nucl Sci 47(6):2237–2243
Millán JdR (2004) On the need for on-line learning in brain-computer interfaces. In: Proceedings of the IEEE international joint conference on neural networks, vol 4. IEEE, pp 2877–2882
Nett M (2014) Intrinsic dimensional design and analysis of similarity search. Ph.D. thesis, University of Tokyo
Pestov V (2000) On the geometry of similarity search: dimensionality curse and concentration of measure. Inf Process Lett 73(1):47–51
Pettis KW, Bailey TA, Jain AK, Dubes RC (1979) An intrinsic dimensionality estimator from near-neighbor information. IEEE Trans Pattern Anal Mach Intell 1:25–37
Pickands J III (1975) Statistical inference using extreme order statistics. Ann Stat 3(1):119–131
Radovanović M, Nanopoulos A, Ivanović M (2010a) Hubs in space: popular nearest neighbors in high-dimensional data. J Mach Learn Res 11:2487–2531
Radovanović M, Nanopoulos A, Ivanović M (2010b) Time-series classification in many intrinsic dimensions. In: Proceedings of the 2010 SIAM international conference on data mining. Citeseer, pp 677–688
Rao CR (2009) Linear statistical inference and its applications, vol 22. Wiley, New York
Roberts SJ (2000) Extreme value statistics for novelty detection in biomedical data processing. Proc Sci Meas Technol 147:363–367
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326
Rozza A, Lombardi G, Ceruti C, Casiraghi E, Campadelli P (2012) Novel high intrinsic dimensionality estimators. Mach Learn J 89(1–2):37–65
Schölkopf B, Smola A, Müller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5):1299–1319
Shaft U, Ramakrishnan R (2006) Theory of nearest neighbors indexability. ACM Trans Database Syst 31(3):814–838
Takens F (1985) On the numerical determination of the dimension of an attractor. Springer, Berlin
Tenenbaum JB, De Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323
Tryon RG, Cruse TA (2000) Probabilistic mesomechanics for high cycle fatigue life prediction. J Eng Mater Technol 122(2):209–214
Venna J, Kaski S (2006) Local multidimensional scaling. IEEE Trans Neural Netw 19(6):889–899
Verveer PJ, Duin RPW (1995) An evaluation of intrinsic dimensionality estimators. IEEE Trans Pattern Anal Mach Intell 17(1):81–86
von Brünken J, Houle ME, Zimek A (2015) Intrinsic dimensional outlier detection in high-dimensional data. Tech. Rep. 2015-003E, National Institute of Informatics
Weber R, Schek HJ, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: 24th international conference on very large data bases, vol 98, pp 194–205
Zhang Z, Zha H (2004) Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. J Shanghai Univ (English Edition) 8(4):406–424
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Jieping Ye.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
L.A. and T.F. supported by French Project Secular ANR-12-CORD-0014. O.C., M.E.H. and K.K. supported by JST ERATO Kawarabayashi Project. M.E.H. supported by JSPS Kakenhi Kiban (A) Research Grant 25240036. O.C. and M.E.H. supported by JSPS Kakenhi Kiban (B) Research Grant 15H02753.
Rights and permissions
About this article
Cite this article
Amsaleg, L., Chelly, O., Furon, T. et al. Extreme-value-theoretic estimation of local intrinsic dimensionality. Data Min Knowl Disc 32, 1768–1805 (2018). https://doi.org/10.1007/s10618-018-0578-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-018-0578-6