Extreme-value-theoretic estimation of local intrinsic dimensionality

Amsaleg, Laurent; Chelly, Oussama; Furon, Teddy; Girard, Stéphane; Houle, Michael E.; Kawarabayashi, Ken-ichi; Nett, Michael

doi:10.1007/s10618-018-0578-6

Extreme-value-theoretic estimation of local intrinsic dimensionality

Published: 27 July 2018

Volume 32, pages 1768–1805, (2018)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Laurent Amsaleg¹,
Oussama Chelly ORCID: orcid.org/0000-0001-6330-3971²,
Teddy Furon³,
Stéphane Girard⁴,
Michael E. Houle²,
Ken-ichi Kawarabayashi² &
…
Michael Nett⁵

820 Accesses
25 Citations
Explore all metrics

Abstract

This paper is concerned with the estimation of a local measure of intrinsic dimensionality (ID) recently proposed by Houle. The local model can be regarded as an extension of Karger and Ruhl’s expansion dimension to a statistical setting in which the distribution of distances to a query point is modeled in terms of a continuous random variable. This form of intrinsic dimensionality can be particularly useful in search, classification, outlier detection, and other contexts in machine learning, databases, and data mining, as it has been shown to be equivalent to a measure of the discriminative power of similarity functions. Several estimators of local ID are proposed and analyzed based on extreme value theory, using maximum likelihood estimation, the method of moments, probability weighted moments, and regularly varying functions. An experimental evaluation is also provided, using both real and artificial data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Local Intrinsic Dimensionality I: An Extreme-Value-Theoretic Foundation for Similarity Applications

Local Intrinsic Dimensionality II: Multivariate Analysis and Distributional Support

Relationships Between Local Intrinsic Dimensionality and Tail Entropy

References

Alves MF, de Haan L, Lin T (2003a) Estimation of the parameter controlling the speed of convergence in extreme value theory. Math Method Stat 12(2):155–176
MathSciNet Google Scholar
Alves MIF, Gomes MI, de Haan L (2003b) A new class of semi-parametric estimators of the second order parameter. Port Math 60(2):193–214
MathSciNet MATH Google Scholar
Amsaleg L, Chelly O, Furon T, Girard S, Houle ME, Kawarabayashi K, Nett M (2015) Estimating local intrinsic dimensionality. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 29–38
Balkema AA, De Haan L (1974) Residual life time at great age. Ann Probab 2(5):792–804
Article MathSciNet MATH Google Scholar
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbor meaningful? In: International conference on database theory. Springer, Berlin, pp 217–235
Beygelzimer A, Kakade S, Langford J (2006) Cover trees for nearest neighbor. In: International conference on machine learning. ACM, pp 97–104
Bingham NH, Goldie CM, Teugels JL (1989) Regular variation, vol 27. Cambridge University Press, Cambridge
MATH Google Scholar
Boujemaa N, Fauqueur J, Ferecatu M, Fleuret F, Gouet V, LeSaux B, Sahbi H (2001) IKONA for interactive specific and generic image retrieval. In: Proceedings of international workshop on multimedia content-based indexing and retrieval
Bouveyron C, Celeux G, Girard S (2011) Intrinsic dimension estimation by maximum likelihood in isotropic probabilistic PCA. Pattern Recogn Lett 32(14):1706–1713
Article Google Scholar
Bruske J, Sommer G (1998) Intrinsic dimensionality estimation with optimally topology preserving maps. IEEE Trans Pattern Anal Mach Intell 20(5):572–575
Article Google Scholar
Camastra F, Vinciarelli A (2002) Estimating the intrinsic dimension of data with a fractal-based method. IEEE Trans Pattern Anal Mach Intell 24(10):1404–1407
Article Google Scholar
Cole R, Fanty M (1990) Spoken letter recognition. In: Proceedings of the third DARPA speech and natural language workshop, pp 385–390
Coles S, Bawa J, Trenner L, Dorazio P (2001) An introduction to statistical modeling of extreme values, vol 208. Springer, Berlin
Book Google Scholar
Costa JA, Hero AO III (2004) Entropic graphs for manifold learning. In: Asilomar conference on signals, systems and computers, vol 1. IEEE, pp 316–320
Dahan E, Mendelson H (2001) An extreme-value model of concept testing. Manag Sci 47(1):102–116
Article Google Scholar
de Vries T, Chawla S, Houle ME (2012) Density-preserving projections for large-scale local anomaly detection. Knowl Inf Syst 32(1):25–52
Article Google Scholar
Dong W, Moses C, Li K (2011) Efficient k-nearest neighbor graph construction for generic similarity measures. In: International World Wide Web conference. ACM, pp 577–586
Donoho DL, Grimes C (2003) Hessian eigenmaps: locally linear embedding techniques for high-dimensional data. Proc Natl Acad Sci 100(10):5591–5596
Article MathSciNet MATH Google Scholar
Faloutsos C, Kamel I (1994) Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension. In: Proceedings of the 13th ACM SIGACT–SIGMOD–SIGART symposium on principles of database systems. ACM, pp 4–13
Fisher RA, Tippett LHC (1928) Limiting forms of the frequency distribution of the largest or smallest member of a sample. Math Proc Camb Philos Soc 24(02):180–190
Article MATH Google Scholar
Fukunaga K, Olsen DR (1971) An algorithm for finding intrinsic dimensionality of data. IEEE Trans Comput 100(2):176–183
Article MATH Google Scholar
Furon T, Jégou H (2013) Using extreme value theory for image detection. Research Report RR-8244, INRIA
Gnedenko B (1943) Sur la distribution limite du terme maximum d’une série aléatoire. Ann Math 44(3):423–453
Article MathSciNet MATH Google Scholar
Grimshaw SD (1993) Computing maximum likelihood estimates for the Generalized Pareto Distribution. Technometrics 35(2):185–191
Article MathSciNet MATH Google Scholar
Gupta A, Krauthgamer R, Lee JR (2003) Bounded geometries, fractals, and low-distortion embeddings. In: Proceedings of the 44th annual IEEE symposium on foundations of computer science. IEEE, pp 534–543
Guyon I, Gunn S, Ben-Hur A, Dror G (2004) Result analysis of the NIPS 2003 feature selection challenge. In: Neural information processing systems, pp 545–552
Harris R (2001) The accuracy of design values predicted from extreme value analysis. J Wind Eng Ind Aerodyn 89(2):153–164
Article Google Scholar
Hein M, Audibert JY (2005) Intrinsic dimensionality estimation of submanifolds in \({R}^d\). In: International conference on machine learning. ACM, pp 289–296
Hill BM et al (1975) A simple general approach to inference about the tail of a distribution. Ann Stat 3(5):1163–1174
Article MathSciNet MATH Google Scholar
Hosking JR, Wallis JR (1987) Parameter and quantile estimation for the Generalized Pareto Distribution. Technometrics 29(3):339–349
Article MathSciNet MATH Google Scholar
Houle ME (2013) Dimensionality, discriminability, density & distance distributions. In: 13th International conference on data mining workshops. IEEE, pp 468–473
Houle ME (2015) Inlierness, outlierness, hubness and discriminability: an extreme-value-theoretic foundation. Tech. Rep. 2015-002E, National Institute of Informatics
Houle ME, Kashima H, Nett M (2012a) Generalized expansion dimension. In: 12th international conference on data mining workshops, IEEE, pp 587–594
Houle ME, Ma X, Nett M, Oria V (2012b) Dimensional testing for multi-step similarity search. In: 12th International conference on data mining. IEEE, pp 299–308
Houle ME, Ma X, Oria V, Sun J (2014) Efficient algorithms for similarity search in axis-aligned subspaces. In: International conference on similarity search and applications. Springer, Berlin, pp 1–12
Google Scholar
Jégou H, Tavenard R, Douze M, Amsaleg L (2011) Searching in one billion vectors: re-rank with source coding. In: International conference on acoustics, speech and signal processing. IEEE, pp 861–864
Jolliffe IT (1986) Principal component analysis and factor analysis. In: Principal component analysis. Springer, Berlin, pp 115–128
Chapter Google Scholar
Karger DR, Ruhl M (2002) Finding nearest neighbors in growth-restricted metrics. In: ACM symposium on theory of computing. ACM, pp 741–750
Karhunen J, Joutsensalo J (1994) Representation and separation of signals using nonlinear PCA type learning. IEEE Trans Neural Netw 7(1):113–127
Article Google Scholar
Landwehr JM, Matalas N, Wallis J (1979) Probability weighted moments compared with some traditional techniques in estimating Gumbel parameters and quantiles. Water Resour Res 15(5):1055–1064
Article Google Scholar
Larrañaga P, Lozano JA (2002) Estimation of distribution algorithms: a new tool for evolutionary computation, vol 2. Springer, Berlin
MATH Google Scholar
Lavenda BH, Cipollone E (2000) Extreme value statistics and thermodynamics of earthquakes: aftershock sequences. Ann Geophys 43(5):967–982
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
Levina E, Bickel PJ (2004) Maximum likelihood estimation of intrinsic dimension. In: Neural information processing systems, pp 777–784
McNulty PJ, Scheick LZ, Roth DR, Davis MG, Tortora MR (2000) First failure predictions for EPROMs of the type flown on the MPTB satellite. IEEE Trans Nucl Sci 47(6):2237–2243
Article Google Scholar
Millán JdR (2004) On the need for on-line learning in brain-computer interfaces. In: Proceedings of the IEEE international joint conference on neural networks, vol 4. IEEE, pp 2877–2882
Nett M (2014) Intrinsic dimensional design and analysis of similarity search. Ph.D. thesis, University of Tokyo
Pestov V (2000) On the geometry of similarity search: dimensionality curse and concentration of measure. Inf Process Lett 73(1):47–51
Article MathSciNet MATH Google Scholar
Pettis KW, Bailey TA, Jain AK, Dubes RC (1979) An intrinsic dimensionality estimator from near-neighbor information. IEEE Trans Pattern Anal Mach Intell 1:25–37
Article MATH Google Scholar
Pickands J III (1975) Statistical inference using extreme order statistics. Ann Stat 3(1):119–131
Radovanović M, Nanopoulos A, Ivanović M (2010a) Hubs in space: popular nearest neighbors in high-dimensional data. J Mach Learn Res 11:2487–2531
MathSciNet MATH Google Scholar
Radovanović M, Nanopoulos A, Ivanović M (2010b) Time-series classification in many intrinsic dimensions. In: Proceedings of the 2010 SIAM international conference on data mining. Citeseer, pp 677–688
Rao CR (2009) Linear statistical inference and its applications, vol 22. Wiley, New York
Google Scholar
Roberts SJ (2000) Extreme value statistics for novelty detection in biomedical data processing. Proc Sci Meas Technol 147:363–367
Article Google Scholar
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326
Article Google Scholar
Rozza A, Lombardi G, Ceruti C, Casiraghi E, Campadelli P (2012) Novel high intrinsic dimensionality estimators. Mach Learn J 89(1–2):37–65
Article MathSciNet MATH Google Scholar
Schölkopf B, Smola A, Müller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5):1299–1319
Article Google Scholar
Shaft U, Ramakrishnan R (2006) Theory of nearest neighbors indexability. ACM Trans Database Syst 31(3):814–838
Article Google Scholar
Takens F (1985) On the numerical determination of the dimension of an attractor. Springer, Berlin
Book MATH Google Scholar
Tenenbaum JB, De Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323
Article Google Scholar
Tryon RG, Cruse TA (2000) Probabilistic mesomechanics for high cycle fatigue life prediction. J Eng Mater Technol 122(2):209–214
Article Google Scholar
Venna J, Kaski S (2006) Local multidimensional scaling. IEEE Trans Neural Netw 19(6):889–899
Article MATH Google Scholar
Verveer PJ, Duin RPW (1995) An evaluation of intrinsic dimensionality estimators. IEEE Trans Pattern Anal Mach Intell 17(1):81–86
Article Google Scholar
von Brünken J, Houle ME, Zimek A (2015) Intrinsic dimensional outlier detection in high-dimensional data. Tech. Rep. 2015-003E, National Institute of Informatics
Weber R, Schek HJ, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: 24th international conference on very large data bases, vol 98, pp 194–205
Zhang Z, Zha H (2004) Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. J Shanghai Univ (English Edition) 8(4):406–424
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Equipe LINKMEDIA - CNRS/IRISA Rennes, Campus Universitaire de Beaulieu, 35042, Rennes Cedex, France
Laurent Amsaleg
National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430, Japan
Oussama Chelly, Michael E. Houle & Ken-ichi Kawarabayashi
Equipe LINKMEDIA - INRIA Rennes, Campus Universitaire de Beaulieu, 35042, Rennes Cedex, France
Teddy Furon
Equipe MISTIS - INRIA Grenoble, Inovallée, 655, Montbonnot, 38334, Saint-Ismier Cedex, France
Stéphane Girard
Google Japan, 6-10-1 Roppongi, Minato-ku, Tokyo, 106-6126, Japan
Michael Nett

Authors

Laurent Amsaleg
View author publications
You can also search for this author in PubMed Google Scholar
Oussama Chelly
View author publications
You can also search for this author in PubMed Google Scholar
Teddy Furon
View author publications
You can also search for this author in PubMed Google Scholar
Stéphane Girard
View author publications
You can also search for this author in PubMed Google Scholar
Michael E. Houle
View author publications
You can also search for this author in PubMed Google Scholar
Ken-ichi Kawarabayashi
View author publications
You can also search for this author in PubMed Google Scholar
Michael Nett
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael E. Houle.

Additional information

Responsible editor: Jieping Ye.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

L.A. and T.F. supported by French Project Secular ANR-12-CORD-0014. O.C., M.E.H. and K.K. supported by JST ERATO Kawarabayashi Project. M.E.H. supported by JSPS Kakenhi Kiban (A) Research Grant 25240036. O.C. and M.E.H. supported by JSPS Kakenhi Kiban (B) Research Grant 15H02753.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Amsaleg, L., Chelly, O., Furon, T. et al. Extreme-value-theoretic estimation of local intrinsic dimensionality. Data Min Knowl Disc 32, 1768–1805 (2018). https://doi.org/10.1007/s10618-018-0578-6

Download citation

Received: 16 January 2016
Accepted: 14 March 2018
Published: 27 July 2018
Issue Date: November 2018
DOI: https://doi.org/10.1007/s10618-018-0578-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extreme-value-theoretic estimation of local intrinsic dimensionality

Abstract

Access this article

Similar content being viewed by others

Local Intrinsic Dimensionality I: An Extreme-Value-Theoretic Foundation for Similarity Applications

Local Intrinsic Dimensionality II: Multivariate Analysis and Distributional Support

Relationships Between Local Intrinsic Dimensionality and Tail Entropy

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Extreme-value-theoretic estimation of local intrinsic dimensionality

Abstract

Access this article

Similar content being viewed by others

Local Intrinsic Dimensionality I: An Extreme-Value-Theoretic Foundation for Similarity Applications

Local Intrinsic Dimensionality II: Multivariate Analysis and Distributional Support

Relationships Between Local Intrinsic Dimensionality and Tail Entropy

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation