Multiplicative distance: a method to alleviate distance instability for high-dimensional data

Mansouri, Jafar; Khademi, Morteza

doi:10.1007/s10115-014-0813-4

Multiplicative distance: a method to alleviate distance instability for high-dimensional data

Regular Paper
Published: 28 December 2014

Volume 45, pages 783–805, (2015)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Jafar Mansouri¹ &
Morteza Khademi¹

502 Accesses
6 Citations
Explore all metrics

Abstract

Recently, it has been shown that under a broad set of conditions, the commonly used distance functions will become unstable in high-dimensional data space; i.e., the distance to the farthest data point approaches the distance to the nearest data point of a given query point with increasing dimensionality. It has been shown that if dimensions are independently distributed, and normalized to have zero mean and unit variance, instability happens. In this paper, it is shown that the normalization condition is not necessary, but all appropriate moments must be finite. Furthermore, a new distance function, namely multiplicative distance, is introduced. It is theoretically proved that this function is stable for data with independent dimensions (with identical or nonidentical distribution). In contrast to usual distance functions which are based on the summation of distances over all dimensions (distance components), the multiplicative distance is based on the multiplication of distance components. Experimental results show the stability of the multiplicative distance for data with independent and correlated dimensions in the high-dimensional space and the superiority of the multiplicative distance over the norm distances for the high-dimensional data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Standard Deviation Score: a novel similarity metric for data analysis

Article Open access 07 March 2025

Improved Euclidean Distance in the K Nearest Neighbors Method

Binary Vectors for Fast Distance and Similarity Estimation

Article 26 January 2017

Notes

References

Aggarwal CC, Hinneburg A, Keim DA (2001) Parametric detection of meaningless distances in high dimensional data. In: Proceedings international conference on database theory (ICDT), pp 420–434
Aggarwal CC, Yu PS (2000) The IGrid index: reversing the dimensionality curse for similarity indexing in high dimensional space. In: Proceedings sixth ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’00), pp 119–129
Bassani HF, Araujo AFR (2012) Dimension selective self-organizing maps for clustering high dimensional data. In: Proceedings international conference on neural networks (IJCNN), pp 1–8
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbors meaningful? In: Proceedings seventh international conference on database theory (ICDT ’99), vol. 1540, pp 217–235
Chang J, Lee A (2008) Parallel high-dimensional index structure for content-based information retrieval. In: Proceedings 8th IEEE international conference on computer and information technology, pp 101–106
Cheng Q (2010) A sparse learning machine for high-dimensional data with application to microarray gene analysis. IEEE/ACM Trans Comput Biol Bioinform 7(4):636–646
Article Google Scholar
Chu Y-H, Huang J-W, Chuang K-T, Yang D-N, Chen M-S (2010) Density conscious subspace clustering for high-dimensional data. IEEE Trans Knowl Data Eng 22(1):16–30
Article Google Scholar
Cui J, Xiao B, Yin Z (2010) Speed up linear scan in high-dimensions using extended B+-tree. In: Proceedings 2nd international workshop on database technology and applications (DBTA), pp 1–4
Deegalla S, Bostrom H (2006) Reducing high-dimensional data by principal component analysis vs. random projection for nearest neighbor classification. In: Proceedings international conference on machine learning and applications, pp 245–250
Demartines P (1994) Analyse de donne’es par re’seaux de neurones auto-organise’s. PhD dissertation, Institut Nat’l Polytechnique de Grenoble, Grenoble, France (in French)
Durrant RJ, Kaban A (2009) When is ‘nearest neighbour’ meaningful: a converse theorem and implications. J Complex 25(4):385–397
Article MathSciNet MATH Google Scholar
Ertoz L, Steinbach M, Kumar V (2003) Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings SIAM international conference on data mining
Fauvel M, Chanussot J, Benediktsson JA, Villa A (2013) Parsimonious Mahalanobis kernel for the classification of high dimensional data. Pattern Recognit 46(3):845–854
Article Google Scholar
Francois D, Wertz M-V, Verleysen SM-M (2007) The concentration of fractional distances. IEEE Trans Knowl Data Eng 19(7):873–886
Article Google Scholar
Gu X, Zhang Y, Zhang L, Zhang D, Li J (2013) An improved method of locality sensitive hashing for indexing large-scale and high-dimensional features. Signal Process 93(8):2244–2255
Article Google Scholar
Hasan A, Adnan MA (2012) High dimensional microarray data classification using correlation based feature selection. In: Proceedings international conference on biomedical engineering (ICoBE), pp 319–321
He Q, Wang Q, Du C-Y, Ma X-D, Shi Z-Z (2010) A parallel hyper-surface classifier for high dimensional data. In: Proceedings international symposium on knowledge acquisition and modeling, pp 338–343
Hinneburg A, Aggarwal CC, Keim DA (2000) What is the nearest neighbor in high dimensional spaces? In: Proceedings 26th international conference on very large data bases, pp 506–515
Hsu C-M, Chen M-S (2009) On the design and applicability of distance functions in high-dimensional data space. IEEE Trans Knowl Data Eng 21(4):523–536
Article MathSciNet Google Scholar
Huang S-C, Wu, T-K (2012) Robust semi-supervised SVM on kernel partial least discriminant space for high dimensional data mining. In: Proceedings international conference on information science and applications (ICISA), pp 1–6
Jagadish HV, Ooi BC, Tan K-L, Yu C, Zhang R (2005) iDistance: an adaptive B+-tree based indexing method for nearest neighbor search. ACM Trans Database Syst 30(2):364–397
Article Google Scholar
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Article Google Scholar
Kabán A (2011) On the distance concentration awareness of certain data reduction techniques. Pattern Recognit 44(2):265–277
Article MATH Google Scholar
Kabán A (2012) Non-parametric detection of meaningless distances in high dimensional data. Stat Comput 22:375–385
Article MathSciNet Google Scholar
Koudas N, Ooi BC, Shen HT, Tung AKH (2004) LDC: enabling search by partial distance in a hyper-dimensional space. In: Proceedings international conference on data engineering, pp 6–17
Ledoux M (2001) The concentration of measure phenomenon. Mathematical Surveys and Monographs, vol 89. American Mathematical Society
Lee G, Rodriguez C, Madabhushi A (2008) Investigating the efficacy of nonlinear dimensionality reduction schemes in classifying gene and protein expression studies. IEEE/ACM Trans Comput Biol Bioinform 5(3):368–384
Article Google Scholar
Lejsek H, Asmundsson FH, Jonsson BT, Amsaleg L (2009) NV-Tree: an efficient disk-based index for approximate search in very large high-dimensional collections. IEEE Trans Pattern Anal Mach Intell 31(5):869–883
Article Google Scholar
Li Y, Luo C, Chung SM (2008) Text clustering with feature selection by using statistical data. IEEE Trans Knowl Data Eng 20(5):641–652
Article Google Scholar
Liang J, Vaishnavi VK, Vandenberg A (2006) Clustering of LDAP directory schemas to facilitate information resources interoperability across organizations. IEEE Trans Syst Man Cybern A Syst Hum 36(4):631–642
Article Google Scholar
Liu H, Wei R-X, Jiang G-P (2012) Similarity measurement for data with high-dimensional and mixed feature values through fuzzy clustering. In: Proceedings international conference on computer science and automation engineering (CSAE), vol 3, pp 617–62
Mo D, Huang SH (2012) Fractal-based intrinsic dimension estimation and its application in dimensionality reduction. IEEE Transa Knowl Data Eng 24(1):59–71
Article Google Scholar
Okada S, Nishida T (2011) Online incremental clustering with distance metric learning for high dimensional data. In: Proceedings international joint conference on neural networks, pp 2047–2054
Radovanovi’c M, Nanopoulos A, Ivanovi’c M (2010) On the existence of obstinate results in vector space models. In: Proceedings 33rd international ACM SIGIR conference on research and development in information retrieval, New York, pp 186–193
Samiappan S, Prasad S, Bruce LM (2013) Non-uniform random feature selection and kernel density scoring with SVM based ensemble classification for hyperspectral image analysis. IEEE J Sel Top Appl Earth Obs Remote Sens 6(2):792–800
Article Google Scholar
Triguero I, Derrac J, Herrera F (2012) A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Trans Syst Man Cybern C Appl Rev 42(1):86–100
Article Google Scholar
Turkay C, Lundervold A, Lundervold AJ, Hauser H (2012) Representative factor generation for the interactive visual analysis of high-dimensional data. IEEE Trans Vis Comput Gr 18(12):2621–2630
Article Google Scholar
Varshney KR, Willsky AS (2011) Linear dimensionality reduction for margin-based classification: high-dimensional data and sensor networks. IEEE Trans Signal Process 59(6):2496–2512
Article MathSciNet Google Scholar
Weber R, Schek H-J, Blott S (1998) A quantitative analysis and performance study for similarity search methods in high-dimensional spaces. In: Proceedings very large data base conference (VLDB’98), pp 194–205
Wei B, Guan T, Yu J (2014) Projected residual vector quantization for ANN search. IEEE Multimed 21(3):41–51
Article Google Scholar
Wu C, Yang H, Zhu J, Zhang J, King I, Lyu RM (2013) Sparse Poisson coding for high dimensional document clustering. In: Proceedings IEEE international conference on big data, pp 512–517
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Article Google Scholar
Xiong H, Wu J, Chen J (2009) K-Means clustering versus validation measures: a data-distribution perspective. IEEE Trans Syst Man Cybern B Cybern 39(2):318–331
Article Google Scholar
Xu M, Chen H, Varshney PK (2013) Dimensionality reduction for registration of high-dimensional data sets. IEEE Trans Image Process 22(8):3041–3049
Article MathSciNet Google Scholar
Yasen Z, Xinwei Z, Ge L, Xian S, Hongqi W, Kun F (2014) Semi-supervised manifold learning based multigraph fusion for high-resolution remote sensing image classification. IEEE Lett Geosci Remote Sens 11(2):464–468
Article Google Scholar
Yu J, Wang M, Tao D (2012) Semisupervised multiview distance metric learning for cartoon synthesis. IEEE Trans Image Process 21(11):4636–4648
Article MathSciNet Google Scholar
Zhang H, Li N (2012) K-Dvd-Tree: a high dimensional data index applying to SIFT feature matching. In: Proceedings fifth international symposium on computational intelligence and design (ISCID), vol 2, pp 14–17

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering, Ferdowsi University of Mashhad, Mashhad, Iran
Jafar Mansouri & Morteza Khademi

Authors

Jafar Mansouri
View author publications
You can also search for this author inPubMed Google Scholar
Morteza Khademi
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jafar Mansouri.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mansouri, J., Khademi, M. Multiplicative distance: a method to alleviate distance instability for high-dimensional data. Knowl Inf Syst 45, 783–805 (2015). https://doi.org/10.1007/s10115-014-0813-4

Download citation

Received: 27 January 2014
Revised: 18 October 2014
Accepted: 06 December 2014
Published: 28 December 2014
Issue Date: December 2015
DOI: https://doi.org/10.1007/s10115-014-0813-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multiplicative distance: a method to alleviate distance instability for high-dimensional data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

The Standard Deviation Score: a novel similarity metric for data analysis

Improved Euclidean Distance in the K Nearest Neighbors Method

Binary Vectors for Fast Distance and Similarity Estimation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now