Is the Distance Compression Effect Overstated? Some Theory and Experimentation

France, Stephen; Carroll, Douglas

doi:10.1007/978-3-642-03070-3_21

Is the Distance Compression Effect Overstated? Some Theory and Experimentation

Stephen France²⁰ &
Douglas Carroll²¹

Conference paper

2365 Accesses
7 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5632))

Abstract

Previous work in the document clustering literature has shown that the Minkowski-p distance metrics are unsuitable for clustering very high dimensional document data. This unsuitability is put down to the effect of “compression” of the distances created using the Minkowski-p metrics on high dimensional data. Previous experimental work on distance compression has generally used the performance of clustering algorithms on distances created by the different distance metrics as a proxy for the quality of the distance representations created by those metrics. In order to separate out the effects of distances from the performance of the clustering algorithms we tested the homogeneity of the latent classes with respect to item neighborhoods rather than testing the homogeneity of clustering solutions with respect to latent classes. We show the theoretical relationships between the cosine, correlation, and Euclidean metrics. We posit that some of the performance differential between the cosine and correlation metrics and the Minkowski-p metrics is due to the inbuilt normalization of the cosine and correlation metrics. The normalization effect decreases with increasing dimensionality and the distance compression effect increases with increasing dimensionality. For document datasets with dimensionality up to 20,000, the normalization effect dominates the distance compression effect. We propose a methodology for measuring the relative normalization and distance compression effects.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the Surprising Behavior of Distance Metrics in High Dimensional Space. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 420–434. Springer, Heidelberg (2001)
Chapter Google Scholar
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor” meaningful? In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer, Heidelberg (1999)
Chapter Google Scholar
Boley, D., Gini, M., Goss, R., et al.: Partitioning-Based Clustering for Web Document Categorization. Decision Support Systems 27, 329–341 (1999)
Article Google Scholar
Statlog (Image Segmentation) Data Set, http://archive.ics.uci.edu/ml/datasets/Statlog+%28Image+Segmentation%29
Corrodo, G.: Measurement of Inequality and Incomes. The Economic Journal 31, 124–126 (1921)
Article Google Scholar
Fanty, M., Cole, R.: Spoken Letter Recognition. In: Lippman, R.P., Moody, J., Touretzky, D.S. (eds.) Advances in Neural Information Processing Systems, vol. 3, pp. 220–226. Morgan Kaufmann, San Mateo (1990)
Google Scholar
Francois, D., Wertz, V., Verleysen, M.: The Concentration of Fractional Distances. IEEE Transactions on Knowledge and Data Engineering 19, 873–886 (2007)
Article Google Scholar
Hersh, W., Buckley, C., Leone, T.J., Hickman, D.: OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research. In: Croft, W.B., Van Rijsbergen, C.J. (eds.) Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 192–201. Springer, New York (1994)
Google Scholar
CLUTO: Software for Clustering High-Dimensional DataSets, http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download
Neslin, S.A., Gupta, S., Kamakura, W.A., Lu, J., Mason, C.H.: Defection Detection: Measuring and Understanding the Predictive Accuracy of Customer Churn Models. Journal of Marketing Research 43, 204–211 (2006)
Article Google Scholar
Scheffé, H.: The Analysis of Variance. John Wiley & Sons, New York (1959)
MATH Google Scholar
Strehl, A., Ghosh, J., Mooney, R.: Impact of Similarity Measures on Web-Page Clustering. In: Proceedings of the 17th National Conference on Artificial Intelligence: Workshop of Artificial Intelligence for Web Search (AAAI 2000), pp. 58–64. AAAI, Cambridge (2000)
Google Scholar
TREC Text REtrieval Conference, http://trec.nist.gov
Tversky, A., Krantz, D.H.: The Dimensional Representation and the Metric Structure of Similarity Data. Journal of Mathematical Psychology 7, 572–596 (1970)
Article MathSciNet MATH Google Scholar
Verleysen, M., Francois, D., Simon, G., Wertz, V.: On the Effects of Dimensionality on Data Analysis with Neural Networks. In: Mira, J., Álvarez, J.R. (eds.) IWANN 2003. LNCS, vol. 2687, pp. 105–112. Springer, Heidelberg (2003)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Lubar School of Business, UW – Milwaukee, 3202 N. Maryland Avenue., Milwaukee, Wisconsin, 53201-0742
Stephen France
Graduate School of Management, Newark, Rutgers University, New Jersey, 07102-3027
Douglas Carroll

Authors

Stephen France
View author publications
You can also search for this author in PubMed Google Scholar
Douglas Carroll
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institut für Bildverarbeitung und angewandte Informatik, Körnerstr. 10, 04107, Leipzig, Deutschland, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

France, S., Carroll, D. (2009). Is the Distance Compression Effect Overstated? Some Theory and Experimentation. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2009. Lecture Notes in Computer Science(), vol 5632. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03070-3_21

Download citation

DOI: https://doi.org/10.1007/978-3-642-03070-3_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03069-7
Online ISBN: 978-3-642-03070-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics