Abstract
Exploratory data analysis is a fundamental stage in data mining of high-dimensional datasets. Several algorithms have been implemented to grasp a general idea of the geometry and patterns present in high-dimensional data. Here, we present a methodology based on the distance matrix of the input data. The algorithm is based in the number of points considered to be neighbors of each input vector. Neighborhood is defined in terms of an hypersphere of varying radius, and from the distance matrix the probability density function of the number of neighbor vectors is computed. We show that when the radius of the hypersphere is systematically increased, a detailed analysis of the probability density function of the number of neighbors unfolds relevant aspects of the overall features that describe the high-dimensional data. The algorithm is tested with several datasets and we show its pertinence as an exploratory data analysis tool.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Dasu, T., Johnson, T.: Exploratory data mining and data cleaning. Wiley (2003)
Basford, K.E., Tukey, J.: Graphical analysis of multiresponse data. Chapman & Hall/CRC (1998)
Morgenthaler, S.: Exploratory data analysis. WIREs Computational Statistics 1, 33–44 (2009)
Martinez, W., Martinez, W.: Exploratory data analysis with Matlab. Chapman & Hall / CRC (2005)
Steinbach, M., Ertöz, L., Kumar, V.: The challenges of clustering high-dimensional data. In: New Vistas in Statistical Physics: Applications in Econophysics, Bioinformatics, and Pattern Recognition (2003)
Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. on Knowledge Discovery from Data 3(1), Article 1 (2009)
Berthold, M., Wiswedel, B., Patterson, D.: Interactive exploration of fuzzy clusters using Neighborgrams Fuzzy Sets and Systems, vol. 149, pp. 21–37 (2005)
Borg, I., Groenen, P.: Modern Multidimensional Scaling: Theory and applications, 2nd edn. Springer (2005)
Vesanto, J., Sulkava, M.: Distance Matrix Based Clustering of the Self-Organizing Map. In: Dorronsoro, J.R. (ed.) ICANN 2002. LNCS, vol. 2415, pp. 951–956. Springer, Heidelberg (2002)
Brim, S.: Near neighbor search in large metric spaces. In: Proc. 21st VLDB Conf., Zürich, Switzerland, pp. 574–584 (1995)
Cha, S.H.: Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions. Int. J. of Mathematical Models and Methods in Applied Sciences 4(1), 300–307 (2007)
Brough, R., Frankum, J., Sims, D.: Functional viability profiles of breast cancer. Cancer Discovery 1, 260–273 (2011)
Blake, C.L., Merz, C.U.: Repository of machine learning databases University of California, Irvine, Dept. of Information and Computer Sciences (1998), http://www.ics.uci.edu/mlearn/MLRepository.html
Garcia-Vallve, S., Romeu, A., Palau, J.: Horizontal Gene Transfer in Bacterial and Archaeal Complete Genomes. Genome Res. 10, 1719–1725 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Neme, A., Nido, A. (2013). Exploratory Data Analysis through the Inspection of the Probability Density Function of the Number of Neighbors. In: Tucker, A., Höppner, F., Siebes, A., Swift, S. (eds) Advances in Intelligent Data Analysis XII. IDA 2013. Lecture Notes in Computer Science, vol 8207. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41398-8_27
Download citation
DOI: https://doi.org/10.1007/978-3-642-41398-8_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41397-1
Online ISBN: 978-3-642-41398-8
eBook Packages: Computer ScienceComputer Science (R0)