Abstract
In data mining, the problem of measuring similarities between different subsets is an important issue which has been little investigated up to now. In this paper, a novel method is proposed based on unsupervised learning. Different subsets of a dataset are characterized by means of a model which implicitly corresponds to a set of prototypes, each one capturing a different modality of the data. Then, structural differences between two subsets are reflected in the corresponding model. Differences between models are detected using a similarity measure based on data density. Experiments over synthetic and real datasets illustrate the effectiveness, efficiency, and insights provided by our approach.
This work was supported in part by the CADI project (N o ANR-07 TLOG 003) financed by the ANR (Agence Nationale de la Recherche).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Gehrke, J., Korn, F., Srivastava, D.: On computing correlated aggregates over continual data streams. In: SIGMOD Conference (2001)
Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: VLDB, pp. 346–357 (2002)
Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: 2006 SIAM Conference on Data Mining (2006)
Aggarwal, C., Yu, P.: A Survey of Synopsis Construction Methods in Data Streams. In: Aggarwal, C. (ed.) Data Streams: Models and Algorithms. Springer, Heidelberg (2007)
Kohonen, T.: Self-Organizing Maps. Springer, Berlin (2001)
Cabanes, G., Bennani, Y.: A local density-based simultaneous two-level algorithm for topographic clustering. In: IJCNN, pp. 1176–1182. IEEE, Los Alamitos (2008)
Silverman, B.: Using kernel density estimates to investigate multi-modality. Journal of the Royal Statistical Society, Series B 43, 97–99 (1981)
Vesanto, J.: Neural network tool for data mining: SOM Toolbox (2000)
Sain, S., Baggerly, K., Scott, D.: Cross-Validation of Multivariate Densities. Journal of the American Statistical Association 89, 807–817 (1994)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39, 1–38 (1977)
Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97(458), 611–631 (2002)
Redner, R.A., Walker, H.F.: Mixture densities, maximum likelihood and the EM algorithm. SIAM Review 26(2), 195–239 (1984)
Park, H., Ozeki, T.: Singularity and Slow Convergence of the EM algorithm for Gaussian Mixtures. Neural Process Letters 29, 45–59 (2009)
Hershey, J.R., Olsen, P.A.: Approximating the kullback leibler divergence between gaussian mixture models. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. 317–320 (2007)
Dunn, J.C.: Well separated clusters and optimal fuzzy partitions. J. Cybern. 4, 95–104 (1974)
Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall, Inc, Upper Saddle River (1988)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cabanes, G., Bennani, Y. (2009). Comparing Large Datasets Structures through Unsupervised Learning. In: Leung, C.S., Lee, M., Chan, J.H. (eds) Neural Information Processing. ICONIP 2009. Lecture Notes in Computer Science, vol 5863. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10677-4_62
Download citation
DOI: https://doi.org/10.1007/978-3-642-10677-4_62
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10676-7
Online ISBN: 978-3-642-10677-4
eBook Packages: Computer ScienceComputer Science (R0)