Abstract
The unsupervised learning process of identifying data clusters on large databases, in common use nowadays, requires an extremely costly computational effort. The analysis of a large volume of data makes it impossible to handle it in the computer’s main storage. In this paper we propose a methodology (henceforth referred to as "FDM" for fast data mining) to determine the optimal sample from a database according to the relevant information on the data, based on concepts drawn from the statistical theory of communication and L ∞ approximation theory. The methodology achieves significant data reduction on real databases and yields equivalent cluster models as those resulting from the original database. Data reduction is accomplished through the determination of the adequate number of instances required to preserve the information present in the population. Then, special effort is put in the validation of the obtained sample distribution through the application of classical statistical non parametrical tests and other tests based on the minimization of the approximation error of polynomial models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From Data Mining to Knowledge Discovery in Databases. AI Magazine (1996)
Kriegel, H.P., Kroger, P., Zimekm, A.: Clustering High-Dimensional Data: A Survey on Subspace Clustering, Pattern-Based Clustering, and Correlation Clustering. ACM Trans. Knowl. Discov. Data 3(1), Article 1 (2009)
Cheng, D., Kannan, R., Vempala, S., Wang, G.: A divide-and-merge methodology for clustering. ACM Trans. Database Syst. 31(4), 1499–1525 (2006)
Jagadish, H.V., Larkshmanan, L.V., Srivastava, D.: Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse. In: ACM Proceedings: International Conference on Management of Data, Philadelphia, USA, pp. 37–48 (2001)
Raïssi, C., Poncelet, P.: Sampling for Sequential Pattern Mining: From Static Databases to Data Streams. In: Seventh IEEE International Conference on Data Mining (2007)
Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3 (2003)
Pitt, E., Nayak, R.: The Use of Various Data Mining and Feature Selection Methods in the Analysis of a Population Survey Dataset. In: Proc. 2nd International Workshop on Integrating Artificial Intelligence and Data Mining (AIDM 2007) (2007)
Jolliffe, I.T.: Principal Component Analysis, 2nd edn. Springer Series on Statistics (2002)
Olken, R., Rotem, D.: Random Sampling from Databases - A Survey (1995), http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.39.8131
Slagle, J.R., Chang, C.L., Heller, S.: A Clustering and data-reorganization algorithm. IEEE Trans. on Systems, Man and Cybernetics (1975)
Berkhin, P.: A Survey of Clustering Data Mining Techniques. In: Grouping Multidimensional Data. Springer, Berlin (2006)
Zhang, Y., Zhang, J., Ma, J., Wang, Z.: Fault Detection Based on Data Mining Theory. Intelligent Systems and Applications, pp. 1–4 (2009)
Zhu, L.: Nonparametric Monte Carlo Tests and Their Applications. Springer Science+Business Media, Inc., Heidelberg (2005)
Guo, H., Hou, W.C., Yan, F., Zhu, Q.: A Monte Carlo Sampling Method for Drawing Representative Samples from Large Databases. In: Proceedings of the 16th International Conference on Scientific and Statistical Database Management (2004)
Palmer, P., Floutsos, C.: Density Biased Sampling: An Improved Method for Data Mining and Clustering. In: Proceedings of ACM SIGMOD International Conference on Management of Data (2000)
Saunders, I.: Restricted stratified random sampling. International Journal of Mineral Processing 25(3-4), 159–166 (1989)
Krantz, S., Parks, H.R.: A Primer of Real Analytic Functions, 2nd edn. Birkhäuser, Basel (2002)
Cheney, E.W.: Introduction to Approximation Theory. AMS Chelsea Publishing (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kuri-Morales, A., Lozano, A. (2010). Sampling for Information and Structure Preservation When Mining Large Data Bases. In: Kuri-Morales, A., Simari, G.R. (eds) Advances in Artificial Intelligence – IBERAMIA 2010. IBERAMIA 2010. Lecture Notes in Computer Science(), vol 6433. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16952-6_18
Download citation
DOI: https://doi.org/10.1007/978-3-642-16952-6_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16951-9
Online ISBN: 978-3-642-16952-6
eBook Packages: Computer ScienceComputer Science (R0)