Sampling for Information and Structure Preservation When Mining Large Data Bases

Kuri-Morales, Angel; Lozano, Alexis

doi:10.1007/978-3-642-16952-6_18

Angel Kuri-Morales²¹ &
Alexis Lozano²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6433))

Included in the following conference series:

Ibero-American Conference on Artificial Intelligence

1392 Accesses

Abstract

The unsupervised learning process of identifying data clusters on large databases, in common use nowadays, requires an extremely costly computational effort. The analysis of a large volume of data makes it impossible to handle it in the computer’s main storage. In this paper we propose a methodology (henceforth referred to as "FDM" for fast data mining) to determine the optimal sample from a database according to the relevant information on the data, based on concepts drawn from the statistical theory of communication and L_∞ approximation theory. The methodology achieves significant data reduction on real databases and yields equivalent cluster models as those resulting from the original database. Data reduction is accomplished through the determination of the adequate number of instances required to preserve the information present in the population. Then, special effort is put in the validation of the obtained sample distribution through the application of classical statistical non parametrical tests and other tests based on the minimization of the approximation error of polynomial models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From Data Mining to Knowledge Discovery in Databases. AI Magazine (1996)
Google Scholar
Kriegel, H.P., Kroger, P., Zimekm, A.: Clustering High-Dimensional Data: A Survey on Subspace Clustering, Pattern-Based Clustering, and Correlation Clustering. ACM Trans. Knowl. Discov. Data 3(1), Article 1 (2009)
Google Scholar
Cheng, D., Kannan, R., Vempala, S., Wang, G.: A divide-and-merge methodology for clustering. ACM Trans. Database Syst. 31(4), 1499–1525 (2006)
Article Google Scholar
Jagadish, H.V., Larkshmanan, L.V., Srivastava, D.: Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse. In: ACM Proceedings: International Conference on Management of Data, Philadelphia, USA, pp. 37–48 (2001)
Google Scholar
Raïssi, C., Poncelet, P.: Sampling for Sequential Pattern Mining: From Static Databases to Data Streams. In: Seventh IEEE International Conference on Data Mining (2007)
Google Scholar
Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3 (2003)
Google Scholar
Pitt, E., Nayak, R.: The Use of Various Data Mining and Feature Selection Methods in the Analysis of a Population Survey Dataset. In: Proc. 2nd International Workshop on Integrating Artificial Intelligence and Data Mining (AIDM 2007) (2007)
Google Scholar
Jolliffe, I.T.: Principal Component Analysis, 2nd edn. Springer Series on Statistics (2002)
Google Scholar
Olken, R., Rotem, D.: Random Sampling from Databases - A Survey (1995), http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.39.8131
Slagle, J.R., Chang, C.L., Heller, S.: A Clustering and data-reorganization algorithm. IEEE Trans. on Systems, Man and Cybernetics (1975)
Google Scholar
Berkhin, P.: A Survey of Clustering Data Mining Techniques. In: Grouping Multidimensional Data. Springer, Berlin (2006)
Google Scholar
Zhang, Y., Zhang, J., Ma, J., Wang, Z.: Fault Detection Based on Data Mining Theory. Intelligent Systems and Applications, pp. 1–4 (2009)
Google Scholar
Zhu, L.: Nonparametric Monte Carlo Tests and Their Applications. Springer Science+Business Media, Inc., Heidelberg (2005)
MATH Google Scholar
Guo, H., Hou, W.C., Yan, F., Zhu, Q.: A Monte Carlo Sampling Method for Drawing Representative Samples from Large Databases. In: Proceedings of the 16th International Conference on Scientific and Statistical Database Management (2004)
Google Scholar
Palmer, P., Floutsos, C.: Density Biased Sampling: An Improved Method for Data Mining and Clustering. In: Proceedings of ACM SIGMOD International Conference on Management of Data (2000)
Google Scholar
Saunders, I.: Restricted stratified random sampling. International Journal of Mineral Processing 25(3-4), 159–166 (1989)
Article Google Scholar
Krantz, S., Parks, H.R.: A Primer of Real Analytic Functions, 2nd edn. Birkhäuser, Basel (2002)
Book MATH Google Scholar
Cheney, E.W.: Introduction to Approximation Theory. AMS Chelsea Publishing (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Departamento de Computación, Instituto Tecnológico Autónomo de México, Rio Hondo No. 1, Mexico City, Mexico
Angel Kuri-Morales
Instituto de Investigaciones en Matemáticas Aplicadas y Sistemas, Universidad Nacional Autónoma de México, Ciudad Universitaria, Mexico City, Mexico
Alexis Lozano

Authors

Angel Kuri-Morales
View author publications
You can also search for this author in PubMed Google Scholar
Alexis Lozano
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Departamento Académico de Computación, Instituto Tecnológico Autónomo de México, Río Hondo No. 1, 01000, Mexico, D.F., México
Angel Kuri-Morales
Department of Computer Science and Engineering, Universidad Nacional del Sur, Alem 1253, 8000, Bahía Blanca, Argentina
Guillermo R. Simari

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kuri-Morales, A., Lozano, A. (2010). Sampling for Information and Structure Preservation When Mining Large Data Bases. In: Kuri-Morales, A., Simari, G.R. (eds) Advances in Artificial Intelligence – IBERAMIA 2010. IBERAMIA 2010. Lecture Notes in Computer Science(), vol 6433. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16952-6_18

Download citation

DOI: https://doi.org/10.1007/978-3-642-16952-6_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16951-9
Online ISBN: 978-3-642-16952-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics