Skip to main content

Sampling for Information and Structure Preservation When Mining Large Data Bases

  • Conference paper
Advances in Artificial Intelligence – IBERAMIA 2010 (IBERAMIA 2010)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6433))

Included in the following conference series:

  • 1392 Accesses

Abstract

The unsupervised learning process of identifying data clusters on large databases, in common use nowadays, requires an extremely costly computational effort. The analysis of a large volume of data makes it impossible to handle it in the computer’s main storage. In this paper we propose a methodology (henceforth referred to as "FDM" for fast data mining) to determine the optimal sample from a database according to the relevant information on the data, based on concepts drawn from the statistical theory of communication and L ∞  approximation theory. The methodology achieves significant data reduction on real databases and yields equivalent cluster models as those resulting from the original database. Data reduction is accomplished through the determination of the adequate number of instances required to preserve the information present in the population. Then, special effort is put in the validation of the obtained sample distribution through the application of classical statistical non parametrical tests and other tests based on the minimization of the approximation error of polynomial models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From Data Mining to Knowledge Discovery in Databases. AI Magazine (1996)

    Google Scholar 

  2. Kriegel, H.P., Kroger, P., Zimekm, A.: Clustering High-Dimensional Data: A Survey on Subspace Clustering, Pattern-Based Clustering, and Correlation Clustering. ACM Trans. Knowl. Discov. Data 3(1), Article 1 (2009)

    Google Scholar 

  3. Cheng, D., Kannan, R., Vempala, S., Wang, G.: A divide-and-merge methodology for clustering. ACM Trans. Database Syst. 31(4), 1499–1525 (2006)

    Article  Google Scholar 

  4. Jagadish, H.V., Larkshmanan, L.V., Srivastava, D.: Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse. In: ACM Proceedings: International Conference on Management of Data, Philadelphia, USA, pp. 37–48 (2001)

    Google Scholar 

  5. Raïssi, C., Poncelet, P.: Sampling for Sequential Pattern Mining: From Static Databases to Data Streams. In: Seventh IEEE International Conference on Data Mining (2007)

    Google Scholar 

  6. Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3 (2003)

    Google Scholar 

  7. Pitt, E., Nayak, R.: The Use of Various Data Mining and Feature Selection Methods in the Analysis of a Population Survey Dataset. In: Proc. 2nd International Workshop on Integrating Artificial Intelligence and Data Mining (AIDM 2007) (2007)

    Google Scholar 

  8. Jolliffe, I.T.: Principal Component Analysis, 2nd edn. Springer Series on Statistics (2002)

    Google Scholar 

  9. Olken, R., Rotem, D.: Random Sampling from Databases - A Survey (1995), http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.39.8131

  10. Slagle, J.R., Chang, C.L., Heller, S.: A Clustering and data-reorganization algorithm. IEEE Trans. on Systems, Man and Cybernetics (1975)

    Google Scholar 

  11. Berkhin, P.: A Survey of Clustering Data Mining Techniques. In: Grouping Multidimensional Data. Springer, Berlin (2006)

    Google Scholar 

  12. Zhang, Y., Zhang, J., Ma, J., Wang, Z.: Fault Detection Based on Data Mining Theory. Intelligent Systems and Applications, pp. 1–4 (2009)

    Google Scholar 

  13. Zhu, L.: Nonparametric Monte Carlo Tests and Their Applications. Springer Science+Business Media, Inc., Heidelberg (2005)

    MATH  Google Scholar 

  14. Guo, H., Hou, W.C., Yan, F., Zhu, Q.: A Monte Carlo Sampling Method for Drawing Representative Samples from Large Databases. In: Proceedings of the 16th International Conference on Scientific and Statistical Database Management (2004)

    Google Scholar 

  15. Palmer, P., Floutsos, C.: Density Biased Sampling: An Improved Method for Data Mining and Clustering. In: Proceedings of ACM SIGMOD International Conference on Management of Data (2000)

    Google Scholar 

  16. Saunders, I.: Restricted stratified random sampling. International Journal of Mineral Processing 25(3-4), 159–166 (1989)

    Article  Google Scholar 

  17. Krantz, S., Parks, H.R.: A Primer of Real Analytic Functions, 2nd edn. Birkhäuser, Basel (2002)

    Book  MATH  Google Scholar 

  18. Cheney, E.W.: Introduction to Approximation Theory. AMS Chelsea Publishing (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kuri-Morales, A., Lozano, A. (2010). Sampling for Information and Structure Preservation When Mining Large Data Bases. In: Kuri-Morales, A., Simari, G.R. (eds) Advances in Artificial Intelligence – IBERAMIA 2010. IBERAMIA 2010. Lecture Notes in Computer Science(), vol 6433. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16952-6_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-16952-6_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-16951-9

  • Online ISBN: 978-3-642-16952-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics