Skip to main content

Distributed Data Mining vs. Sampling Techniques: A Comparison

  • Conference paper
Advances in Artificial Intelligence (Canadian AI 2004)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3060))

Abstract

To address the of mining a huge volume of geographically distributed databases, we propose two approaches. The first one is to download only a sample of each database. The second option is to mine each distributed database remotely and to download the resulting models to a central site and then aggregate these models. In this paper, we present an overview of the most common sampling techniques. We then present a new technique of distributed data-mining based on rule set models, where the aggregation technique is based on a confidence coefficient associated with each rule and on very small samples from each database. Finally, we present a comparison between the best sampling techniques that we found in the literature, and our approach of model aggregation.

This work is sponsored by NSERC.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aounallah, M., Mineau, G.: Rule Confidence Produced From Disjoint Databases: a Statistically Sound Way to Regroup Rules Sets. Accepted in IADIS international conference, Applied Computing (2004)

    Google Scholar 

  2. Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html

  3. John, G., Langley, P.: Static Versus Dynamic Sampling for Data Mining. In: Simoudis, E., Han, J., Fayya, U.M. (eds.) Proceedings of the Second International Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August 1996, pp. 367–370. AAAI/MIT Press (1996)

    Google Scholar 

  4. Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, Dublin, IE, pp. 3–12. Springer, Heidelberg (1994)

    Google Scholar 

  5. Mangasarian, O.L., Wolberg, W.H.: Cancer diagnosis via linear programming. SIAM News 23(5), 1–18 (1990)

    Google Scholar 

  6. Provost, F., Jensen, D., Oates, T.: Efficient Progressive Sampling. In: Chaudhuri, S., Madigan, D. (eds.) Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 15-18, pp. 23–32. ACM Press, New York (1999)

    Chapter  Google Scholar 

  7. Ross Quinlan, J.: Improved Use of Continuous Attributes in C4.5. Journal of Artificial Intelligence Research 4, 77–90 (1996)

    Google Scholar 

  8. Saar-Tsechansky, M., Provost, F.: Active Sampling for Class Probability Estimation and Ranking (2001), http://www.mccombs.utexas.edu/faculty/Maytal.Saar-Tsechansky/home/MLJ-BootstrapLV-final.pdf

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Aounallah, M., Quirion, S., Mineau, G.W. (2004). Distributed Data Mining vs. Sampling Techniques: A Comparison. In: Tawfik, A.Y., Goodwin, S.D. (eds) Advances in Artificial Intelligence. Canadian AI 2004. Lecture Notes in Computer Science(), vol 3060. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24840-8_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24840-8_37

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-22004-6

  • Online ISBN: 978-3-540-24840-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics