Distributed Data Mining vs. Sampling Techniques: A Comparison

Aounallah, Mohamed; Quirion, Sébastien; Mineau, Guy W.

doi:10.1007/978-3-540-24840-8_37

Mohamed Aounallah¹⁸,
Sébastien Quirion¹⁸ &
Guy W. Mineau¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3060))

Included in the following conference series:

Conference of the Canadian Society for Computational Studies of Intelligence

1523 Accesses
2 Citations

Abstract

To address the of mining a huge volume of geographically distributed databases, we propose two approaches. The first one is to download only a sample of each database. The second option is to mine each distributed database remotely and to download the resulting models to a central site and then aggregate these models. In this paper, we present an overview of the most common sampling techniques. We then present a new technique of distributed data-mining based on rule set models, where the aggregation technique is based on a confidence coefficient associated with each rule and on very small samples from each database. Finally, we present a comparison between the best sampling techniques that we found in the literature, and our approach of model aggregation.

This work is sponsored by NSERC.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aounallah, M., Mineau, G.: Rule Confidence Produced From Disjoint Databases: a Statistically Sound Way to Regroup Rules Sets. Accepted in IADIS international conference, Applied Computing (2004)
Google Scholar
Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html
John, G., Langley, P.: Static Versus Dynamic Sampling for Data Mining. In: Simoudis, E., Han, J., Fayya, U.M. (eds.) Proceedings of the Second International Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August 1996, pp. 367–370. AAAI/MIT Press (1996)
Google Scholar
Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, Dublin, IE, pp. 3–12. Springer, Heidelberg (1994)
Google Scholar
Mangasarian, O.L., Wolberg, W.H.: Cancer diagnosis via linear programming. SIAM News 23(5), 1–18 (1990)
Google Scholar
Provost, F., Jensen, D., Oates, T.: Efficient Progressive Sampling. In: Chaudhuri, S., Madigan, D. (eds.) Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 15-18, pp. 23–32. ACM Press, New York (1999)
Chapter Google Scholar
Ross Quinlan, J.: Improved Use of Continuous Attributes in C4.5. Journal of Artificial Intelligence Research 4, 77–90 (1996)
Google Scholar
Saar-Tsechansky, M., Provost, F.: Active Sampling for Class Probability Estimation and Ranking (2001), http://www.mccombs.utexas.edu/faculty/Maytal.Saar-Tsechansky/home/MLJ-BootstrapLV-final.pdf

Download references

Author information

Authors and Affiliations

Laboratory of Computational Intelligence, Computer Science and Software Engineering Department, Laval University, Sainte-Foy, Québec, G1K 7P4, Canada
Mohamed Aounallah, Sébastien Quirion & Guy W. Mineau

Authors

Mohamed Aounallah
View author publications
You can also search for this author in PubMed Google Scholar
Sébastien Quirion
View author publications
You can also search for this author in PubMed Google Scholar
Guy W. Mineau
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Windsor, 401 Sunset Avenue, N9B 3P4, Windsor, Ontario, Canada
Ahmed Y. Tawfik
School of Computer Science, University of Windsor, Windsor, Ontario,
Scott D. Goodwin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Aounallah, M., Quirion, S., Mineau, G.W. (2004). Distributed Data Mining vs. Sampling Techniques: A Comparison. In: Tawfik, A.Y., Goodwin, S.D. (eds) Advances in Artificial Intelligence. Canadian AI 2004. Lecture Notes in Computer Science(), vol 3060. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24840-8_37

Download citation

DOI: https://doi.org/10.1007/978-3-540-24840-8_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22004-6
Online ISBN: 978-3-540-24840-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics