ML-DS: A Novel Deterministic Sampling Algorithm for Association Rules Mining

Elsayed, Samir A. Mohamed; Rajasekaran, Sanguthevar; Ammar, Reda A.

doi:10.1007/978-3-642-31488-9_18

Samir A. Mohamed Elsayed²⁰,
Sanguthevar Rajasekaran²⁰ &
Reda A. Ammar²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7377))

Included in the following conference series:

Industrial Conference on Data Mining

1410 Accesses

Abstract

Due to the explosive growth of data in every aspect of our life, data mining algorithms often suffer from scalability issues. One effective way to tackle this problem is to employ sampling techniques. This paper introduces, ML-DS, a novel deterministic sampling algorithm for mining association rules in large datasets. Unlike most algorithms in the literature that use randomness in sampling, our algorithm is fully deterministic. The process of sampling proceeds in stages. The size of the sample data in any stage is half that of the previous stage. In any given stage, the data is partitioned into disjoint groups of equal size. Some distance measure is used to determine the importance of each group in identifying accurate association rules. The groups are then sorted based on this measure. Only the best 50% of the groups move to the next stage. We perform as many stages of sampling as needed to produce a sample of a desired target size. The resultant sample is then employed to identify association rules. Empirical results show that our approach outperforms simple randomized sampling in accuracy and is competitive in comparison with the state-of-the-art sampling algorithms in terms of both time and accuracy.

This work is partially supported by the following grants: NSF0829916 and NIH-R01-LM010101

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 49.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, SIGMOD 1993, pp. 207–216. ACM, New York (1993)
Chapter Google Scholar
Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. Very Large Data Bases, VLDB 1994, vol. 1215, pp. 487–499 (1994)
Google Scholar
Akcan, H., Astashyn, A., Brönnimann, H.: Deterministic algorithms for sampling count data. Data Knowl. Eng. 64, 405–418 (2008)
Article Google Scholar
Bayardo Jr., R.J.: Efficiently mining long patterns from databases. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, SIGMOD 1998, pp. 85–93. ACM, New York (1998)
Chapter Google Scholar
Brönnimann, H., Chen, B., Dash, M., Haas, P., Scheuermann, P.: Efficient data reduction with ease. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2003, pp. 59–68. ACM, New York (2003)
Chapter Google Scholar
Chazelle, B.: The Discrepancy Method. In: Chwa, K.-Y., Ibarra, O.H. (eds.) ISAAC 1998. LNCS, vol. 1533, pp. 1–3. Springer, Heidelberg (1998)
Chapter Google Scholar
Chen, B., Haas, P., Scheuermann, P.: A new two-phase sampling based algorithm for discovering association rules. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2002, pp. 462–468. ACM, New York (2002)
Chapter Google Scholar
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD 2000, pp. 1–12. ACM, New York (2000)
Chapter Google Scholar
Houtsma, M., Swami, A.: Set-oriented mining of association rules. In: International Conference on Data Engineering (1993)
Google Scholar
John, G., Langley, P.: Static versus dynamic sampling for data mining. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 367–370 (1996)
Google Scholar
Olken, F., Rotem, D.: Random sampling from databases: a survey. Statistics and Computing 5(1), 25–42 (1995)
Article Google Scholar
Pei, J., Han, J., Lu, H., Nishio, S., Tang, S., Yang, D.: H-mine: hyper-structure mining of frequent patterns in large databases. In: Proceedings IEEE International Conference on Data Mining, ICDM 2001, pp. 441–448 (2001)
Google Scholar
Provost, F., Jensen, D., Oates, T.: Efficient progressive sampling. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 1999, pp. 23–32. ACM, New York (1999)
Google Scholar
Rajasekaran, S.: Selection algorithms for parallel disk systems. Journal of Parallel and Distributed Computing 64(4), 536–544 (2001)
Article Google Scholar
Song, M., Rajasekaran, S.: A transaction mapping algorithm for frequent itemsets mining. IEEE Transactions on Knowledge and Data Engineering 18, 472–481 (2006)
Article Google Scholar
Toivonen, H.: Sampling large databases for association rules. In: Proceedings of the 22th International Conference on Very Large Data Bases, VLDB 1996, pp. 134–145. Morgan Kaufmann Publishers Inc., San Francisco (1996)
Google Scholar
Zaki, M.J., Gouda, K.: Fast vertical mining using diffsets. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2003, pp. 326–335. ACM, New York (2003)
Chapter Google Scholar
Zaki, M.J., Parthasarathy, S., Li, W., Ogihara, M.: Evaluation of sampling for data mining of association rules. In: Proceedings of the 7th International Workshop on Research Issues in Data Engineering, RIDE 1997, p. 42. IEEE Computer Society, Washington, DC (1997)
Google Scholar
Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: New algorithms for fast discovery of association rules. In: Knowledge Discovery and Data Mining, pp. 283–286 (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, University of Connecticut, USA
Samir A. Mohamed Elsayed, Sanguthevar Rajasekaran & Reda A. Ammar

Authors

Samir A. Mohamed Elsayed
View author publications
You can also search for this author in PubMed Google Scholar
Sanguthevar Rajasekaran
View author publications
You can also search for this author in PubMed Google Scholar
Reda A. Ammar
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Vision and Applied Computer Sciences, IBaI, Kohlenstraße 2, 04107, Leipzig, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Elsayed, S.A.M., Rajasekaran, S., Ammar, R.A. (2012). ML-DS: A Novel Deterministic Sampling Algorithm for Association Rules Mining. In: Perner, P. (eds) Advances in Data Mining. Applications and Theoretical Aspects. ICDM 2012. Lecture Notes in Computer Science(), vol 7377. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31488-9_18

Download citation

DOI: https://doi.org/10.1007/978-3-642-31488-9_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31487-2
Online ISBN: 978-3-642-31488-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics