Items2Data: Generating Synthetic Boolean Datasets from Itemsets

Wong, Ian Shane; Dobbie, Gillian; Koh, Yun Sing

doi:10.1007/978-3-030-12079-5_6

Ian Shane Wong¹⁵,
Gillian Dobbie¹⁵ &
Yun Sing Koh¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11393))

Included in the following conference series:

Australasian Database Conference

509 Accesses
2 Citations

Abstract

Boolean data is a core data type in machine learning. It is used to represent categorical and transactional data. Unlike real valued data, it is notoriously difficult to efficiently design boolean datasets that satisfy particular constraints. Inverse Frequent Itemset Mining (IFM) is the problem of constructing a boolean dataset, satisfying given support constraints for some itemsets. Previous work mainly focuses on the theoretical complexity of IFM and practical solutions scale poorly or do not satisfy all the constraints. We propose Items2Data, a practical algorithm for generating boolean datasets which is efficient under specific conditions. We introduce global closure to describe the condition which a dataset can be efficiently constructed. We evaluate Items2Data and its use in designing synthetic datasets and to analyze its accuracy, scalability and speed on real world datasets. The results indicate Items2Data is practical and efficient for generating synthetic boolean data when pre-defined itemsets are globally closed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Trefethen, L., Bau, D.: Numerical Linear Algebra. Other Titles in Applied Mathematics. Society for Industrial and Applied Mathematics (1997)
Google Scholar
Belohlavek, R., Vychodil, V.: Discovery of optimal factors in binary data via a novel method of matrix decomposition. J. Comput. Syst. Sci. 76(1), 3–20 (2010)
Article MathSciNet Google Scholar
Guzzo, A., Moccia, L., Saccà, D., Serra, E.: Solving inverse frequent itemset mining with infrequency constraints via large-scale linear programs. ACM Trans. Knowl. Disc. Data (TKDD) 7(4), 18 (2013)
Google Scholar
Guzzo, A., Saccà, D., Serra, E.: An effective approach to inverse frequent set mining. In: Ninth IEEE International Conference on Data Mining, ICDM 2009, pp. 806–811. IEEE (2009)
Google Scholar
Wu, X., Wu, Y., Wang, Y., Li, Y.: Privacy-aware market basket data set generation: a feasible approach for inverse frequent set mining. In: Proceedings of the 2005 SIAM International Conference on Data Mining, pp. 103–114. SIAM (2005)
Google Scholar
Ramesh, G., Zaki, M.J., Maniatty, W.A.: Distribution-based synthetic database generation techniques for itemset mining. In: 9th International Database Engineering and Application Symposium, IDEAS 2005, pp. 307–316. IEEE (2005)
Google Scholar
Calders, T.: The complexity of satisfying constraints on databases of transactions. Acta Informatica 44(7–8), 591–624 (2007)
Article MathSciNet Google Scholar
Calders, T.: Computational complexity of itemset frequency satisfiability. In: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 143–154. ACM (2004)
Google Scholar
Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed itemsets for association rules. In: Beeri, C., Buneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 398–416. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-49257-7_25
Chapter Google Scholar
Mielikainen, T.: On inverse frequent set mining. In: Proceedings of the 3rd IEEE ICDM Workshop on Privacy Preserving Data Mining, pp. 18–23. Citeseer (2003)
Google Scholar
Madsen, L., Birkes, D.: Simulating dependent discrete data. J. Stat. Comput. Simul. 83(4), 677–691 (2013)
Article MathSciNet Google Scholar
Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)
Google Scholar
Calders, T., Rigotti, C., Boulicaut, J.-F.: A survey on condensed representations for frequent sets. In: Boulicaut, J.-F., De Raedt, L., Mannila, H. (eds.) Constraint-Based Mining and Inductive Databases. LNCS (LNAI), vol. 3848, pp. 64–80. Springer, Heidelberg (2006). https://doi.org/10.1007/11615576_4
Chapter Google Scholar
Calders, T., Goethals, B.: Non-derivable itemset mining. Data Min. Knowl. Disc. 14(1), 171–206 (2007)
Article MathSciNet Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Rish, I.: An empirical study of the Naive Bayes Classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46. IBM (2001)
Google Scholar
Dheeru, D., Karra Taniskidou, E.: UCI machine learning repository (2017)
Google Scholar
Geurts, K., Wets, G., Brijs, T., Vanhoof, K.: Profiling of high-frequency accident locations by use of association rules. Transp. Res. Rec. J. Transp. Res. Board 1840, 123–130 (2003)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Auckland, 38 Princes Street, Auckland, New Zealand
Ian Shane Wong, Gillian Dobbie & Yun Sing Koh

Authors

Ian Shane Wong
View author publications
You can also search for this author in PubMed Google Scholar
Gillian Dobbie
View author publications
You can also search for this author in PubMed Google Scholar
Yun Sing Koh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ian Shane Wong .

Editor information

Editors and Affiliations

University of Sydney, Sydney, NSW, Australia
Lijun Chang
University of Melbourne, Parkville, VIC, Australia
Junhao Gan
University of New South Wales, Sydney, NSW, Australia
Xin Cao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wong, I.S., Dobbie, G., Koh, Y.S. (2019). Items2Data: Generating Synthetic Boolean Datasets from Itemsets. In: Chang, L., Gan, J., Cao, X. (eds) Databases Theory and Applications. ADC 2019. Lecture Notes in Computer Science(), vol 11393. Springer, Cham. https://doi.org/10.1007/978-3-030-12079-5_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-12079-5_6
Published: 23 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-12078-8
Online ISBN: 978-3-030-12079-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics