Skip to main content

Items2Data: Generating Synthetic Boolean Datasets from Itemsets

  • Conference paper
  • First Online:
Book cover Databases Theory and Applications (ADC 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11393))

Included in the following conference series:

Abstract

Boolean data is a core data type in machine learning. It is used to represent categorical and transactional data. Unlike real valued data, it is notoriously difficult to efficiently design boolean datasets that satisfy particular constraints. Inverse Frequent Itemset Mining (IFM) is the problem of constructing a boolean dataset, satisfying given support constraints for some itemsets. Previous work mainly focuses on the theoretical complexity of IFM and practical solutions scale poorly or do not satisfy all the constraints. We propose Items2Data, a practical algorithm for generating boolean datasets which is efficient under specific conditions. We introduce global closure to describe the condition which a dataset can be efficiently constructed. We evaluate Items2Data and its use in designing synthetic datasets and to analyze its accuracy, scalability and speed on real world datasets. The results indicate Items2Data is practical and efficient for generating synthetic boolean data when pre-defined itemsets are globally closed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Trefethen, L., Bau, D.: Numerical Linear Algebra. Other Titles in Applied Mathematics. Society for Industrial and Applied Mathematics (1997)

    Google Scholar 

  2. Belohlavek, R., Vychodil, V.: Discovery of optimal factors in binary data via a novel method of matrix decomposition. J. Comput. Syst. Sci. 76(1), 3–20 (2010)

    Article  MathSciNet  Google Scholar 

  3. Guzzo, A., Moccia, L., Saccà, D., Serra, E.: Solving inverse frequent itemset mining with infrequency constraints via large-scale linear programs. ACM Trans. Knowl. Disc. Data (TKDD) 7(4), 18 (2013)

    Google Scholar 

  4. Guzzo, A., Saccà, D., Serra, E.: An effective approach to inverse frequent set mining. In: Ninth IEEE International Conference on Data Mining, ICDM 2009, pp. 806–811. IEEE (2009)

    Google Scholar 

  5. Wu, X., Wu, Y., Wang, Y., Li, Y.: Privacy-aware market basket data set generation: a feasible approach for inverse frequent set mining. In: Proceedings of the 2005 SIAM International Conference on Data Mining, pp. 103–114. SIAM (2005)

    Google Scholar 

  6. Ramesh, G., Zaki, M.J., Maniatty, W.A.: Distribution-based synthetic database generation techniques for itemset mining. In: 9th International Database Engineering and Application Symposium, IDEAS 2005, pp. 307–316. IEEE (2005)

    Google Scholar 

  7. Calders, T.: The complexity of satisfying constraints on databases of transactions. Acta Informatica 44(7–8), 591–624 (2007)

    Article  MathSciNet  Google Scholar 

  8. Calders, T.: Computational complexity of itemset frequency satisfiability. In: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 143–154. ACM (2004)

    Google Scholar 

  9. Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed itemsets for association rules. In: Beeri, C., Buneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 398–416. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-49257-7_25

    Chapter  Google Scholar 

  10. Mielikainen, T.: On inverse frequent set mining. In: Proceedings of the 3rd IEEE ICDM Workshop on Privacy Preserving Data Mining, pp. 18–23. Citeseer (2003)

    Google Scholar 

  11. Madsen, L., Birkes, D.: Simulating dependent discrete data. J. Stat. Comput. Simul. 83(4), 677–691 (2013)

    Article  MathSciNet  Google Scholar 

  12. Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)

    Google Scholar 

  13. Calders, T., Rigotti, C., Boulicaut, J.-F.: A survey on condensed representations for frequent sets. In: Boulicaut, J.-F., De Raedt, L., Mannila, H. (eds.) Constraint-Based Mining and Inductive Databases. LNCS (LNAI), vol. 3848, pp. 64–80. Springer, Heidelberg (2006). https://doi.org/10.1007/11615576_4

    Chapter  Google Scholar 

  14. Calders, T., Goethals, B.: Non-derivable itemset mining. Data Min. Knowl. Disc. 14(1), 171–206 (2007)

    Article  MathSciNet  Google Scholar 

  15. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  16. Rish, I.: An empirical study of the Naive Bayes Classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46. IBM (2001)

    Google Scholar 

  17. Dheeru, D., Karra Taniskidou, E.: UCI machine learning repository (2017)

    Google Scholar 

  18. Geurts, K., Wets, G., Brijs, T., Vanhoof, K.: Profiling of high-frequency accident locations by use of association rules. Transp. Res. Rec. J. Transp. Res. Board 1840, 123–130 (2003)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ian Shane Wong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wong, I.S., Dobbie, G., Koh, Y.S. (2019). Items2Data: Generating Synthetic Boolean Datasets from Itemsets. In: Chang, L., Gan, J., Cao, X. (eds) Databases Theory and Applications. ADC 2019. Lecture Notes in Computer Science(), vol 11393. Springer, Cham. https://doi.org/10.1007/978-3-030-12079-5_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-12079-5_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-12078-8

  • Online ISBN: 978-3-030-12079-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics