Preserving the Confidentiality of Categorical Statistical Data Bases When Releasing Information for Association Rules*

Fienberg, Stephen E.; Slavkovic, Aleksandra B.

doi:10.1007/s10618-005-0010-x

Preserving the Confidentiality of Categorical Statistical Data Bases When Releasing Information for Association Rules^*

Published: 14 September 2005

Volume 11, pages 155–180, (2005)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Stephen E. Fienberg¹ &
Aleksandra B. Slavkovic²

280 Accesses
30 Citations
3 Altmetric
Explore all metrics

Abstract

In the statistical literature, there has been considerable development of methods of data releases for multivariate categorical data sets, where the releases come in the form of marginal tables corresponding to subsets of the categorical variables. Very recently some of the ideas have been extended to allow for the release of combinations of mixtures of marginal tables and conditional tables for subsets of variables. Association rules can be viewed as conditional tables. In this paper we consider possible inferences an intruder can make about confidential categorical data following the release of information on one or more association rules. We illustrate this with several examples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Privacy and artificial intelligence: challenges for protecting health information in a new era

Article Open access 15 September 2021

Big healthcare data: preserving security and privacy

Article Open access 09 January 2018

Uncertainty in big data analytics: survey, opportunities, and challenges

Article Open access 04 June 2019

References

Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD Conference International Conference on Management of Data, Washington, DC, pp. 207–216.
Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the 20th VLDB Conference, Santiago, Chile, pp. 487–489.
Agrawal, R. and Srikant, R. 2000. Privacy-preserving data mining. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, pp. 439–450.
Agresti, A. 2002.Categorical Data Analysis, 2nd edition. New York: Wiley.
MATH Google Scholar
Anderson, B. and Moore, A. 1998. AD-trees for fast counting and for fast learning of association rules. Knowledge Discovery from Databases Conference, KDD, pp. 134–138.
Arnold, B.C., Castillo, E., and Sarabia, J.M. 1999.Conditional Specification of Statistical Models, Springer-Verlag.
Arnold, B.C. and Press, J.S. 1998. Compatible conditional distributions. Journal of the American Statistical Association, 84, 405:152–156.
Article MathSciNet Google Scholar
Atallah, M., Bertino, E., Elmagarmid, A., Ibrahim, M., and Verykios, V. 1999. Disclosure limitation of sensitive rules. In Proceedings of the IEEE Knowledge and Data Engineering Exchange Workshop (KDEX'99), Chicago, IL, pp. 45–52.
Bishop, Y.M.M., Fienberg, S.E., and Holland, P.W. 1975. Discrete Multivariate Analysis: Theory and Practice. Cambridge, MA: MIT Press.
MATH Google Scholar
Chang, L. and Moskowitz, I.S. 2001. An integrated framework for database privacy protection. Proceedings of the IFIP TC11/ WG11.3 Fourteenth Annual Working Conference on Database Security, Kluwer, B.V., pp. 161–172.
Dalenius, T. and Reiss, S.P. 2004. Data-swapping: A technique for disclosure control. Journal of Statistical Planning and Inference, 6:73–85.
Article MathSciNet Google Scholar
De Loera, J., Haws, D., Hemmecke, R., Huggins, P., Tauzer, J., and Yoshida, R. 2003.A User's Guide for LattE v1.1. University of California, Davis.
Diaconis, P. and Sturmfels, B. 1998. Algebraic algorithms for sampling from conditional distributions. Annals of Statistics, 26:363–397.
Article MATH MathSciNet Google Scholar
Dobra, A. and Fienberg, S.E. 2000. Bounds for cell entries in contingency tables given marginal totals and decomposable graphs. Proceedings of the National Academy of Sciences, 97:11885–11892.
Article MATH MathSciNet Google Scholar
Dobra, A. and Fienberg, S.E. 2001. Bounds for cell entries in contingency tables induced by fixed marginal totals. Statistical Journal of the United Nations ECE, 18:363–371.
Google Scholar
Dobra, A. and Fienberg, S.E. 2003. Bounding entries in multi-way contingency tables given a set of marginal totals. In Foundations of Statistical Inference: Proceedings of the Shoresh Conference 2000, Y. Haitovsky, H.R. Lerche and Y. Ritov, (eds.), Berlin: Springer-Verlag, pp. 3–16.
Domingo-Ferrer, J. and Torra, V. (eds.), 2004. Privacy in Statistical Databases– PSD'2004, Lecture Notes in Computer Science No. 3050, New York: Springer-Verlag.
Google Scholar
DuMouchel, W. and Pregibon, D. 2001. Empirical bayes screening for multi-item associations. In Proceedings of the ACM SIGKDD Intentional Conference on Knowledge Discovery in Databases & Data Mining (KDD01), ACM Press, pp. 67–76.
Duncan, G.T., Fienberg, S.E., Krishnan, R., Padman, R., and Roehrig, S.F. 2001. Disclosure Limitation Methods and Information Loss for Tabular Data. In Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, P. Doyle, J. Lane, J. Theeuwes, and L. Zayatz (eds.), Amsterdam: Elsevier, pp. 135–166.
Google Scholar
Duncan, G.T. and Stokes, S.L. 2004. Disclosure risk vs. data utility: The R-U confidentiality map as applied to topcoding. Chance, 17(3):16–20.
MathSciNet Google Scholar
Edwards, D.E. and Havranek, T. 1985. A fast procedure for model search in multidimensional contingency tables. Biometrika, 72:339–351.
Article MathSciNet MATH Google Scholar
Estivill-Castro, V. and Brankovic, Lj. 1999. Data Warehousing and knowledge discovery. In First International Conference, DaWaK '99, M.K. Mohania, and A. Min Tjoa, (eds.), Lecture Notes in Computer Science No. 1676, New York: Springer-Verlag, pp. 389-398.
Evfimievski, A., Srikant, R., Agrawal, R., and Gehrke, J. 2000. Privacy preserving mining of association rules. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery in Databases and Data Mining, Edmonton, Canada, pp. 217-228.
Fienberg, S.E. 1980. The Analysis of Cross-Classified Categorical Data, 2nd edition. Cambridge, MA: MIT Press.
MATH Google Scholar
Fienberg, S.E. 2004. Datamining and disclosure limitation for categorical statistical databases. Proceedings of Workshop on Privacy and Security Aspects of Data Mining, Fourth IEEE International Conference on Data Mining (ICDM 2004), Brighton, UK, pp. 1–12.
Fienberg, S.E., Makov, U.E., Meyer, M.M., and Steele, R.J. 2001. Computing the exact distribution for a multi-way contingency table conditional on its marginals totals. In Data Analysis from Statistical Foundations: Papers in Honor of D. A. S. Fraser and A.K.M.E. Saleh (eds.), Huntington, NY: Nova Science Publishing, pp. 145–165.
Google Scholar
Fienberg, S.E., Makov, U.E., and Steele, R.J. 1998. Disclosure limitation using perturbation and related methods for categorical data (with discussion). Journal of Official Statistics, 14:485–502.
Google Scholar
Fienberg, S.E. and McIntyre, J. 2004. Data swapping: Variations on a theme by Dalenius and Reiss. In Privacy in Statistical Databases—PSD'2004, J. Domingo-Ferrer, and V. Torra (eds.), Lecture Notes in Computer Science No. 3050, New York: Springer-Verlag, pp. 14–29.
Google Scholar
Fienberg, S.E. and Slavkovic, A.B. 2004. Making the release of confidential data from multi-way tables count. Chance, 17(3):5–10.
MathSciNet Google Scholar
Gelman, A. and Speed, T.S. 1993. Characterizing a joint probability distribution by conditionals. Journal of the Royal Statistical Society. Series B, 55(1):185–188.
MATH MathSciNet Google Scholar
Gelman, A. and Speed, T.S. 1999. Corrigendum: Characterizing a joint probability distribution by conditionals.Journal of the Royal Statistical Society. Series B, 61(2):483.
Article MathSciNet Google Scholar
Goldenberg, A. and Moore, A. 2004. Tractable learning of large bayes net structures from sparse data. ICML'04: Twenty-first International Conference on Machine Learning, ACM Press, pp. 345–352. http://doi.acm.org/10.1145/1015406.
Gouweleeuw, J.M., Kooiman, P., Willenborg, L.C.R.J., and de Wolf, P.P., 1998. Post randomization for statistical disclosure control: Theory and implementation. Journal of Official Statistics, 14:463–478.
Google Scholar
Hemmecke, R. and Hemmecke, R. 2003. 4ti2 Version 1.1—Computation of Hilbert bases, Graver bases, toric Gröbner bases, and more. Available at www.4ti2.de.
Jordan, M.I. (eds.), 1998. Learning in Graphical Models. Cambridge, MA: MIT Press.
Kantarcioglu, M. and Clifton, C. 2004. Privacy preserving data mining of association rules on horizontally partitioned data. Transactions on Knowledge and Data Engineering, 1026–1037.
Kantarcioglu, M., Jin, J., and Clifton, C. 2004. When do data mining results violate privacy? In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22–25, ACM Press, pp. 599–604.
Kargupta, H., Datta, S., Wang, Q., and Sivakumar, K. 2003. Random data perturbation techniques and privacy preserving data mining. In Proceedings of 3rd IEEE International Conference on Data Mining, ICDM 2003, Melbourne, Florida, pp. 99–106.
Komarek, P. and Moore, A. 2000. A Dynamic Adaptation of AD-trees for efficient machine learning on large data sets. In Proceedings of the 17th International Conference on Machine Learning, pp. 495–502.
Koch, G., Amara, J., Atkinson, S. and Stanish, W. 1983. Overview of categorical analysis methods. SAS-SUGI, 8:785–795.
Google Scholar
Lauritzen, S.L.Graphical Models. Oxford: Oxford University Press.
Moore, A. and Schneider, J. 2002. Real-valued all-dimensions search: Low-overhead rapid searching over subsets of attributes. Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, San Francisco: Morgan Kaufmann Publishers, pp. 360–369.
Oliveira, S.R.M. and Zaïane, O.R. 2003. Algorithms for balancing privacy and knowledge discovery in association rule mining. In Proceedings of the 7th International Database Engineering and Applications Symposium (IDEAS 2003), Hong Kong, China, pp. 54–63.
Pavlov, D., Mannila, H., and Smyth, P. 2003. Beyond independence: In Probabilistic models for query approximation on binary transaction data. IEEE Transactions on Knowledge and Data Engineering, 15:1409–1421.
Article Google Scholar
Pelleg, D. and Moore, A. 2003. Using Tarjan's red rule for fast dependency tree construction. Advances in Neural Information Processing Systems 15 (NIPS 2002), Cambridge, MA: MIT Press, pp. 801–808.
Google Scholar
Pistone, J., Riccomagno, E., and Wynn, H. 2001. Algebraic Statistics—Computational Commutative Algebra in Statistics. Boca Raton, FL:Chapman and Hall/CRC.
MATH Google Scholar
Pontikakis, E.D., Verykios, V.S., and Theodoridis, Y. 2004. On the comparison of association rule hiding techniques. Hellenic Database Management Symposium, Athens, Greece.
Pontikakis, E.D., Tsitsonis, A.A., and Verykios, V.S. 2004. A quantitative experimental study of distortion-based techniques in association rule hiding. Conference in Database Security, Sitges, Spain, pp. 325–339.
Pontikakis, E.D., Tsitsonis, A.A., Verykios, V.S., Theodoridis, Y., and Chang, L. 2004. A quantitative and qualitative analysis of blocking in association rules hiding. ACM Workshop on Privacy in Electronic Society, Washington, DC, 29–30.
Rizvi, S. and Haritsa, J. 2002. Maintaining data privacy in association rule mining. Proceedings of the 28th Conference on Very Large Data Base (VLDB'02), pp. 682–693.
Silverstein, C., Brin, S., and Motwani, R. 1998. Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery, 2:39–68.
Article Google Scholar
Silverstein, C., Brin, S., Motwani, R., and Ullman, J. 2000. Scalable techniques for mining causal structures. Data Mining and Knowledge Discovery, 4:163–192.
Article Google Scholar
Slavkovic, A.B. 2004. Statistical Disclosure Limitation Beyond the Margins. Ph.D. Thesis, Department of Statistics, Carnegie Mellon University.
Slavkovic, A.B. and Fienberg, S.E. 2004. Bounds for cell entries in two-way tables given conditional relative frequencies. In Privacy in Statistical Databases– PSD'2004, J. Domingo-Ferrer, and V. Torra, (eds.), Lecture Notes in Computer Science No. 3050. New York: Springer-Verlag: pp. 30–43.
Google Scholar
Srikant, R. and Agrawal, R. 1995. Mining generalized association rules. In Proceedings of the 21st International Conference on Very Large Databases, Zurich, Switzerland, pp. 407–419.
Sturmfels, B. 2003. Algebra and Geometery of Statistical Models. John von Neumann Lectures at Munich University.
Trottini, M. 2003. Decision Models for Data Disclosure Limitation. Ph.D. Thesis, Department of Statistics, Carnegie Mellon University.
Trottini, M. and Fienberg, S.E. 2002. Modelling user uncertainty for disclosure risk and data utility. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems, 10:511–528.
Article MATH Google Scholar
Vaidya, J. and Clifton, C. 2002. Privacy preserving association rule mining in vertically partitioned data. The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada.
Willenborg, L.C.R.J. and de Waal, T. 2000. Elements of Statistical Disclosure Control. Lecture Notes in Statistics, Vol. 155, New York: Springer-Verlag.
Witten, I.H. and Frank, E. 2000. Data Mining: Practical Machine Learning Tools and Techniques. New York: Morgan Kaufmann.
Google Scholar
Wu, X., Barbar, D., and Ye, Y. 2003. Screening and interpreting multi-item associations based on log-linear modeling, In Proceedings of the ACM SIGKDD Intentional Conference on Knowledge Discovery in Databases & Data Mining (KDD03), ACM Press, pp. 276–285.
Zaki M.J. 2004. Mining non-redundant association rules. Data Mining and Knowledge Discovery, 9:223–248.
Article MathSciNet Google Scholar

Download references

Acknowledgments

We owe special thanks to Alan Karr for drawing our attention to the close correspondence between the confidentiality problems we have been working on and those associated with association rule mining. We are indebted to the comments of the referees for some references and suggestions that helped to emphasize the complementary nature of the statistical and datamining literatures. This research is part of several larger efforts focused on privacy and confidentiality, including a project coordinated by the National Institute of Statistical Sciences involving several U.S. federal statistical agencies.

Author information

Authors and Affiliations

Department of Statistics, Cylab, and Center for Automated Learning and Discovery, Carnegie Mellon University, Pittsburgh, PA, 15213-3890, USA
Stephen E. Fienberg
Department of Statistics, Pennsylvania State University, University Park, PA, 16802, USA
Aleksandra B. Slavkovic

Authors

Stephen E. Fienberg
View author publications
You can also search for this author in PubMed Google Scholar
Aleksandra B. Slavkovic
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stephen E. Fienberg.

Additional information

^*The research reported here was supported in part by NSF grants EIA–9876619 and IIS–0131884 to the National Institute of Statistical Sciences, as well as by Grant R01-AG023141 from the NIH to the Department of Statistics and by Army contract DAAD19-02-1-3-0389 to CyLab, both at Carnegie Mellon University.

Editor:

Geoff Webb

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fienberg, S.E., Slavkovic, A.B. Preserving the Confidentiality of Categorical Statistical Data Bases When Releasing Information for Association Rules^* . Data Min Knowl Disc 11, 155–180 (2005). https://doi.org/10.1007/s10618-005-0010-x

Download citation

Received: 27 October 2004
Accepted: 14 March 2005
Published: 14 September 2005
Issue Date: September 2005
DOI: https://doi.org/10.1007/s10618-005-0010-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Preserving the Confidentiality of Categorical Statistical Data Bases When Releasing Information for Association Rules^*

Abstract

Access this article

Similar content being viewed by others

Privacy and artificial intelligence: challenges for protecting health information in a new era

Big healthcare data: preserving security and privacy

Uncertainty in big data analytics: survey, opportunities, and challenges

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Editor:

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Preserving the Confidentiality of Categorical Statistical Data Bases When Releasing Information for Association Rules*

Abstract

Access this article

Similar content being viewed by others

Privacy and artificial intelligence: challenges for protecting health information in a new era

Big healthcare data: preserving security and privacy

Uncertainty in big data analytics: survey, opportunities, and challenges

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Editor:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

Preserving the Confidentiality of Categorical Statistical Data Bases When Releasing Information for Association Rules^*