Abstract
In multi-relational databases, a view, which is a context- and content-dependent subset of one or more tables (or other views), is often used to preserve privacy by hiding sensitive information. However, recent developments in data mining present a new challenge for database security even when traditional database security techniques, such as database access control, are employed. This paper presents a data mining framework using semi-supervised learning that demonstrates the potential for privacy leakage in multi-relational databases. Many different types of semi-supervised learning techniques, such as the K-nearest neighbor (KNN) method, can be used to demonstrate privacy leakage. However, we also introduce a new approach to semi-supervised learning, hyperclique pattern-based semi-supervised learning (HPSL), which differs from traditional semi-supervised learning approaches in that it considers the similarity among groups of objects instead of only pairs of objects. Our experimental results show that both the KNN and HPSL methods have the ability to compromise database security, although the HPSL is better at this privacy violation (has higher prediction accuracy) than the KNN method. Finally, we provide a principle for avoiding privacy leakage in multi-relational databases via semi-supervised learning and illustrate this principle with a simple preventive technique whose effectiveness is demonstrated by experiments.
Similar content being viewed by others
References
Agrawal, D., Aggarwal, C.C.: On the design and quantification of privacy preserving data mining algorithms. In: Proceedings of the ACM Symposium on Principles of Database Systems (PODS) (2001)
Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD Conference on Management of Data (1993)
Agrawal, R., Srikant, R.: Privacy-preserving data mining. In: Proceedings of the ACM SIGMOD Conference on Management of Data (2000)
Bayardo, R.J., Srikant, R.: Technological solutions for protecting privacy. In: IEEE Computer (2003)
Bertino, E., Ooi, B.C., Yang, Y., Deng, R.H.: Privacy and ownership preserving of outsourced medical data. In: Proceedings of the 21st International Conference on Data Engineering (ICDE), pp. 521–532 (2005)
Carminati, B., Ferrari, E., Bertino, E.: Assuring security properties in third-party architectures. In: Proceedings of the 21st International Conference on Data Engineering (ICDE), pp. 547–548 (2005)
Castelli V., Cover T.M.(1996): The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Trans. Inf. Theory 42(6): 2102–2117
Codd E.(1970): A relational model for large shared data banks. Comm. ACM 13(6): 377–387
Denning, D., Akl, S., Morgenstern, M., Neumanna, P.: Views for multilevel database security. In: IEEE Symposium on Security and Privacy (1986)
Denning, D., Lunt, T., Schell, R., Heckman, M., Shockley, W.: Views for multilevel database security. In: IEEE Symposium on Security and Privacy (1986)
Domingos, P.: Prospects and challenges for multi-relational data mining. SIGKDD explorations (2003)
Du, W., Han, Y.S., Chen, S.: Privacy-preserving multivariate statistical analysis: linear regression and classification. In: Proceedings of the 4th SIAM International Conference on Data Mining (2004)
Duin, R.: Classifiers in almost empty spaces. In: Proceedings of 15th International Conference on Pattern Recognition (2000)
Evfimievski, A., Gehrke, J., Srikant, R.: Limiting privacy breaches in privacy preserving data mining. In: Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (2002)
Evfimievski, A., Srikant, R., Agrawal, R., Gehrke, J.: Privacy preserving mining of association rules. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2002)
Faloutsos, C., Jagadish, H.V., Sidiropoulos, N.: Recovering information from summary data. In: Proceedings of 23rd International Conference on Very Large Data Bases (VLDB), pp. 36–45 (1997)
Ferrari, E., Thuraisingham, B.M.: Security and privacy for web databases and services. In: Proceedings of the 9th International Conference on Extending Database Technology (EDBT), pp. 17–28 (2004)
Ghahramani, Z., Jordan, M.I.: Supervised learning from incomplete data via an EM approach. In: NIPS, pp. 120–127 (1993)
Han, E.-H., Boley, D., Gini, M., Gross, R., Hastings, K., , G., Kumar, V., Mobasher, B., Moore, J.: Webace: a web agent for document categorization and exploration. In: Proceedings of the 2nd International Conference on Autonomous Agents (1998)
Huang, Y., Xiong, H., Wu, W., Zhang, Z.: A hybrid approach for mining maixmal hyperclique patterns. In: ICTAI, pp. 354–361 (2004)
Huang, Z., Du, W., Chen, B.: Deriving private information from randomized data. In: Proceedings of the ACM SIGMOD Conference, pp. 37–48 (2005)
Kargupta, H., Datta, S., Wang, Q., Sivakumar, K.: On the privacy preserving properties of random data perturbation techniques. In: Proceedings of the 3rd IEEE International Conference on Data Mining, pp. 387–394 (2003)
Karypis, G.: Cluto: Software for clustering high dimensional datasets. /www.cs.umn.edu/~karypis
Lewis, D.: Reuters-21578 text categorization text collection 1.0. In: http://www.research.att.com/~lewis
Nigam K., McCallum A., Thrun S., Mitchell T.M.(2000): Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2/3): 103–134
Porter, M.F.: An algorithm for suffix stripping. In: Program, 14(3), (1980)
Raudys S., Jain A. (1991): Small sample size effects in statistical pattern recognition: Recommendations for practitioners. IEEE Trans. Pattern Anal. Mach. Intell. 13(3): 252–264
Seeger, M.: Learning with labeled and unlabeled data. In: Technical Report, University of Edinburgh (2001)
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)
Steinbach, M., Tan, P.N., Xiong, H., Kumar, V.: Generalizing the notion of support. In: Proceedings of the 2004 ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining pp. 689–694. ACM Press (2004)
TREC.: In: http://trec.nist.gov.
Xiong, H., Steinbach, M., Kumar, V.: Privacy leakage in databases via pattern based semi-supervised learning. In: Proceedings of the ACM Conference on information and Knowledge Management (CIKM) (2005)
Xiong, H., Tan, P., Kumar, V.: Mining strong affinity association patterns in data sets with skewed support distribution. In: Proceedings of the third IEEE International Conference on Data Mining (ICDM), pp. 387–394 (2003)
Author information
Authors and Affiliations
Corresponding author
Additional information
A preliminary version of this work has been published as a two-page short paper in ACM CIKM 2005 (Proceedings of the ACM conference on information and knowledge management (CIKM) 2005).
Rights and permissions
About this article
Cite this article
Xiong, H., Steinbach, M. & Kumar, V. Privacy leakage in multi-relational databases: a semi-supervised learning perspective. The VLDB Journal 15, 388–402 (2006). https://doi.org/10.1007/s00778-006-0011-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-006-0011-4