research-article

Privacy preserving via interval covering based subclass division and manifold learning based bi-directional obfuscation for effort estimation

Authors:

Xiao-Yuan Jing,

Li ChengAuthors Info & Claims

ASE '16: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering

Pages 75 - 86

https://doi.org/10.1145/2970276.2970302

Published: 25 August 2016 Publication History

Abstract

When a company lacks local data in hand, engineers can build an effort model for the effort estimation of a new project by utilizing the training data shared by other companies. However, one of the most important obstacles for data sharing is the privacy concerns of software development organizations. In software engineering, most of existing privacy-preserving works mainly focus on the defect prediction, or debugging and testing, yet the privacy-preserving data sharing problem has not been well studied in effort estimation. In this paper, we aim to provide data owners with an effective approach of privatizing their data before release. We firstly design an Interval Covering based Subclass Division (ICSD) strategy. ICSD can divide the target data into several subclasses by digging a new attribute (i.e., class label) from the effort data. And the obtained class label is beneficial to maintaining the distribution of the target data after obfuscation. Then, we propose a manifold learning based bi-directional data obfuscation (MLBDO) algorithm, which uses two nearest neighbors, which are selected respectively from the previous and next subclasses by utilizing the manifold learning based nearest neighbor selector, as the disturbances to obfuscate the target sample. We call the entire approach as ICSD&MLBDO. Experimental results on seven public effort datasets show that: 1) ICSD&MLBDO can guarantee the privacy and maintain the utility of obfuscated data. 2) ICSD&MLBDO can achieve better privacy and utility than the compared privacy-preserving methods.

References

[1]

K. LeFevre, D. J. DeWitt, R. Ramakrishnan. Mondrian multidimensional k-anonymity. In IEEE International Conference on Data Engineering (ICDE), pages 25-25, 2006.

Digital Library

[2]

K. Wang, P. S. Yu, S. Chakraborty. Bottom-up generalization: A data mining solution to privacy protection. In IEEE International Conference on Data Mining (ICDM), pages 249-256, 2004.

Digital Library

[3]

L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05): 571-588,2002.

Digital Library

[4]

L. Sweeney. K-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05): 557-570, 2002.

Digital Library

[5]

B. Fung, K. Wang, P. S. Yu. Top-down specialization for information and privacy preservation. In IEEE International Conference on Data Engineering (ICDE), pages 205-216, 2005.

Digital Library

[6]

R. Chen, B. C. M. Fung, N. Mohammed, et al. Privacypreserving trajectory data publishing by local suppression. Information Sciences, 231: 83-97,2013.

Digital Library

[7]

A. Machanavajjhala, D. Kifer, J. Gehrke, et al. l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data, 1(1): 1-52,2007.

Digital Library

[8]

N. Li, T. Li, S. Venkatasubramanian. T-closeness: Privacy beyond k-anonymity and l-diversity. In IEEE International Conference on Data Engineering (ICDE), pages 106-115, 2007.

[9]

K. Honda, A. Kawano, A. Notsu, et al. A fuzzy variant of kmember clustering for collaborative filtering with data anonymization. In IEEE International Conference on Fuzzy Systems (FUZZ), pages 1-6, 2012.

[10]

J. W. Byun, A. Kamra, E. Bertino, et al. Efficient kanonymization using clustering techniques. Springer Berlin Heidelberg, 2007

[11]

H. Kasugai, A. Kawano, K. Honda, et al. A study on applicability of fuzzy k-member clustering to privacypreserving pattern recognition. In IEEE International Conference on Fuzzy Systems (FUZZ), pages 1-6, 2013.

[12]

J. Casas-Roma, J. Herrera-Joancomartí, V. Torra. Anonymizing graphs: measuring quality for clustering. Knowledge and Information Systems, 44(3): 507-528, 2015.

Digital Library

[13]

J. Vaidya, C. Clifton. Privacy-preserving k-means clustering over vertically partitioned data. In ACM International Conference on Knowledge Discovery and Data Mining (TKDDM), pages 206-215, 2003.

Digital Library

[14]

G. Aggarwal, R. Panigrahy, T. Feder, et al. Achieving anonymity via clustering. ACM Transactions on Algorithms, 6(3): 49, 2010.

Digital Library

[15]

X. Xiao, Y. Tao. Anatomy: Simple and effective privacy preservation. International Conference on Very Large Data Bases (VLDB), pages 139-150, 2006.

Digital Library

[16]

R. C. W. Wong, J. Li, A. W. C. Fu, et al. (α, k)-anonymity: an enhanced k-anonymity model for privacy-preserving data publishing. In ACM International Conference on Knowledge Discovery and Data Mining (KDD), pages 754-759, 2006.

Digital Library

[17]

V. S. Verykios, E. Bertino, I. N. Fovino, et al. State-of-theart in privacy-preserving data mining. ACM Sigmod Record, 33(1): 50-57, 2004.

Digital Library

[18]

M. Grechanik, C. Csallner, C. Fu, et al. Is data privacy always good for software testing?.In IEEE International Symposium on Software Reliability Engineering (ISSRE), pages 368-377, 2010.

Digital Library

[19]

T. Li, N. Li, J. Zhang, et al. Slicing: A new approach for privacy-preserving data publishing. IEEE Transactions on Knowledge and Data Engineering, 24(3): 561-574, 2012.

Digital Library

[20]

B. Fung, K. Wang, R. Chen, et al. Privacy-preserving data publishing: A survey of recent developments. In ACM Computing Surveys, 42(4): 14, 2010.

Digital Library

[21]

F. Peters, T. Menzies. Privacy and utility for defect Prediction: Experiments with morph. In ACM International Conference on Software Engineering (ICSE), pages 189-199, 2012.

Digital Library

[22]

F. Peters, T. Menzies, L. Gong, H. Zhang. Balancing privacy and utility in cross-company defect Prediction. IEEE Transactions on Software Engineering, 39(8): 1054-1068, 2013.

Digital Library

[23]

F. Peters, T. Menzies, L. Layman. LACE2: better privacypreserving data sharing for cross project defect Prediction. In ACM International Conference on Software Engineering (ICSE), pages 801-811, 2015.

Digital Library

[24]

J. Clause, A. Orso. Camouflage: automated anonymization of field data. In ACM International Conference on Software Engineering (ICSE), pages 21-30, 2011.

Digital Library

[25]

K. Taneja, M. Grechanik, R. Ghani, et al. Testing software in age of data privacy: a balancing act. In ACM SIGSOFT Symposium and European Conference on Foundations of Software Engineering (ESEC/FSE), pages 201-211, 2011.

Digital Library

[26]

D. Lo, L. Jiang, A. Budi. e Kb -anonymity: test data anonymization for evolving programs. In IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 262-265, 2012.

Digital Library

[27]

A. Budi, D. Lo, L. Jiang. Kb-anonymity: a model for anonymized behavior-preserving test and debugging data. ACM SIGPLAN Notices, 46(6):447-457, 2011.

Digital Library

[28]

J. Brickell, V. Shmatikov. The cost of privacy: destruction of data-mining utility in anonymized data publishing. In ACM International Conference on Knowledge Discovery and Data Mining (ICKDDM), pages 70-78, 2008.

Digital Library

[29]

S. J. Pan, Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10): 1345-1359, 2010.

Digital Library

[30]

G. Hamerly, C. Elkan. Learning the K inK-means. Technical Report CS2002-0716, University of California San Diego, 2002.

[31]

I. Jolliffe. Principal component analysis. John Wiley & Sons, 2002.

[32]

B. Kitchenham, S. L. Pfleeger, B. McColl, S. Eagan. An empirical study of maintenance and development estimation accuracy, Journal of Systems and Software, 64(1):57-77, 2002.

Digital Library

[33]

A. P. Dempster, N. M. Laird, D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 1(39):1-38, 1977.

[34]

E. Kocaguneli, T. Menzies, J. W. Keung. On the value of ensemble effort estimation. IEEE Transactions on Software Engineering, 38(6):1403-1416, 2012.

Digital Library

[35]

G. Boetticher, T. Menzies, T. Ostrand. PROMISE Repository of empirical software engineering data. West Virginia University, Department of Computer Science, 2007.

[36]

C. F. Kemerer. An empirical validation of software cost estimation models. Communications of the ACM, 30(5):416- 429,1987.

Digital Library

[37]

J. E. Matson, B. E. Barrett, J. M. Mellichamp. Software development cost estimation using function points. IEEE Transactions on Software Engineering, 20(4): 275-287, 1994.

Digital Library

[38]

C. Dwork. Differential privacy: A survey of results. Springer Berlin Heidelberg, 2008.

[39]

J. Li, G. Ruhe. Decision support analysis for software effort estimation by analogy. In IEEE International Workshop on Predictor Models in Software Engineering (PROMISE), pages 6-6, 2007.

Digital Library

[40]

D. Rebollo-Monedero, J. Forne, J. Domingo-Ferrer. From tcloseness-like privacy to postrandomization via information theory. IEEE Transactions on Knowledge and Data Engineering, 22(11): 1623-1636, 2010.

Digital Library

[41]

J. Li, Y. Tao, X. Xiao. Preservation of proximity privacy in publishing numerical sensitive data. In ACM International Conference on Management of Data (ICMD), pages 473-486, 2008.

Digital Library

[42]

S. L. Parker, T. Tong, S. Bolden, et al. Cancer statistics, 1996. CA: A cancer journal for clinicians, 46(1): 5-27, 1996.

[43]

J. B. Tenenbaum, V. De. Silva, J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500): 2319-2323, 2000.

[44]

M. Belkin, P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering, Advance in Neural Information Processing System. 14: 585-591, 2001.

Digital Library

[45]

J. Cheng, H. Liu, F. Wang, et al. Silhouette Analysis for Human Action Recognition Based on Supervised Temporal t-SNE and Incremental Learning. IEEE Transactions on Image Processing, 24(10): 3203-3217, 2015.

Digital Library

[46]

Y. Tang, R. Rose. A study of using locality preserving projections for feature extraction in speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1569-1572, 2008.

[47]

J. Gui, Z. Sun, W. Jia, R. Hu, Y. Lei, S. Ji. Discriminant sparse neighborhood preserving embedding for face recognition. Pattern Recognition, 45(8):2884-2893, 2012.

Digital Library

[48]

X. Niyogi. Locality preserving projections. MIT Press, 2004.

[49]

M. Dubinko, R. Kumar, J. Magnani, J. Novak, P. Raghavan, A. Tomkins. Visualizing tags over time. In ACM International Conference on World Wide Web, pages193-202, 2006.

Digital Library

[50]

H. He, E. A. Garcia. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9): 1263-1284, 2009.

Digital Library

[51]

E. Kocaguneli, T. Menzies, A. B. Bener, J. W. Keung. Exploiting the essential assumptions of analogy-based effort estimation. IEEE Transactions on Software Engineering, 38(2):425-438, 2012.

Digital Library

[52]

K. Dejaeger, W. Verbeke, D. Martens, B. Baesens. Data mining techniques for software effort estimation: a comparative study. IEEE Transactions on Software Engineering, 38(2):375-397, 2012.

Digital Library

[53]

K. Liu, L. Xu, J. Zhao. Co-Extracting Opinion Targets and Opinion Words from Online Reviews Based on the Word Alignment Model. IEEE Transactions on Knowledge and Data Engineering, 27(3):636-650, 2015.

Digital Library

[54]

T. Menzies, D. Port, Z. Chen, J. Hihn, S. Sstukes. Validation Methods for Calibrating Software Effort Models, In ACM International Conference on Software Engineering (ICSE), pages 587-595, 2005.

Digital Library

[55]

X. Jing, F. Qi, F. Wu, B. Xu. Missing data imputation based on low-rank recovery and semi-supervised regression for software effort estimation. In ACM International Conference on Software Engineering (ICSE), pages 607-618, 2016.

Digital Library

[56]

J. Keung, E. Kocaguneli, T. Menzies. Finding conclusion stability for selecting the best effort predictor in software effort estimation. Automated Software Engineering, 20(4):543-567, 2013.

Digital Library

[57]

P. E. Danielsson. Euclidean distance mapping. Computer Graphics and image processing, 14(3): 227-248, 1980.

[58]

D. R. Wilson, T. R. Martinez. Improved heterogeneous distance functions. Journal of Artificial Intelligence Research, (6): 1-34, 1997.

Digital Library

[59]

F. Zhang, Q. Zheng, Y. Zou, Ahmed E. Hassan. Crossproject defect prediction using a connectivity-based unsupervised classifier. In ACM International Conference on Software Engineering (ICSE), pages 309-320, 2016.

Digital Library

[60]

T. Menzies, A. Butcher, A. Marcus, T. Zimmermann, D. Cok. Local vs. global models for effort estimation and defect Prediction. In IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 343-351, 2011.

Digital Library

[61]

B. W. Boehm, R. Madachy, B. Steece. Software cost estimation with Cocomo II with Cdrom. Prentice Hall, 2000.

Digital Library

[62]

B. Twala, M. Cartwright. Ensemble missing data techniques for software effort Prediction. Intelligent Data Analysis, 14(3):299-331, 2010.

Digital Library

[63]

T. Menzies, A. Butcher, D. Cok, A. Marcus, L. Layman, F. Shull, B. Turhan, T. Zimmermann. Local versus global lessons for defect prediction and effort estimation., IEEE Transactions on Software Engineering, 39(6): 822-834, 2013.

Digital Library

[64]

A. J. Albrecht, J. E GaffneyJr. Software function, source lines of code, and development effort Prediction: a software science validation. IEEE Transactions on Software Engineering, SE-9(6): 639-648, 1983.

Digital Library

[65]

A. Heiat. Comparison of artificial neural network and regression models for estimating software development effort. Information and Software Technology, 44(15):911-922, 2002.

[66]

F. Sarro, A. Petrozziello, M. Harman. Multi-objective software effort estimation. In IEEE International Conference on Software Engineering (ICSE), pages 619-630, 2016.

Digital Library

Cited By

Saifan ALataifeh Z(2021)Privacy preserving defect prediction using generalization and entropy-based data reductionIntelligent Data Analysis10.3233/IDA-20550425:6(1369-1405)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.3233/IDA-205504
Zhu XJing XWu DHe ZCao JYue DWang L(2021)Similarity-Maintaining Privacy Preservation and Location-Aware Low-Rank Matrix Factorization for QoS Prediction Based Web Service RecommendationIEEE Transactions on Services Computing10.1109/TSC.2018.283974114:3(889-902)Online publication date: 1-May-2021
https://doi.org/10.1109/TSC.2018.2839741
Li ZJing XZhu XZhang HXu BYing S(2019)On the Multiple Sources and Privacy Preservation Issues for Heterogeneous Defect PredictionIEEE Transactions on Software Engineering10.1109/TSE.2017.278022245:4(391-411)Online publication date: 1-Apr-2019
https://doi.org/10.1109/TSE.2017.2780222
Show More Cited By

Index Terms

Privacy preserving via interval covering based subclass division and manifold learning based bi-directional obfuscation for effort estimation
1. Security and privacy
  1. Security services
    1. Privacy-preserving protocols
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Empirical software validation

Recommendations

Privacy-Preserving Data Publishing Based on De-clustering
ICIS '08: Proceedings of the Seventh IEEE/ACIS International Conference on Computer and Information Science (icis 2008)

In recent years, privacy preservation has become a serious concern in publication of personal data because of the wide availability of personal data. In the literature, we know that the degree of privacy protection is really determined by the number of ...
A New Scheme for Distributed Density Estimation based Privacy-Preserving Clustering
ARES '08: Proceedings of the 2008 Third International Conference on Availability, Reliability and Security

The sensitive information leakage and security risk is a problem from which both individual and enterprise suffer in massive data collection and the information retrieval by the distrusted parties. In this paper, we focus on the privacy issue of data ...
Privacy preserving data obfuscation for inherently clustered data

Privacy is defined as the freedom from unauthorised intrusion. The availability of public records along with intelligent search engines and data mining tools allow easy access to useful information. They also serve as a haven for individuals with ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASE '16: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering

August 2016

899 pages

ISBN:9781450338455

DOI:10.1145/2970276

General Chair:
David Lo
Singapore Management University, Singapore
,
Program Chairs:
Sven Apel
University of Passau, Germany
,
Sarfraz Khurshid
University of Texas at Austin, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAI: ACM Special Interest Group on Artificial Intelligence
SIGSOFT: ACM Special Interest Group on Software Engineering
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 August 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASE'16

Sponsor:

SIGAI
SIGSOFT
IEEE-CS

ASE'16: ACM/IEEE International Conference on Automated Software Engineering

September 3 - 7, 2016

Singapore, Singapore

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
271
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Saifan ALataifeh Z(2021)Privacy preserving defect prediction using generalization and entropy-based data reductionIntelligent Data Analysis10.3233/IDA-20550425:6(1369-1405)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.3233/IDA-205504
Zhu XJing XWu DHe ZCao JYue DWang L(2021)Similarity-Maintaining Privacy Preservation and Location-Aware Low-Rank Matrix Factorization for QoS Prediction Based Web Service RecommendationIEEE Transactions on Services Computing10.1109/TSC.2018.283974114:3(889-902)Online publication date: 1-May-2021
https://doi.org/10.1109/TSC.2018.2839741
Li ZJing XZhu XZhang HXu BYing S(2019)On the Multiple Sources and Privacy Preservation Issues for Heterogeneous Defect PredictionIEEE Transactions on Software Engineering10.1109/TSE.2017.278022245:4(391-411)Online publication date: 1-Apr-2019
https://doi.org/10.1109/TSE.2017.2780222
Li ZJing XZhu X(2018)Progress on approaches to software defect predictionIET Software10.1049/iet-sen.2017.014812:3(161-175)Online publication date: Jun-2018
https://doi.org/10.1049/iet-sen.2017.0148

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten