skip to main content
10.1145/2970276.2970302acmconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
research-article

Privacy preserving via interval covering based subclass division and manifold learning based bi-directional obfuscation for effort estimation

Published: 25 August 2016 Publication History

Abstract

When a company lacks local data in hand, engineers can build an effort model for the effort estimation of a new project by utilizing the training data shared by other companies. However, one of the most important obstacles for data sharing is the privacy concerns of software development organizations. In software engineering, most of existing privacy-preserving works mainly focus on the defect prediction, or debugging and testing, yet the privacy-preserving data sharing problem has not been well studied in effort estimation. In this paper, we aim to provide data owners with an effective approach of privatizing their data before release. We firstly design an Interval Covering based Subclass Division (ICSD) strategy. ICSD can divide the target data into several subclasses by digging a new attribute (i.e., class label) from the effort data. And the obtained class label is beneficial to maintaining the distribution of the target data after obfuscation. Then, we propose a manifold learning based bi-directional data obfuscation (MLBDO) algorithm, which uses two nearest neighbors, which are selected respectively from the previous and next subclasses by utilizing the manifold learning based nearest neighbor selector, as the disturbances to obfuscate the target sample. We call the entire approach as ICSD&MLBDO. Experimental results on seven public effort datasets show that: 1) ICSD&MLBDO can guarantee the privacy and maintain the utility of obfuscated data. 2) ICSD&MLBDO can achieve better privacy and utility than the compared privacy-preserving methods.

References

[1]
K. LeFevre, D. J. DeWitt, R. Ramakrishnan. Mondrian multidimensional k-anonymity. In IEEE International Conference on Data Engineering (ICDE), pages 25-25, 2006.
[2]
K. Wang, P. S. Yu, S. Chakraborty. Bottom-up generalization: A data mining solution to privacy protection. In IEEE International Conference on Data Mining (ICDM), pages 249-256, 2004.
[3]
L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05): 571-588,2002.
[4]
L. Sweeney. K-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05): 557-570, 2002.
[5]
B. Fung, K. Wang, P. S. Yu. Top-down specialization for information and privacy preservation. In IEEE International Conference on Data Engineering (ICDE), pages 205-216, 2005.
[6]
R. Chen, B. C. M. Fung, N. Mohammed, et al. Privacypreserving trajectory data publishing by local suppression. Information Sciences, 231: 83-97,2013.
[7]
A. Machanavajjhala, D. Kifer, J. Gehrke, et al. l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data, 1(1): 1-52,2007.
[8]
N. Li, T. Li, S. Venkatasubramanian. T-closeness: Privacy beyond k-anonymity and l-diversity. In IEEE International Conference on Data Engineering (ICDE), pages 106-115, 2007.
[9]
K. Honda, A. Kawano, A. Notsu, et al. A fuzzy variant of kmember clustering for collaborative filtering with data anonymization. In IEEE International Conference on Fuzzy Systems (FUZZ), pages 1-6, 2012.
[10]
J. W. Byun, A. Kamra, E. Bertino, et al. Efficient kanonymization using clustering techniques. Springer Berlin Heidelberg, 2007
[11]
H. Kasugai, A. Kawano, K. Honda, et al. A study on applicability of fuzzy k-member clustering to privacypreserving pattern recognition. In IEEE International Conference on Fuzzy Systems (FUZZ), pages 1-6, 2013.
[12]
J. Casas-Roma, J. Herrera-Joancomartí, V. Torra. Anonymizing graphs: measuring quality for clustering. Knowledge and Information Systems, 44(3): 507-528, 2015.
[13]
J. Vaidya, C. Clifton. Privacy-preserving k-means clustering over vertically partitioned data. In ACM International Conference on Knowledge Discovery and Data Mining (TKDDM), pages 206-215, 2003.
[14]
G. Aggarwal, R. Panigrahy, T. Feder, et al. Achieving anonymity via clustering. ACM Transactions on Algorithms, 6(3): 49, 2010.
[15]
X. Xiao, Y. Tao. Anatomy: Simple and effective privacy preservation. International Conference on Very Large Data Bases (VLDB), pages 139-150, 2006.
[16]
R. C. W. Wong, J. Li, A. W. C. Fu, et al. (α, k)-anonymity: an enhanced k-anonymity model for privacy-preserving data publishing. In ACM International Conference on Knowledge Discovery and Data Mining (KDD), pages 754-759, 2006.
[17]
V. S. Verykios, E. Bertino, I. N. Fovino, et al. State-of-theart in privacy-preserving data mining. ACM Sigmod Record, 33(1): 50-57, 2004.
[18]
M. Grechanik, C. Csallner, C. Fu, et al. Is data privacy always good for software testing?.In IEEE International Symposium on Software Reliability Engineering (ISSRE), pages 368-377, 2010.
[19]
T. Li, N. Li, J. Zhang, et al. Slicing: A new approach for privacy-preserving data publishing. IEEE Transactions on Knowledge and Data Engineering, 24(3): 561-574, 2012.
[20]
B. Fung, K. Wang, R. Chen, et al. Privacy-preserving data publishing: A survey of recent developments. In ACM Computing Surveys, 42(4): 14, 2010.
[21]
F. Peters, T. Menzies. Privacy and utility for defect Prediction: Experiments with morph. In ACM International Conference on Software Engineering (ICSE), pages 189-199, 2012.
[22]
F. Peters, T. Menzies, L. Gong, H. Zhang. Balancing privacy and utility in cross-company defect Prediction. IEEE Transactions on Software Engineering, 39(8): 1054-1068, 2013.
[23]
F. Peters, T. Menzies, L. Layman. LACE2: better privacypreserving data sharing for cross project defect Prediction. In ACM International Conference on Software Engineering (ICSE), pages 801-811, 2015.
[24]
J. Clause, A. Orso. Camouflage: automated anonymization of field data. In ACM International Conference on Software Engineering (ICSE), pages 21-30, 2011.
[25]
K. Taneja, M. Grechanik, R. Ghani, et al. Testing software in age of data privacy: a balancing act. In ACM SIGSOFT Symposium and European Conference on Foundations of Software Engineering (ESEC/FSE), pages 201-211, 2011.
[26]
D. Lo, L. Jiang, A. Budi. e Kb -anonymity: test data anonymization for evolving programs. In IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 262-265, 2012.
[27]
A. Budi, D. Lo, L. Jiang. Kb-anonymity: a model for anonymized behavior-preserving test and debugging data. ACM SIGPLAN Notices, 46(6):447-457, 2011.
[28]
J. Brickell, V. Shmatikov. The cost of privacy: destruction of data-mining utility in anonymized data publishing. In ACM International Conference on Knowledge Discovery and Data Mining (ICKDDM), pages 70-78, 2008.
[29]
S. J. Pan, Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10): 1345-1359, 2010.
[30]
G. Hamerly, C. Elkan. Learning the K inK-means. Technical Report CS2002-0716, University of California San Diego, 2002.
[31]
I. Jolliffe. Principal component analysis. John Wiley & Sons, 2002.
[32]
B. Kitchenham, S. L. Pfleeger, B. McColl, S. Eagan. An empirical study of maintenance and development estimation accuracy, Journal of Systems and Software, 64(1):57-77, 2002.
[33]
A. P. Dempster, N. M. Laird, D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 1(39):1-38, 1977.
[34]
E. Kocaguneli, T. Menzies, J. W. Keung. On the value of ensemble effort estimation. IEEE Transactions on Software Engineering, 38(6):1403-1416, 2012.
[35]
G. Boetticher, T. Menzies, T. Ostrand. PROMISE Repository of empirical software engineering data. West Virginia University, Department of Computer Science, 2007.
[36]
C. F. Kemerer. An empirical validation of software cost estimation models. Communications of the ACM, 30(5):416- 429,1987.
[37]
J. E. Matson, B. E. Barrett, J. M. Mellichamp. Software development cost estimation using function points. IEEE Transactions on Software Engineering, 20(4): 275-287, 1994.
[38]
C. Dwork. Differential privacy: A survey of results. Springer Berlin Heidelberg, 2008.
[39]
J. Li, G. Ruhe. Decision support analysis for software effort estimation by analogy. In IEEE International Workshop on Predictor Models in Software Engineering (PROMISE), pages 6-6, 2007.
[40]
D. Rebollo-Monedero, J. Forne, J. Domingo-Ferrer. From tcloseness-like privacy to postrandomization via information theory. IEEE Transactions on Knowledge and Data Engineering, 22(11): 1623-1636, 2010.
[41]
J. Li, Y. Tao, X. Xiao. Preservation of proximity privacy in publishing numerical sensitive data. In ACM International Conference on Management of Data (ICMD), pages 473-486, 2008.
[42]
S. L. Parker, T. Tong, S. Bolden, et al. Cancer statistics, 1996. CA: A cancer journal for clinicians, 46(1): 5-27, 1996.
[43]
J. B. Tenenbaum, V. De. Silva, J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500): 2319-2323, 2000.
[44]
M. Belkin, P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering, Advance in Neural Information Processing System. 14: 585-591, 2001.
[45]
J. Cheng, H. Liu, F. Wang, et al. Silhouette Analysis for Human Action Recognition Based on Supervised Temporal t-SNE and Incremental Learning. IEEE Transactions on Image Processing, 24(10): 3203-3217, 2015.
[46]
Y. Tang, R. Rose. A study of using locality preserving projections for feature extraction in speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1569-1572, 2008.
[47]
J. Gui, Z. Sun, W. Jia, R. Hu, Y. Lei, S. Ji. Discriminant sparse neighborhood preserving embedding for face recognition. Pattern Recognition, 45(8):2884-2893, 2012.
[48]
X. Niyogi. Locality preserving projections. MIT Press, 2004.
[49]
M. Dubinko, R. Kumar, J. Magnani, J. Novak, P. Raghavan, A. Tomkins. Visualizing tags over time. In ACM International Conference on World Wide Web, pages193-202, 2006.
[50]
H. He, E. A. Garcia. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9): 1263-1284, 2009.
[51]
E. Kocaguneli, T. Menzies, A. B. Bener, J. W. Keung. Exploiting the essential assumptions of analogy-based effort estimation. IEEE Transactions on Software Engineering, 38(2):425-438, 2012.
[52]
K. Dejaeger, W. Verbeke, D. Martens, B. Baesens. Data mining techniques for software effort estimation: a comparative study. IEEE Transactions on Software Engineering, 38(2):375-397, 2012.
[53]
K. Liu, L. Xu, J. Zhao. Co-Extracting Opinion Targets and Opinion Words from Online Reviews Based on the Word Alignment Model. IEEE Transactions on Knowledge and Data Engineering, 27(3):636-650, 2015.
[54]
T. Menzies, D. Port, Z. Chen, J. Hihn, S. Sstukes. Validation Methods for Calibrating Software Effort Models, In ACM International Conference on Software Engineering (ICSE), pages 587-595, 2005.
[55]
X. Jing, F. Qi, F. Wu, B. Xu. Missing data imputation based on low-rank recovery and semi-supervised regression for software effort estimation. In ACM International Conference on Software Engineering (ICSE), pages 607-618, 2016.
[56]
J. Keung, E. Kocaguneli, T. Menzies. Finding conclusion stability for selecting the best effort predictor in software effort estimation. Automated Software Engineering, 20(4):543-567, 2013.
[57]
P. E. Danielsson. Euclidean distance mapping. Computer Graphics and image processing, 14(3): 227-248, 1980.
[58]
D. R. Wilson, T. R. Martinez. Improved heterogeneous distance functions. Journal of Artificial Intelligence Research, (6): 1-34, 1997.
[59]
F. Zhang, Q. Zheng, Y. Zou, Ahmed E. Hassan. Crossproject defect prediction using a connectivity-based unsupervised classifier. In ACM International Conference on Software Engineering (ICSE), pages 309-320, 2016.
[60]
T. Menzies, A. Butcher, A. Marcus, T. Zimmermann, D. Cok. Local vs. global models for effort estimation and defect Prediction. In IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 343-351, 2011.
[61]
B. W. Boehm, R. Madachy, B. Steece. Software cost estimation with Cocomo II with Cdrom. Prentice Hall, 2000.
[62]
B. Twala, M. Cartwright. Ensemble missing data techniques for software effort Prediction. Intelligent Data Analysis, 14(3):299-331, 2010.
[63]
T. Menzies, A. Butcher, D. Cok, A. Marcus, L. Layman, F. Shull, B. Turhan, T. Zimmermann. Local versus global lessons for defect prediction and effort estimation., IEEE Transactions on Software Engineering, 39(6): 822-834, 2013.
[64]
A. J. Albrecht, J. E GaffneyJr. Software function, source lines of code, and development effort Prediction: a software science validation. IEEE Transactions on Software Engineering, SE-9(6): 639-648, 1983.
[65]
A. Heiat. Comparison of artificial neural network and regression models for estimating software development effort. Information and Software Technology, 44(15):911-922, 2002.
[66]
F. Sarro, A. Petrozziello, M. Harman. Multi-objective software effort estimation. In IEEE International Conference on Software Engineering (ICSE), pages 619-630, 2016.

Cited By

View all
  • (2021)Privacy preserving defect prediction using generalization and entropy-based data reductionIntelligent Data Analysis10.3233/IDA-20550425:6(1369-1405)Online publication date: 1-Jan-2021
  • (2021)Similarity-Maintaining Privacy Preservation and Location-Aware Low-Rank Matrix Factorization for QoS Prediction Based Web Service RecommendationIEEE Transactions on Services Computing10.1109/TSC.2018.283974114:3(889-902)Online publication date: 1-May-2021
  • (2019)On the Multiple Sources and Privacy Preservation Issues for Heterogeneous Defect PredictionIEEE Transactions on Software Engineering10.1109/TSE.2017.278022245:4(391-411)Online publication date: 1-Apr-2019
  • Show More Cited By

Index Terms

  1. Privacy preserving via interval covering based subclass division and manifold learning based bi-directional obfuscation for effort estimation

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ASE '16: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering
      August 2016
      899 pages
      ISBN:9781450338455
      DOI:10.1145/2970276
      • General Chair:
      • David Lo,
      • Program Chairs:
      • Sven Apel,
      • Sarfraz Khurshid
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 August 2016

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Effort estimation
      2. locality preserving projection
      3. privacy-preserving
      4. subclass division

      Qualifiers

      • Research-article

      Conference

      ASE'16
      Sponsor:

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)7
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 15 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)Privacy preserving defect prediction using generalization and entropy-based data reductionIntelligent Data Analysis10.3233/IDA-20550425:6(1369-1405)Online publication date: 1-Jan-2021
      • (2021)Similarity-Maintaining Privacy Preservation and Location-Aware Low-Rank Matrix Factorization for QoS Prediction Based Web Service RecommendationIEEE Transactions on Services Computing10.1109/TSC.2018.283974114:3(889-902)Online publication date: 1-May-2021
      • (2019)On the Multiple Sources and Privacy Preservation Issues for Heterogeneous Defect PredictionIEEE Transactions on Software Engineering10.1109/TSE.2017.278022245:4(391-411)Online publication date: 1-Apr-2019
      • (2018)Progress on approaches to software defect predictionIET Software10.1049/iet-sen.2017.014812:3(161-175)Online publication date: Jun-2018

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media