Abstract
Cross-project defect prediction (CPDP) refers to predicting defects in a target project using prediction models trained from historical data of other source projects. And CPDP in the scenario where source and target projects have different metric sets is called heterogeneous defect prediction (HDP). Recently, HDP has received much research interest. Existing HDP methods only consider the linear correlation relationship among the features (metrics) of the source and target projects, and such models are insufficient to evaluate nonlinear correlation relationship among the features. So these methods may suffer from the linearly inseparable problem in the linear feature space. Furthermore, existing HDP methods do not take the class imbalance problem into consideration. Unfortunately, the imbalanced nature of software defect datasets increases the learning difficulty for the predictors. In this paper, we propose a new cost-sensitive transfer kernel canonical correlation analysis (CTKCCA) approach for HDP. CTKCCA can not only make the data distributions of source and target projects much more similar in the nonlinear feature space, where the learned features have favorable separability, but also utilize the different misclassification costs for defective and defect-free classes to alleviate the class imbalance problem. We perform the Friedman test with Nemenyi’s post-hoc statistical test and the Cliff’s delta effect size test for the evaluation. Extensive experiments on 28 public projects from five data sources indicate that: (1) CTKCCA significantly performs better than the related CPDP methods; (2) CTKCCA performs better than the related state-of-the-art HDP methods.
Similar content being viewed by others
Notes
The left side of “\( \Rightarrow \)” denotes the source project and the right side of “\( \Rightarrow \)” denotes the target project
References
Arisholm, E., Briand, L.C., Johannessen, E.B.: A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J. Syst. Softw. 83(1), 2–17 (2010)
Bach, F.R., Jordan, M.I.: Kernel independent component analysis. J. Mach. Learn. Res. 3, 1–48 (2003)
Baktashmotlagh, M., Harandi, M., Lovell, B., Salzmann, M.: Unsupervised domain adaptation by domain invariant projection. In: ICCV’13, pp. 769–776 (2013)
Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell. 19(7), 711–720 (1997)
Bezerra, M.E., Oliveiray, A.L., Adeodato, P.J.: Predicting software defects: A cost-sensitive approach. In: SMC’11, pp. 2515–2522 (2011)
Bradley, A.P.: The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recogn. 30(7), 1145–1159 (1997)
Briand, L.C., Melo, W.L., Wust, J.: Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Trans. Softw. Eng. 28(7), 706–720 (2002)
Camargo Cruz, A.E., Ochimizu, K.: Towards logistic regression models for predicting fault-prone code across software projects. In: ESEM’09, pp. 460–463 (2009)
Canfora, G., Lucia, A.D., Penta, M.D., Oliveto, R., Panichella, A., Panichella, S.: Defect prediction as a multiobjective optimization problem. Softw. Test. Verif. Reliab. 25(4), 426–459 (2015)
Catal, C., Diri, B.: Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Inf. Sci. 179(8), 1040–1058 (2009)
Chen, L., Fang, B., Shang, Z., Tang, Y.: Negative samples reduction in cross-company software defects prediction. Inf. Softw. Technol. 62, 67–77 (2015)
Cliff, N.: Ordinal Methods for Behavioral Data Analysis. Psychology Press, Routledge (2014)
D’Ambros, M., Lanza, M., Robbes, R.: Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir. Softw. Eng. 17(4–5), 531–577 (2012)
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
Elish, K.O., Elish, M.O.: Predicting defect-prone software modules using support vector machines. J. Syst. Softw. 81(5), 649–660 (2008)
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
Gao, K., Khoshgoftaar, T.M., Wang, H., Seliya, N.: Choosing software metrics for defect prediction: an investigation on feature selection techniques. Softw. Pract. Exp. 41(5), 579–606 (2011)
Ghotra, B., McIntosh, S., Hassan, A.E.: Revisiting the impact of classification techniques on the performance of defect prediction models. In: ICSE’15, pp. 789–800 (2015)
Hall, T., Beecham, S., Bowes, D., Gray, D., Counsell, S.: A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng. 38(6), 1276–1304 (2012)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
He, Z., Shu, F., Yang, Y., Li, M., Wang, Q.: An investigation on the feasibility of cross-project defect prediction. Autom. Softw. Eng. 19(2), 167–199 (2012)
He, Z., Peters, F., Menzies, T., Yang, Y.: Learning from open-source projects: an empirical study on defect prediction. In: ESEM’13, pp. 45–54 (2013)
He, P., Li, B., Liu, X., Chen, J., Ma, Y.: An empirical study on software defect prediction with a simplified metric set. Inf. Softw. Technol. 59, 170–190 (2015)
Herbold, S.: Training data selection for cross-project defect prediction. In: PROMISE’13, pp. 6–15 (2013)
Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)
Jiang, Y., Cukic, B.: Misclassification cost-sensitive fault prediction models. In: PROMISE’09, pp. 1–10 (2009)
Jiang, Y., Cukic, B., Ma, Y.: Techniques for evaluating fault prediction models. Empir. Softw. Eng. 13(5), 561–595 (2008a)
Jiang, Y., Cukic, B., Menzies, T.: Cost curve evaluation of fault prediction models. In: ISSRE’08, pp. 197–206 (2008b)
Jiang, T., Tan, L., Kim, S.: Personalized defect prediction. In: ASE’13, pp. 279–289 (2013)
Jing, X.Y., Hu, R.M., Zhu, Y.P., Wu, S.S., Liang, C., Yang, J.Y.: Intra-view and inter-view supervised correlation analysis for multi-view feature learning. In: AAAI’14, pp. 1882–1889 (2014a)
Jing, X.Y., Ying, S., Zhang, Z.W., Wu, S.S., Liu, J.: Dictionary learning based software defect prediction. In: ICSE’14, pp. 414–423 (2014b)
Jing, X.Y., Zhang, Z.W., Ying, S., Wang, F., Zhu, Y.P.: Software defect prediction based on collaborative representation classification. In: ICSE’14, pp. 632–633 (2014c)
Jing, X.Y., Wu, F., Dong, X., Qi, F., Xu, B.: Heterogeneous cross-company defect prediction by unified metric representation and cca-based transfer learning. In: ESEC/FSE’15, pp. 496–507 (2015)
Kamei, Y., Shihab, E., Adams, B., Hassan, A.E., Mockus, A., Sinha, A., Ubayashi, N.: A large-scale empirical study of just-in-time quality assurance. IEEE Trans. Softw. Eng. 39(6), 757–773 (2013)
Kamei, Y., Fukushima, T., Mcintosh, S., Yamashita, K., Ubayashi, N., Hassan, A.E.: Studying just-in-time defect prediction using cross-project models. Empir. Softw. Eng. 21(5), 2072–2106 (2016)
Khoshgoftaar, T.M., Geleyn, E., Nguyen, L., Bullard, L.: Cost-sensitive boosting in software quality modeling. In: ISHASE’02, pp. 51–60 (2002)
Kim, T.K., Kittler, J., Cipolla, R.: Discriminative learning and recognition of image set classes using canonical correlations. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1005–1018 (2007)
Kim, S., Zhang, H., Wu, R., Gong, L.: Dealing with noise in defect prediction. In: ICSE’11, pp. 481–490 (2011)
Lai, P.L., Fyfe, C.: Kernel and nonlinear canonical correlation analysis. Int. J. Neural Syst. 10(05), 365–377 (2000)
Lee, T., Nam, J., Han, D., Kim, S., In, H.: Developer micro interaction metrics for software defect prediction. IEEE Trans. Softw. Eng. 42(11), 1015–1035 (2016)
Lessmann, S., Baesens, B., Mues, C., Pietsch, S.: Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans. Softw. Eng. 34(4), 485–496 (2008)
Li, Y.O., Adali, T., Wang, W., Calhoun, V.D.: Joint blind source separation by multiset canonical correlation analysis. IEEE Trans. Signal Process. 57(10), 3918–3929 (2009)
Li, M., Zhang, H., Wu, R., Zhou, Z.H.: Sample-based software defect prediction with active and semi-supervised learning. Autom. Softw. Eng. 19(2), 201–230 (2012)
Liu, W., Wang, J., Ji, R., Jiang, Y.G., Chang, S.F.: Supervised hashing with kernels. In: CVPR’12, pp. 2074–2081 (2012)
Liu, M., Miao, L., Zhang, D.: Two-stage cost-sensitive learning for software defect prediction. IEEE Trans. Reliab. 63(2), 676–686 (2014)
Lu, J., Tan, Y.P.: Cost-sensitive subspace analysis and extensions for face recognition. IEEE Trans. Inf. Forensics Secur. 8(3), 510–519 (2013)
Ma, Y., Luo, G., Zeng, X., Chen, A.: Transfer learning for cross-company software defect prediction. Inf. Softw. Technol. 54(3), 248–256 (2012)
Menzies, T., Dekhtyar, A., Distefano, J., Greenwald, J.: Problems with precision: a response to “comments on ‘data mining static code attributes to learn defect predictors”’. IEEE Trans. Softw. Eng. 33(9), 635–636 (2007a)
Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33(1), 2–13 (2007b)
Menzies, T., Turhan, B., Bener, A., Gay, G., Cukic, B., Jiang, Y.: Implications of ceiling effects in defect predictors. In: PROMISE’08, pp. 47–54 (2008)
Menzies, T., Milton, Z., Turhan, B., Cukic, B., Jiang, Y., Bener, A.: Defect prediction from static code features: current results, limitations, new approaches. Autom. Softw. Eng. 17(4), 375–407 (2010)
Menzies, T., Butcher, A., Cok, D., Marcus, A., Layman, L., Shull, F., Turhan, B., Zimmermann, T.: Local versus global lessons for defect prediction and effort estimation. IEEE Trans. Softw. Eng. 39(6), 822–834 (2013)
Menzies, T., Krishna, R., Pryor, D.: The promise repository of empirical software engineering data. http://openscience.us/repo/ (2016)
Nam, J., Kim, S.: Clami: defect prediction on unlabeled datasets. In: ASE’15, pp. 1–12 (2015a)
Nam, J., Kim, S.: Heterogeneous defect prediction. In: ESEC/FSE’15, pp. 508–519 (2015b)
Nam, J., Pan, S.J., Kim, S.: Transfer defect learning. In: ICSE’13, pp. 382–391 (2013)
Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)
Panichella, A., Oliveto, R., De Lucia, A.: Cross-project defect prediction models: L’union fait la force. In: CSMR-WCRE’14, pp. 164–173 (2014)
Pelayo, L., Dick, S.: Evaluating stratification alternatives to improve software defect prediction. IEEE Trans. Reliab. 61(61), 516–525 (2012)
Peters, F., Menzies, T., Gong, L., Zhang, H.: Balancing privacy and utility in cross-company defect prediction. IEEE Trans. Softw. Eng. 39(8), 1054–1068 (2013a)
Peters, F., Menzies, T., Marcus, A.: Better cross company defect prediction. In: MSR’13, pp. 409–418 (2013b)
Peters, F., Menzies, T., Layman, L.: Lace2: Better privacy-preserving data sharing for cross project defect prediction. In: ICSE’15, pp. 801–811 (2015)
Rahman, F., Posnett, D., Devanbu, P.: Recalling the imprecision of cross-project defect prediction. In: ESEC/FSE’12, pp. 1–11 (2012)
Ren, J., Qin, K., Ma, Y., Luo, G.: On software defect prediction using machine learning. J. Appl. Math. 2014(3), 201–211 (2014)
Ryu, D., Jang, J.I., Baik, J.: A transfer cost-sensitive boosting approach for cross-project defect prediction. Softw. Qual. J. 25(1), 235–272 (2017)
Ryu, D., Choi, O., Baik, J.: Value-cognitive boosting with a support vector machine for cross-project defect prediction. Empir. Softw. Eng. 21(1), 43–71 (2016)
Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J.: Improving software-quality predictions with data sampling and boosting. IEEE Trans. Syst. Man Cybern. A Syst. Hum. 39(6), 1283–1294 (2009)
Shepperd, M., Song, Q., Sun, Z., Mair, C.: Data quality: some comments on the nasa software defect datasets. IEEE Trans. Softw. Eng. 39(9), 1208–1215 (2013)
Shepperd, M., Bowes, D., Hall, T.: Researcher bias: the use of machine learning in software defect prediction. IEEE Trans. Softw. Eng. 40(6), 603–616 (2014)
Shivaji, S., Whitehead, E.J., Akella, R., Kim, S.: Reducing features to improve code change-based bug prediction. IEEE Trans. Softw. Eng. 39(4), 552–569 (2013)
Sun, Z., Song, Q., Zhu, X.: Using coding-based ensemble learning to improve software defect prediction. IEEE Trans. Syst. Man Cybern. C Appl. Rev. 42(6), 1806–1817 (2012)
Tan, M., Tan, L., Dara, S., Mayeux, C.: Online defect prediction for imbalanced data. In: ICSE’15, pp. 99–108(2015)
Tantithamthavorn, C., McIntosh, S., Hassan, A.E., Ihara, A., Matsumoto, K.: The impact of mislabelling on the performance and interpretation of defect prediction models. In: ICSE’15, pp. 812–823 (2015)
Tantithamthavorn, C., McIntosh, S., Hassan, A.E., Matsumoto, K.: Automated parameter optimization of classification techniques for defect prediction models. In: ICSE’16, pp. 321–332 (2016)
Thiagarajan, J.J., Ramamurthy, K.N., Spanias, A.: Multiple kernel sparse representations for supervised and unsupervised learning. IEEE Trans. Image Process. 23(7), 2905–2915 (2014)
Thompson, B.: Canonical Correlation Analysis: Uses and Interpretation, vol. 47. Sage, Beverly Hills (1984)
Tosun, A., Bener, A., Turhan, B., Menzies, T.: Practical considerations in deploying statistical methods for defect prediction: a case study within the turkish telecommunications industry. Inf. Softw. Technol. 52(11), 1242–1257 (2010)
Turhan, B., Menzies, T., Bener, A.B., Di Stefano, J.: On the relative value of cross-company and within-company data for defect prediction. Empir. Softw. Eng. 14(5), 540–578 (2009)
Turhan, B., Mısırlı, A.T., Bener, A.: Empirical evaluation of the effects of mixed project data on learning defect predictors. Inf. Softw. Technol. 55(6), 1101–1118 (2013)
Vaerenbergh, S.V.: Kernel Methods for Nonlinear Identification, Equalization and Separation of Signals. Universidad de Cantabria, santander (2010)
Wang, S., Yao, X.: Using class imbalance learning for software defect prediction. IEEE Trans. Reliab. 62(2), 434–443 (2013)
Wang, S., Liu, T., Tan, L.: Automatically learning semantic features for defect prediction. In: ICSE’16, pp. 297–308 (2016a)
Wang, T.J., Zhang, Z.W., Jing, X.Y., Zhang, L.Q.: Multiple kernel ensemble learning for software defect prediction. Autom. Softw. Eng. 23(4), 569–590 (2016b)
Watanabe, S., Kaiya, H., Kaijiri, K.: Adapting a fault prediction model to allow inter languagereuse. In: PROMISE’08, pp. 19–24 (2008)
Weston, J., Elisseeff, A., Schölkopf, B., Tipping, M.: Use of the zero norm with linear models and kernel methods. J. Mach. Learn. Res. 3, 1439–1461 (2003)
Wu, R., Zhang, H., Kim, S., Cheung, S.C.: Relink: recovering links between bugs and changes. In: ESEC/FSE’11, pp. 15–25 (2011)
Wu, X., Wang, H., Liu, C., Jia, Y.: Cross-view action recognition over heterogeneous feature spaces. IEEE Trans. Image Process. 24(11), 4096–4108 (2015)
Xia, X., Lo, D., McIntosh, S., Shihab, E., Hassan, A.E.: Cross-project build co-change prediction. In: SANER’15, pp. 311–320 (2015)
Xia, X., Lo, D., Pan, S.J., Nagappan, N., Wang, X.: Hydra: massively compositional model for cross-project defect prediction. IEEE Trans. Softw. Eng. 42(10), 977–998 (2016)
Yeh, Y.R., Huang, C.H., Wang, Y.C.F.: Heterogeneous domain adaptation and classification by exploiting the correlation subspace. IEEE Trans. Image Process. 23(5), 2009–2018 (2014)
Ying, M., Guangchun, L., Hao, C.: Kernel based asymmetric learning for software defect prediction. IEICE Trans. Inf. Syst. 95(1), 267–270 (2012)
You, D., Hamsici, O.C., Martinez, A.M.: Kernel optimization in discriminant analysis. IEEE Trans. Pattern Anal. Mach. Intell. 33(3), 631–638 (2011)
Zhang, H., Zhang, X.: Comments on “data mining static code attributes to learn defect predictors”. IEEE Trans. Softw. Eng. 33(9), 635–637 (2007)
Zhang, B., Shi, Z.Z.: Classification of big velocity data via cross-domain canonical correlation analysis. In: ICBD’13, pp. 493–498 (2013)
Zhang, F., Mockus, A., Keivanloo, I., Zou, Y.: Towards building a universal defect prediction model with rank transformed predictors. Empir. Softw. Eng. 21(5), 1–39 (2016a)
Zhang, F., Zheng, Q., Zou, Y., Hassan, A.E.: Cross-project defect prediction using a connectivity-based unsupervised classifier. In: ICSE’16, pp. 309–320 (2016b)
Zhang, Z.W., Jing, X.Y., Wang, T.J.: Label propagation based semi-supervised learning for software defect prediction. Autom. Softw. Eng. 24(1), 47–69 (2017)
Zheng, J.: Cost-sensitive boosting neural networks for software defect prediction. Expert Syst. Appl. 37(6), 4537–4543 (2010)
Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. NIPS’04 16(16), 321–328 (2004)
Zimmermann, T., Nagappan, N., Gall, H., Giger, E., Murphy, B.: Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: ESEC/FSE’09, pp. 91–100 (2009)
Acknowledgements
The authors would like to thank the editors and anonymous reviewers for their constructive comments and suggestions. This work was supported by the National Key Research and Development Program of China under Grant No. 2017YFB0202001, the National Nature Science Foundation of China under Grant Nos. 61272273, 61373038, 61672392, 61472178, 61672208, U1404618, the National Basic Research 973 Program of China under Project No. 2014CB340702, the Program of State Key Laboratory of Software Engineering under Grant No. SKLSE-1216-14, the Natural Science Foundation of Jiangsu Province under Grant No. BK20170900, the Scientific Research Staring Foundation for Introduced Talents in NJUPT under NUPTSF No. NY217009, the Science and Technology Program in Henan province under Grant No. 1721102410064, the Science and Technique Development Program of Henan under Grant No. 172102210186, and the Province-School-Region Project of Henan University under Grant No. 2016S11.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, Z., Jing, XY., Wu, F. et al. Cost-sensitive transfer kernel canonical correlation analysis for heterogeneous defect prediction. Autom Softw Eng 25, 201–245 (2018). https://doi.org/10.1007/s10515-017-0220-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10515-017-0220-7