Re-estimating software effort using prior phase efforts and data mining techniques

Jodpimai, Pichai; Sophatsathit, Peraphon; Lursinsap, Chidchanok

doi:10.1007/s11334-018-0311-z

Re-estimating software effort using prior phase efforts and data mining techniques

Original Paper
Published: 02 May 2018

Volume 14, pages 209–228, (2018)
Cite this article

Innovations in Systems and Software Engineering Aims and scope Submit manuscript

Pichai Jodpimai ORCID: orcid.org/0000-0001-8785-3775¹,
Peraphon Sophatsathit¹ &
Chidchanok Lursinsap¹

516 Accesses
3 Citations
Explore all metrics

Abstract

Software effort estimation has played an important role in software project management. An accurate estimation helps reduce cost overrun and the eventual project failure. Unfortunately, many existing estimation techniques rely on the total project effort which is often determined from the project life cycle. As the project moves on, the course of action deviates from what originally has planned, despite close monitoring and control. This leads to re-estimating software effort so as to improve project operating costs and budgeting. Recent research endeavors attempt to explore phase level estimation that uses known information from prior development phases to predict effort of the next phase by using different learning techniques. This study aims to investigate the influence of preprocessing in prior phases on learning techniques to re-estimate the effort of next phase. The proposed re-estimation approach preprocesses prior phase effort by means of statistical techniques to select a set of input features for learning which in turn are exploited to generate the estimation models. These models are then used to re-estimate next phase effort by using four processing steps, namely data transformation, outlier detection, feature selection, and learning. An empirical study is conducted on 440 estimation models being generated from combinations of techniques on 5 data transformation, 5 outlier detection, 5 feature selection, and 5 learning techniques. The experimental results show that suitable preprocessing is significantly useful for building proper learning techniques to boosting re-estimation accuracy. However, there is no one learning technique that can outperform other techniques over all phases. The proposed re-estimation approach yields more accurate estimation than proportion-based estimation approach. It is envisioned that the proposed re-estimation approach can facilitate researchers and project managers on re-estimating software effort so as to finish the project on time and within the allotted budget.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Educational data mining to predict students' academic performance: A survey study

Article 09 July 2022

References

Wang Y, Song Q, MacDonell S, Shepperd M, Junyi S (2009) Integrate the GM (1,1) and verhulst models to predict software stage effort. IEEE Trans Syst Man Cybern Part C 39(6):647–658
Article Google Scholar
Zia Z, Rashid A, uz Zaman K (2011) Software cost estimation for component-based fourth-generation-language software applications. IET Softw 5(1):103–110
Article Google Scholar
Menzies T, Chen Z, Hihn J, Lum K (2006) Selecting best practices for effort estimation. IEEE Trans Softw Eng 32(11):883–895
Article Google Scholar
Jorgensen M, Boehm B, Rifkin S (2009) Software development effort estimation: Formal models or expert judgment? IEEE Softw 26(2):14–19
Article Google Scholar
MacDonell SG, Shepperd MJ (2003) Using prior-phase effort records for re-estimation during software projects. In: Proceedings of the ninth international software metrics symposium (METRICS’03), pp 73–86
Azzeh M, Cowling PI, Neagu D (2010) Software stage-effort estimation based on association rule mining and fuzzy set theory. In: Proceedings of 2010 IEEE 10th international conference on computer and information technology (CIT), pp 249–256
Ferrucci F, Gravino C, Sarro F (2014) Exploiting prior-phase effort data to estimate the effort for the subsequent phases: a further assessment. In: Proceedings of the 10th international conference on predictive models in software engineering, PROMISE ’14, pp 42–51. ACM, New York, NY, USA
Kantardzic M (2011) Data Mining: Concepts, Models, Methods, and Algorithms. Wiley, Piscataway
Book MATH Google Scholar
Kocaguneli E, Menzies T, Keung JW (2012) On the value of ensemble effort estimation. IEEE Trans Softw Eng 38(6):1403–1416
Article Google Scholar
Boehm BW (1981) Software engineering economics. Prentice Hall PTR, Upper Saddle River
MATH Google Scholar
Yucalar F, Kilinc D, Borandag E, Ozcift A (2016) Regression analysis based software effort estimation method. Int J Softw Eng Knowl Eng 26(05):807–826
Article Google Scholar
Huang SJ, Chiu NH, Liu YJ (2008) A comparative evaluation on the accuracies of software effort estimates from clustered data. Inf Softw Technol 50(9–10):879–888
Article Google Scholar
Putnam LH (1978) A general empirical solution to the macro software sizing and estimating problem. IEEE Trans Softw Eng SE–4(4):345–361
Article MATH Google Scholar
Boehm BW, Abts C, Brown AW, Chulani S, Clark BK, Horowitz E, Madachy R, Reifer D, Steece B (2000) Software cost estimation with COCOMO II. Prentice Hall PTR, Upper Saddle River
Google Scholar
Liu Q, Qin W, Mintram R, Ross M (2008) Evaluation of preliminary data analysis framework in software cost estimation based on ISBSG R9 data. Softw Qual J 16:411–458
Article Google Scholar
Kocaguneli E, Menzies T, Bener A, Keung JW (2012) Exploiting the essential assumptions of analogy-based effort estimation. IEEE Trans Softw Eng 38(2):425–438
Article Google Scholar
Idri A, Amazal F, Abran A (2015) Analogy-based software development effort estimation: a systematic mapping and review. Inf Softw Technol 58:206–230
Article Google Scholar
Kumar KV, Ravi V, Carr M, Kiran NR (2008) Software development cost estimation using wavelet neural networks. J Syst Softw 81(11):1853–1867
Article Google Scholar
Huang SJ, Chiu NH (2009) Applying fuzzy neural network to estimate software development effort. Appl Intell 30:73–83
Article Google Scholar
Oliveira ALI (2006) Estimation of software project effort with support vector regression. Neurocomputing 69(13–15):1749–1753
Article Google Scholar
Corazza A, Martino SD, Ferrucci F, Gravino C, Mendes E (2011) Investigating the use of support vector regression for web effort estimation. Empir Softw Eng 16:211–243
Article Google Scholar
Mittal A, Parkash K, Mittal H (2010) Software cost estimation using fuzzy logic. SIGSOFT Softw Eng Notes 35(1):1–7
Article Google Scholar
Muzaffar Z, Ahmed MA (2010) Software development effort prediction: a study on the factors impacting the accuracy of fuzzy logic systems. Inf Softw Technol 52(1):92–109
Article Google Scholar
Oliveira AL, Braga PL, Lima RM, Cornlio ML (2010) GA-based method for feature selection and parameters optimization for machine learning regression applied to software effort estimation. Inf Softw Technol 52(11):1155–1166 Special Section on Best Papers PROMISE 2009
Article Google Scholar
Minku LL, Yao X (2013) Software effort estimation as a multiobjective learning problem. ACM Trans Softw Eng Methodol 22(4):35:1–35:32
Article Google Scholar
Jrgensen M (2004) A review of studies on expert estimation of software development effort. J Syst Softw 70(12):37–60
Article Google Scholar
Polikar R (2006) Ensemble based systems in decision making. IEEE Circuits Syst Mag 6(3):21–45
Article Google Scholar
Tan HBK, Zhao Y, Zhang H (2009) Conceptual data model-based software size estimation for information systems. ACM Trans Softw Eng Methodol 19(2):4:1–4:37
Article Google Scholar
Malik AA, Boehm BW (2011) Quantifying requirements elaboration to improve early software cost estimation. Inf Sci 181(13):2747–2760
Article Google Scholar
Yang Y, He M, Li M, Wang Q, Boehm BW (2008) Phase distribution of software development effort. In: Proceedings of the second ACM-IEEE international symposium on empirical software engineering and measurement, ESEM ’08, pp 61–69. ACM, New York, NY, USA
Strike K, Emam KE, Madhavji N (2001) Software cost estimation with incomplete data. IEEE Trans Softw Eng 27(10):890–908
Article Google Scholar
Azzeh M, Neagu D, Cowling P (2008) Improving analogy software effort estimation using fuzzy feature subset selection algorithm. In: Proceedings of the 4th international workshop on predictor models in software engineering, PROMISE ’08, pp 71–78. ACM, New York, NY, USA
Pai DR, McFall KS, Subramanian GH (2013) Software effort estimation using a neural network ensemble. J Comput Inf Syst 53(4):4958
Google Scholar
Dejaeger K, Verbeke W, Martens D, Baesens B (2012) Data mining techniques for software effort estimation: A comparative study. IEEE Trans Softw Eng 38(2):375–397
Article Google Scholar
Sakia R (1992) The Box-Cox transformation technique: a review. J R Stat Soc Ser D 41(2):169–178
Google Scholar
Junling R (2006) A pattern selection algorithm based on the generalized confidence. In: Proceedings of 18th international conference on pattern recognition (ICPR’06), vol. 2, pp. 824–827
Huang SJ, Chiu NH, Chen LW (2008) Integration of the grey relational analysis with genetic algorithm for software effort estimation. Eur J Oper Res 188(3):898–909
Article MATH Google Scholar
Jarque CM (2011) International encyclopedia of statistical science. Jarque–Bera test part. Springer, Berlin
Google Scholar
Cook RD (1977) Detection of influential observation in linear regression. Technometrics 19(1):15–18
MathSciNet MATH Google Scholar
Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33(1):2–13
Article Google Scholar
Malhotra R, Kaur A, Singh Y (2010) Application of machine learning methods for software effort prediction. SIGSOFT Softw Eng Notes 35(3):1–6
Article Google Scholar
Chen Z, Menzies T, Port D, Boehm BW (2005) Finding the right data for software cost modeling. IEEE Softw 22(6):38–46
Article Google Scholar
Hall MA (1999) Correlation-based feature selection for machine learning Ph.D. thesis, doctors thesis, Department of Computer Science, Waikato University. The bibliography
Unified code count. http://sunset.usc.edu/ucc/, accessed 9 (November 2015)
Backfiring table conversion guidelines. http://www.qsm.com/resources/function-point-languages-table/, accessed 9 (November 2015)
Conte SD, Dunsmore HE, Shen VY (1981) Software engineering metrics and models. Benjamin-Cummings, Menlo Park
Google Scholar
Foss T, Stensrud E, Kitchenham B, Myrtveit I (2003) A Simulation study of the model evaluation criterion MMRE. IEEE Trans Softw Eng 29(11):985–995
Article Google Scholar
Shepperd M, MacDonell S (2012) Evaluating prediction systems in software project estimation. Inf Softw Technol 54(8):820–827
Article Google Scholar
Miyazaki Y, Terakado M, Ozada K, Nozaki H (1994) Robust regression for developing software estimation models. J Syst Softw 27(1):3–16
Article Google Scholar
Jorgensen M (2010) Selection of strategies in judgment-based effort estimation. J Syst Softw 83(6):1039–1050
Article Google Scholar
Kocaguneli E, Menzies T, Keung J, Cok D, Madachy R (2013) Active learning and effort estimation: finding the essential content of software effort estimation data. IEEE Trans Softw Eng 39(8):1040–1053
Article Google Scholar
Refaeilzadeh P, Tang L, Liu L (2009) Encyclopedia of database systemscross validation. Springer, New York
Google Scholar
Menzies T, Caglayan B, He Z, Kocaguneli E, Krall J, Peters F, Turhan B (2012) The promise repository of empirical software engineering data, http://promisedata.googlecode.com

Download references

Acknowledgements

The authors wish to thank Electronic Government Agency (Public Organization) and VP Advance Company for providing software project data and administrative support, without whom the work will never be complete.

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, Chulalongkorn University, Patumwan, Bangkok, 10330, Thailand
Pichai Jodpimai, Peraphon Sophatsathit & Chidchanok Lursinsap

Authors

Pichai Jodpimai
View author publications
You can also search for this author in PubMed Google Scholar
Peraphon Sophatsathit
View author publications
You can also search for this author in PubMed Google Scholar
Chidchanok Lursinsap
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pichai Jodpimai.

Appendices

A data set

This section provides details on all 38 software projects collected from two software development organizations. The project data being collected in compliance with COCOMO II data collection guidelines [14] are shown in Table 6. Only KSLOC represented quantitative data in ratio scale which could be directly used as a feature for the estimation. In contrast, all remaining features were represented as qualitative data in ordinal scale or 6-likert scale, i.e., very low (VL), low (L), nominal (N), high (H), very high (VH), and extra high (XH). These features would be transformed according to COCOMO II rules [14] before participating in the estimation. For example, the rule for CPLX is 0.73, 0.87, 1.00, 1.17, 1.34, and 1.74 for VL, L, N, H, VH, and XH, respectively. As a result, when project 1–5 (see Table 6) contain ordinal scales as L, L, L, N, and N, a transformed value of those projects is 0.87, 0.87, 0.87, 1, and 1, respectively. Even though there was one software project less than 1 KSLOC (0.27 KSLOC) that claimed to be not appropriate to calibrate COCOMO II model, we decided to keep it since GLOBAL COCOMO II and LOCAL calibration COCOMO II were not involved in our study. Instead, we built proportion-based estimation model with the help of learning techniques, where all collected projects were used. However, when an outlier detection was performed, the small project was often considered as an outlier and excluded from the training set. Software efforts of individual phase and total project measured in man-day are summarized in Table 7, which will be used to building re-estimation models. Characteristics of individual project are shown in Table 8 which subsequently are used to drill down in project category for better estimation accuracy of both re-estimation and proportion models.

Table 6 Software project data collected according to COCOMO II

Full size table

Table 7 Software effort of different phases and total project measured in man-day

Full size table

B parameter set up

The 10-fold cross-validation is applied to determine optimal parameters for the five learning techniques as discussed below.

Regression analysis (RA): RA is a statistical technique to estimate the relationships among variables. OLS is a traditional regression analysis to approximate the target values in a linear regression model. It is carried out under an assumption that the data are normally distributed to provide favorable estimation results. To find the optimal parameters, the first step is to generate multiple choices by the combination of parameters. In this case, there was only one parameter, i.e., regression constant. Therefore, there were two choices, i.e., applying regression constant and not applying regression constant. The second step is to select the best choice by applying 10-fold cross-validation to the training set. The choice yielding the lowest sum of MdBRE, MBRE, MIBRE, and MdIBRE is selected to be the optimal parameter.

Table 8 Characteristics of software projects

Full size table

Support vector regression (SVR) SVR employs the ideas of SVM for regression task [21]. SVR defines \(\epsilon \)-intensive loss function to establish a band around the true outputs [20]. Three parameters are considered, namely kernel functions composing of linear and radial basis, \(\epsilon \) value in loss function being 0.0001, 0.001, 0.01, and 0.1, regularization parameter in loss function ranging from 1 to 10 with 2 step increment, and gamma value or width of radial basis function ranging from 0.1 to 1 with 0.2 step increment. The choices for the linear function are 20 (4 \(\epsilon \) values \(\times \) 5 regularization parameters). The choices for radial basis function are 100 (4 \(\epsilon \) values \(\times \) 5 regularization parameters \(\times \) 5 gamma values). The total becomes 120 choices (20 for linear function and 100 for radial basis function.

Radial basis function (RBF) RBF is one of feed-forward neural networks that generally uses radial basis function as the activation function. RBF takes a sequence of two mappings consisting of nonlinear mapping of the input data via the basis function and a linear mapping of the basis function to output. There are two parameters to be considered, namely number of basis functions ranging from 1 to the number of projects with s step increment, where s is the number of projects/10. The width of the basis function ranging from 0.1 to 1 with 0.2 step increment. The total becomes 50 choices (10 neurons \(\times \) 50 widths).

Classification and regression tree (CART) CART builds a decision tree for the prediction that works for both classification and regression problems [9]. In this experiment is to apply CART in the regression problem. There are two parameters to be focused, namely pruning tree and stopping criterion. The pruning is to reduce a tree by removing some leaf nodes from the original branch. For example, a tree would be terminated if the number of projects in a leaf node was less than threshold value, where threshold ranged from 1 to 10 with 1 step increment. Hence, the combinations of the two parameters generate 20 choices (2 pruning \(\times \) 10 projects).

\(\mathbf{K}\)-nearest neighbor (KNN) KNN uses local neighborhood data points to obtain the prediction in analogy-based estimation. KNN finds the most similar projects from the training set by measuring the distance between test and training data points, where each point denotes a project in the computation space. KNN selects the first k points from the training set having the smallest distance from the given test point. These efforts can be weighted in proportion to the above-measured distance. Compute the average of these efforts to yield the predicted effort of the test project. Thus, there are three parameters to be chosen, namely distance measurements (Euclidean and Minkowski), number of K projects (ranging from 1 to 10), and the effort (total weighted effort average of k nearest neighbor projects). The total becomes 40 choices (2 distances \(\times \) 10 neighbors \(\times \) 2 efforts).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jodpimai, P., Sophatsathit, P. & Lursinsap, C. Re-estimating software effort using prior phase efforts and data mining techniques. Innovations Syst Softw Eng 14, 209–228 (2018). https://doi.org/10.1007/s11334-018-0311-z

Download citation

Received: 22 November 2016
Accepted: 24 April 2018
Published: 02 May 2018
Issue Date: September 2018
DOI: https://doi.org/10.1007/s11334-018-0311-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Re-estimating software effort using prior phase efforts and data mining techniques

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Educational data mining to predict students' academic performance: A survey study

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

A data set

B parameter set up

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Re-estimating software effort using prior phase efforts and data mining techniques

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Educational data mining to predict students' academic performance: A survey study

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

A data set

B parameter set up

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation