Abstract
Software effort estimation has played an important role in software project management. An accurate estimation helps reduce cost overrun and the eventual project failure. Unfortunately, many existing estimation techniques rely on the total project effort which is often determined from the project life cycle. As the project moves on, the course of action deviates from what originally has planned, despite close monitoring and control. This leads to re-estimating software effort so as to improve project operating costs and budgeting. Recent research endeavors attempt to explore phase level estimation that uses known information from prior development phases to predict effort of the next phase by using different learning techniques. This study aims to investigate the influence of preprocessing in prior phases on learning techniques to re-estimate the effort of next phase. The proposed re-estimation approach preprocesses prior phase effort by means of statistical techniques to select a set of input features for learning which in turn are exploited to generate the estimation models. These models are then used to re-estimate next phase effort by using four processing steps, namely data transformation, outlier detection, feature selection, and learning. An empirical study is conducted on 440 estimation models being generated from combinations of techniques on 5 data transformation, 5 outlier detection, 5 feature selection, and 5 learning techniques. The experimental results show that suitable preprocessing is significantly useful for building proper learning techniques to boosting re-estimation accuracy. However, there is no one learning technique that can outperform other techniques over all phases. The proposed re-estimation approach yields more accurate estimation than proportion-based estimation approach. It is envisioned that the proposed re-estimation approach can facilitate researchers and project managers on re-estimating software effort so as to finish the project on time and within the allotted budget.
Similar content being viewed by others
References
Wang Y, Song Q, MacDonell S, Shepperd M, Junyi S (2009) Integrate the GM (1,1) and verhulst models to predict software stage effort. IEEE Trans Syst Man Cybern Part C 39(6):647–658
Zia Z, Rashid A, uz Zaman K (2011) Software cost estimation for component-based fourth-generation-language software applications. IET Softw 5(1):103–110
Menzies T, Chen Z, Hihn J, Lum K (2006) Selecting best practices for effort estimation. IEEE Trans Softw Eng 32(11):883–895
Jorgensen M, Boehm B, Rifkin S (2009) Software development effort estimation: Formal models or expert judgment? IEEE Softw 26(2):14–19
MacDonell SG, Shepperd MJ (2003) Using prior-phase effort records for re-estimation during software projects. In: Proceedings of the ninth international software metrics symposium (METRICS’03), pp 73–86
Azzeh M, Cowling PI, Neagu D (2010) Software stage-effort estimation based on association rule mining and fuzzy set theory. In: Proceedings of 2010 IEEE 10th international conference on computer and information technology (CIT), pp 249–256
Ferrucci F, Gravino C, Sarro F (2014) Exploiting prior-phase effort data to estimate the effort for the subsequent phases: a further assessment. In: Proceedings of the 10th international conference on predictive models in software engineering, PROMISE ’14, pp 42–51. ACM, New York, NY, USA
Kantardzic M (2011) Data Mining: Concepts, Models, Methods, and Algorithms. Wiley, Piscataway
Kocaguneli E, Menzies T, Keung JW (2012) On the value of ensemble effort estimation. IEEE Trans Softw Eng 38(6):1403–1416
Boehm BW (1981) Software engineering economics. Prentice Hall PTR, Upper Saddle River
Yucalar F, Kilinc D, Borandag E, Ozcift A (2016) Regression analysis based software effort estimation method. Int J Softw Eng Knowl Eng 26(05):807–826
Huang SJ, Chiu NH, Liu YJ (2008) A comparative evaluation on the accuracies of software effort estimates from clustered data. Inf Softw Technol 50(9–10):879–888
Putnam LH (1978) A general empirical solution to the macro software sizing and estimating problem. IEEE Trans Softw Eng SE–4(4):345–361
Boehm BW, Abts C, Brown AW, Chulani S, Clark BK, Horowitz E, Madachy R, Reifer D, Steece B (2000) Software cost estimation with COCOMO II. Prentice Hall PTR, Upper Saddle River
Liu Q, Qin W, Mintram R, Ross M (2008) Evaluation of preliminary data analysis framework in software cost estimation based on ISBSG R9 data. Softw Qual J 16:411–458
Kocaguneli E, Menzies T, Bener A, Keung JW (2012) Exploiting the essential assumptions of analogy-based effort estimation. IEEE Trans Softw Eng 38(2):425–438
Idri A, Amazal F, Abran A (2015) Analogy-based software development effort estimation: a systematic mapping and review. Inf Softw Technol 58:206–230
Kumar KV, Ravi V, Carr M, Kiran NR (2008) Software development cost estimation using wavelet neural networks. J Syst Softw 81(11):1853–1867
Huang SJ, Chiu NH (2009) Applying fuzzy neural network to estimate software development effort. Appl Intell 30:73–83
Oliveira ALI (2006) Estimation of software project effort with support vector regression. Neurocomputing 69(13–15):1749–1753
Corazza A, Martino SD, Ferrucci F, Gravino C, Mendes E (2011) Investigating the use of support vector regression for web effort estimation. Empir Softw Eng 16:211–243
Mittal A, Parkash K, Mittal H (2010) Software cost estimation using fuzzy logic. SIGSOFT Softw Eng Notes 35(1):1–7
Muzaffar Z, Ahmed MA (2010) Software development effort prediction: a study on the factors impacting the accuracy of fuzzy logic systems. Inf Softw Technol 52(1):92–109
Oliveira AL, Braga PL, Lima RM, Cornlio ML (2010) GA-based method for feature selection and parameters optimization for machine learning regression applied to software effort estimation. Inf Softw Technol 52(11):1155–1166 Special Section on Best Papers PROMISE 2009
Minku LL, Yao X (2013) Software effort estimation as a multiobjective learning problem. ACM Trans Softw Eng Methodol 22(4):35:1–35:32
Jrgensen M (2004) A review of studies on expert estimation of software development effort. J Syst Softw 70(12):37–60
Polikar R (2006) Ensemble based systems in decision making. IEEE Circuits Syst Mag 6(3):21–45
Tan HBK, Zhao Y, Zhang H (2009) Conceptual data model-based software size estimation for information systems. ACM Trans Softw Eng Methodol 19(2):4:1–4:37
Malik AA, Boehm BW (2011) Quantifying requirements elaboration to improve early software cost estimation. Inf Sci 181(13):2747–2760
Yang Y, He M, Li M, Wang Q, Boehm BW (2008) Phase distribution of software development effort. In: Proceedings of the second ACM-IEEE international symposium on empirical software engineering and measurement, ESEM ’08, pp 61–69. ACM, New York, NY, USA
Strike K, Emam KE, Madhavji N (2001) Software cost estimation with incomplete data. IEEE Trans Softw Eng 27(10):890–908
Azzeh M, Neagu D, Cowling P (2008) Improving analogy software effort estimation using fuzzy feature subset selection algorithm. In: Proceedings of the 4th international workshop on predictor models in software engineering, PROMISE ’08, pp 71–78. ACM, New York, NY, USA
Pai DR, McFall KS, Subramanian GH (2013) Software effort estimation using a neural network ensemble. J Comput Inf Syst 53(4):4958
Dejaeger K, Verbeke W, Martens D, Baesens B (2012) Data mining techniques for software effort estimation: A comparative study. IEEE Trans Softw Eng 38(2):375–397
Sakia R (1992) The Box-Cox transformation technique: a review. J R Stat Soc Ser D 41(2):169–178
Junling R (2006) A pattern selection algorithm based on the generalized confidence. In: Proceedings of 18th international conference on pattern recognition (ICPR’06), vol. 2, pp. 824–827
Huang SJ, Chiu NH, Chen LW (2008) Integration of the grey relational analysis with genetic algorithm for software effort estimation. Eur J Oper Res 188(3):898–909
Jarque CM (2011) International encyclopedia of statistical science. Jarque–Bera test part. Springer, Berlin
Cook RD (1977) Detection of influential observation in linear regression. Technometrics 19(1):15–18
Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33(1):2–13
Malhotra R, Kaur A, Singh Y (2010) Application of machine learning methods for software effort prediction. SIGSOFT Softw Eng Notes 35(3):1–6
Chen Z, Menzies T, Port D, Boehm BW (2005) Finding the right data for software cost modeling. IEEE Softw 22(6):38–46
Hall MA (1999) Correlation-based feature selection for machine learning Ph.D. thesis, doctors thesis, Department of Computer Science, Waikato University. The bibliography
Unified code count. http://sunset.usc.edu/ucc/, accessed 9 (November 2015)
Backfiring table conversion guidelines. http://www.qsm.com/resources/function-point-languages-table/, accessed 9 (November 2015)
Conte SD, Dunsmore HE, Shen VY (1981) Software engineering metrics and models. Benjamin-Cummings, Menlo Park
Foss T, Stensrud E, Kitchenham B, Myrtveit I (2003) A Simulation study of the model evaluation criterion MMRE. IEEE Trans Softw Eng 29(11):985–995
Shepperd M, MacDonell S (2012) Evaluating prediction systems in software project estimation. Inf Softw Technol 54(8):820–827
Miyazaki Y, Terakado M, Ozada K, Nozaki H (1994) Robust regression for developing software estimation models. J Syst Softw 27(1):3–16
Jorgensen M (2010) Selection of strategies in judgment-based effort estimation. J Syst Softw 83(6):1039–1050
Kocaguneli E, Menzies T, Keung J, Cok D, Madachy R (2013) Active learning and effort estimation: finding the essential content of software effort estimation data. IEEE Trans Softw Eng 39(8):1040–1053
Refaeilzadeh P, Tang L, Liu L (2009) Encyclopedia of database systemscross validation. Springer, New York
Menzies T, Caglayan B, He Z, Kocaguneli E, Krall J, Peters F, Turhan B (2012) The promise repository of empirical software engineering data, http://promisedata.googlecode.com
Acknowledgements
The authors wish to thank Electronic Government Agency (Public Organization) and VP Advance Company for providing software project data and administrative support, without whom the work will never be complete.
Author information
Authors and Affiliations
Corresponding author
Appendices
A data set
This section provides details on all 38 software projects collected from two software development organizations. The project data being collected in compliance with COCOMO II data collection guidelines [14] are shown in Table 6. Only KSLOC represented quantitative data in ratio scale which could be directly used as a feature for the estimation. In contrast, all remaining features were represented as qualitative data in ordinal scale or 6-likert scale, i.e., very low (VL), low (L), nominal (N), high (H), very high (VH), and extra high (XH). These features would be transformed according to COCOMO II rules [14] before participating in the estimation. For example, the rule for CPLX is 0.73, 0.87, 1.00, 1.17, 1.34, and 1.74 for VL, L, N, H, VH, and XH, respectively. As a result, when project 1–5 (see Table 6) contain ordinal scales as L, L, L, N, and N, a transformed value of those projects is 0.87, 0.87, 0.87, 1, and 1, respectively. Even though there was one software project less than 1 KSLOC (0.27 KSLOC) that claimed to be not appropriate to calibrate COCOMO II model, we decided to keep it since GLOBAL COCOMO II and LOCAL calibration COCOMO II were not involved in our study. Instead, we built proportion-based estimation model with the help of learning techniques, where all collected projects were used. However, when an outlier detection was performed, the small project was often considered as an outlier and excluded from the training set. Software efforts of individual phase and total project measured in man-day are summarized in Table 7, which will be used to building re-estimation models. Characteristics of individual project are shown in Table 8 which subsequently are used to drill down in project category for better estimation accuracy of both re-estimation and proportion models.
B parameter set up
The 10-fold cross-validation is applied to determine optimal parameters for the five learning techniques as discussed below.
Regression analysis (RA): RA is a statistical technique to estimate the relationships among variables. OLS is a traditional regression analysis to approximate the target values in a linear regression model. It is carried out under an assumption that the data are normally distributed to provide favorable estimation results. To find the optimal parameters, the first step is to generate multiple choices by the combination of parameters. In this case, there was only one parameter, i.e., regression constant. Therefore, there were two choices, i.e., applying regression constant and not applying regression constant. The second step is to select the best choice by applying 10-fold cross-validation to the training set. The choice yielding the lowest sum of MdBRE, MBRE, MIBRE, and MdIBRE is selected to be the optimal parameter.
Support vector regression (SVR) SVR employs the ideas of SVM for regression task [21]. SVR defines \(\epsilon \)-intensive loss function to establish a band around the true outputs [20]. Three parameters are considered, namely kernel functions composing of linear and radial basis, \(\epsilon \) value in loss function being 0.0001, 0.001, 0.01, and 0.1, regularization parameter in loss function ranging from 1 to 10 with 2 step increment, and gamma value or width of radial basis function ranging from 0.1 to 1 with 0.2 step increment. The choices for the linear function are 20 (4 \(\epsilon \) values \(\times \) 5 regularization parameters). The choices for radial basis function are 100 (4 \(\epsilon \) values \(\times \) 5 regularization parameters \(\times \) 5 gamma values). The total becomes 120 choices (20 for linear function and 100 for radial basis function.
Radial basis function (RBF) RBF is one of feed-forward neural networks that generally uses radial basis function as the activation function. RBF takes a sequence of two mappings consisting of nonlinear mapping of the input data via the basis function and a linear mapping of the basis function to output. There are two parameters to be considered, namely number of basis functions ranging from 1 to the number of projects with s step increment, where s is the number of projects/10. The width of the basis function ranging from 0.1 to 1 with 0.2 step increment. The total becomes 50 choices (10 neurons \(\times \) 50 widths).
Classification and regression tree (CART) CART builds a decision tree for the prediction that works for both classification and regression problems [9]. In this experiment is to apply CART in the regression problem. There are two parameters to be focused, namely pruning tree and stopping criterion. The pruning is to reduce a tree by removing some leaf nodes from the original branch. For example, a tree would be terminated if the number of projects in a leaf node was less than threshold value, where threshold ranged from 1 to 10 with 1 step increment. Hence, the combinations of the two parameters generate 20 choices (2 pruning \(\times \) 10 projects).
\(\mathbf{K}\)-nearest neighbor (KNN) KNN uses local neighborhood data points to obtain the prediction in analogy-based estimation. KNN finds the most similar projects from the training set by measuring the distance between test and training data points, where each point denotes a project in the computation space. KNN selects the first k points from the training set having the smallest distance from the given test point. These efforts can be weighted in proportion to the above-measured distance. Compute the average of these efforts to yield the predicted effort of the test project. Thus, there are three parameters to be chosen, namely distance measurements (Euclidean and Minkowski), number of K projects (ranging from 1 to 10), and the effort (total weighted effort average of k nearest neighbor projects). The total becomes 40 choices (2 distances \(\times \) 10 neighbors \(\times \) 2 efforts).
Rights and permissions
About this article
Cite this article
Jodpimai, P., Sophatsathit, P. & Lursinsap, C. Re-estimating software effort using prior phase efforts and data mining techniques. Innovations Syst Softw Eng 14, 209–228 (2018). https://doi.org/10.1007/s11334-018-0311-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11334-018-0311-z