Skip to main content
Log in

Re-estimating software effort using prior phase efforts and data mining techniques

  • Original Paper
  • Published:
Innovations in Systems and Software Engineering Aims and scope Submit manuscript

Abstract

Software effort estimation has played an important role in software project management. An accurate estimation helps reduce cost overrun and the eventual project failure. Unfortunately, many existing estimation techniques rely on the total project effort which is often determined from the project life cycle. As the project moves on, the course of action deviates from what originally has planned, despite close monitoring and control. This leads to re-estimating software effort so as to improve project operating costs and budgeting. Recent research endeavors attempt to explore phase level estimation that uses known information from prior development phases to predict effort of the next phase by using different learning techniques. This study aims to investigate the influence of preprocessing in prior phases on learning techniques to re-estimate the effort of next phase. The proposed re-estimation approach preprocesses prior phase effort by means of statistical techniques to select a set of input features for learning which in turn are exploited to generate the estimation models. These models are then used to re-estimate next phase effort by using four processing steps, namely data transformation, outlier detection, feature selection, and learning. An empirical study is conducted on 440 estimation models being generated from combinations of techniques on 5 data transformation, 5 outlier detection, 5 feature selection, and 5 learning techniques. The experimental results show that suitable preprocessing is significantly useful for building proper learning techniques to boosting re-estimation accuracy. However, there is no one learning technique that can outperform other techniques over all phases. The proposed re-estimation approach yields more accurate estimation than proportion-based estimation approach. It is envisioned that the proposed re-estimation approach can facilitate researchers and project managers on re-estimating software effort so as to finish the project on time and within the allotted budget.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Wang Y, Song Q, MacDonell S, Shepperd M, Junyi S (2009) Integrate the GM (1,1) and verhulst models to predict software stage effort. IEEE Trans Syst Man Cybern Part C 39(6):647–658

    Article  Google Scholar 

  2. Zia Z, Rashid A, uz Zaman K (2011) Software cost estimation for component-based fourth-generation-language software applications. IET Softw 5(1):103–110

    Article  Google Scholar 

  3. Menzies T, Chen Z, Hihn J, Lum K (2006) Selecting best practices for effort estimation. IEEE Trans Softw Eng 32(11):883–895

    Article  Google Scholar 

  4. Jorgensen M, Boehm B, Rifkin S (2009) Software development effort estimation: Formal models or expert judgment? IEEE Softw 26(2):14–19

    Article  Google Scholar 

  5. MacDonell SG, Shepperd MJ (2003) Using prior-phase effort records for re-estimation during software projects. In: Proceedings of the ninth international software metrics symposium (METRICS’03), pp 73–86

  6. Azzeh M, Cowling PI, Neagu D (2010) Software stage-effort estimation based on association rule mining and fuzzy set theory. In: Proceedings of 2010 IEEE 10th international conference on computer and information technology (CIT), pp 249–256

  7. Ferrucci F, Gravino C, Sarro F (2014) Exploiting prior-phase effort data to estimate the effort for the subsequent phases: a further assessment. In: Proceedings of the 10th international conference on predictive models in software engineering, PROMISE ’14, pp 42–51. ACM, New York, NY, USA

  8. Kantardzic M (2011) Data Mining: Concepts, Models, Methods, and Algorithms. Wiley, Piscataway

    Book  MATH  Google Scholar 

  9. Kocaguneli E, Menzies T, Keung JW (2012) On the value of ensemble effort estimation. IEEE Trans Softw Eng 38(6):1403–1416

    Article  Google Scholar 

  10. Boehm BW (1981) Software engineering economics. Prentice Hall PTR, Upper Saddle River

    MATH  Google Scholar 

  11. Yucalar F, Kilinc D, Borandag E, Ozcift A (2016) Regression analysis based software effort estimation method. Int J Softw Eng Knowl Eng 26(05):807–826

    Article  Google Scholar 

  12. Huang SJ, Chiu NH, Liu YJ (2008) A comparative evaluation on the accuracies of software effort estimates from clustered data. Inf Softw Technol 50(9–10):879–888

    Article  Google Scholar 

  13. Putnam LH (1978) A general empirical solution to the macro software sizing and estimating problem. IEEE Trans Softw Eng SE–4(4):345–361

    Article  MATH  Google Scholar 

  14. Boehm BW, Abts C, Brown AW, Chulani S, Clark BK, Horowitz E, Madachy R, Reifer D, Steece B (2000) Software cost estimation with COCOMO II. Prentice Hall PTR, Upper Saddle River

    Google Scholar 

  15. Liu Q, Qin W, Mintram R, Ross M (2008) Evaluation of preliminary data analysis framework in software cost estimation based on ISBSG R9 data. Softw Qual J 16:411–458

    Article  Google Scholar 

  16. Kocaguneli E, Menzies T, Bener A, Keung JW (2012) Exploiting the essential assumptions of analogy-based effort estimation. IEEE Trans Softw Eng 38(2):425–438

    Article  Google Scholar 

  17. Idri A, Amazal F, Abran A (2015) Analogy-based software development effort estimation: a systematic mapping and review. Inf Softw Technol 58:206–230

    Article  Google Scholar 

  18. Kumar KV, Ravi V, Carr M, Kiran NR (2008) Software development cost estimation using wavelet neural networks. J Syst Softw 81(11):1853–1867

    Article  Google Scholar 

  19. Huang SJ, Chiu NH (2009) Applying fuzzy neural network to estimate software development effort. Appl Intell 30:73–83

    Article  Google Scholar 

  20. Oliveira ALI (2006) Estimation of software project effort with support vector regression. Neurocomputing 69(13–15):1749–1753

    Article  Google Scholar 

  21. Corazza A, Martino SD, Ferrucci F, Gravino C, Mendes E (2011) Investigating the use of support vector regression for web effort estimation. Empir Softw Eng 16:211–243

    Article  Google Scholar 

  22. Mittal A, Parkash K, Mittal H (2010) Software cost estimation using fuzzy logic. SIGSOFT Softw Eng Notes 35(1):1–7

    Article  Google Scholar 

  23. Muzaffar Z, Ahmed MA (2010) Software development effort prediction: a study on the factors impacting the accuracy of fuzzy logic systems. Inf Softw Technol 52(1):92–109

    Article  Google Scholar 

  24. Oliveira AL, Braga PL, Lima RM, Cornlio ML (2010) GA-based method for feature selection and parameters optimization for machine learning regression applied to software effort estimation. Inf Softw Technol 52(11):1155–1166 Special Section on Best Papers PROMISE 2009

    Article  Google Scholar 

  25. Minku LL, Yao X (2013) Software effort estimation as a multiobjective learning problem. ACM Trans Softw Eng Methodol 22(4):35:1–35:32

    Article  Google Scholar 

  26. Jrgensen M (2004) A review of studies on expert estimation of software development effort. J Syst Softw 70(12):37–60

    Article  Google Scholar 

  27. Polikar R (2006) Ensemble based systems in decision making. IEEE Circuits Syst Mag 6(3):21–45

    Article  Google Scholar 

  28. Tan HBK, Zhao Y, Zhang H (2009) Conceptual data model-based software size estimation for information systems. ACM Trans Softw Eng Methodol 19(2):4:1–4:37

    Article  Google Scholar 

  29. Malik AA, Boehm BW (2011) Quantifying requirements elaboration to improve early software cost estimation. Inf Sci 181(13):2747–2760

    Article  Google Scholar 

  30. Yang Y, He M, Li M, Wang Q, Boehm BW (2008) Phase distribution of software development effort. In: Proceedings of the second ACM-IEEE international symposium on empirical software engineering and measurement, ESEM ’08, pp 61–69. ACM, New York, NY, USA

  31. Strike K, Emam KE, Madhavji N (2001) Software cost estimation with incomplete data. IEEE Trans Softw Eng 27(10):890–908

    Article  Google Scholar 

  32. Azzeh M, Neagu D, Cowling P (2008) Improving analogy software effort estimation using fuzzy feature subset selection algorithm. In: Proceedings of the 4th international workshop on predictor models in software engineering, PROMISE ’08, pp 71–78. ACM, New York, NY, USA

  33. Pai DR, McFall KS, Subramanian GH (2013) Software effort estimation using a neural network ensemble. J Comput Inf Syst 53(4):4958

    Google Scholar 

  34. Dejaeger K, Verbeke W, Martens D, Baesens B (2012) Data mining techniques for software effort estimation: A comparative study. IEEE Trans Softw Eng 38(2):375–397

    Article  Google Scholar 

  35. Sakia R (1992) The Box-Cox transformation technique: a review. J R Stat Soc Ser D 41(2):169–178

    Google Scholar 

  36. Junling R (2006) A pattern selection algorithm based on the generalized confidence. In: Proceedings of 18th international conference on pattern recognition (ICPR’06), vol. 2, pp. 824–827

  37. Huang SJ, Chiu NH, Chen LW (2008) Integration of the grey relational analysis with genetic algorithm for software effort estimation. Eur J Oper Res 188(3):898–909

    Article  MATH  Google Scholar 

  38. Jarque CM (2011) International encyclopedia of statistical science. Jarque–Bera test part. Springer, Berlin

    Google Scholar 

  39. Cook RD (1977) Detection of influential observation in linear regression. Technometrics 19(1):15–18

    MathSciNet  MATH  Google Scholar 

  40. Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33(1):2–13

    Article  Google Scholar 

  41. Malhotra R, Kaur A, Singh Y (2010) Application of machine learning methods for software effort prediction. SIGSOFT Softw Eng Notes 35(3):1–6

    Article  Google Scholar 

  42. Chen Z, Menzies T, Port D, Boehm BW (2005) Finding the right data for software cost modeling. IEEE Softw 22(6):38–46

    Article  Google Scholar 

  43. Hall MA (1999) Correlation-based feature selection for machine learning Ph.D. thesis, doctors thesis, Department of Computer Science, Waikato University. The bibliography

  44. Unified code count. http://sunset.usc.edu/ucc/, accessed 9 (November 2015)

  45. Backfiring table conversion guidelines. http://www.qsm.com/resources/function-point-languages-table/, accessed 9 (November 2015)

  46. Conte SD, Dunsmore HE, Shen VY (1981) Software engineering metrics and models. Benjamin-Cummings, Menlo Park

    Google Scholar 

  47. Foss T, Stensrud E, Kitchenham B, Myrtveit I (2003) A Simulation study of the model evaluation criterion MMRE. IEEE Trans Softw Eng 29(11):985–995

    Article  Google Scholar 

  48. Shepperd M, MacDonell S (2012) Evaluating prediction systems in software project estimation. Inf Softw Technol 54(8):820–827

    Article  Google Scholar 

  49. Miyazaki Y, Terakado M, Ozada K, Nozaki H (1994) Robust regression for developing software estimation models. J Syst Softw 27(1):3–16

    Article  Google Scholar 

  50. Jorgensen M (2010) Selection of strategies in judgment-based effort estimation. J Syst Softw 83(6):1039–1050

    Article  Google Scholar 

  51. Kocaguneli E, Menzies T, Keung J, Cok D, Madachy R (2013) Active learning and effort estimation: finding the essential content of software effort estimation data. IEEE Trans Softw Eng 39(8):1040–1053

    Article  Google Scholar 

  52. Refaeilzadeh P, Tang L, Liu L (2009) Encyclopedia of database systemscross validation. Springer, New York

    Google Scholar 

  53. Menzies T, Caglayan B, He Z, Kocaguneli E, Krall J, Peters F, Turhan B (2012) The promise repository of empirical software engineering data, http://promisedata.googlecode.com

Download references

Acknowledgements

The authors wish to thank Electronic Government Agency (Public Organization) and VP Advance Company for providing software project data and administrative support, without whom the work will never be complete.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pichai Jodpimai.

Appendices

A data set

This section provides details on all 38 software projects collected from two software development organizations. The project data being collected in compliance with COCOMO II data collection guidelines [14] are shown in Table 6. Only KSLOC represented quantitative data in ratio scale which could be directly used as a feature for the estimation. In contrast, all remaining features were represented as qualitative data in ordinal scale or 6-likert scale, i.e., very low (VL), low (L), nominal (N), high (H), very high (VH), and extra high (XH). These features would be transformed according to COCOMO II rules [14] before participating in the estimation. For example, the rule for CPLX is 0.73, 0.87, 1.00, 1.17, 1.34, and 1.74 for VL, L, N, H, VH, and XH, respectively. As a result, when project 1–5 (see Table 6) contain ordinal scales as L, L, L, N, and N, a transformed value of those projects is 0.87, 0.87, 0.87, 1, and 1, respectively. Even though there was one software project less than 1 KSLOC (0.27 KSLOC) that claimed to be not appropriate to calibrate COCOMO II model, we decided to keep it since GLOBAL COCOMO II and LOCAL calibration COCOMO II were not involved in our study. Instead, we built proportion-based estimation model with the help of learning techniques, where all collected projects were used. However, when an outlier detection was performed, the small project was often considered as an outlier and excluded from the training set. Software efforts of individual phase and total project measured in man-day are summarized in Table 7, which will be used to building re-estimation models. Characteristics of individual project are shown in Table 8 which subsequently are used to drill down in project category for better estimation accuracy of both re-estimation and proportion models.

Table 6 Software project data collected according to COCOMO II
Table 7 Software effort of different phases and total project measured in man-day

B parameter set up

The 10-fold cross-validation is applied to determine optimal parameters for the five learning techniques as discussed below.

Regression analysis (RA): RA is a statistical technique to estimate the relationships among variables. OLS is a traditional regression analysis to approximate the target values in a linear regression model. It is carried out under an assumption that the data are normally distributed to provide favorable estimation results. To find the optimal parameters, the first step is to generate multiple choices by the combination of parameters. In this case, there was only one parameter, i.e., regression constant. Therefore, there were two choices, i.e., applying regression constant and not applying regression constant. The second step is to select the best choice by applying 10-fold cross-validation to the training set. The choice yielding the lowest sum of MdBRE, MBRE, MIBRE, and MdIBRE is selected to be the optimal parameter.

Table 8 Characteristics of software projects

Support vector regression (SVR) SVR employs the ideas of SVM for regression task [21]. SVR defines \(\epsilon \)-intensive loss function to establish a band around the true outputs [20]. Three parameters are considered, namely kernel functions composing of linear and radial basis, \(\epsilon \) value in loss function being 0.0001, 0.001, 0.01, and 0.1, regularization parameter in loss function ranging from 1 to 10 with 2 step increment, and gamma value or width of radial basis function ranging from 0.1 to 1 with 0.2 step increment. The choices for the linear function are 20 (4 \(\epsilon \) values \(\times \) 5 regularization parameters). The choices for radial basis function are 100 (4 \(\epsilon \) values \(\times \) 5 regularization parameters \(\times \) 5 gamma values). The total becomes 120 choices (20 for linear function and 100 for radial basis function.

Radial basis function (RBF) RBF is one of feed-forward neural networks that generally uses radial basis function as the activation function. RBF takes a sequence of two mappings consisting of nonlinear mapping of the input data via the basis function and a linear mapping of the basis function to output. There are two parameters to be considered, namely number of basis functions ranging from 1 to the number of projects with s step increment, where s is the number of projects/10. The width of the basis function ranging from 0.1 to 1 with 0.2 step increment. The total becomes 50 choices (10 neurons \(\times \) 50 widths).

Classification and regression tree (CART) CART builds a decision tree for the prediction that works for both classification and regression problems [9]. In this experiment is to apply CART in the regression problem. There are two parameters to be focused, namely pruning tree and stopping criterion. The pruning is to reduce a tree by removing some leaf nodes from the original branch. For example, a tree would be terminated if the number of projects in a leaf node was less than threshold value, where threshold ranged from 1 to 10 with 1 step increment. Hence, the combinations of the two parameters generate 20 choices (2 pruning \(\times \) 10 projects).

\(\mathbf{K}\)-nearest neighbor (KNN) KNN uses local neighborhood data points to obtain the prediction in analogy-based estimation. KNN finds the most similar projects from the training set by measuring the distance between test and training data points, where each point denotes a project in the computation space. KNN selects the first k points from the training set having the smallest distance from the given test point. These efforts can be weighted in proportion to the above-measured distance. Compute the average of these efforts to yield the predicted effort of the test project. Thus, there are three parameters to be chosen, namely distance measurements (Euclidean and Minkowski), number of K projects (ranging from 1 to 10), and the effort (total weighted effort average of k nearest neighbor projects). The total becomes 40 choices (2 distances \(\times \) 10 neighbors \(\times \) 2 efforts).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jodpimai, P., Sophatsathit, P. & Lursinsap, C. Re-estimating software effort using prior phase efforts and data mining techniques. Innovations Syst Softw Eng 14, 209–228 (2018). https://doi.org/10.1007/s11334-018-0311-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11334-018-0311-z

Keywords

Navigation