A comparative study of two software development cost modeling techniques using multi-organizational and company-specific data

https://doi.org/10.1016/S0950-5849(00)00153-1Get rights and content

Abstract

This research examined the use of the International Software Benchmarking Standards Group (ISBSG) repository for estimating effort for software projects in an organization not involved in ISBSG. The study investigates two questions: (1) What are the differences in accuracy between ordinary least-squares (OLS) regression and Analogy-based estimation? (2) Is there a difference in accuracy between estimates derived from the multi-company ISBSG data and estimates derived from company-specific data? Regarding the first question, we found that OLS regression performed as well as Analogy-based estimation when using company-specific data for model building. Using multi-company data the OLS regression model provided significantly more accurate results than Analogy-based predictions. Addressing the second question, we found in general that models based on the company-specific data resulted in significantly more accurate estimates.

Introduction

Delivering a software product on time, within budget, and to an agreed level of quality is a critical concern for many software organizations. Underestimating software costs can have detrimental effects on software quality and thus on a company's business reputation. On the other hand, overestimation of software cost can result in missed opportunities to fund other projects. In response to industry demand, many estimation techniques have been proposed during the last three decades. In order to assess the suitability of a cost modeling technique, its performance and relative merits must be compared. Normally, homogenous company-specific data are believed to form a better basis for more accurate estimates. However, those data sets are typically small and cost driver data are tailored and specific such that comparison with other organizations or across the industry is difficult. Moreover, data collection is an expensive and time-consuming process for individual organizations. Industry representative parties have addressed the problem of software data collection in the past few years with the advent of multi-organizational data sets. The collaboration of organizations (such as the International Software Benchmarking Standards Group, ISBSG) to form multi-organizational data sets provides a possibility for reduced data collection costs, faster data accumulation and shared information benefits. Therefore, the pertinent question is whether multi-organizational data is valuable for estimation. Previous studies [1], [2] have shown this to be the case for organizations participating in these data repositories. A question addressed here is whether this is also the case if the organization has not participated in such defined data collection processes.

In this study, we used public domain data from the ISBSG. We compared the estimates derived from those data with estimates derived from company-specific data from an Australian company (Megatec). At the time of our analysis, Megatec did not contribute data to the ISBSG repository.

The two modeling techniques selected were: (1) OLS regression, as it is one of the most commonly applied techniques, and (2) Analogy-based estimation, whose popularity has increased in the 90s [5], [16], [18]. We applied different variants of the Analogical and Algorithmic Cost Estimator (ACE) algorithm to our data sets. This algorithm calculates the difference between the target project and each completed project in a database for a set of search metrics. ACE ranks the completed projects in a database according to their similarity. The effort of the most similar project(s) is used to predict the effort for the target project. In addition, size adjustments are applied to address differences between projects.

This study is motivated by the challenge of assessing the feasibility of using multi-organization data to build cost models for organizations and the benefits gained from company-specific data collection. The study looks at the prediction accuracy of two different estimation techniques and examines their performance based on both multi-organizational and company-specific data sets. Thus, two important questions are addressed: (1) what are the differences in estimation accuracy between a traditional technique such as ordinary least-squares (OLS) regression and Analogy-based estimation? (2) Is there a difference between estimates derived from multi-company data and estimates derived from company-specific data?

This research uses an organizational data set which is not part of the multi-organization set, and therefore provides a possibly more stringent test to the use of these types of data sets than has been carried out in the past [1], [2]. Furthermore, there is a difference in the quality of data collection. When collecting the Megatec project data researchers were involved in the first place and carried out extensive prior analysis [8]. Therefore, more detailed knowledge of the data context, relationships and accuracy for Megatec was present than what could be expected for any public data set. On the other hand more characteristics are measured in the public data set.

This paper starts with a discussion of related work in Section 2 followed by the presentation of the research method in Section 3 that includes a description of the data sets, the data preparation, and the estimation techniques. The results of the analysis are described in Section 4. Section 5 presents the conclusions and discussion of practical implications.

Section snippets

Related work

There have been two previous studies that utilized the ISBSG data set. The first one was a descriptive study done by the ISBSG itself [17]. Examples of the areas analyzed in this report are system size, project effort, and other descriptive metrics, e.g. their range, distribution, and relationships. In the second study Lokan [9] investigated the relationship between the five elements in function point analysis. This is the first application of this data set to the issue of cost estimation.

Two

Research method

In this section, we provide descriptive statistics for the data sets used in our analysis, summarize the data preparation activities, explain the approach followed in model building and application, and introduce the reader to the estimation techniques and evaluation criteria applied.

Analysis and results

4.1 Results based on Megatec data, 4.2 Results based on ISBSG data briefly present the results of applying the two estimation techniques on the two data sets. Section 4.3 then compares the results and, thus, builds the basis for discussion of the questions stated in Section 1.

Discussion and conclusions

To summarize, we can conclude that for Megatec: (1) Generally estimates using their own data are in general much more accurate than using the ISBSG repository. (2) Using their own data, both OLS regression and Analogy should be considered for predictions of new Megatec projects.

Analogy-based estimates are slightly improved on average (but not significantly) when adjusted for the expected size of the target project.

For the ISBSG repository we can conclude: (1) The estimation accuracy when

Acknowledgements

This work was supported by grants from Megatec, the Fraunhofer Institute for Experimental Software Engineering, the Centre for Advanced Empirical Software Research (CAESAR) at UNSW, and the CSIRO. Thanks also go to the International Software Benchmarking Standards Group (ISBSG) for the repository used in the study.

References (19)

  • L.C. Briand, K. El Emam, K. Maxwell, D. Surmann, I. Wieczorek, An Assessment and Comparison of Common Software Cost...
  • L.C. Briand, T. Langley, I. Wieczorek, A replicated Assessment of Common Software Cost Estimation Techniques. In:...
  • S.D Conte et al.

    Software Engineering Metrics and Models

    (1986)
  • S.J. Delany, P. Cunningham, W. Wilke, The limits of CBR in software project estimation, Proceedings of the 6th German...
  • G.R Finnie et al.

    A comparison of software effort estimation techniques: using function points with neural networks, case-based reasoning and regression models

    Journal of Systems and Software

    (2000)
  • A.R Gray et al.

    A comparison of techniques for developing predictive models of software metrics

    Information and Software Technology

    (2000)
  • M Hardy

    Regression with Dummy Variables

    (1993)
  • R Jeffery et al.

    Function point sizing: structure, validity and applicability

    Empirical Software Engineering

    (1996)
  • C.J. Lokan, An empirical study of the correlations between function point elements, Proceedings of the 6th...
There are more references available in the full text version of this article.

Cited by (133)

  • Research patterns and trends in software effort estimation

    2017, Information and Software Technology
  • Use of Qualitative Research to Generate a Function for Finding the Unit Cost of Software Test Cases

    2021, Research Anthology on Agile Software, Software Development, and Testing
  • Digital Economy and Modern Programming Technologies: Some Experimental Results

    2021, Proceedings - 2021 3rd International Conference on Control Systems, Mathematical Modeling, Automation and Energy Efficiency, SUMMA 2021
View all citing articles on Scopus
View full text