Estimation support by lexical analysis of requirements documents

https://doi.org/10.1016/S0164-1212(99)00114-4Get rights and content

Abstract

Estimation of the effort required for a software project is difficult. Various means are used, but most rely on some expert assessment of the individual requirements and their implications. A method of supporting this assessment for object-oriented developments is described. Lexical analysis of a draft requirements specification can be used to identify individual objects which will translate directly into the final implementation. These object counts can then be used to provide ‘first-cut’ effort estimates, using historical information from previous projects. Experiments were conducted on a problem implemented by student project teams. The results show that the untrained domain-independent automated noun and technical term finding programs used were no worse than the typical student group in deriving problem-space objects, and that these object counts provided a reasonable indicator to the effort required. Further work in this area is discussed.

Introduction

The publicity associated with projects that run over budget has been embarrassing to the software industry. However, the problem is not confined to software. The Channel Tunnel between the UK and France was originally estimated (in 1985) to cost £2000 M, and on completion cost £9000 M (Channel Tunnel Publications, 1995). Any contract where there is some novelty runs the risk of a miscalculation, and software projects are always novel since identical copies cost no development effort. It is therefore worth trying to target the novelty factors for a software project in an attempt to assist the early estimation process, since the commitment to delivering at the bid price (and to specification, schedule, quality, etc.) is critical to commercial success.

Section snippets

Current estimation methods

The most well known algorithmic estimation methods are probably COCOMO and FPA Boehm, 1981, Albrecht, 1979. In COCOMO, a raw ‘size’ for the development is determined in terms of source lines of code (SLOC), and this is combined with cost drivers relating to the ‘technical difficulty’ of the project and the non-technical environment in which it is carried out. Discussions of COCOMO have related to the need to calibrate it to the local company conditions and process maturity level (Clark, 1996),

Object oriented development and estimating

In order to try to assess the critical elements of an estimate, it is worth examining the project life-cycle for an object oriented system.

The Coding phase has in many cases been almost entirely automated by visual programming languages. Bespoke code generation has been replaced by screen and dialogue design, with selection (rather than construction) of appropriate elements, for which the tool then supplies the code. For example, application generation via a tool such as Borland C++ will

Related work

The CISAU project analysed construction industry documents in order to cross-check for inconsistency, for example between a bill of quantities and a job specification (Quinn, 1996). This type of cross-checking between documents or sections of a document has also been used in other contexts, such as biblical analysis (Hahne, 1994). The problem for such enterprises is that domain specific information may be needed in order to support the task. For example, in the CISAU project, the term `wall

Analysis programs used

The first consideration was to determine whether a completely domain-independent program could provide useful information for estimation. Three different programs were used to perform lexical analysis. The programs were tested on three small problem specifications, taken from textbooks and a local teaching module (see Table 1).

NounFinder

This takes a very simple approach to finding candidate objects (Hargreaves, 1998). Each unique word and its frequency of occurrence in the text is recorded. Then the words are examined (by comparing the first sections) to try to find those that relate to the same abstraction (e.g. dispatcher, dispatchers or agitate, agitator). Finally, general ‘noise’ words (e.g. ‘the’, ‘and’) are eliminated by comparing against a ‘stop-list’ dictionary. A frequency count of 2 is currently used as the threshold

KEP (KEP1)

KEP (Knowledge Extraction Program, (Bowden et al., 1996a)) is a domain independent program designed to extract facts from explanatory and informative texts using shallow methods. Amongst other things, KEP attempts to build glossaries automatically i.e. without human intervention save post-editing. These glossaries are composed of three-column entries where the first field may be an acronym (or empty), the second field is a technical term, and the third field an explanation of that term (if one

KEP + single term finder (KEP2)

In addition to J+K terms, KEP finds single-word hypernym terms arising from two-word J+K terms having common elements. Thus, for example, the terms ‘mainframe computer’ and ‘personal computer’ would give rise to the term ‘computer’. These terms are added to the list of J+K terms they were derived from. KEP also attempts to filter out so-called `duff' terms, i.e. those that are clearly not terms due to their generic nature e.g. ‘recent year’, ‘common problem’, etc. The resulting list of terms is

Results

The results for NounFinder are indicated in Table 2, which shows the frequency of occurrence of the candidate objects. The plant control system looks promising, in that the words identified appear to relate strongly to problem-space objects. Unfortunately, for the traffic management system and the weather-monitoring system, the identified words do not appear to match so closely, and the threshold has led to a number of relevant entities (e.g. failure prediction, drawbar force, analogue sensors)

Conclusions and further work

These experiments hold some promise that automated class detection can assist in the estimation process for object oriented systems – particularly at the bid stage, where time and effort rule out detailed analysis of the problem. But there are many issues to be addressed. In order for these counts to be of use for estimation, companies need detailed historical information from previous projects of the effort per class, and variances. Effort and costs to completion for projects may not have been

Paul Bowden gained a B.Sc. in Electrical and Electronic Engineering from the University of Nottingham, a B.Sc. in Physics from the University of London (Queen Mary College) and an M.Sc. in IT from the University of Nottingham. He has recently completed his doctorate here in the Department of Computing at Nottingham Trent University. Paul has worked for many years as an analyst/programmer in the software industry, both on the traditional DP side and the engineering (microprocessor) side. Paul's

References (33)

  • Channel Tunnel Publications, 1995. The Official Channel Tunnel...
  • Clark, B.K., 1996. Cost modeling process maturity – COCOMO 2.0. In: IEEE Aerospace Applications Conference Proceedings,...
  • Edwards, M.L., Flanzer, M., Terry, M., Landa, J., 1995. RECAP: a requirements elicitation, capture and analysis process...
  • E. Gamma et al.

    Design Patterns

    (1995)
  • Goldin, L., Berry, D.M., 1994. AbstFinder: A protype abstraction finder for natural language text for use in...
  • Gray, A., MacDonell, S., 1997. Applications of fuzzy logic to software metric models for development effort estimation....
  • Cited by (0)

    Paul Bowden gained a B.Sc. in Electrical and Electronic Engineering from the University of Nottingham, a B.Sc. in Physics from the University of London (Queen Mary College) and an M.Sc. in IT from the University of Nottingham. He has recently completed his doctorate here in the Department of Computing at Nottingham Trent University. Paul has worked for many years as an analyst/programmer in the software industry, both on the traditional DP side and the engineering (microprocessor) side. Paul's research interests are centred on Natural Language Processing (NLP). His Ph.D. thesis concerns Knowledge Extraction from Text (KE), but he is also interested in Information Extraction (IE) and Text Summarisation.

    Mark Hargreaves gained a B.Sc.(Hons) in Computing at Nottingham Trent University in 1998. He now works at Reuters as a systems engineer, troubleshooting problems for clients in the City. The clients are mostly financial and media companies and the advent of the Euro is causing more work in this area.

    Caroline Langensiepen has a Ph.D. in theoretical particle physics, but spent 15 years working in the computing industry. She specialised in real time systems and object oriented methods, and was software design authority on a number of large military and mission critical projects. She joined the Nottingham Trent University two years ago and lectures in OOD and software quality. Her research interests centre on design methodologies and the system life-cycle.

    View full text