Skip to main content
Log in

Empirical evaluation of tools for hairy requirements engineering tasks

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Context

A hairy requirements engineering (RE) task involving natural language (NL) documents is (1) a non-algorithmic task to find all relevant answers in a set of documents, that is (2) not inherently difficult for NL-understanding humans on a small scale, but is (3) unmanageable in the large scale. In performing a hairy RE task, humans need more help finding all the relevant answers than they do in recognizing that an answer is irrelevant. Therefore, a hairy RE task demands the assistance of a tool that focuses more on achieving high recall, i.e., finding all relevant answers, than on achieving high precision, i.e., finding only relevant answers. As close to 100% recall as possible is needed, particularly when the task is applied to the development of a high-dependability system. In this case, a hairy-RE-task tool that falls short of close to 100% recall may even be useless, because to find the missing information, a human has to do the entire task manually anyway. On the other hand, too much imprecision, too many irrelevant answers in the tool’s output, means that manually vetting the tool’s output to eliminate the irrelevant answers may be too burdensome. The reality is that all that can be realistically expected and validated is that the recall of a hairy-RE-task tool is higher than the recall of a human doing the task manually.

Objective

Therefore, the evaluation of any hairy-RE-task tool must consider the context in which the tool is used, and it must compare the performance of a human applying the tool to do the task with the performance of a human doing the task entirely manually, in the same context. The context of a hairy-RE-task tool includes characteristics of the documents being subjected to the task and the purposes of subjecting the documents to the task. However, traditionally, many a hairy-RE-task tool has been evaluated by considering only (1) how high is its precision, or (2) how high is its F-measure, which weights recall and precision equally, ignoring the context, and possibly leading to incorrect — often underestimated — conclusions about how effective it is.

Method

To evaluate a hairy-RE-task tool, this article offers an empirical procedure that takes into account not only (1) the performance of the tool, but also (2) the context in which the task is being done, (3) the performance of humans doing the task manually, and (4) the performance of those vetting the tool’s output. The empirical procedure uses (I) on one hand, (1) the recall and precision of the tool, (2) a contextually weighted F-measure for the tool, (3) a new measure called summarization of the tool, and (4) the time required for vetting the tool’s output, and (II) on the other hand, (1) the recall and precision achievable by and (2) the time required by a human doing the task.

Results

The use of the procedure is shown for a variety of different contexts, including that of successive attempts to improve the recall of an imagined hairy-RE-task tool. The procedure is shown to be context dependent, in that the actual next step of the procedure followed in any context depends on the values that have emerged in previous steps.

Conclusion

Any recommendation for a hairy-RE-task tool to achieve close to 100% recall comes with caveats and may be required only in specific high-dependability contexts. Appendix C applies parts of this procedure to several hairy-RE-task tools reported in the literature using data published about them. The surprising finding is that some of the previously evaluated tools are actually better than they were thought to be when they were evaluated using mainly precision or an unweighted F-measure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. I chose the word “hairy” to evoke the metaphor of the hairy theorem or proof.

  2. An algorithmic task is one for which there exists an algorithm whose output is guaranteed logically, as opposed to empirically, to be 100% correct. For example, searching for every occurrence of the word “only” is algorithmic; think grep. However, searching for each occurrence of “only” that is in the wrong position for its intended meaning is non-algorithmic.

  3. The phrase “this article” is used to refer to only the document that you are reading right now, to distinguish it from the document of any related work, for which the word “paper” is used.

  4. Later, this article defines “close to” in terms of what humans can actually achieve. Until the background has been laid for this definition, a vague understanding suffices, but “close to 100%” is definitely less than 100%. Nowhere does this article insist on achieving 100% correctness, recall, or precision!

  5. “Recalling” is to “recall” as “precise” is to “precision”.

  6. This used to be called the ‘humanly achievable high recall (HAHR)”, expressing the hope that it is close to 100%. However, actual values have proved to be quite low, sometimes as low as 32.95%.

  7. Of course, one can argue that such a tool is useful as a defense against a human’s less-than-100% recall when the tool is run as a double check after the human has done the tool’s task manually. However, it seems to me, that if the human knows that the tool will be run, he or she might be lazy in carrying out the manual task and not do as well as possible. Empirical studies are needed to see if this effect is is real, and if so, how destructive it is of the human’s recall. Perhaps, the effect is technology-induced inattentional blindness (Mack and Rock 1998), which has been studied in depth elsewhere. Winkler and Vogelsang may have found evidence for this effect (Winkler and Vogelsang 2018).

  8. The word “I” is used when describing activities that required thinking and a decision on my part to advance the research. The data speak for themselves. “I” is less pretentious than “this author”. Also “this author” can be ambiguous in the context of a discussion about a related work’s authors.

  9. This sentence says only “probably”. Section 7 explains how to empirically decide how accurate this prediction is.

  10. If a tool can be shown to always have 100% precision, it is never necessary to vet the tool’s output.

  11. Of course, any t for which this property does not hold could be considered to be poorly designed!

  12. Think: “speedup” is roughly “acceleration”; hence “a”.

  13. The HAR is realistically the best that can be expected. However, see Item 13 in Section 7.2 for an idea about increasing HAR beyond the what is possible by one human being.

  14. A discussion of which formula for the F-measure is more appropriate to evaluate tools for hairy tasks is outside the scope of this article.

  15. on the assumption that the time required for a run of t is negligible or other work can be done while t is running on its own.

  16. Section 7 explains how to determine empirically if this claim is true for any T and t.

  17. Whenever “X et al” would be ambiguous w.r.t. the bibliography of this article, the full list of author family names starting with X is given. If in addition, there would be several occurrences of that long list in the following text, an acronym is introduced to be used in place of those occurrences. If on the other hand, the intent of “X et al” is to refer to all works, with possibly different author lists, that are referred to by “X et al”, then “X et al” is used. There are occasional local, footnote-marked-and-explained exceptions to these rules.

  18. The minimum possible precision is λ, which is positive if the document has at least one relevant answer.

  19. There are many other NLP measures that are called by the same name, but this measure is not any of them.

  20. As pointed out by an anonymous reviewer, it may not be helpful for the tool to include with a candidate link, information about why its IR method decided that the link is a candidate, particularly if the method, as does LSI (Latent Semantic Indexing), computes the similarity of the link’s source and target in the space of concepts rather than in the more traditional space of terms.

  21. I learned from reviews of previous versions of this article, that presenting the formula-laden description of the evaluation procedure of Section 7 without prefacing it with example evaluations using the procedure loses too many readers, including me. On the other hand, some readers understand examples applying formulae only after reading and understanding the formulae. If you are one of the latter kind of reader, please read Section 7 before reading Section 6.

  22. I had to concoct a fictitious analysis, because I have no actual complete analysis. I myself realized what is needed to do a complete context-aware evaluation of a tool for an NL RE task only after the last evaluation in which I participated was completed, and in general, it is impossible to obtain several key data after the evaluation steps are completed. In addition, to date, to my knowledge, no one has published a full analysis of the kind prescribed in this article, probably because doing this kind of analysis is very much not traditional.

  23. The reason that these recalls are only approximate is that the number of answers in a tool’s output must be a whole integer!

  24. For a list of all identifications used throughout this article, see Appendix A.

  25. The comparison of \(R_{\textit {vet},D,T,t_{1}}\) and the HAR, RaveExpert, D, T, should be accompanied by the calculation of a p-value, which allows determining the statistical significance of the whatever difference there is between the two values, each of which is, after all, the average of the R values of the members of a group of domain experts. The same holds for all comparisons like this one. See Section 7.4.

  26. All the standard issues concerning the selection of a representative document with which to test a tool for at task T and concerning the internal and external validity of an evaluation based on a gold standard are outside the scope of this article, for no other reason than that these issues have been studied thoroughly elsewhere (Manning et al. 2008). Nevertheless, the threats arising from these issues are discussed in Section 7.5.

  27. “Vel cetera” is to “inclusive or” as “et cetera” is to “and”.

  28. Here and here alone, “Maalej et al.” is used because Maalej is the one person who is an author in all of the cited work.

  29. “Nocuous” is the opposite of “innocuous”.

  30. “Hayes, Dekhtyar, et al.” is used to refer to all the tracing work done by Hayes, Dekhtyar, and various colleagues and students over the years, as a single body of work, while an acronym, e.g., “HDLG”, consisting of the ordered initials of the family names of the authors of a paper, is used to refer to the work of the paper featured in a section, e.g, that cited by the reference “(Hayes et al. 2018)”.

  31. This observation calls to mind the famous joke: “A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, ‘this is where the light is’.” (Freedman 2010)

  32. The formula for F1 that they give in their paper lacks the multiplier 2 in the numerator. So, one would expect the reported values to be one half of what they should be, but the reported values lie, correctly, between those of recall and precision. Moreover, they match the values that I computed for building Table 16. Therefore, they probably used the correct formula to compute the F1 values.

  33. There are not enough data in the paper to estimate either β from the construction of the gold standard.

  34. In the quotation, “accuracy” seems to mean “average F2”, because average F2 is the only value that is maintained in moving from the fourth to the fifth row. Each has an average F2 of 0.82.

References

  • Anish PR, Ghaisas S (2014) Product knowledge configurator for requirements gap analysis and customizations. In: Proceedings of the IEEE international requirements engineering conference (RE), pp 437–443

  • Antoniol G, Canfora G, Casazza G, De Lucia A (2000) Identifying the starting impact set of a maintenance request: a case study. In: Proceedings of the fourth European conference on software maintenance and reengineering, pp 227–230

  • Antoniol G, Canfora G, Casazza G, De Lucia A, Merlo E (2002) Recovering traceability links between code and documentation. IEEE Trans Softw Eng 28(10):970–983

    Article  Google Scholar 

  • Arora C, Sabetzadeh M, Briand L, Zimmer F (2015) Automated checking of conformance to requirements templates using natural language processing. IEEE Trans Softw Eng 41(10):944–968

    Article  Google Scholar 

  • Beard M, Kraft N, Etzkorn L, Lukins S (2011) Measuring the accuracy of information retrieval based bug localization techniques. In: 18th working conference on reverse engineering (WCRE), pp 124–128

  • Berry DM (2017) Evaluation of tools for hairy requirements and software engineering tasks. In: Proceedings of workshop on empirical requirements engineering (EmpirRE) in IEEE 25th international requirements engineering conference workshops, pp 284–291

  • Berry DM (2017) Evaluation of tools for hairy requirements engineering and software engineering tasks. Technical report, School of Computer Science, University of Waterloo. https://cs.uwaterloo.ca/dberry/FTP_SITE/tech.reports/EvalPaper.pdf

  • Berry DM, Cleland-Huang J, Ferrari A, Maalej W, Mylopoulos J, Zowghi D (2017) Panel: context-dependent evaluation of tools for NL RE tasks: recall vs. precision, and beyond. In: IEEE 25th international requirements engineering conference (RE), pp 570–573

  • Berry DM, Ferrari A, Gnesi S (2017) Assessing tools for defect detection in natural language requirements: recall vs precision. Technical report, School of Computer Science, University of Waterloo. https://cs.uwaterloo.ca/dberry/FTP_SITE/tech.reports/BFGpaper.pdf

  • Berry DM, Gacitua R, Sawyer P, Tjong SF (2012) The case for dumb requirements engineering tools. In: Proceedings of the international working conference on requirements engineering: foundation of software quality (REFSQ), pp 211–217

  • Binkley D, Lawrie D (2010) Information retrieval applications in software maintenance and evolution. In: Laplante PA (ed) Encyclopedia of software engineering, pp 454–463. Taylor & Francis

  • Breaux TD, Gordon DG (2013) Regulatory requirements traceability and analysis using semi-formal specifications. In: Proceedings of the international working conference on requirements engineering: foundation of software quality (REFSQ), pp 141–157

  • Breaux TD, Schaub F (2014) Scaling requirements extraction to the crowd: experiments with privacy policies. In: Proceedings of the IEEE international requirements engineering conference (RE), pp 163–172

  • Bucchiarone A, Gnesi S, Pierini P (2005) Quality analysis of NL requirements: an industrial case study. In: Proc. 13th IEEE international requirements engineering conference (RE), pp 390–394

  • Casamayor A, Godoy D, Campo M (2012) Functional grouping of natural language requirements for assistance in architectural software design. Knowledge-Based Systs 30:78–86

    Article  Google Scholar 

  • Cavalcanti G, Borba P, Accioly P (2017) Should we replace our merge tools?. In: Proceedings of the 39th international conference on software engineering companion (ICSE-C), pp 325–327

  • Chantree F, Nuseibeh B, de Roeck A, Willis A (2006) Identifying nocuous ambiguities in natural language requirements. In: Proceedings of the IEEE international requirements engineering conference (RE), pp 56–65

  • Cleland-Huang J, Czauderna A, Gibiec M, Emenecker J (2010) A machine learning approach for tracing regulatory codes to product specific requirements. In: Proceedings of the international conference on software engineering (ICSE), pp 155–164

  • Cleland-Huang J, Gotel O, Zisman A (eds) (2012) Software and systems traceability. Springer, London

    Google Scholar 

  • Cleland-Huang J, Zemont G, Lukasik W (2004) A heterogeneous solution for improving the return on investment of requirements traceability. In: Proceedings of the 12th IEEE international requirements engineering conference (RE), pp 230–239

  • Cuddeback D, Dekhtyar A, Hayes JH (2010) Automated requirements traceability: the study of human analysts. In: Proceedings of the IEEE international requirements engineering conference (RE), pp 231–240

  • De Lucia A, Fasano F, Oliveto R, Tortora G (2007) Recovering traceability links in software artifact management systems using information retrieval methods. ACM Transactions on Software Engineering Methodology 16 (4):13,1–13,50

    Article  Google Scholar 

  • De Lucia A, Marcus A, Oliveto R, Poshyvanyk D (2012) Information retrieval methods for automated traceability recovery. In: Cleland-Huang J, Gotel O, Zisman A (eds) Software and systems traceability. Springer, pp 71–98

  • De Lucia A, Oliveto R, Tortora G (2008) IR-based traceability recovery processes: an empirical comparison of “one-shot” and incremental processes. In: 26th IEEE/ACM international conference on automated software engineering (ASE), pp 39–48, Los Alamitos, CA, USA. IEEE Computer Society

  • De Lucia A, Oliveto R, Tortora G (2009) Assessing IR-based traceability recovery tools through controlled experiments. Empir Softw Eng 14(1):57–92

    Article  Google Scholar 

  • De Lucia A, Oliveto R, Tortora G (2009) The role of the coverage analysis during IR-based traceability recovery: a controlled experiment. In: IEEE international conference on software maintenance (ICSM), pp 371–380

  • Dekhtyar A, Hayes JH, Smith M (2011) Towards a model of analyst effort for traceability research. In: Proceedings of the 6th international workshop on traceability in emerging forms of software engineering (TEFSE), pp 58–62

  • Delater A, Paech B (2013) Tracing requirements and source code during software development: an empirical study. In: Proceedings of the international symposium on empirical software engineering and measurement (ESEM), pp 25–34

  • Dwarakanath A, Ramnani RR, Sengupta S (2013) Automatic extraction of glossary terms from natural language requirements. In: Proceedings of the IEEE international requirements engineering conference (RE), pp 314–319

  • Fabbrini F, Fusani M, Gnesi S, Lami G (2001) An automatic quality evaluation for natural language requirements. In: Requirements engineering: foundation for software quality (REFSQ), pp 1–18

  • Ferrari A, Dell’Orletta F, Spagnolo GO, Gnesi S (2014) Measuring and improving the completeness of natural language requirements. In: Proceedings of the international working conference on requirements engineering: foundation of software quality (REFSQ), pp 23–38

  • Freedman DH (2010) Wrong: why experts keep failing us — and how to know when not to trust them. Little, Brown and Company, New York

  • Gacitua R, Sawyer P (2008) Ensemble methods for ontology learning - an empirical experiment to evaluate combinations of concept acquisition techniques. In: ICIS’2008, pp 328–333

  • Gacitua R, Sawyer P, Gervasi V (2010) On the effectiveness of abstraction identification in requirements engineering. In: Proceedings of the 18th IEEE international requirements engineering conference (RE), pp 5–14

  • Gervasi V, Zowghi D (2014) Supporting traceability through affinity mining. In: Proceedings of the IEEE international requirements engineering conference (RE), pp 143–152

  • Gleich B, Creighton O, Kof L (2010) Ambiguity detection: towards a tool explaining ambiguity sources. In: Proceedings of the international working conference on requirements engineering: foundation of software quality (REFSQ), pp 218–232

  • Goldin L, Berry DM (1997) Abstfinder: a prototype abstraction finder for natural language text for use in requirements elicitation. Autom Softw Eng 4:375–412

    Article  Google Scholar 

  • Gotel O, Cleland-Huang J, Hayes JH, Zisman A, Egyed A, Grünbacher P, Antoniol G (2012) The quest for ubiquity: a roadmap for software and systems traceability research. In: 20th IEEE international requirements engineering conference (RE), pp 71–80

  • Groen EC, Schowalter J, Kopczyńska S, Polst S, Alvani S (2018) Is there really a need for using NLP to elicit requirements? A benchmarking study to assess scalability of manual analysis. In: Schmid K, Spoletini P (eds) Joint proceedings of the REFSQ 2018 Co-located events: the workshop on natural language processing for RE (NLP4RE), pp 1–11. CEUR Workshop Proceedings 2075. http://ceur-ws.org/Vol-2075/NLP4RE_paper11.pdf

  • Grossman MR, Cormack GV, Roegiest A (2016) TREC 2016 total recall track overview. http://trec.nist.gov/pubs/trec25/trec2016.html

  • Guzman E, Maalej W (2014) How do users like this feature? A fine grained sentiment analysis of app reviews. In: Proceedings of the IEEE international requirements engineering conference (RE), pp 153–162

  • Hayes J (2019) E-mail communication

  • Hayes JH, Dekhtyar A (2005) Humans in the traceability loop: can’t live with ’em, can’t live without ’em. In: Proceedings of the 3rd international workshop on traceability in emerging forms of software engineering (TEFSE), pp 20–23

  • Hayes JH, Dekhtyar A, Larsen J, Guéhéneuc Y-G (2018) Effective use of analysts’ effort in automated tracing. Requirements Engineering Journal 23(1):119–143

    Article  Google Scholar 

  • Hayes JH, Dekhtyar A, Osborne J (2003) Improving requirements tracing via information retrieval. In: Proceedings of the 11th IEEE international requirements engineering conference (RE), pp 138–147

  • Hayes JH, Dekhtyar A, Sundaram SK (2006) Advancing candidate link generation for requirements tracing: the study of methods. IEEE Trans Softw Eng 32(1):4–19

    Article  Google Scholar 

  • Hayes JH, Dekhtyar A, Sundaram SK, Howard S (2004) Helping analysts trace requirements: an objective look. In: Proceedings of the 12th IEEE international requirements engineering conference (RE), pp 249–259

  • Heindl M, Biffl S (2005) A case study on value-based requirements tracing. In: Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on foundation for software engineering ESEC/FSE, pp 60–69

  • Hübner P (2016) Quality improvements for trace links between source code and requirements. In: Joint proceedings of the REFSQ 2016 Co-located events: the REFSQ 2016 doctoral symposium. http://ceur-ws.org/Vol-1564/paper29.pdf, pp 1–7

  • Hübner P, Paech B (2017) Using interaction data for continuous creation of trace links between source code and requirements in issue tracking systems. In: Proceedings of the 23rd international working conference on requirements engineering: foundation for software quality (REFSQ), pp TBD

  • Ingram C, Riddle S (2012) Cost-benefits of traceability. In: Cleland-Huang J, Gotel O, Zisman A (eds) Software and systems traceability, pp 23–42. Springer

  • Jha N, Mahmoud A (2017) Mining user requirements from application store reviews using frame semantics. In: Grünbacher P, Perini A (eds) Proceedings of the 23rd international working conference on requirements engineering: foundation for software quality (REFSQ), pp 273–287. Springer

  • Jha N, Mahmoud A (2018) Using frame semantics for classifying and summarizing application store reviews. Empir Softw Eng 23(6):3734–3767

    Article  Google Scholar 

  • Jha N, Mahmoud A (2019) Mining non-functional requirements from App store reviews. Empir Softw Eng 24(6):3659–3695

    Article  Google Scholar 

  • Juristo N, Moreno A (2001) Basics of software engineering experimentation. Kluwer Academic Publishers, Norwell

  • Knauss E, Ott D (2014) Semi-automatic categorization of natural language requirements. In: Proceedings of the international working conference on requirements engineering: foundation of software quality (REFSQ), pp 39–54

  • Kohavi R, Provost F (1998) Glossary of terms. Mach Learn 30 (2):271–274

    Google Scholar 

  • Kong W, Hayes JH, Dekhtyar A, Dekhtyar O (2012) Process improvement for traceability: a study of human fallibility. In: 20th IEEE international requirements engineering conference (RE), pp 31–40

  • Kong W, Hayes JH, Dekhtyar A, Holden J (2011) How do we trace requirements: an initial study of analyst behavior in trace validation tasks. In: Proceedings of the international workshop on cooperative and human aspects of software engineering (CHASE), pp 32–39

  • Li H (2011) Learning to rank for information retrieval and natural language processing. Morgan & Claypool Publishers

  • Maalej W (2017) In-person, verbal communication

  • Maalej W, Kurtanović Z, Nabil H, Stanik C (2016) On the automatic classification of app reviews. Requirements Engineering Journal 21(3):311–331

    Article  Google Scholar 

  • Maalej W, Kurtanović Z, Nabil H, Stanik C (2016) On the automatic classification of app reviews. Reqs Engg J 21(3):311–331

    Article  Google Scholar 

  • Mack A, Rock I (1998) Inattentional blindness. MIT Press, Cambridge

    Book  Google Scholar 

  • Mäder P, Egyed A (2015) Do developers benefit from requirements traceability when evolving and maintaining a software system? Empir Softw Eng 20 (2):413–441

    Article  Google Scholar 

  • Mahmoud A, Niu N (2014) Supporting requirements to code traceability through refactoring. Requirements Engineering Journal 19(3):309–329

    Article  Google Scholar 

  • Manning CD, Raghavan P, Schütze H (2008) Chapter 8: evaluation in information retrieval. In: Introduction to information retrieval. https://nlp.stanford.edu/IR-book/pdf/08eval.pdf. Cambridge University Press, Cambridge

  • Marcus A, Haiduc S (2013) Text retrieval approaches for concept location in source code. In: De Lucia A, Ferrucci F (eds) Software engineering: international summer schools (ISSSE), 2009–2011, Salerno, Italy. Revised Tutorial Lectures, pp 126–158. Springer, Berlin

  • Maro S (2019) E-mail communication

  • Maro S, Steghöfer J, Hayes J, Cleland-Huang J, Staron M (2018) Vetting automatically generated trace links: what information is useful to human analysts?. In: IEEE 26th international requirements engineering conference (RE), pp 52–63

  • Menzies T, Dekhtyar A, Distefano J, Greenwald J (2007) Problems with precision: a response to comments on data mining static code attributes to learn defect predictors. IEEE Transactions on Software Enginerring 33(9):637–640

    Article  Google Scholar 

  • Merten T, Krämer D, Mager B, Schell P, Bürsner S, Paech B (2016) Do information retrieval algorithms for automated traceability perform effectively on issue tracking system data?. In: Proceedings of the 22nd international working conference on requirements engineering: foundation for software quality (REFSQ), pp 45–62

  • Montgomery L, Damian D (2017) What do support analysts know about their customers? on the study and prediction of support ticket escalations in large software organizations. In: Proceedings of the 25th IEEE international requirements engineering conference (RE), page to appear

  • Mori K, Okubo N, Ueda Y, Katahira M, Amagasa T (2020) Joint proceedings of REFSQ-2020 workshops, doctoral symposium, live studies track, and poster track. In: Sabetzadeh M, Vogelsang A, Abualhaija S, Borg M, Dalpiaz F, Daneva M, Fernández N, Franch X, Fucci D, Gervasi V, Groen E, Guizzardi R, Herrmann A, Horkoff J, Mich L, Perini A, Susi A (eds). http://ceur-ws.org/Vol-2584/NLP4RE-paper2.pdf

  • Nagappan M (2018) In-person, verbal communication

  • Nikora AP, Hayes JH, Holbrook EA (2010) Experiments in automated identification of ambiguous natural-language requirements. In: Proceedings of the international symposium on software reliability engineering, pp 229–238

  • Northrop L, Pollak B, Feiler P, Gabriel RP, Goodenough J, Linger R, Longstaff T, Kazman R, Klein M, Schmidt D, Sullivan K, Wallnau K (2006). In: Ultra-large-scale systems, the software challenge of the future. Software Engineering Institute at Carnegie Mellon University, Pittsburg, PA, USA. http://www.sei.cmu.edu/library/assets/ULS_Book20062.pdf

  • Oliveto R, Gethers M, Poshyvanyk D, De Lucia A (2010) On the equivalence of information retrieval methods for automated traceability link recovery. In: IEEE 18th international conference on program comprehension (ICPC), pp 68–71

  • Paech B (2019) E-mail communication

  • Pagano D, Maalej W (2013) User feedback in the appstore: an empirical study. In: Proceedings of the 21st IEEE international requirements engineering conference (RE), pp 125–134

  • Pittke F, Leopold H, Mendling J (2015) Automatic detection and resolution of lexical ambiguity in process models. IEEE Trans Softw Eng 41(6):526–544

    Article  Google Scholar 

  • Pollock L, Vijay-Shanker K, Hill E, Sridhara G, Shepherd D (2013) Natural language-based software analyses and tools for software maintenance. In: De Lucia A, Ferrucci F (eds) Software engineering: international summer schools (ISSSE), 2009–2011, Salerno, Italy. Revised Tutorial Lectures, pp 126–158. Springer, Berlin

  • Poshyvanyk D, Marcus A (2007) Combining formal concept analysis with information retrieval for concept location in source code. In: Proceedings of the 15th IEEE international conference on program comprehension (ICPC), pp 37–48

  • Quirchmayr T, Paech B, Kohl R, Karey H (2017) Semi-automatic software feature-relevant information extraction from natural language user manuals. In: Proceedings of the international working conference on requirements engineering: foundation of software quality (REFSQ), pp 255–272

  • Rahimi M, Mirakhorli M, Cleland-Huang J (2014) Automated extraction and visualization of quality concerns from requirements specifications. In: Proceedings of the IEEE international requirements engineering conference (RE), pp 253–262

  • Rempel P, Mäder P (2017) Preventing defects: the impact of requirements traceability completeness on software quality. IEEE Trans Softw Eng 43 (8):777–797

    Article  Google Scholar 

  • Riaz M, King JT, Slankas J, Williams LA (2014) Hidden in plain sight: automatically identifying security requirements from natural language artifacts. In: Proceedings of the IEEE international requirements engineering conference (RE), pp 183–192

  • Robeer M, Lucassen G, van der Werf JME, Dalpiaz F, Brinkkemper S (2016) Automated extraction of conceptual models from user stories via NLP. In: Int. Reqs. Engg. Conf. (RE), pp 196–205

  • Roegiest A, Cormack GV, Grossman MR, Clarke CL (2016) TREC 2015 total recall track overview. http://trec.nist.gov/pubs/trec24/trec2015.html

  • Ryan K (1993) The role of natural language in requirements engineering. In: Proceedings of the IEEE international symposium on requirements engineering, pp 240–242

  • Saito S, Iimura Y, Tashiro H, Massey AK, Antón AI (2016) Visualizing the effects of requirements evolution. In: Proceedings of the international conference on software engineering (ICSE) companion, pp 152–161

  • Saracevic T (1995) Evaluation of evaluation in information retrieval. In: Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR), pp 138–146

  • Sundaram SK, Hayes JH, Dekhtyar A (2005) Baselines in requirements tracing. In: Proceedings of the 2005 workshop on predictor models in software engineering (PROMISE). https://doi.org/10.1145/1082983.1083169, pp 1–6

  • Sutcliffe A, Rayson P, Bull CN, Sawyer P (2014) Discovering affect-laden requirements to achieve system acceptance. In: Proceedings of the IEEE international requirements engineering conference (RE), pp 173–182

  • Tjong SF, Berry DM (2013) The design of SREE — a prototype potential ambiguity finder for requirements specifications and lessons learned. In: Proceedings of the international working conference on requirements engineering: foundation of software quality (REFSQ), pp 80–95

  • TREC Conferences (2015) TREC 2015 total recall track. http://plg.uwaterloo.ca/gvcormac/total-recall/guidelines.html

  • TREC Conferences (2017) Text REtrieval Conference (TREC). http://trec.nist.gov

  • Vogelsang A (2019) E-mail communication

  • Wang L, Nakagawa H, Tsuchiya T (2020) Opinion analysis and organization of mobile application user reviews. In: Sabetzadeh M, Vogelsang A, Abualhaija S, Borg M, Dalpiaz F, Daneva M, Fernández N, Franch X, Fucci D, Gervasi V, Groen E, Guizzardi R, Herrmann A, Horkoff J, Mich L, Perini A, Susi A (eds) Joint proceedings of REFSQ-2020 workshops, doctoral symposium, live studies track, and poster track. http://ceur-ws.org/Vol-2584/NLP4RE-paper4.pdf

  • Wilson WM, Rosenberg LH, Hyatt LE (1997) Automated analysis of requirement specifications. In: Proc. 19th international conference on software engineering (ICSE), pp 161–171

  • Winkler JP, Grönberg J, Vogelsang A (2019) Optimizing for recall in automatic requirements classification: an empirical study. In: 27th IEEE international requirements engineering conference (RE), pp 40–50

  • Winkler JP, Vogelsang A (2018) Using tools to assist identification of non-requirements in requirements specifications — a controlled experiment. In: Kamsties E, Horkoff J, Dalpiaz F (eds) Proceedings of requirements engineering: foundation for software quality (REFSQ), pp 57–71

  • Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, Burlington

  • Wnuk K, Höst M, Regnell B (2012) Replication of an experiment on linguistic tool support for consolidation of requirements from multiple sources. Empirical Softw Engg 17(3):305–344

    Article  Google Scholar 

  • Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2000) Experimentation in software engineering: an introduction. Kluwer Academic Publishers, Norwell

  • Yang H, Roeck AND, Gervasi V, Willis A, Nuseibeh B (2011) Analysing anaphoric ambiguity in natural language requirements. Requirements Engineering Journal 16(3):163–189

    Article  Google Scholar 

  • Yazdani Seqerloo A, Amiri MJ, Parsa S, Koupaee M (2019) Automatic test cases generation from business process models. Requirements Engineering Journal 24(1):119–132

    Article  Google Scholar 

  • Zeni N, Kiyavitskaya N, Mich L, Cordy JR, Mylopoulos J (2015) Gaiust: supporting the extraction of rights and obligations for regulatory compliance. Requirements Engineering Journal 20(1):1–22

    Article  Google Scholar 

Download references

Acknowledgments

I benefited from discussions, in person or electronically, with William Berry, Travis Breaux, Lionel Briand, Alex Dekhtyar, Jane Cleland-Huang, Alessio Ferrari, Eddy Groen, Jane Hayes, Walid Maalej, Anas Mahmoud, Salome Maro, Aaron Massey, Thorsten Merten, Lloyd Montgomery, John Mylopoulos, Mei Nagappan, Barbara Paech, Jan Steghöfer, Andreas Vogelsang, and the anonymous referees. My work was supported in part by a Canadian NSERC grant NSERC-RGPIN227055-15.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel M. Berry.

Ethics declarations

Ethics approval and consent to participate

To prepare this article, I conducted no experiments.

Additional information

Communicated by: Paul Grünbacher

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: List of Identifications

This appendix lists all the identifications, to have them in one place for easy reference by the reader.

First is a list of artifacts and concepts:

ID

Explanation

T

hairy RE task

t

tool for a hairy RE task

G

gold standard

D

document or documents to which T or t is applied

E

set of experts about the domain of D

\(\mathcal {A}\)

set of answers from D

HAR

humanly achievable recall

Next comes a list of measures. For the measures, there are two general forms:

  1. 1.

    Mp, D, T, t, i that can be read as

    Measure, M, [of process | with qualifying property], p, that involves a document, D, a task T, a tool t, and the i th [expert | answer ] in a way that makes sense for Mp.

    or

  2. 2.

    MD, T, t, i that can be read as

    Measure, M, that involves a document, D, a task T, a tool t, and the i th [expert | answer ] in a way that makes sense for M.

In any particular measure of either form, one or more of D, T, t, i may be missing, just because for the measure, they are not relevant.

For any process p, “p” refers to the whole process applied to its entire artifact, while “pItem” refers to the process applied to a single item in the entire artifact.

Next is a list of M s and their p s, if any, divided into sublists. In this list, each p in any sublist applies to each M in the same sublist.

M

p

Explanation

τ

 

time to do p

 

examine

examine entire document for relevant answers

 

examineItem

examine one answer to determine its relevance

 

findItem

find one relevant answer in an entire document

 

vet

vet entire tool output for relevant answers

 

vetItem

vet one answer for relevance

R

 

recall of p

P

 

precision of p

A

 

accuracy of p

S

 

summarization of p

s

 

selectivity of p

 

expert

an expert

 

aveExpert

average of experts

 

raw

raw tool output

 

vet

result of vetting of tool output

λ

 

frequency of relevant answers among answers

β

 

weight of recall over precision

a

 

vetting speedup (think “acceleration”)

Appendix B: Why is precision so emphasized?

Why is there such an emphasis on precision? Precision is important in the information retrieval (IR) area from which we, in RE, have borrowed many of the algorithms used in constructing the tools for RE hairy tasks (De Lucia et al. 2012; Berry et al. 2012). In IR, users of a tool with low precision are turned off by having to reject false positives more often than they accept true positives. In some cases, only a few or even only one true positive is needed. Perhaps the force of habit drives people to evaluate the tools for hairy tasks with the same criteria that are used for IR tools. Also, “precision” sounds so much more important than “recall”, as in “This output is precisely right!”.

Mei Nagappan in face-to-face, verbal communication offered the observation that it’s much easier to calculate precision than recall, because it’s easier for a person to reject false positives than to find true positives (Nagappan 2018). So, precision is the measure of choice.Footnote 31 Ironically, the reason that precision is chosen over recall is the very reason that a tool’s help is needed to find the true positives and that an evaluation of the tool must focus instead on its recall.

In fact, the most senior author of each of two of the works cited in Section 10 as having used the F-measure, which weights recall and precision equally, or as having emphasized precision more than recall, expressed surprise when I told him personally that his paper had done so. Each knew that recall was more critical than precision for his paper’s task, and each claimed that his paper reflected or accounted for that understanding. However, after I showed each the relevant text in his paper, he agreed that a reasonable reader would interpret the text as I had. Apparently, some other author of his paper had just followed the evaluation convention inherited from IR in writing parts of the paper.

When I asked one author of each of two other papers why his paper had used the F-measure, he replied that doing so was conventional. When I asked him if he and his co-authors had considered the possibility that recall should be weighted more than precision, he said simply “No.” One of the two added that he and his co-authors had enough else to worry about in conducting their tool evaluation.

Appendix C: Example reevaluations

This appendix examines several recently published papers that evaluate tools for hairy RE tasks. In each case, I believe that with more complete data, the paper’s results can be improved. In each case, the data of the paper together, in some cases, with data provided by an author of the paper by personal communication, are mined to estimate as many of a β, selectivity, and HAR as are possible. When the paper calculates a F1 or F2, the paper’s calculation is redone using Fβ with the estimated β. Then the paper’s analysis is reexamined in light of the new estimated values.

If there are not sufficient data to estimate a β, and I believe that for the paper’s task, finding a true positive manually requires an order of magnitude more time than rejecting a false positive manually, and thus that β should be 10, the reexamination uses a β of 5, just to be on the conservative side with the time estimate. In any case, for high enough precision, F5 is close enough to R, that one does not need F10 to make the case that the results ride on the strength of the recall. If the original paper used an F-measure that was actually f1 or f2, then the strengthening with f5 is not as dramatic as with F5.

In each case in which I use estimated values, not from actual data, to complete a reevaluation, the authors of the involved paper are in effect invited to carry out a replication of their study in which they gather all the data needed to do a complete, context aware evaluation with actual data.

1.1 C.1 Hayes, Dekhtyar, and Osborne

In 2003, Hayes, Dekhtyar, and Osborne (HDO) describe the ordering of criteria used to evaluate a tool for the tracing task (Hayes et al. 2003).

Between these two metrics [of recall and precision], we chose recall as our most important objective. This ordering ensures that an analyst feels confident that all related links have, indeed, been retrieved. Precision comes next, so that the analyst would not have to sift through numerous erroneous potential links. We notice however, that without good precision, total recall is a meaningless accomplishment. For example, in forward tracing, it can be achieved by simply including every single lower-level requirement in the candidate lists for every higher-level requirement.

To show the use of these criteria, HDO describe two experiments conducted for the purpose of evaluating a new tracing tool that uses what they call their “retrieval with thesaurus algorithm” (Hayes et al. 2003). With these two experiments, they are able to show that

  • junior analysts achieve higher recall than does a tool using a vanilla IR algorithm, of which their retrieval with thesaurus algorithm is an extension, and

  • their tool achieves higher recall than “a senior analyst who traced manually as well as with an existing requirements tracing tool” (Hayes et al. 2003).

In each experiment, the analysts and the tools trace one or both of two datasets, each a matrix showing candidate links from up to 19 high-level requirements to 50 low-level requirements. Among these 19 × 50 = 950 candidate links, there are 41 true links. These 41 true links comprise a gold standard against which to evaluate the recall and precision of any analyst, tool, or both. One dataset is the full 19 × 50 matrix of candidate links and the other is a carefully constructed 10 × 10 submatrix which includes a sampling of all kinds of linking situations, including source with no target, target with no source, source with many targets, target with many sources, source with one target, target with one source, etc. (Hayes 2019). HDO’s paper does not state the number of true links in the 10 × 10 submatrix, neither senior author remembers the exact number (Hayes 2019), but the number is definitely less than 41.

Table 11 reproduces their Table 1 about the first experiment, but with some additional clarification. Each junior analyst traced the 10 × 10 dataset, and the tool with the vanilla IR algorithm traced both the 10 × 10 dataset and the 19 × 50 dataset. Consistent with HDO’s recognition of the importance of recall over precisions for tracing, their comments on this table emphasizes that “the analysts tied or outperformed the vanilla vector algorithm in overall recall (by 0 – 20%)” (Hayes et al. 2003).

Table 11 Comparison of vanilla IR algorithm with two junior analysts

Table 12 reproduces parts of their Table 2 about the second experiment, but with some corrections and additional clarification (Hayes et al. 2003). One clarification that is needed to understand the table is that the SuperTracePlus tool is not a modern automated tool, after whose invocation the analyst quickly vets the candidate links that the tool returns. Rather, the SuperTracePlus tool requires that a complete manual trace be done to prepare input that the SuperTracePlus tool refines to a more complete set of candidate links (Hayes 2019). In the second experiment, the senior analyst spent 9 hours doing a manual trace, finding 39 candidate links. He or she gave these 39 candidate links to HDO for analysis, leading to the data reported in the “Analyst Alone” column of the table. Then, during an additional 4 hours, the analyst (1) input to the SuperTracePlus tool the data it needs, all discovered during the manual trace and (2) worked with the tool to find an additional 28 candidate links. He or she gave the full set of 67 candidate links to HDO for analysis, leading to the data reported in the “Analyst with SuperTracePlus Tool” column of the table. HDO’s comments about this table leads off with observations about overall recall (Hayes et al. 2003):

Note that the retrieval with thesaurus algorithm achieved recall of over 85% with 40.6% precision on a dataset that is only 19x50. The recall outperforms [the analyst working with] SuperTracePlusTM by 22% and the analyst [working alone, manually] by a whopping 42%.

Table 12 Comparison of (1) expert senior analyst with supertraceplus tool, (2) expert senior analyst alone, and (3) retrieval with thesaurus algorithm

HDO’s data allow computing some of the measures described in this article. For the task of tracing the full 19 × 50 dataset that contains 41 true links, \(\beta _{D,T} = \frac {950}{41} = 23.17\). βD, T for the task of tracing the 10 × 10 dataset cannot be calculated because the number of true links is unknown. However that number is no bigger than 40 and is probably bigger than 4.32, the number expected if the 10 × 10 dataset were a random sampling of the 19 × 50 dataset. Therefore, for the task of tracing the 10 × 10 dataset, \(23.17 > \beta _{D,T} \geq (\frac {100}{40} = 2.5)\). However, this βD, T is contrived because it is from a subdataset constructed to have all of the true links in the original dataset. So, it will be ignored in the following argument. Thus, HDO’s emphasis on recall over precision is well founded.

From the recalls of the three human analysts, it is possible to calculate a HAR for the task of tracing the 19 × 50 dataset and its subdataset. The overall HAR for all three analysts on both datasets is 36.6%, the average of 23.0%, 42.9%, and 43.9%. The HAR for the expert senior analyst on the 19 × 50 dataset is 43.9%, and the HAR for the junior analyst on the 10 × 10 dataset is 32.95%, the average of 23.0% and 42.9%. These values may be lower than what would be achieved in a real-life project in which the analysts doing the tracing would be familiar with the system under development. The comments HDO gathered from the analysts indicate that they felt that their lack of knowledge of details left out of the documents hindered them.

If these HARs are accepted at face value, then this high βD, T and these low HARs say that a tool with 85.36% recall should be used regardless of what its precision is, particularly if the tool has any decent selectivity. In fact, the selectivity of the tool can be calculated as 9.05%, which is 86, the total number of links found, divided by 19 × 50, the total number of possible links.

Interestingly, the two junior analysts spent .65 minutes and 1.5 minutes per candidate link, while the senior analyst spent .57 minutes per candidate link. The values are consistent with each other, even as the senior analyst is a bit faster than each junior analyst.

The second experiment evaluated the performance of their retrieval with thesaurus algorithm running by itself with no vetting by a human analyst to remove false positives, to help HDO to discover what needs to be changed to improve the algorithm’s precision. Considering these improvements, HDO say (Hayes et al. 2003):

Analysis of our retrieval algorithms showed the presence of many false positives. We also noticed that many of these were returned with very low relevance. In order to analyze the true effectiveness of our algorithms, we chose to implement various thresholds to trim the lists of candidate links. Decreasing the size of the lists this way allows us to improve precision at a potential cost to recall.

Because HDO’s tool has such good selectivity, I do not think that paying for improving precision by decreasing recall is a good deal. Therefore, I would not attempt to improve the tool’s precision at all and would rely on analysts’ vetting the tool’s output to remove false positives.

1.2 C.2 Sundaram, Hayes, and Dekhtyar

Sundaram, Hayes, and Dekhtyar (SHD) experimentally evaluated a variety of IR methods applied to the task of tracing of open source datasets (Sundaram et al. 2005). SHD do not use any F-measure, instead couching their conclusions strictly on the basis of recall, precision, and selectivity. Moreover, SHD make tradeoffs consistent with HDO’s observation, mentioned earlier, of the importance of recall over precision for the tracing task.

SHD give in their Table 1, the precision, recall, and selectivity of the variety of IR methods applied to the MODIS and CM-1 datasets. Table 13 reproduces this table (Sundaram et al. 2005).

Table 13 Results of IR methods on CM-1 and MODIS datasets, no feedback, no filtering

SHD describe how they do the evaluation (Sundaram et al. 2005):

When evaluating candidate traces (whether computer- or human-generated), we want both precision and recall to be as high as possible. However, we use different thresholds for these two measures. Based on our observation, we note that a change from 10% to 20% in precision means that instead of 1 true link in 10, the trace contains 1 true link in 5, a savings of about 50% in terms of the number of links to verify for the analyst. We strive for high recall to prevent analysts from having to search for links not shown to them. This latter activity is much more time consuming and much less desirable as a task than simply vetting the results provided from an existing candidate trace. In addition, we want selectivity to be as low [good] as possible (while keeping recall high). For recall, we consider results above 80% excellent, above 70% — good, and between 60% and 70% — acceptable. For precision, 20- 30% is acceptable, 30-50% is good, and 50% and above — excellent.

So, for the MODIS task, the best tool algorithm is tf-idf+TH, because it has 100.0% recall, and even though it has only 10.1% precision, it has very nearly the best selectivity at 43.1%. The algorithm with the best selectivity, 41.9%, has a much lower recall, 75.6%. With this assurance of 100.0% recall, which certainly is no worse than the HAR for tracing MODIS, significant work is saved when the humans spend their time boringly vetting the tool’s reported links instead of tediously doing the tracing manually. For the CM-1 task, the best tool algorithm is again tf-idf+TH, because it has 97.7% recall, and even though it has only 1.5% precision, it has the best selectivity at 42.8%. With this assurance of 97.7% recall, which almost certainly is no worse than the HAR for tracing CM-1, significant work is saved when the humans spend their time boringly vetting the tool’s reported links instead of tediously doing the tracing manually.

SHD do not mention any F-measure. However, they give data from which λD, T, and thus βD, T, can be calculated for each dataset. The MODIS data set consists of 19 high-level requirements and 49 low-level requirements, and has 41 true links, each from a high-level requirement to a low-level requirement. The CS-1 data set consists of 235 high-level requirements and 220 design elements, and has 361 true links, each from a high-level requirement to a design element. Thus, \(\beta _{T_{\text {MODIS}}} = \frac {19 \times 49}{41} = \frac {931}{41} = 22.70\), and \(\beta _{T_{\text {CM-1}}} = \frac {235 \times 220}{361} = \frac {51700}{361} = 143.21\). With β = 22.70, Fβ is virtually identical to R, and with β = 143.21, Fβ is identical to R. So, using Fβ with empirically determined values of β would lead to the same conclusions.

1.3 C.3 Merten et al.

Merten, Krämer, Mager, Schell, Bürsner, and Paech (MKMSBP) extensively evaluated the performance of five different IR algorithms in the task of extracting trace links for a CBS from the data in the issue tacking system (ITS) for the CBS’s development (Merten et al. 2016). They calculated the recall and the precision of each algorithm by comparing its output for any ITS with a gold standard set of links extracted manually from the ITS. In order that both recall and precision be taken into account in the evaluation, they calculated F1 and F2, the latter because it “emphasizes recall”. However, as is shown below, F2 does not emphasize recall enough.

After MKMSBP state that their goal is to “maximize precision, recall, F1 and F2 measure”, their “Results are presented as F2 and F1 measure in general.” They summarize their results:

In terms of algorithms, to our surprise, no variant of BM25 competed for the best results. The best F2 measures of all BM25 variants varied from 0.09 to 0.19 over all projects, independently of standard preprocessing. When maximizing R to 1, P does not cross a 2% barrier for any algorithm. Even for R ≥ 0.9, P is still < 0.05. All in all, the results are not good according to [standards set by Hayes, Dekhtyar, and Sundaram (Hayes et al. 2006)] independently of standard preprocessing, and they cannot compete with related work on structured RAs [requirements artifacts].

This summary is puzzling because, earlier they had said,

However, maximising [sic] recall is often desirable in practice, because it is simpler to remove wrong links manually than to find correct links manually. ,

implying that they recognize Paech’s observation, quoted in Section 3, that vetting a link that is output by a tool should be easier than an in-situ examination of a potential link during a manual tracing.

On the assumption that each of the algorithms is doing real work and is not just returning every possible link, at least one of MKMSBP’s algorithms is able to achieve an R of 1.0, even at the expense of a P of 0.02! This recall is better than the best that they gave in their “Related Work” section, that reported by Gotel et al (Gotel et al. 2012):

“[some] methods retrieved almost all of the true links (in the 90% range for recall) and yet also retrieved many false positives (with precision in the low 10–20% range, with occasional exceptions).”

That is, the best R is around 0.9 with P between 0.1 and 0.2. Of course, to be certain that the algorithm with the recall of 1.0 is useful in the presence of a precision of 0.02, it will be necessary to calculate the algorithm’s selectivity and to compare the time for a completely manual search with the time to vet the output of the algorithm.

Table 14 shows in Columns 4 and 5, the F1 and F2 values computed from the three pairs of P and R values that they compared. Certainly, each of F1 and F2 for the Related-Works R = 0.9 and P = 0.1 or P = 0.2 is about an order of magnitude bigger than that for the Merten-et-alR = 1.0 and P = 0.02.

Table 14 Computed Fβ values

MKMSBP give some data from which it is possible to compute βD, T. They say that they manually created gold standard trace matrices (GSTMs). From the paragraph titled “Gold Standard Trace Matrices” in their Section 5.1, it is revealed that for each project, three of the authors together made 4950 manual comparisons. Table 2 of their paper shows in the “GSTM generic” row, the number of links found for the four projects. They are 102, 18, 55, and 94, for a total of 269 links. Thus, for this context, \(\lambda = \frac {269}{4950} = 0.054\), and \(\beta _{D,T} = \frac {4950}{269} = 18.40\), nearly twice as large as 10.

Let’s round 18.40 downward to 18. Columns 6 and 7 of Table 14 show F18 and F21 values for the same pairs of P s and R s. With F18, the R and P of MKMSBP just underperform the R and P of both of the Related Works. With F21, the R and P of MKMSBP just outperform the R and P of either of the Related Works. The F18 is with βD, T, the task β. If vetting a tool t’s candidate link is only 14.13% faster than manually deciding a candidate link in situ in the documents, then βD, T, t is 21. When β = 21, an R of 1.0 is truly better than an R of 0.9, regardless of the P value, if close to 100% recall is essential. There are reasons to believe that \(\frac {\beta _{D,T}}{\beta _{D,T,t}}\) is larger than 1 for the tracing task. See the last half of Section 5.2.

The MKMSBP R and P, being 100% and 2%, respectively, raises the specter that the high recall is achieved by the extreme tradeoff of returning the tracing equivalent of entire document, namely every possible link, as is described in the beginning of Section 5. So, selectivity becomes the deciding measure. However, even if selectivity is close to 100%, i.e., bad, using the tool may still be better than searching manually, because vetting a link in a tool’s output is usually faster than deciding the relevance of a potential link in a manual search. The MKMSBP context seems to be like that described by Hayes et al., when they observed that (Hayes et al. 2018):

In our prior work [citing (Hayes et al. 2006)], we observed that even a high-recall, low-precision candidate TM [(trace matrix)] already generates savings as compared to the analyst’s need to examine every pair of low-level/high-level elements when measured in terms of selectivity.

1.4 C.4 Maalej et al.

Maalej, Kurtanović, Nabil, and Stanik (MKNS) observe that the billions of NL app reviews written by app downloaders and users constitute a rich source of information about the reviewed apps that includes descriptions of app bugs, descriptions of current app features, and ideas for new app features (Maalej et al. 2016a).

The majority of the reviews, however, is rather non-informative just praising the app and repeating to the star ratings in words.

That is, the useful information is buried in a lot of noise. The task of extracting the useful information from the mass of NL app reviews is surely a hairy task. This conclusion is strengthened by estimates of βD, T (Maalej 2017), interpreting data from Table III of Maalej’s paper with Pagano (Pagano and Maalej 2013), according to the information given by MKNS in Paragraph 4 of Section 1 on Page 312. This method yields βD, Ts for the tasks

  1. 1.

    of finding bug reports in app reviews as 10.00, and

  2. 2.

    of finding feature requests in app reviews as 9.09 (rounded to 9),

  3. 3.

    of predicting user experiences from app reviews as 2.71 (rounded to 3),

  4. 4.

    of predicting app ratings from app reviews as 1.07 (rounded to 1).

The first two tasks are clearly hairy, the last task is equally clearly not, and the third is only somewhat hairy. These values for βD, T are used in the discussions below about the MKNS findings about tools for these tasks.

MKNS evaluated a number of different combinations of probabilistic techniques to classify app reviews into four types: bug reports, feature requests, user experiences, and app ratings. That is, they evaluated each technique’s effectiveness at each of the four kinds of classification tasks. Each technique was applied to a collection of NL reviews for which manual classification of each kind had been done to create a gold standard for the classification task. To evaluate each technique and task, they calculated the technique’s recall and precision for the task relative to the task’s gold standard. They calculated from each such recall and precision, an accuracy measure that is none other than F1. Their paper gives tables showing for each combination of techniques, its recall, precision, and F1, to allow each reader to pick the combination that is best suited to his or her context, including any in which recall is more critical. Nevertheless, the focus in the discussion in the paper is on F1.

Based on these F1 values, they conclude that a specific set of techniques are the best for the four tasks.

We achieved the highest precision for predicting user experience and ratings (92%), the highest recall, and F-measure for user experience (respectively, 99 and 92%). For bug reports we found that the highest precision (89%) was achieved with the bag of words, rating, and one sentiment, while the highest recall (98%) with using bigrams, rating, and one score sentiment.

Their discussions, task by task are:

Finding bug reports:

MKNS do observe that

For predicting bug reports the recall might be more important than precision. Bug reports are critical reviews, and app vendors would probably need to make sure that a review analytics tool does not miss any of them, with the compromise that a few of the reviews predicted as bug reports are actually not (false positives).

Even though they say that recall might be more important than precision for at least the finding-bug-reports task, they conclude that

For a balance between precision and recall combining bag of words, lemmatization, bigram, rating, and tense seems to work best,

with F1 = 88%, and they do not consider what might work best when recall is emphasized.

When F10 is used to evaluate the techniques, then instead, combining bigram, rating, and one sentiment works best, with F10 = 98%.

Finding feature requests:

MKNS make no statement for the feature-request finding task about the relative importance of recall and precision. They conclude from their data,

Concerning feature requests, using the bag of words, rating, and one sentiment resulted in the highest precision with 89%. The best F-measure was 85% with bag of words, lemmatization, bigram, rating, and tense as the classification features.

When F9 is used to evaluate the techniques, then instead, bigram works best, with F9 = 96%.

Predicting user experiences:

MKNS make no statement also for the user-experiences predicting task about the relative importance of recall and precision. They conclude from their data,

The results for predicting user experiences were surprisingly high. We expect those to be hard to predict as the basic technique for user experiences shows. The best option that balances precision and recall was to combine bag of words wit bigrams, lemmatization, the rating, and the tense. This option achieved a balanced precision and recall with a F-measure of 92%.

When F3 is used to evaluate the techniques, then instead, combining bigram, rating, and one sentiment works best, with F3 = 96%.

Predicting app ratings:

MKNS make no statement also for the app-rating predicting task about the relative importance of recall and precision. However, it seems to Maalej (in conversation in person), that for this task, precision and recall are equally important. Indeed, the calculated βD, T is 1.07. They conclude from their data,

Predicting ratings with the bigram, rating, and one sentiment score leads to the top precision of 92%. This result means that stakeholders can precisely select rating among many reviews. Even if not all ratings are selected (false negatives) due to average recall, those that are selected will be very likely ratings. A common use case would be to filter out reviews that only include ratings or to select another type of reviews with or without ratings.

For every one of the three at least somewhat hairy tasks, the maximum F1 that any combination of techniques achieves is 92%, while the minimum \(F_{\beta _{D,T}}\), with the calculated βD, T, that any combination of techniques achieves is 96%, which is greater than 92%, and the maximum such \(F_{\beta _{D,T}}\) that any combination of techniques achieves is 98%. Table 15 shows the values for each evaluation measure for each hairy task.

Table 15 Computed Fβ values

If one considers the use of F1 as one treatment and the use of Fβ as another treatment, then among the three hairy tasks, the differences between the F1 and Fβ values in the three data rows of Table 15) is significant; a two-tailed T-test yields a T-value of − 3.90434 with a P-value of 0.017477, which is significant at P < 0.05. Thus, the use of \(F_{\beta _{D,T}}\) with an empirically calculated βD, T to evaluate the techniques produces stronger results than does the use of F1 to evaluate the techniques.

Whenever, as with MKNS’s work, there is a collection of more than one tool for the same hairy task, some additional possibilities present themselves.

  • If the task is one for which summarization is meaningful, i.e., the output is in the same language as the input, then if one tool, HR, has very high recall and summarizes, but another, HP, has very high precision, then run HR first, and run HP on HR’s output. This composition of HR and HP should have both high recall and high precision.

  • If no tool for the task has 100% recall, and each tool’s output differs from that of all others in the collection, then the union of the tools’ outputs should have higher recall than any one tool, particularly if each tool’s false negatives are different from those of the other tools in the collection.

1.5 C.5 Quirchmayr et al.

Quirchmayr, Paech, Kohl, and Karey (Quirchmayr et al. 2017) (QPKK) describe a semi-automatic approach for the task of identifying and extracting feature-relevant information about a CBS from NL users’ manuals about the CBS. Since the task “requires huge manual effort”, which is “cumbersome, error-prone, and costly”, the task is clearly hairy. They use recall, precision, and F1, called “accuracy”, to evaluate the approach.Footnote 32

QPKK describe several multiti-step processes for achieving the task. Most steps in the process are automatic, but a few are manual. Thus, a process is basically a composition of automatic tools with some human intervention. The empirical study was to see which process, i.e., which composition of automatic tools and manual intervention give the highest precision and accuracy relative to a gold standard,Footnote 33 a document for which the features are known. QPKK appear ready to accept a decrease in recall as the cost of achieving higher precision and accuracy.

For feature identification, QPKK consider first a process, Id-E #1, which uses only the domain terms determination step. Then, they consider a process, Id-E #2, which uses the sentence type determination step in addition to the domain terms determination step. For feature extraction, they consider a process, Ex-E #1, that is based on only Id-E #1. Then, they consider a process, Ex-E #2, that is based on only Id-E #2. Finally, they consider a process, Ex-E #3, that is based on Id-E #2, with the addition of the syntactical relevancy determination step. Their paper’s Table 1 shows the R, P, and F1 values for each process, Id-E #1, Id-E #2, Ex-E #1, Ex-E #2, and Ex-E #3.

For feature identification, in going from Id-E #1 to Id-E #2, the increase in precision and accuracy is achieved with no change to recall. For feature extraction, in going from Ex-E #1 to Ex-E #2, the increase in precision and accuracy, of 7.0% and of 3.8%, respectively, is achieved with no change to recall, but in going from Ex-E #2 to Ex-E #3, the second, additional increase in precision and accuracy, of 11.7% and of 5.4%, respectively, is achieved at the cost of a 2.1% decrease in recall. For each identification task, the paper reports the process with the highest accuracy as the one to use. Thus, they recommend Id-E #2 for feature identification and Ex-E #3 for feature extraction.

This effort to increase precision and accuracy was largely unnecessary. For feature identification, the recall of Id-E #1 was already at the maximum, at 98.75%, and for feature extraction, the recall of Ex-E #1 was already at the maximum, at 99.06%. Each of these recalls probably beats the HAR for its task! For feature identification, each of the precision and the accuracy of Id-E #2 is higher than that of Id-E #1, while the recall of Id-E #2 unchanged from that of Id-E #1. For feature extraction, each of the precision and the accuracy of Ex-E #2 is higher than that of Ex-E #1, while the recall of Ex-E #2 unchanged from that of Ex-E #1. In other words, for each task, going from the first process to the second yielded increased precision and accuracy without decreasing recall. So, these second processes should be used; there’s no harm in increasing precision and accuracy if recall is not sacrificed.

However, for feature extraction, each of the precision and the accuracy of Ex-E #3 is higher than that of Ex-E #2, while the recall of Ex-E #3 is less than that of Ex-E #2. So, Ex-E #3 should not be used. Had QPKK used as their accuracy measure even F5, they would have seen that Ex-E #2 is the process to use for feature extraction.

Table 16 shows for Each Task, FI (Feature Identification) and FE (Feature Extraction), for each process, Id-E #1, Id-E #2, Ex-E #1, Ex-E #2, and Ex-E #3, the values of P, R, F1, F5, and F10. Observe how close F5 and F10 are to R in each row of the table.

Table 16 Computed Fβ values

1.6 C.6 Maro et al.

Maro, Steghöfer, Hayes, Cleland-Huang, and Staron (MSHC-HS) empirically study the information that is supplied with candidate links that are generated by automated tools for the tracing task, T, in order to determine what information will help a human to vet the links (Maro et al. 2018). They conducted a controlled experiment to determine the effect of providing with each link in the set L of candidate links returned by an open source tool t for T, information about the context of the source and target of a link

  1. 1.

    on the speed with which L is vetted and

  2. 2.

    on the recall and precision of the final set \(L^{\prime }\) of vetted links produced by human vetters.

In the experiment, t is an easily customized open-source tracing tool. The dataset to which t is applied is the requirements and the code, assumptions, and faults for an application for which a gold standard G of relevant links had been developed previously by colleagues of one of the authors. The test (treatment) group of subjects used an extension \(t^{\prime }\) of t that had been customized to provide the context of the source and target of each candidate link it generated. The control group used the vanilla t providing no context information with any candidate link. The tools return the same set of candidate links but differ in the information provided with the links.

The results show that the test group

  • vetted slightly more links in 45 minutes and

  • achieved slightly lower precision and recall in their final sets of vetted links,

on average than did the control group. While the results are not statistically significant, by interviewing subjects, MSHC-HS did manage to learn useful things that can help build better tools for the tracing task.

While MSHC-HS did not quite achieve the goals that drove their study, the paper provides data that, when augmented by data supplied by Salome Maro in e-mail communication (Maro 2019) and then subjected to the measures described in this article, show that either the test or control tool is quite good as a tool for the tracing task.

There were 7644 potential links in the dataset L of which 188 were relevant links in G. Therefore,

$$ \beta_{D,T} = \frac{7644}{188} = 40.66~~~{.} $$
(53)

Since both tools returned the same candidate links, each tool returned 1322 candidate links. Therefore, the selectivity,

$$ s_{\textit{raw},D,T,t} = s_{D,T,t^{\prime}} = \frac{1322}{7664} = 17.25 {\%} $$
(54)

For the test tool, \(t^{\prime }\):

$$ \begin{array}{ll} &\tau_{\textit{vet},D,T,t^{\prime}} = \frac{45}{143.07} = 0.3145 \text{~minutes/link}~~~{;} \\ &R_{t^{\prime}} = 72.02{\%}~~~{;} \\ &P_{t^{\prime}} = 19.05{\%}~~~{;} \\ &F_{\beta_{D,T}} = 71.90{\%}~~~{.} \end{array} $$
(55)

For the control tool, t:

$$ \begin{array}{ll} &{\tau_{\textit{vet},D,T,t}} = \frac{45}{119.64} = 0.3761 \text{~minutes/link}~~~{;} \\ &R_{\textit{raw},D,T,t} = 72.23{\%}~~~{;} \\ &P_{\textit{raw},D,T,t} = 22.04{\%}~~~{;} \\ &F_{\beta_{D,T}} = 72.13{\%}~~~{.} \end{array} $$
(56)

Note how close each \(F_{\beta _{D,T}} = F_{40}\) is to its R.

Unfortunately, there are no data that allow computing τfindItem, D, T, the time to find a relevant link in a manual search and the HAR, the human achievable recall. The values needed to compute them should have been noted during the construction of G, but were not (because MSHC-HS just did not anticipate that they would want it). However, the dataset used in the experiment was supplied by Cleland-Huang, one of whose other works about the returns on investment of various tracing strategies (Cleland-Huang et al. 2004), described in Section 4, provides an estimate for τfindItem, D, T of 90 minutes, based on Cleland-Huang et al.’s extensive experience with tracing, including, presumably, with the dataset used by MSHC-HS. Therefore, for the test tool,

$$ \begin{array}{ll} &\beta_{D,T,t^{\prime}} = \frac{90}{\frac{45}{143.07}} = 286.14~~~{;} \\ &F_{\beta_{D,T,t^{\prime}}} = 72.02{\%}~~~{,} \end{array} $$
(57)

and for the control tool,

$$ \begin{array}{ll} &\beta_{D,T,t} = \frac{90}{\frac{45}{119.64}} = 239.28~~~{;} \\ &F_{\beta_{D,T,t}} = 72.23{\%}~~~{.} \end{array} $$
(58)

With a β > 200, after round off, each of \(F_{\beta _{D,T,t^{\prime }}}\) and \(F_{\beta _{D,T,t}}\) is identical to its R.

As mentioned in Appendix C.1, the data I have seen that allows calculating a HAR is that by Hayes, Dekhtyar, and Osborne (HDO) (Hayes et al. 2003). That estimate for HAR is 36.6%. If this HAR is accepted as applicable to MSHC-HS’s tracing task, then either tool is more than acceptable. In either case, the tool’s recall is greater than 72%, substantially greater than the HAR, and the tool’s selectivity is a very good 17.25%, meaning that the vetting task is substantially smaller than the manual tracing task. Thus, the tool’s low precision of around 20% does not matter, a conclusion confirmed by the various β values that make the various Fβ values virtually identical to the tool’s recall.

Also, even if the selectivity were not so good, the vetting times of the tools are much faster than the time to manually decide the relevance of any potential link. The time to manually find a relevant link is 90 minutes, and there are βD, T = 40.66 potential links per relevant link. So, manually deciding the relevance of any potential link requires \(\tau _{\textit {examineItem},D,T} = \frac {90}{40.66} = 2.2134\) minutes. The vetting times are \(\tau _{\textit {vet},D,T,t^{\prime }} = 0.3145\) minutes/link and τvet, D, T, t = 0.3761 minutes/link. Therefore, vetting \(t^{\prime }\)’s output is 7.03 times faster than manually deciding the relevance of any potential link, and vetting t’s output is 5.89 times faster than manually deciding the relevance of any potential link. These two values are the vetting speedups, \(\textbf {a}_{D,T,t^{\prime }}\) and aD, T, t, respectively, of \(t^{\prime }\) and t. No matter what, if either tool’s recall beats the HAR for T, the data say that it saves the tracer a lot of time to use the tool to do T over doing T manually.

1.7 C.7 Related work basing their evaluations on F 2

The results of each of the related work cited in Section 10.2, which used f2 or F2 to come to a favorable conclusion about its tool can be strengthened by the use of fβ or Fβ, respectively, with β being 5, 10, or even larger.

For example, consider the results of Arora et al. that were considered mediocre (Arora et al. 2015). Their results are actually very good, with recall in the range of 0.91 to 1 and precision in the range of 0.85 to 0.94. Table 17 extends their Table 6, which uses f2, with f5, f10, F5, and F10 values.

Table 17 Computed Fβ values

Each f2 value lies about \(\frac {2}{3}\) of the way from its P value to its R value. Each f5 value is a bit closer to its R value, and each f10 is yet even closer to its R value. Each F5 value is already very nearly its R value, and each F10 value is almost exactly its R value. Given the empirically calculated βD, T of 18.40 for a tracing tool, these high β s may not be unrealistic.

In a similar fashion, the results of each of (Cleland-Huang et al. 2010), (Yang et al. 2011), and (Delater and Paech 2013) are improved when f5 or F5 is used instead of its f2 or F2, respectively.

1.8 C.8 The next two subappendices

The next two subappendices discuss work that appeared after earlier versions of this article appeared as a workshop publication (Berry 2017a) and as a technical report at my Website (Berry 2017b), and after a panel titled “Context-Dependent Evaluation of Tools for NL RE Tasks: Recall vs. Precision, and Beyond” was held at RE 2017 in Lisbon (Berry et al. 2017). Each uses ideas from and cites my workshop paper. Consequently, the subappendices use the vocabulary of this article.

1.9 C.9 Winkler, Grönberg, and Vogelsang

The work of Winkler, Grönberg, and Vogelsang (WGV) (Winkler et al. 2019) comes the closest of all published work known to me to doing the full evaluation described in this paper. The task, T, is that of classifying elements of a RS into requirements and non-requirements. Such a classification is useful for deciding (1) which elements of a RS provide information that must be considered during implementation and (2) which elements provide only additional information that might be useful. The tool, t, that they built for T is ML-based. Besides conducting much of the suggested evaluation on one version of t, they actually used βD, T, t calculated from data about the first version of t to inform an optimization for building a second version of t.

WGV conducted an experiment with two groups, each with eight students. The experiment involved conducting T on two RS documents D1 and D2, each of whose requirement statements had been carefully classified by domain experts. That is, these documents were gold standards. One group, the manual group (MG), conducted T manually on D1 and D2, and the other group, the tool-assisted group (TAG), conducted T by using t on D1 and D2 and then vetting its output.

WGV report [In the following, text inside square brackets, relates WGV’s measures to those defined in this article], after collecting a lot of data that

  • the MG achieved an average recall [RaveExpert, D, T] on the documents of 39%,

  • t itself achieved a recall [Rraw, D, T, t] of 84% on D1 and of 66% on D2,

  • the TAG achieved an average recall after vetting [Rvet, D, T, t] on the documents of 51%,

  • β [βD, T, t] can be calculated as 6.2, using the timing data from MG’s doing T and from TAG’s doing the vetting after running t,

  • using F1 to optimize t results in a recall [Rraw, D, T, t] of 83%, a precision [Praw, D, T, t] of 81%, and a summarization [Sraw, D, T, t] of 83%, while

  • using F6.2 to optimize t results in a recall [Rraw, D, T, t] of 98%, a precision [Praw, D, T, t] of 42%, and a summarization [Sraw, D, T, t] of 61%.

The last two items represent a moderate, but decent gain in recall and at the cost of halving the precision and a moderate lost in summarization. WGV emphasize these last two items in words when they say

Our main finding is that, in the given setting, [1] finding defects with the help of a classification tool works better than working on the original data and [2] that using Fβ instead of F1 for optimizing the tool makes sense and reduces the number of elements that need to be examined by a human by 61% (i.e., summarization).

Note that WGV correctly did not call the MG’s average recall [RaveExpert, D, T] the HAR, because they used students and not domain experts to do the manual task. According to Vogelsang in e-mail communication, WGV received D1 and D2 and classified D1 and D2 themselves. So, domain experts had not created the so-called gold standard used in their experiment (Vogelsang 2019). Nevertheless, even the unoptimized t beat these humans doing T manually. The optimized t, with its 98% recall probably beats a properly determined HAR.

WGV conclude about the recall, precision, and summarization of the optimized t:

Based on our experiment, we know that these values represent a decent balance between recall and precision for the defect detection task.

Along the way, WGV add support to the idea that tools make humans lazy or overconfident. They found that members of the TAG missed 90% of defects that the tool did not warn them about, while members of the MG missed only 62% of the same defects. Vogelsang, in e-mail communication, thought it might be that student tool users are more insecure than professionals would be to disagree with the tool (Vogelsang 2019).

1.10 C.10 Jha and Mahmoud

The work of Jha and Mahmoud (JM) comes in two parts (Jha and Mahmoud 2019). In the main part of the paper, they do a thorough traditional evaluation of their tools based on F2 and come to conclusions that are weaker than they needed to. However, near the end of the paper, in its “Impact, Tool Support, and Threats to Validity” section they have a subsection titled “More on Fβ Score”, in which they redo the main analysis with F7 after obtaining 7 as the average of three estimates for β. Accordingly, this summary of their work is divided into (1) that about the main part and (2) that about the Fβ Score part.

1.10.1 C.10.1 Main part

The main part of the paper describes building tools for the task T of finding the non-functional requirements (NFRs) expressed in user reviews of mobile apps downloadable from app stores. To be able to evaluate the effectiveness of the tools, JM built a gold standard of 6000 reviews covering a broad range of iOS apps. They use this gold standard to evaluate variations of a tool that uses a dictionary-based multi-label classification approach to automatically capture NFRs in user reviews, which is the essence of T. Variations of the approach are evaluated and then compared, much as did Maalej et al., in order to select the variation with the best balance of recall and precision according to F2. They explain that

In our analysis, we use β = 2 to emphasize recall over precision. The assumption is that, errors of commission (false positives) can be easier to deal with than errors of omission (false negative).

Thus, JM recognize that recall should be weighted more than precision. They even give explicitly a basis for calculating the value of the weight, i.e., the ratio of the cost of an error of omission to the cost of an error of commission, which is none other than βD, T, t. As a result, their choice of β = 2 appears to be without any empirical basis, perhaps based on tradition, rather than on any empirical data.

In the main part of the paper, JM do not provide any data from which βD, T, t can be calculated for any of their proposed tools. However, they do provide the frequency, λ, for each kind of NFR. For example, the most common kind of NFR mentioned in the 6000 reviews was usability, which occurred in 18.63% of the reviews. Thus, βD, T for the task of finding usability NFRs is \(\frac {1}{\lambda } = \frac {1}{0.1863} = 5.37\). The task of finding each of the other, less frequent kinds of NFRs would have a higher βD, T. So, βD, T = 5.37 can be taken as a conservative lower bound on β for Fβ.

The first six data columns of Table 18 reproduce the the “Average”, i.e., over all kinds of NFRs, columns of JM’s Table 5. (The last two data columns are for later discussions.) The following discussion focuses on the data in columns, in particular on the precision (P), the recall (R), and the F2 (F2) columns. The rows in table, in fact, describe a progression of classification schemes. The first row, about a classification scheme called “TM”, shows the evaluation of the first classification scheme that they implemented. Each subsequent row shows the evaluation of an attempt to improve the classification scheme of the previous with an adjustment to the scheme’s settings, an addition to the scheme, or both. The goal of the progression is to obtain a scheme that represents the best balance between recall and precision as measured by F2.

Table 18 “Average” columns of JM’s Table 5 plus two redoings

The text about the first row, says that the TM classification scheme,

while managed to achieve a higher recall (0.94 [sic: the table gives 0.93]), suffered in terms of precision (0.51). A closer look at the data reveals that the low precision values can be attributed to longer reviews.

In this row, F2 = 0.80.

The paper provides data that show that considering only reviews of length 12 or less provides the best tradeoff between recall and precision:

In general, at smaller cut-off points, the precision is relatively high and the recall is low (too many false negatives). The recall gradually increases as the threshold increases; however, the precision decreases. At a 12 word cut-off point, the performance achieves a reasonable balance between precision and recall. The average performance in terms of the different performance measures at threshold 12 is shown at the second row of Table 5. At this point, precision = 0.72 and recall = 0.79.

In this row, however, F2 = 0.77.

To improve the accuracy of the classifications, JM tried updating the TM scheme to consider specific terms identified by relative term frequency (TF) and those identified by term frequency-inverse document frequency (TF-IDF). They note that

The best results obtained when using TF and TF-IDF super indicator terms are shown in the third and fourth rows of Table 5 respectively. Using the list of TF terms [described in the fourth row], a reasonable performance (F2 = 0.82) is achieved at the cut-off point of 12 words. At that point …, precision = 0.69 [sic: the table gives 0.68], recall = 0.86 [sic: the table gives 0.87], HS = 0.69, and HL= 0.12. Using TF-IDF [described in the third row], the performance also seems to be balanced (F2 = 0.80) at the cut-off point of 12 words. At that point …, precision = 0.66 [sic: the table gives 0.61], recall = 0.84 [sic: the table gives 0.85; however, the table P and R yield an F2 of 0.79, not 0.80], HS = 0.67 [sic: the table gives 0.66], and HL = 0.13 [sic: the table gives 0.14].

The third row, for the classification scheme called “TM + TF-IDF + (12)”, shows an average F2 of 0.79, while the fourth row, for the classification scheme called “TM + TF + (12)”, shows an average F2 of 0.82. These F2 values justify JM’s observation, in passing,

The relatively better performance of TF in comparison to TF-IDF …

Thus, JM consider TM + TF + (12) to be the best classification scheme. They do so, despite that it has a lower average recall than just TM, because it has a better enough average precision than just TM that its average F2 is higher than that of just TM.

JM took the best classification scheme, TM + TF + (12), and applied it to a new collection of reviews of Android apps to obtain the data shown in the fifth row of the same table. They conclude,

In general, the approach maintained its accuracy,Footnote 34 achieving an average precision of 0.72 and average recall of 0.82 [sic: the table gives 0.85; moreover, 0.82 yields an F2 of 0.80, not 0.82]), thus, increasing the confidence in the generalizability of our results.

If JM had taken β to any value at least 5.37, the results would have been quite different. Each row of the seventh data column of Table 18 shows the F5.37 calculated from the row’s average precision and recall values. Now, the TM classification scheme proves to be the best of the classification schemes listed in the table. Each of its F5.37 and its average recall is the highest. Observe how little effect that the low precision of 0.51 has on F5.37. F5.37 is 0.91, and recall is 0.93. If F5.37 had been used to evaluate TM, instead of trying to increase precision without decreasing recall too much, JM might have tried to increase recall without decreasing precision to the point of decreasing F5.37 to less than 0.91. A totally different history of optimizations would have ensued.

1.10.2 C.10.2 F β score part

The Fβ Score part cites my workshop paper (Berry 2017a) and describes some additional experiments, using the tool t, MARC, for T, that JM developed based on the approach called “TM + TF (12)”. Specifically, for each of the FitBit, Pandora, and Zillow apps, JM selected one random set of 50 reviews of the app. Each of three judges was given

  1. 1.

    one set of reviews on which to perform T manually, while keeping track of time, and

  2. 2.

    a different set of reviews on which to perform T by running t and then vetting t’s output, while keeping track of time.

Thus, T was performed on each set of reviews once manually and once with the help of t.

The data from these experiments allow JM to determine that

  • HAHR (this article’s HAR) is 93%, from the average of the recalls of the of the manual performances of T.

  • βD, T = 2.5, from the fact that each of 40% of the reviews discusses at least one NFR.

  • β (this article’s βD, T, t) = 7, as the average of three β s for the three judges, each calculated as the ratio of

    1. 1.

      τfindItem (this article’s τfindItem, D, T), obtained from one judge’s manual performance of T to

    2. 2.

      τvet (this article’s τvetItem, D, T, t), obtained from the same judge’s performing of the vetting after running t,

JM then use this β = 7 in a new calculation for Fβ. Each row of the ninth column of Table 18 shows the F7 calculated from the row’s average precision and recall values, which for the first through the fourth rows, duplicate the values given in the last column of JM’s Table 9. In each row of Table 18, observe how close each row’s R, F5.37, and F7 are to each other. Thus, if P and R are of the same order of magnitude, Fβ does approach R very rapidly as β grows. Even F5.37 is essentially indistinguishable from R.

Based on these new values, JM conclude:

The judges [sic] recall (Rj = 93%) … can be thought of as the humanly achievable high recall (HAHR) under ideal manual settings, and this HAHR is the recall that a tool should beat if the tool will be used in a setting in which it is critical to have as close to 100% recall as is possible. The tool achieves only 78% recall, because some recall was sacrificed in exchange for high precision. Perhaps that sacrifice was not a good tradeoff. The fact that the time it took judges to discard false positives … using the tool was on average 1/7 of the time they needed to consider one candidate in the manual search, says that even with low precision, using a tool will be faster than doing the task manually. Furthermore, given our data, we can estimate β for our task (T) before any tool has been built to see if building a tool for the task is worth the effort. Specifically, let λ be the fraction of the potential true positives in the set of reviews, then \(\beta _{D,T} = \frac {1}{\lambda }\). Our quantitative analysis revealed that λ ≈ 0.4, thus our βD, T = 2.5. The ratio between βD, T and the tool’s β [this article’s βD, T, t] (2.5 to 7) provides another evidence of the time-saving the tool achieves over the manual task

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Berry, D.M. Empirical evaluation of tools for hairy requirements engineering tasks. Empir Software Eng 26, 111 (2021). https://doi.org/10.1007/s10664-021-09986-0

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-021-09986-0

Keywords

Navigation