Abstract
A common way to describe requirements in Agile software development is through user stories, which are short descriptions of desired functionality. Nevertheless, there are no widely accepted quantitative metrics to evaluate user stories. We propose a novel metric to evaluate user stories called instability, which measures the number of changes made to a user story after it was assigned to a developer to be implemented in the near future. A user story with a high instability score suggests that it was not detailed and coherent enough to be implemented. The instability of a user story can be automatically extracted from industry-standard issue tracking systems such as Jira by performing retrospective analysis over user stories that were fully implemented. We propose a method for creating prediction models that can identify user stories that will have high instability even before they have been assigned to a developer. Our method works by applying a machine learning algorithm on implemented user stories, considering only features that are available before a user story is assigned to a developer. We evaluate our prediction models on several open-source projects and one commercial project and show that they outperform baseline prediction models.
Similar content being viewed by others
Notes
That is, the data for each project were split to train and test. The train part was used to create a prediction model that was then evaluated on the test part of the data for that project.
USIs labeled as “defects” or “bug reports” are not included.
The textual description of a USI was collected from the “summary”, “description”, and “acceptance criteria” fields in JIRA.
Performing a k-fold cross-validation would require for some folds to use data for evaluation that was collected before the data for training was used. This is problematic, especially since the features in the “Personalized Metrics” family rely on analyzing the instability of USIs done in the past.
Note that we have also repeated our experiments using a k-fold cross-validation, and the results obtained were similar to those reported here.
See the definition of AUC ROC in Sect. 5.2.
The code is written in Python and includes a detailed README file with documentation with step-by-step instructions. This code can be used to reproduce our experiments, train and evaluate instability prediction models on other datasets, and explore other features and algorithms over our dataset. Our dataset is given as an exported SQL server dump file.
References
Abdelali Z, Mustapha H, Abdelwahed N (2019) Investigating the use of random forest in software effort estimation. Procedia Comput Sci 148(2019):343–352
Abrahamsson P, Fronza I, Moser R, Vlasenko J, Pedrycz W (2011) Predicting development effort from user stories. In: International symposium on empirical software engineering and measurement, pp 400–403
Abrahamsson P, Oza N, Siponen MT (2010) Agile software development methods: a comparative review. In: Dingsøyr T, Dybå T, Moe N (eds) Agile software development. Springer, Berlin
Abrahamsson P, Salo O, Ronkainen J, Warsta J (2017) Agile software development methods: review and analysis. Preprint arXiv:1709.08439
Beck K, Beedle M, Van Bennekum A, Cockburn A, Cunningham W, Fowler M, Grenning J, Highsmith J, Hunt A, Jeffries R et al. (2010) Manifesto for agile software development
Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press
Breiman L (2001) Random forests. Mach Learn 45:5–32
Buglione L, Abran A (2013) Improving the user story agile technique using the invest criteria. In: International workshop on software measurement and international conference on software process and product measurement. IEEE, pp 49–53
Caruana R, Karampatziakis N, Yessenalina A (2008) An empirical evaluation of supervised learning in high dimensions. In: International conference on Machine learning. ACM, pp 96–103
Chen T, Guestrin C (2016) Xgboost: A scalable tree boosting system. In: ACM sigkdd international conference on knowledge discovery and data mining, pp 785–794
Chen X, Ishwaran H (2012) Random forests for genomic data analysis. Genomics 99:323–329
Choetkiertikul M, Dam HK, Tran T, Ghose A (2017) Predicting the delay of issues with due dates in software projects. Empir Softw Eng 22:1223–1263
Choetkiertikul M, Dam HK, Tran T, Pham TTM, Ghose A, Menzies T (2018) A deep learning model for estimating story points. IEEE Trans Softw Eng 45(7):637–656
Coelho E, Anirban B (2012) Effort estimation in agile software development using story points. Int J Appl Inf Syst (IJAIS) 3(7):7–10
Davis J, Goadrich M (2006) The relationship between Precision-Recall and ROC curves. In: The international conference on machine learning (ICML), pp 233–240
Dimitrijević S, Jovanović J, Devedžić V (2015) A comparative study of software tools for user story management. Inf Softw Technol 57:352–368
Dargan JL, Wasek JS, Campos-Nanez E (2016) Systems performance prediction using requirements quality attributes classification. Requir Eng 21:553–572
Femmer H, Fernández DM, Wagner S, Eder S (2017) Rapid quality assurance with requirements smells. J Syst Softw 123:190–213
Fowler M, Highsmith J (2001) The agile manifesto, Software Development
Gupta A, Shilpa S, Goyal S, Rashid M (2020) Novel XGBoost tuned machine learning model for software bug prediction. In: The international conference on intelligent engineering and management (ICIEM), pp 376–380
Haugen NC (2006) An empirical study of using planning poker for user story estimation. In Agile 06:23–34
Hayes JH, Li W, Yu T, Han X, Hays M, Woodson C (2015) Measuring requirement quality to predict testability. In: IEEE international workshop on artificial intelligence for requirements engineering (AIRE), pp 1–8
Hearty P, Fenton N, Marquez D, Neil M (2008) Predicting project velocity in xp using a learning dynamic bayesian network model. IEEE Trans Softw Eng 35:124–137
Kassab M (2015) The changing landscape of requirements engineering practices over the past decade. In: International workshop on empirical requirements engineering (EmpiRE), pp 1–8
Lai ST (2017) A user story quality measurement model for reducing agile software development risk. Int J Softw Eng Appl 8:75–86
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196
Leffingwell D (2010) Agile software requirements: lean requirements practices for teams, programs, and the enterprise. Addison-Wesley Professional
López-Martín C (2021) Machine learning techniques for software testing effort prediction. Softw Qual J 2021:1–36
Lucassen G, Dalpiaz F,van der Werf JMEM, Brinkkemper S (2015) Forging high-quality user stories: towards a discipline for agile requirements. In: IEEE international requirements engineering conference, pp 126–135
Lucassen G, Dalpiaz F, van der Werf JMEM, Brinkkemper S (2016) The use and effectiveness of user stories in practice. Foundation for software quality. In: International working conference on requirements engineering, pp 205–222
Lucassen G, Dalpiaz F, van der Werf JMEM, Brinkkemper S (2016) Improving agile requirements: the quality user story framework and tool. Requir Eng Springer 21:383–403
Lucassen G, Dalpiaz F,van der Werf JMEM, Brinkkemper S (2017) Improving user story practice with the Grimm Method: a multiple case study in the software industry. In: International working conference on requirements engineering: foundation for software quality. Springer, pp 235–252
Mahnič V, Hovelja T (2012) On using planning poker for estimating user stories. J Syst Softw 85:2086–2095
Paetsch F, Eberlein A, Maurer F (2003) Requirements engineering and agile software development. In: IEEE international workshops on enabling technologies: infrastructure for collaborative enterprises, pp 308–313
Palomares C, Franch X, Quer C, Chatzipetrou P, López L, Gorschek T (2021) The state-of-practice in requirements elicitation: an extended interview study at 12 companies. Requir Eng 26(2):273–299
Porru S, Murgia A, Demeyer S, Marchesi M, Tonelli R (2016) Estimating story points from issue reports. In: International conference on predictive models and data analytics in software engineering. ACM, pp 1–10
Rees MJ (2002) A feasible user story tool for agile software development? In: Proceedings of the ninth asia-pacific software engineering conference, pp 22–30
Schwaber K, Beedle M (2002) Agile software development with Scrum. Prentice Hall, Upper Saddle River
Schwaber K, Sutherland J (2020) The scrum guide: the definitive guide to scrum: the rules of the game. http://www.scrum.org/scrum-guides. Accessed 31 Dec 2021
Shi L, Wang Q, Li M (2013) Learning from evolution history to predict future requirement changes. In: IEEE international requirements engineering conference (RE), pp 135–144
Wallach HM (2006) Topic modeling, beyond bag-of-words. In: The international conference on machine learning (ICML), pp 977–984
Wang X, Zhao L, Wang Y, Sun, J (2014) The role of requirements engineering practices in agile development: an empirical study. Requirements Engineering. Springer, pp 195–209
Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2012) Experimentation in software engineering. Springer
Zhao L, Waad A, Ferrari A, Letsholo KJ, Ajagbe MA, Chioasca EV, Batista-Navarro RT (2021) Natural language processing for requirements engineering: a systematic mapping study. ACM Comput Surv 54(3):1–41
Ziauddin SKTZS (2012) An effort estimation model for agile software development. Adv Comput Sci Appl (ACSA) 2:314–324
Acknowledgements
This research was supported by the Ministry of Science & Technology, Israel, and by the Israeli Science Foundation Grant #210/17 to Roni Stern.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: List of features
The features used for our instability prediction models are listed below. For each feature we briefly explain the rationale of using it and, when needed, how it is computed.
1.1 1.1: Simple text processing features
-
Text length The number of characters in the USI text. Short text may suggest missing details, and very long text may indicate an excessively detailed USI that causes confusion.
-
Number of question marks in the text Question marks may indicate that the USI writer is not sure about the requirement, and therefore, the probability to change increases.
-
Number of headlines in the text In some projects, most USIs contain headlines that provide some structure to the USI text. In such cases, missing headlines can testify to missing information.
Has an acceptance criteria headline An “Acceptance Criteria” headline exists in many projects. Since the acceptance criteria are an important part of the USI, its absence may testify to missing information.
-
Number of sentences in the text Using one long sentence can be hard to understand, as well as many short sentences.
-
Number of words in the text Similar to the text length feature, fewer words may suggest missing details, and too many words may cause confusion.
-
Average number of words in a sentence in the text Similar to the number of sentences in the text feature, too short/long sentences, on average, may affect the readability of the text.
-
Does the text contain a URL This binary feature indicates whether the USI description contains a URL. A URL can add additional information that is missing in the URI text, and thus affect the USI instability.
-
Does the text contain source code This binary feature indicates whether the USI description contains source code. As with the previous feature, the source code can add additional information that can be missing in the text.
-
Does the text contain the Connextra user story template This binary feature indicates whether the USI description follows the well-known Connextra template (“As a (role) I want (something) so that (benefit)”. We conjecture that stories that were written following this template will be of higher quality and possibly lower instability.
-
Does the text contains the words “TODO/TBD/Please” This binary feature indicates if the USI description contain one of the words “TODO”, “TBD”, and “Please”. These words imply something missing or unclear in the text, which may indicate that it will be edited later (exhibiting instability).
-
Number of stop-words the text contains A sentence without stop words can be hard to understand. On the other hand, a sentence with too many stop words may suggest low writing quality and uncertainty.
-
Number of nouns/adjectives/adverbs/pronouns in the text This feature may provide some insight into the USI description structure. For example, a USI missing a verb may be hard to understand.
-
Is text field is empty This binary feature indicates whether the USI description was empty or not. Missing text field may suggest lack of information and thus be correlated with instability.
1.2 1.2: Advanced text-based features
-
USI Vector Translates the documents (USIs) text into a fixed-size vector of numbers, which are the resulting feature. See more details in the main body of the text.
-
Topic Model we train a topic model for each project and add a feature for each topic. See more details in the main body of the text.
1.3 1.3: Process metrics
.
-
Number of changes in the text before entering a sprint Many changes in the USI before the sprint can point on a problem with the text or an unclear requirement and can cause more changes during the sprint.
-
Number of comments before entering a sprint The number of comments this USI had before entering the sprint. If a USI has many comments, it may suggest an unclear requirement.
-
Number of changes in story point before entering a sprint Change in the number of story points may indicate an unclear requirement and hence lead to USI instability.
-
Number of story points when entering a sprint The estimated effort of the development team may provide indirect influence on the USI instability.
-
USI priority when entering a sprint The priority may influence stability, e.g., if the USI is urgent, it may be written in a hasty and less rigorous manner.
-
Time until the USI entered a sprint The rationale for this feature is that if a USI stays long in the backlog it had more time to be written in a full and stable manner. On the other hand, this may suggest that this USI is of less importance, and hence may have been written in a sloppy manner, and eventually exhibit instability.
-
The number of issue links The number of issue links from several USI types (as duplicate and block). A high number of dependencies may affect the USI instability.
1.4 1.4: Personalized metrics
-
Number of USIs The number of USIs that the author wrote before the current USI. The idea behind that feature is that we expect a person with limited experience in writing USI, to write limited quality USI. On the other hand, we expect a skilled person to write high-quality USI, with a lower probability that the USI will be changed.
-
Ratio of unstable USIs in the past The ratio of unstable USIs from all the USIs that the author wrote before the specific USI. We expect that the probability that a USI will be changed to increase when the ratio of unstable USI to a writer is high.
Appendix 2: Confusion matrices for instability prediction models
For completeness, Table 9 provides the TN, FP, FN, TP, and accuracy results obtained by the evaluated prediction models in each project. We show here results the 5-instability and 20-instability prediction tasks in the columns \(k=5\) and \(k=20\), respectively.
Rights and permissions
About this article
Cite this article
Levy, Y., Stern, R., Sturm, A. et al. An impact-driven approach to predict user stories instability. Requirements Eng 27, 231–248 (2022). https://doi.org/10.1007/s00766-022-00372-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00766-022-00372-w