Automatically detecting feature requests from development emails by leveraging semantic sequence mining

Shi, Lin; Chen, Celia; Wang, Qing; Boehm, Barry

doi:10.1007/s00766-020-00344-y

Automatically detecting feature requests from development emails by leveraging semantic sequence mining

Original Article
Published: 30 March 2021

Volume 26, pages 255–271, (2021)
Cite this article

Requirements Engineering Aims and scope Submit manuscript

Lin Shi ORCID: orcid.org/0000-0003-1476-7213^1,5,
Celia Chen²,
Qing Wang^1,3,5 &
…
Barry Boehm⁴

603 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Mailing list is widely used as an important channel for communications between developers and stakeholders. It consists of emails that are posted for various purposes, such as reporting problems, seeking help in usage, managing projects, and discussing new features. Due to the intensive amount of new incoming emails every day, some valuable emails that intend to describe new features may get overlooked by developers. However, identifying these feature requests from development emails is a labor-intensive and challenging task. In this paper, we propose an automated solution to discover feature requests from development emails by leveraging semantic sequence patterns. First, we tag sentences in emails by using 81 fuzzy rules proposed in our previous study. Then we represent the semantic sequence with the contextual information of an email in a 2-g model. After applying sequence pattern mining techniques, we generate 10 semantic sequence patterns from 317 tagged emails that are randomly sampled from the Ubuntu community. We also conduct an empirical evaluation of their capability to discover feature requests from massive emails in Ubuntu and other four open source communities. The results show that our approach can effectively identify feature requests from these emails. Compared to existing baselines, our approach can achieve a better performance in terms of precision, recall, F1-score, AUC, and positive, with the average precision and recall for discovering feature requests from emails being 76% and 86%, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Large language models and unsupervised feature learning: implications for log analysis

Article 04 April 2024

How different are different diff algorithms in Git?

Article Open access 11 September 2019

CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection

Article 08 April 2024

Notes

References

Aery M, Chakravarthy S (2005) emailsift: email classification based on structure and content. In: Proceedings of the 5th IEEE international conference on data mining (ICDM 2005), 27–30 Nov 2005, Houston, Texas, USA, pp 18–25
Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. Acm Sigmod Rec 22:207–216
Article Google Scholar
Alrajeh D, Russo A, Uchitel S, Kramer J (2016) Logic-based learning in software engineering. In: Proceedings of the 38th international conference on software engineering, ICSE 2016, Austin, TX, USA, May 14–22, 2016, pp 892–893
Antoniol G, Ayari K, Di Penta M, Khomh F, Guéhéneuc YG (2008) Is it a bug or an enhancement? A text-based approach to classify change requests. In: Proceedings of the 2008 conference of the center for advanced studies on collaborative research: meeting of minds. ACM, p 23
Antoniol G, Ayari K, Penta MD, Khomh F, Guéhéneuc Y (2008) Is it a bug or an enhancement? A text-based approach to classify change requests. In: Proceedings of the (2008) conference of the centre for advanced studies on collaborative research, Oct 27–30, 2008. Richmond Hill, p 23
Bacchelli A, Sasso TD, D’Ambros M, Lanza M (2012) Content classification of development emails. In: International conference on software engineering, pp 375–385
Bacchelli A, Mocci A, Cleve A, Lanza M (2017) Mining structured data in natural language artifacts with island parsing. Sci Comput Program 150:31–55
Article Google Scholar
Bagui S, Nandi D, Bagui SC, White RJ (2019) Classifying phishing email using machine learning and deep learning. In: 2019 International conference on cyber security and protection of digital services, cyber security 2018, Oxford, United Kingdom, June 3–4, 2019, pp 1–2
Bahgat EM, Rady S, Gad W, Moawad IF (2018) Efficient email classification approach based on semantic methods. Ain Shams Eng J 9(4):3259–3269
Article Google Scholar
Barushka A, Hajek P (2018) Spam filtering using integrated distribution-based balancing approach and regularized deep neural networks. Appl Intell 48(10):35383556
Article Google Scholar
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(1):281–305
MathSciNet MATH Google Scholar
Brown PF, Desouza PV, Mercer RL, Pietra VJD, Lai JC (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–479
Google Scholar
Burdukiewicz M, Sobczyk P, Lauber C (2015) N-gram analysis of biological sequences. Biol Cybern 9(3):85–95
Google Scholar
Chakravarthy S, Venkatachalam A, Telang A (2010) A graph-based approach for multi-folder email classification. In: ICDM 2010, the 10th IEEE international conference on data mining, Sydney, Australia, 14–17 Dec 2010, pp 78–87
Cleland-Huang J, Dumitru H, Duan C, Castro-Herrera C (2009) Automated support for managing feature requests in open forums. Commun ACM 52(10):68–74
Article Google Scholar
Community U (2017) Mailing lists. https://lists.ubuntu.com/
Community U (2017) Ubuntu development discuss. https://lists.ubuntu.com/archives/ubuntu-devel-discuss/
Di Sorbo A, Panichella S, Visaggio CA, Di Penta M, Canfora G, Gall H (2016) Deca: development emails content analyzer. In: Proceedings of the 38th international conference on software engineering companion, ACM, ICSE ’16, pp 641–644
Fang Y, Zhang C, Huang C, Liu L, Yang Y (2019) Phishing email detection using improved RCNN model with multilevel vectors and attention mechanism. IEEE Access 7:56329–56340
Article Google Scholar
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874
Article MathSciNet Google Scholar
Goguen JA, Linde C (1993) Techniques for requirements elicitation. In: Proceedings of IEEE international symposium on requirements engineering, RE 1993, San Diego, California, USA, Jan 4–6, 1993, pp 152–164
Groen EC, Seyff N, Ali R, Dalpiaz F, Dörr J, Guzman E, Hosseini M, Marco J, Oriol M, Perini A, Stade MJC (2017) The crowd in requirements engineering: the landscape and challenges. IEEE Softw 34(2):44–52
Article Google Scholar
Guzzi A, Bacchelli A, Lanza M, Pinzger M, Deursen AV (2013) Communication in open source software development mailing lists. In: Working conference on mining software repositories, pp 277–286
Heider F (1958) The psychology of interpersonal relations. Am Sociol Rev 23(6):170
Google Scholar
Herzig K, Just S, Zeller A (2013) It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In: 35th International conference on software engineering, ICSE ’13, San Francisco, CA, USA, May 18–26, 2013, pp 392–401
Herzig K, Just S, Zeller A (2013) It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In: Proceedings of the 2013 international conference on software engineering. IEEE Press, pp 392–401
Faris H, Ala MAZ, Heidari AA, Aljarah I, Mafarja M, Hassonah MA, Fujita H (2018) An intelligent system for spam detection and identification of the most relevant features based on evolutionary random weight networks. Inf Fusion 48:67–83
Article Google Scholar
Huang Q, Xia X, Lo D, Murphy GC (2020) Automating intention mining. IEEE Trans Softw Eng 46(10):1098–1119. https://doi.org/10.1109/TSE.2018.2876340
Kim Y (2014) Convolutional neural networks for sentence classification. In: Moschitti A, Pang B, Daelemans W (eds) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, 2014, October 25-29, 2014, Doha, Qatar. Meeting of SIGDAT, a Special Interest Group of the ACL. ACL, pp 1746–1751. https://doi.org/10.3115/v1/d14-1181
Kiritchenko S, Matwin S (2011) Email classification with co-training. Ibm Corp 301–312
Kiritchenko S, Matwin S, Abu-Hakima S (2004) Email classification with temporal features. In: Intelligent information processing and web mining, proceedings of the international IIS: IIPWM’04 conference held in Zakopane, Poland, May 17–20, 2004, pp 523–533
Liu P, Qiu X, Huang X (2016) Recurrent neural network for text classification with multi-task learning. 1605.05101
Maalej W, Nabil H (2015) Bug report, feature request, or simply praise? On automatically classifying app reviews. In: 2015 IEEE 23rd international requirements engineering conference (RE), pp 116–125
Malle BF (1999) How people explain behavior: a new theoretical framework. Personal Soc Psychol Rev Off J Soc Person Soc Psychol 3(1):23
Article Google Scholar
Malle BF, Knobe J (1997) The folk concept of intentionality. J Exp Soc Psychol 33(2):101–121
Article Google Scholar
Mcmillan C, Mcmillan C, Mcmillan C, Mcmillan C (2017) Detecting user story information in developer-client conversations to generate extractive summaries. In: IEEE/ACM international conference on software engineering, pp 49–59
Merten T, Mager B, Hübner P, Quirchmayr T, Paech B, Bürsner S (2015) Requirements communication in issue tracking systems in four open-source projects. In: REFSQ workshops, pp 114–125
Merten T, Falis M, Hübner P, Quirchmayr T, Bürsner S, Paech B (2016) Software feature request detection in issue tracking systems. In: Requirements engineering conference (RE), 2016 IEEE 24th international, pp 166–175
Morales-Ramirez I, Kifetew FM, Perini A (2017) Analysis of online discussions in support of requirements discovery. In: International conference on advanced information systems engineering. Springer, Berlin, pp 159–174
Pei J, Han J, Mortazaviasl B, Pinto H, Chen Q, Dayal U, Hsu MC (2001) Prefixspan: mining sequential patterns efficiently by prefix-projected pattern growth, pp 215–224
Rijsbergen CJV (1979) Information retrieval, 2nd edn. Butterworth-Heinemann, Newton
MATH Google Scholar
Robertson AM, Willett P (1998) Applications of n-grams in textual information systems. J Doc 54(1):48–67
Article Google Scholar
Russell SJ, Norvig PN (2010) Artificial intelligence: a modern approach. Third International Edition. Pearson Education. https://dblp.org/rec/books/daglib/0023820.bib
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Article Google Scholar
Sankhwar S, Pandey D, Khan RA (2019) Email phishing: an enhanced classification model to detect malicious urls. EAI Endorsed Trans Scal Inf Syst 6(21):e5
Google Scholar
Saraiva J, Bird C, Zimmermann T (2015) Products, developers, and milestones: How should i build my N-gram language model. In: Proceedings of the joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of Software Engineering (ESEC/FSE) Industry Track, ACM
Shi L, Wang Q, Li M (2013) Learning from evolution history to predict future requirement changes. In: 21st IEEE international requirements engineering conference, RE 2013, Rio de Janeiro, RJ, Brazil, July 15–19, 2013, pp 135–144
Shi L, Chen C, Wang Q, Boehm BW (2016) Is it a new feature or simply “don’t know yet”?: On automated redundant OSS feature requests identification. In: 24th IEEE international requirements engineering conference, RE 2016, Beijing, China, Sep 12–16, 2016, pp 377–382
Shi L, Chen C, Wang Q, Li S, Boehm B (2017) Understanding feature requests by leveraging fuzzy method and linguistic analysis. In: IEEE/ACM international conference on automated software engineering, pp 440–450
Shi L, Chen C, Wang Q, Li S, Boehm BW (2017) Understanding feature requests by leveraging fuzzy method and linguistic analysis. In: Proceedings of the 32nd IEEE/ACM international conference on automated software engineering, ASE 2017, Urbana, IL, USA, Oct 30–Nov 03, 2017, pp 440–450
Slimani T, Lazzez A (2013) Sequential mining: patterns and algorithms analysis. Int J Comput Electron Res 2(5):639–64
Google Scholar
Sorbo AD, Panichella S, Visaggio CA, Penta MD, Canfora G, Gall HC (2015) Development emails content analyzer: intention mining in developer discussions (T). In: Proceedings of the 2015 30th IEEE/ACM international conference on automated software engineering (ASE), pp 12–23
Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. Springer, Berlin, pp 1–17
Google Scholar
Steinmacher I, Silva MAG, Gerosa MA (2014) Barriers faced by newcomers to open source projects: a systematic review. In: Source Open Corral L, Sillitti A, Succi G, Vlasenko J, Wasserman AI (eds) Software, mobile open source technologies, pp 153–163
Vlas RE, Robinson WN (2012) Two rule-based natural language strategies for requirements discovery and classification in open source software development projects. J Manag Inf Syst 28(4):11–38
Article Google Scholar
Zaki MJ (2001) Spade: an efficient algorithm for mining frequent sequences. Mach Learn 42(1–2):31–60
Article Google Scholar
Zhang Y, Shen B, Chen Y (2014) Mining developer mailing list to predict software defects, vol. 1, pp 83–390

Download references

Acknowledgements

Our deepest gratitude goes to the anonymous reviewers for their careful work and thoughtful suggestions that have helped improve this manuscript substantially. We also would like to thank Michael Shoga for constructive criticism of this manuscript. This work is supported by the National Key Research and Development Program of China under Grant No. 2018YFB1403400, Youth Innovation Promotion Association CAS, and the National Science Foundation of China under Grant Nos. 61802374, 61432001, 61602450, and 62002348.

This material is also based upon work supported by the U.S. Department of Defense through the Systems Engineering Research Center (SERC), and the National Science Foundation Grant CMMI-1408909, Developing a Constructive Logic-Based Theory of Value-Based Systems Engineering.

Author information

Authors and Affiliations

Laboratory for Internet Software Technologies, Institute of Software Chinese Academy of Sciences, Beijing, China
Lin Shi & Qing Wang
Department of Computer Science, Occidental College, Los Angeles, USA
Celia Chen
State Key Laboratory of Computer Science, Institute of Software Chinese Academy of Sciences, Beijing, China
Qing Wang
Center for Systems and Software Engineering, University of Southern California, Los Angeles, USA
Barry Boehm
University of Chinese Academy of Sciences, Beijing, China
Lin Shi & Qing Wang

Authors

Lin Shi
View author publications
You can also search for this author in PubMed Google Scholar
Celia Chen
View author publications
You can also search for this author in PubMed Google Scholar
Qing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Barry Boehm
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qing Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shi, L., Chen, C., Wang, Q. et al. Automatically detecting feature requests from development emails by leveraging semantic sequence mining. Requirements Eng 26, 255–271 (2021). https://doi.org/10.1007/s00766-020-00344-y

Download citation

Received: 12 January 2020
Accepted: 16 November 2020
Published: 30 March 2021
Issue Date: June 2021
DOI: https://doi.org/10.1007/s00766-020-00344-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatically detecting feature requests from development emails by leveraging semantic sequence mining

Abstract

Access this article

Similar content being viewed by others

Large language models and unsupervised feature learning: implications for log analysis

How different are different diff algorithms in Git?

CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automatically detecting feature requests from development emails by leveraging semantic sequence mining

Abstract

Access this article

Similar content being viewed by others

Large language models and unsupervised feature learning: implications for log analysis

How different are different diff algorithms in Git?

CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation