Abstract
Mature software systems comprise a vast number of heterogeneous system capabilities which are usually requested by different groups of stakeholders and which evolve over time. Software features describe and bundle low level capabilities logically on an abstract level and thus provide a structured and comprehensive overview of the entire capabilities of a software system. Software features are often not explicitly managed. Quite the contrary, feature-relevant information is often spread across several software engineering artifacts (e.g., user manual, issue tracking systems). It requires huge manual effort to identify and extract feature-relevant information from these artifacts in order to make feature knowledge explicit. In this paper we present a two-step-approach to extract feature-relevant information from a user manual: First we semi-automatically extract a domain terminology from a natural language user manual based on linguistic patterns. Then, we apply natural language processing techniques based on the extracted domain terminology and structural sentence information. Our approach is able to extract atomic feature-relevant information with an F1-score of at least 92.00%. We describe the implementation of the approach as well as evaluations based on example sections of a user manual taken from industry.
Similar content being viewed by others
Notes
The empirical estimation of β is calculated following Berry (2017): in average, there are 200 terms per domain terminology. In total, the two exemplary sections contain 1160 sentences. Hence, in average, each 6th sentence contains a domain term and we estimate ∼3 seconds to read a sentence and determine a domain term. So, it takes 18 sec. to manually find a true positive. On the other hand, it takes ∼2 sec. to manually reject a false positive by means of our provided tool which finally results in a beta of 9 (18/2). In the paper we therefore rounded Beta to 10.
References
Abney SP (2012) Parsing by chunks. In: Principle-based parsing, pp 257–278
Acher M, Cleve A, Perrouin G, Heymans P, Vanbeneden C, Collet P, Lahire P (2012) On extracting feature models from product descriptions. In: Proceedings of 6th International Workshop on Variability Modeling of Software-Intensive Systems (VaMoS’12), pp 45–54
Aggarwal C, Zhai C (2012) Mining Text Data
Alves V, Schwanninger C, Barbosa L, Rashid A, Sawyer P, Rayson P, Pohl C, Rummler A (2008) An exploratory study of information retrieval techniques in domain analysis. In: Proceedings of 12th International Software Product Line Conference (SPLC’08), pp 67–76
Apel S, Kästner C (2009) An Overview of Feature-Oriented Software Development. Obj Tec 8(5):49–84
Bakar NH, Kasirun ZM, Salleh N (2015a) Feature extraction approaches from natural language requirements for reuse in software product lines. Syst Softw 106 (C):132–149
Bakar NH, Kasirun ZM, Salleh N (2015b) Terms extractions: An approach for requirements reuse. In: 2nd Int. Conf. on Information Science and Security (ICISS), pp 1–4
Balachandran K, Ranathunga S (2016) Domain-specific term extraction for concept identification in ontology construction. In: International Conference on Web Intelligence (WI), pp 34–41
Beliga S, Meṡtrović A, Martinċić-Ipṡić S (2015) An overview of graph-based keyword extraction methods and approaches. J Inf Organ Sci 39(1):1–20
Berry DM (2017) Evaluation of Tools for Hairy Requirements and Software Engineering Tasks. In: Proceedings of the 25th Int. Requirements Engineering Conference Workshops (REW), pp 284–291
Berry DM, Gacitua R, Sawyer P, Tjong SF (2012) The case for dumb requirements engineering tools. In: Proceedings of the 18th International Conference on Requirements Engineering (REFSQ’12), pp 211–217
Bishop CM (2006) Pattern recognition and machine learning
Blanco R, Lioma C (2012) Graph-based term weighting for information retrieval. Inf Retr 15(1):54–92
Bosch J (2000) Design and use of software architectures: adopting and evolving a product-line approach
Boutkova E, Houdek F (2011) Semi-automatic identification of features in requirement specifications. In: Proceedings of 19th International Requirements Engineering Conference (RE’11), pp 313–318
Brinton LJ, Brinton D (2010) The linguistic structure of modern English
Chandrasekar R, Doran C, Srinivas B (1996) Motivations and methods for text simplification. In: Proceedings of the 16th Conference on Computational Linguistics (COLING), pp 1041–1044
Charniak E (1997) Statistical parsing with a context-free grammar and word statistics. AAAI/IAAI
Chen J, Chau R, Yeh CH (2004) Discovering parallel text from the world wide web. In: Proceedings of the 2nd Workshop on Australasian Information Security, Data Mining and Web Intelligence, and Software Internationalisation, pp 157–161
Chen K, Zhang W, Zhao H, Mei H (2005) An approach to constructing feature models based on requirements clustering. In: Proceedings of 13th International Requirements Engineering Conference (RE’05), pp 31–40
Chen P I, Lin S J (2010) Automatic keyword prediction using google similarity distance. Expert Syst Appl 37(3):1928–1938
Classen A, Heymans P, Schobbens PY (2008) What’s in a feature: A requirements engineering perspective. In: Proceedings of 11th International Conference on Fundamental Approaches to Software Engineering (FASE’08), pp 16–30
Conrado MS, Pardo TAS, Rezende SO (2014) The main challenge of semi-automatic term extraction methods. In: Proceedings of the 11th International Workshop on Natural Language Processing and Cognitive Science (NLPCS’14), pp 49–59
Corbett G (2006) Linguistic features. Encyclopedia of language and linguistics 2(7):193–194
Drymonas E, Zervanou K, Petrakis EGM (2010) Unsupervised ontology acquisition from plain texts: the OntoGain system. In: International Conference on Application of Natural Language to Information Systems, pp 277–287
Earls A, Embury S, Turner N (2002) A method for the manual extraction of business rules from legacy source code. BT Technology 20(4):127–145
Eisenbarth T, Koschke R, Simon D (2003) Locating features in source code. IEEE Trans Softw Eng 29(3):210–224
Ercan G, Cicekli I (2007) Using lexical chains for keyword extraction. Inf Process Manag 43(6):1705–1714
Gao X, Murugesan S, Lo B (2005) Extraction of keyterms by simple text mining for business information retrieval. In: Proceedings of the International Conference on e-Business Engineering (ICEBE’15), pp 332–339
Ghosh S, Elenius D, Li W, Lincoln P, Shankar N, Steiner W (2016) Arsenal: Automatic requirements specification extracting from natural language. In: Proceedings of 8th Int. Symp. of NASA Formal Methods (NFM’16), pp 41–46
Guzman E, Maalej W (2014) How do users like this feature? a fine grained sentiment analysis of app reviews. In: Proceedings of 22nd International Requirements Engineering Conference (RE’14), IEEE, pp 153–162
IEEE (1990) IEEE standard glossary of software engineering terminology. IEEE Std
Indurkhya N, Damerau FJ (2010) Handbook of natural language processing
John I, Dörr J (2003) Elicitation of requirements from user documentation. In: 9th International Workshop on Requirements Engineering (REFSQ’03), pp 17–26
Jonnalagadda S, Tari L, Hakenberg J, Baral C, Gonzalez G (2009) Towards effective sentence simplification for automatic processing of biomedical text. In: Proceedings of Human Language Technologies (HLT’09), pp 177–180
Kim S N, Baldwin T, Kan M Y (2009) An unsupervised approach to domain-specific term extraction. In: Australasian Language Technology Association Workshop, vol 2009, pp 94–98
Klein D, Manning C D (2003) Fast exact inference with a factored model for natural language parsing. In: Becker S, Thrun S, Obermayer K (eds) Advances in Neural Information Processing Systems, vol 15, pp 3–10
Kleinberg J M (1999) Authoritative sources in a hyperlinked environment. J ACM 46(5):604–632
Kof L (2009) Requirements analysis: concept extraction and translation of textual specifications to executable models, pp 79–90
Levy R, Andrew G (2006) Tregex and Tsurgeon: tools for querying and manipulating tree data structures. In: Proceedings of 5th International Conference on Language Resources and Evaluation (LREC’06), pp 2231–2234
Li Y, Guzman E, Tsiamoura K, Schneider F, Bruegge B (2015) Automated requirements extraction for scientific software. Procedia Comput Sci 51:582–591
Liu F, Pennell D, Liu F, Liu Y (2009) Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: Proceedings of human language technologies: The 2009 annual Conf. of the North American chapter of the association for computational linguistics, pp 620–628
Lossio-Ventura JA, Jonquet C, Roche M, Teisseire M (2014a) Biomedical terminology extraction: A new combination of statistical and web mining approaches. In: JADT: Journées d’Analyse statistique des Données Textuelles, pp 421–432
Lossio-Ventura JA, Jonquet C, Roche M, Teisseire M (2014b) Yet another ranking function for automatic multiword term extraction. In: International Conference on Natural Language Processing (NLP’14), pp 52–64
Loughran N, Sampaio A, Rashid A (2006) From Requirements Documents to Feature Models for Aspect Oriented Product Line Implementation. In: Int. Conf. on Model Driven Engineering Languages and Systems, pp 262–271
Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D (2014) The stanford corenlp natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations
Marciuska S, Gencel C, Abrahamsson P (2014) Automated feature identification in web applications. In: Proceedings of 14th International Conference on Software Quality (QSIC’14), pp 100–114
Meijer K, Frasincar F, Hogenboom F (2014) A semantic approach for extracting domain taxonomies from text. Decis Support Syst 62:78–93
Melville P, Gryc W, Lawrence RD (2009) Sentiment analysis of blogs by combining lexical knowledge with text classification. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge discovery and data mining, pp 1275–1284
Merten T, Falis M, Hübner P, Quirchmayr T, Bürsner S, Paech B (2016) Software feature request detection in issue tracking systems. In: Proceedings of 24th Int. Requirements Engineering Conference (RE’16), pp 166–175
Mu Y, Wang Y, Guo J (2009) Extracting software functional requirements from free text documents. In: Proceedings of 1st International Conference on Information and Multimedia Technology (ICIMT’09), pp 194–198
Nixon M (2008) Feature extraction & image processing
Paech B, Hübner P, Merten T (2014) What Are the Features of This Software? An Exploratory Study. In: Proceedings of 9th International Conference on Software Engineering Advances (ICSEA’14), pp 114–125
Pikkarainen M, Haikara J, Salo O, Abrahamsson P, Still J (2008) The impact of agile practices on communication in software development. J Empir Softw Eng 13(3):303–337
Quirchmayr T, Paech B, Kohl R, Karey H (2017) Semi-automatic software feature-relevant information extraction from natural language user manuals. In: Proceedings of the 23rd International Conference on Requirements Engineering (REFSQ’17), Springer, pp 255–272
Rose S, Engel D, Cramer N, Cowley W (2010) Automatic keyword extraction from individual documents. In: Berry MW, Kogan J (eds) Text Mining: Applications and Theory, pp 1–20
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Shaker P, Atlee JM, Wang S (2012) A feature-oriented requirements modelling language. In: Proceedings of 20th International Requirements Engineering Conference (RE’12), pp 151–160
da Silva Conrado M, Pardo TAS, Rezende SO (2013) A machine learning approach to automatic term extraction using a rich feature set. In: HLT-NAACL, pp 16–23
Venu SH, Mohan V, Urkalan K, Geetha TV (2016) Unsupervised domain ontology learning from text. In: International Conference on Mining Intelligence and Knowledge Exploration, pp 132–143
Ward LJ, Woods G (2013) English grammar for dummies
Weston N, Chitchyan R, Rashid A (2009) A framework for constructing semantically composable feature models from natural language requirements. In: Proceedings of the 13th International Software Product Line Conf. (SPLC’09), pp 211–220
Wimalasuriya DC, Dou D (2010) Ontology-based information extraction: An introduction and a survey of current approaches. Inf Sci 36(3):306–323
Wong W, Liu W, Bennamoun M (2012) Ontology learning from text: A look back and into the future. ACM Comput Surv 44(4):20
Zapata JCM, Losada BM, Gonzalez-Calderon G (2012) An approach for using procedure manuals as a source for requirements elicitation. In: Proceedings of 38th Conf. Latinoamericana En Informatica (CLEI’12), pp 1–8
Zhang K, Xu H, Tang J, Li J (2006) Keyword extraction using support vector machine. In: International Conference on Web-Age Information Management, pp 85–96
Zorn-Pauli G, Paech B, Wittkopf J (2012) Strategic release planning challenges for global information systems - a position paper. In: Proceedings of 6th International Workshop on Software Product Management (IWSPM’12), pp 186–191
Acknowledgements
We would like to thank Roche Diagnostics GmbH for the financial support of this research project. Many thanks also to the GDC experts for their participation in the case study and valuable discussions of the results.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Anna Perini and Paul Grünbacher
Appendices
Appendices
The appendices presented in the following sections summarize the patterns which are required to (1) correct parse trees as well as (2) adapt parse trees in order to extract potentially feature-relevant information in a smooth way in context of our approach. The patterns are defined by means of Tregex (Levy and Andrew 2006) which indicate parts of a parse tree to be modified. Tsurgeon (Levy and Andrew 2006), which is a tree-transformation utility built on top of Tregex, allows to manipulate the identified parse trees as desired. In the following sections, we provide Tregex patterns with corresponding Tsurgeon operations and examples. An example shows a parse tree before modification on the left hand side, indicating the part of the parse tree which matches the pattern defined and are colored red. The right hand side shows the parse tree after modification(s) which are colored green.
Appendix A Parse Tree Correction Patterns
2.1 A.1 JJ to NN
2.2 A.2 ADVP to NP
2.3 A.3 Cleanse PP
2.4 A.4 VP to JJ
2.5 A.5 ADJP to PP
2.6 A.6 Complex NP#1
2.7 A.7 Complex NP#2
2.8 A.8 Complex NP#3
2.9 A.9 Complex NP#4
2.10 A.10 Complex NP#5
2.11 A.11 Cleanse PP
2.12 A.12 Cleanse NP lists#1
2.13 A.13 Cleanse NP lists#2
2.14 A.14 Cleanse S#1
2.15 A.15 Cleanse S#2
2.16 A.16 Cleanse ”between” #1
2.17 A.17 Cleanse ”between” #2
Appendix B Parse Tree Adaption Patterns
3.1 B.1 Remove SINV
3.2 B.2 Remove Brackets
3.3 B.3 Cleanse FRAG
3.4 B.4 Complex VP#1
3.5 B.5 Complex VP#2
3.6 B.6 Complex VP#3
3.7 B.7 Complex VP#4
3.8 B.8 ADVP in VP#1
3.9 B.9 ADVP in VP#2
3.10 B.10 ADJP in VP
3.11 B.11 PRT in VP
3.12 B.12 Complex PP
3.13 B.13 Complex NP#6
3.14 B.14 Multiple PP#1
3.15 B.15 Multiple PP#2
3.16 B.16 Remove S#1
3.17 B.17 Remove S#2
3.18 B.18 SBAR to VPH
3.19 B.19 SBAR to VPC#1
3.20 B.20 SBAR to VPC#2
3.21 B.21 SBAR to VPP
3.22 B.22 VP to VPV
3.23 B.23 PP to VPP#1
3.24 B.24 PP to VPP#2
3.25 B.25 VP to VPT#1
3.26 B.26 VP to VPT#2
3.27 B.27 VP to VPT#3
3.28 B.28 VP to VPW
3.29 B.29 PP to NPP
3.30 B.30 SBAR to NPW
3.31 B.31 VP to NPV
3.32 B.32 NP to PPN
3.33 B.33 PP to PPV
3.34 B.34 PP to PPW#1
3.35 B.35 PP to PPW#2
3.36 B.36 Surround NP
Rights and permissions
About this article
Cite this article
Quirchmayr, T., Paech, B., Kohl, R. et al. Semi-automatic rule-based domain terminology and software feature-relevant information extraction from natural language user manuals. Empir Software Eng 23, 3630–3683 (2018). https://doi.org/10.1007/s10664-018-9597-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-018-9597-6