Skip to main content
Log in

Semi-automatic rule-based domain terminology and software feature-relevant information extraction from natural language user manuals

An approach and evaluation at Roche Diagnostics GmbH

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Mature software systems comprise a vast number of heterogeneous system capabilities which are usually requested by different groups of stakeholders and which evolve over time. Software features describe and bundle low level capabilities logically on an abstract level and thus provide a structured and comprehensive overview of the entire capabilities of a software system. Software features are often not explicitly managed. Quite the contrary, feature-relevant information is often spread across several software engineering artifacts (e.g., user manual, issue tracking systems). It requires huge manual effort to identify and extract feature-relevant information from these artifacts in order to make feature knowledge explicit. In this paper we present a two-step-approach to extract feature-relevant information from a user manual: First we semi-automatically extract a domain terminology from a natural language user manual based on linguistic patterns. Then, we apply natural language processing techniques based on the extracted domain terminology and structural sentence information. Our approach is able to extract atomic feature-relevant information with an F1-score of at least 92.00%. We describe the implementation of the approach as well as evaluations based on example sections of a user manual taken from industry.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

Notes

  1. https://poi.apache.org/

  2. http://nlp.stanford.edu/software/lex-parser.shtml

  3. The empirical estimation of β is calculated following Berry (2017): in average, there are 200 terms per domain terminology. In total, the two exemplary sections contain 1160 sentences. Hence, in average, each 6th sentence contains a domain term and we estimate ∼3 seconds to read a sentence and determine a domain term. So, it takes 18 sec. to manually find a true positive. On the other hand, it takes ∼2 sec. to manually reject a false positive by means of our provided tool which finally results in a beta of 9 (18/2). In the paper we therefore rounded Beta to 10.

References

  • Abney SP (2012) Parsing by chunks. In: Principle-based parsing, pp 257–278

    Google Scholar 

  • Acher M, Cleve A, Perrouin G, Heymans P, Vanbeneden C, Collet P, Lahire P (2012) On extracting feature models from product descriptions. In: Proceedings of 6th International Workshop on Variability Modeling of Software-Intensive Systems (VaMoS’12), pp 45–54

  • Aggarwal C, Zhai C (2012) Mining Text Data

    Chapter  Google Scholar 

  • Alves V, Schwanninger C, Barbosa L, Rashid A, Sawyer P, Rayson P, Pohl C, Rummler A (2008) An exploratory study of information retrieval techniques in domain analysis. In: Proceedings of 12th International Software Product Line Conference (SPLC’08), pp 67–76

  • Apel S, Kästner C (2009) An Overview of Feature-Oriented Software Development. Obj Tec 8(5):49–84

    Article  Google Scholar 

  • Bakar NH, Kasirun ZM, Salleh N (2015a) Feature extraction approaches from natural language requirements for reuse in software product lines. Syst Softw 106 (C):132–149

    Article  Google Scholar 

  • Bakar NH, Kasirun ZM, Salleh N (2015b) Terms extractions: An approach for requirements reuse. In: 2nd Int. Conf. on Information Science and Security (ICISS), pp 1–4

  • Balachandran K, Ranathunga S (2016) Domain-specific term extraction for concept identification in ontology construction. In: International Conference on Web Intelligence (WI), pp 34–41

  • Beliga S, Meṡtrović A, Martinċić-Ipṡić S (2015) An overview of graph-based keyword extraction methods and approaches. J Inf Organ Sci 39(1):1–20

    Google Scholar 

  • Berry DM (2017) Evaluation of Tools for Hairy Requirements and Software Engineering Tasks. In: Proceedings of the 25th Int. Requirements Engineering Conference Workshops (REW), pp 284–291

  • Berry DM, Gacitua R, Sawyer P, Tjong SF (2012) The case for dumb requirements engineering tools. In: Proceedings of the 18th International Conference on Requirements Engineering (REFSQ’12), pp 211–217

    Chapter  Google Scholar 

  • Bishop CM (2006) Pattern recognition and machine learning

  • Blanco R, Lioma C (2012) Graph-based term weighting for information retrieval. Inf Retr 15(1):54–92

    Article  Google Scholar 

  • Bosch J (2000) Design and use of software architectures: adopting and evolving a product-line approach

  • Boutkova E, Houdek F (2011) Semi-automatic identification of features in requirement specifications. In: Proceedings of 19th International Requirements Engineering Conference (RE’11), pp 313–318

  • Brinton LJ, Brinton D (2010) The linguistic structure of modern English

  • Chandrasekar R, Doran C, Srinivas B (1996) Motivations and methods for text simplification. In: Proceedings of the 16th Conference on Computational Linguistics (COLING), pp 1041–1044

  • Charniak E (1997) Statistical parsing with a context-free grammar and word statistics. AAAI/IAAI

  • Chen J, Chau R, Yeh CH (2004) Discovering parallel text from the world wide web. In: Proceedings of the 2nd Workshop on Australasian Information Security, Data Mining and Web Intelligence, and Software Internationalisation, pp 157–161

  • Chen K, Zhang W, Zhao H, Mei H (2005) An approach to constructing feature models based on requirements clustering. In: Proceedings of 13th International Requirements Engineering Conference (RE’05), pp 31–40

  • Chen P I, Lin S J (2010) Automatic keyword prediction using google similarity distance. Expert Syst Appl 37(3):1928–1938

    Article  Google Scholar 

  • Classen A, Heymans P, Schobbens PY (2008) What’s in a feature: A requirements engineering perspective. In: Proceedings of 11th International Conference on Fundamental Approaches to Software Engineering (FASE’08), pp 16–30

  • Conrado MS, Pardo TAS, Rezende SO (2014) The main challenge of semi-automatic term extraction methods. In: Proceedings of the 11th International Workshop on Natural Language Processing and Cognitive Science (NLPCS’14), pp 49–59

  • Corbett G (2006) Linguistic features. Encyclopedia of language and linguistics 2(7):193–194

    Article  Google Scholar 

  • Drymonas E, Zervanou K, Petrakis EGM (2010) Unsupervised ontology acquisition from plain texts: the OntoGain system. In: International Conference on Application of Natural Language to Information Systems, pp 277–287

    Chapter  Google Scholar 

  • Earls A, Embury S, Turner N (2002) A method for the manual extraction of business rules from legacy source code. BT Technology 20(4):127–145

    Article  Google Scholar 

  • Eisenbarth T, Koschke R, Simon D (2003) Locating features in source code. IEEE Trans Softw Eng 29(3):210–224

    Article  Google Scholar 

  • Ercan G, Cicekli I (2007) Using lexical chains for keyword extraction. Inf Process Manag 43(6):1705–1714

    Article  Google Scholar 

  • Gao X, Murugesan S, Lo B (2005) Extraction of keyterms by simple text mining for business information retrieval. In: Proceedings of the International Conference on e-Business Engineering (ICEBE’15), pp 332–339

  • Ghosh S, Elenius D, Li W, Lincoln P, Shankar N, Steiner W (2016) Arsenal: Automatic requirements specification extracting from natural language. In: Proceedings of 8th Int. Symp. of NASA Formal Methods (NFM’16), pp 41–46

    Google Scholar 

  • Guzman E, Maalej W (2014) How do users like this feature? a fine grained sentiment analysis of app reviews. In: Proceedings of 22nd International Requirements Engineering Conference (RE’14), IEEE, pp 153–162

  • IEEE (1990) IEEE standard glossary of software engineering terminology. IEEE Std

  • Indurkhya N, Damerau FJ (2010) Handbook of natural language processing

  • John I, Dörr J (2003) Elicitation of requirements from user documentation. In: 9th International Workshop on Requirements Engineering (REFSQ’03), pp 17–26

  • Jonnalagadda S, Tari L, Hakenberg J, Baral C, Gonzalez G (2009) Towards effective sentence simplification for automatic processing of biomedical text. In: Proceedings of Human Language Technologies (HLT’09), pp 177–180

  • Kim S N, Baldwin T, Kan M Y (2009) An unsupervised approach to domain-specific term extraction. In: Australasian Language Technology Association Workshop, vol 2009, pp 94–98

  • Klein D, Manning C D (2003) Fast exact inference with a factored model for natural language parsing. In: Becker S, Thrun S, Obermayer K (eds) Advances in Neural Information Processing Systems, vol 15, pp 3–10

  • Kleinberg J M (1999) Authoritative sources in a hyperlinked environment. J ACM 46(5):604–632

    Article  MathSciNet  Google Scholar 

  • Kof L (2009) Requirements analysis: concept extraction and translation of textual specifications to executable models, pp 79–90

    Chapter  Google Scholar 

  • Levy R, Andrew G (2006) Tregex and Tsurgeon: tools for querying and manipulating tree data structures. In: Proceedings of 5th International Conference on Language Resources and Evaluation (LREC’06), pp 2231–2234

  • Li Y, Guzman E, Tsiamoura K, Schneider F, Bruegge B (2015) Automated requirements extraction for scientific software. Procedia Comput Sci 51:582–591

    Article  Google Scholar 

  • Liu F, Pennell D, Liu F, Liu Y (2009) Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: Proceedings of human language technologies: The 2009 annual Conf. of the North American chapter of the association for computational linguistics, pp 620–628

  • Lossio-Ventura JA, Jonquet C, Roche M, Teisseire M (2014a) Biomedical terminology extraction: A new combination of statistical and web mining approaches. In: JADT: Journées d’Analyse statistique des Données Textuelles, pp 421–432

  • Lossio-Ventura JA, Jonquet C, Roche M, Teisseire M (2014b) Yet another ranking function for automatic multiword term extraction. In: International Conference on Natural Language Processing (NLP’14), pp 52–64

    Google Scholar 

  • Loughran N, Sampaio A, Rashid A (2006) From Requirements Documents to Feature Models for Aspect Oriented Product Line Implementation. In: Int. Conf. on Model Driven Engineering Languages and Systems, pp 262–271

    Chapter  Google Scholar 

  • Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D (2014) The stanford corenlp natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

  • Marciuska S, Gencel C, Abrahamsson P (2014) Automated feature identification in web applications. In: Proceedings of 14th International Conference on Software Quality (QSIC’14), pp 100–114

    Chapter  Google Scholar 

  • Meijer K, Frasincar F, Hogenboom F (2014) A semantic approach for extracting domain taxonomies from text. Decis Support Syst 62:78–93

    Article  Google Scholar 

  • Melville P, Gryc W, Lawrence RD (2009) Sentiment analysis of blogs by combining lexical knowledge with text classification. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge discovery and data mining, pp 1275–1284

  • Merten T, Falis M, Hübner P, Quirchmayr T, Bürsner S, Paech B (2016) Software feature request detection in issue tracking systems. In: Proceedings of 24th Int. Requirements Engineering Conference (RE’16), pp 166–175

  • Mu Y, Wang Y, Guo J (2009) Extracting software functional requirements from free text documents. In: Proceedings of 1st International Conference on Information and Multimedia Technology (ICIMT’09), pp 194–198

  • Nixon M (2008) Feature extraction & image processing

  • Paech B, Hübner P, Merten T (2014) What Are the Features of This Software? An Exploratory Study. In: Proceedings of 9th International Conference on Software Engineering Advances (ICSEA’14), pp 114–125

  • Pikkarainen M, Haikara J, Salo O, Abrahamsson P, Still J (2008) The impact of agile practices on communication in software development. J Empir Softw Eng 13(3):303–337

    Article  Google Scholar 

  • Quirchmayr T, Paech B, Kohl R, Karey H (2017) Semi-automatic software feature-relevant information extraction from natural language user manuals. In: Proceedings of the 23rd International Conference on Requirements Engineering (REFSQ’17), Springer, pp 255–272

  • Rose S, Engel D, Cramer N, Cowley W (2010) Automatic keyword extraction from individual documents. In: Berry MW, Kogan J (eds) Text Mining: Applications and Theory, pp 1–20

    Google Scholar 

  • Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47

    Article  MathSciNet  Google Scholar 

  • Shaker P, Atlee JM, Wang S (2012) A feature-oriented requirements modelling language. In: Proceedings of 20th International Requirements Engineering Conference (RE’12), pp 151–160

  • da Silva Conrado M, Pardo TAS, Rezende SO (2013) A machine learning approach to automatic term extraction using a rich feature set. In: HLT-NAACL, pp 16–23

  • Venu SH, Mohan V, Urkalan K, Geetha TV (2016) Unsupervised domain ontology learning from text. In: International Conference on Mining Intelligence and Knowledge Exploration, pp 132–143

    Chapter  Google Scholar 

  • Ward LJ, Woods G (2013) English grammar for dummies

  • Weston N, Chitchyan R, Rashid A (2009) A framework for constructing semantically composable feature models from natural language requirements. In: Proceedings of the 13th International Software Product Line Conf. (SPLC’09), pp 211–220

  • Wimalasuriya DC, Dou D (2010) Ontology-based information extraction: An introduction and a survey of current approaches. Inf Sci 36(3):306–323

    Article  Google Scholar 

  • Wong W, Liu W, Bennamoun M (2012) Ontology learning from text: A look back and into the future. ACM Comput Surv 44(4):20

    Article  Google Scholar 

  • Zapata JCM, Losada BM, Gonzalez-Calderon G (2012) An approach for using procedure manuals as a source for requirements elicitation. In: Proceedings of 38th Conf. Latinoamericana En Informatica (CLEI’12), pp 1–8

  • Zhang K, Xu H, Tang J, Li J (2006) Keyword extraction using support vector machine. In: International Conference on Web-Age Information Management, pp 85–96

    Chapter  Google Scholar 

  • Zorn-Pauli G, Paech B, Wittkopf J (2012) Strategic release planning challenges for global information systems - a position paper. In: Proceedings of 6th International Workshop on Software Product Management (IWSPM’12), pp 186–191

Download references

Acknowledgements

We would like to thank Roche Diagnostics GmbH for the financial support of this research project. Many thanks also to the GDC experts for their participation in the case study and valuable discussions of the results.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas Quirchmayr.

Additional information

Communicated by: Anna Perini and Paul Grünbacher

Appendices

Appendices

The appendices presented in the following sections summarize the patterns which are required to (1) correct parse trees as well as (2) adapt parse trees in order to extract potentially feature-relevant information in a smooth way in context of our approach. The patterns are defined by means of Tregex (Levy and Andrew 2006) which indicate parts of a parse tree to be modified. Tsurgeon (Levy and Andrew 2006), which is a tree-transformation utility built on top of Tregex, allows to manipulate the identified parse trees as desired. In the following sections, we provide Tregex patterns with corresponding Tsurgeon operations and examples. An example shows a parse tree before modification on the left hand side, indicating the part of the parse tree which matches the pattern defined and are colored red. The right hand side shows the parse tree after modification(s) which are colored green.

Appendix A Parse Tree Correction Patterns

2.1 A.1 JJ to NN

figure y

2.2 A.2 ADVP to NP

figure z

2.3 A.3 Cleanse PP

figure aa

2.4 A.4 VP to JJ

figure ab

2.5 A.5 ADJP to PP

figure ac

2.6 A.6 Complex NP#1

figure ad

2.7 A.7 Complex NP#2

figure ae

2.8 A.8 Complex NP#3

figure af

2.9 A.9 Complex NP#4

figure ag

2.10 A.10 Complex NP#5

figure ah

2.11 A.11 Cleanse PP

figure ai

2.12 A.12 Cleanse NP lists#1

figure aj

2.13 A.13 Cleanse NP lists#2

figure ak

2.14 A.14 Cleanse S#1

figure al

2.15 A.15 Cleanse S#2

figure am

2.16 A.16 Cleanse ”between” #1

figure an

2.17 A.17 Cleanse ”between” #2

figure ao

Appendix B Parse Tree Adaption Patterns

3.1 B.1 Remove SINV

figure ap

3.2 B.2 Remove Brackets

figure aq

3.3 B.3 Cleanse FRAG

figure ar

3.4 B.4 Complex VP#1

figure as

3.5 B.5 Complex VP#2

figure at

3.6 B.6 Complex VP#3

figure au

3.7 B.7 Complex VP#4

figure av

3.8 B.8 ADVP in VP#1

figure aw

3.9 B.9 ADVP in VP#2

figure ax

3.10 B.10 ADJP in VP

figure ay

3.11 B.11 PRT in VP

figure az

3.12 B.12 Complex PP

figure ba

3.13 B.13 Complex NP#6

figure bb

3.14 B.14 Multiple PP#1

figure bc

3.15 B.15 Multiple PP#2

figure bd

3.16 B.16 Remove S#1

figure be

3.17 B.17 Remove S#2

figure bf

3.18 B.18 SBAR to VPH

figure bg

3.19 B.19 SBAR to VPC#1

figure bh

3.20 B.20 SBAR to VPC#2

figure bi

3.21 B.21 SBAR to VPP

figure bj

3.22 B.22 VP to VPV

figure bk

3.23 B.23 PP to VPP#1

figure bl

3.24 B.24 PP to VPP#2

figure bm

3.25 B.25 VP to VPT#1

figure bn

3.26 B.26 VP to VPT#2

figure bo

3.27 B.27 VP to VPT#3

figure bp

3.28 B.28 VP to VPW

figure bq

3.29 B.29 PP to NPP

figure br

3.30 B.30 SBAR to NPW

figure bs

3.31 B.31 VP to NPV

figure bt

3.32 B.32 NP to PPN

figure bu

3.33 B.33 PP to PPV

figure bv

3.34 B.34 PP to PPW#1

figure bw

3.35 B.35 PP to PPW#2

figure bx

3.36 B.36 Surround NP

figure by

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Quirchmayr, T., Paech, B., Kohl, R. et al. Semi-automatic rule-based domain terminology and software feature-relevant information extraction from natural language user manuals. Empir Software Eng 23, 3630–3683 (2018). https://doi.org/10.1007/s10664-018-9597-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-018-9597-6

Keywords

Navigation