Concept extraction from business documents for software engineering projects

Ménard, Pierre André; Ratté, Sylvie

doi:10.1007/s10515-015-0184-4

Concept extraction from business documents for software engineering projects

Published: 21 August 2015

Volume 23, pages 649–686, (2016)
Cite this article

Automated Software Engineering Aims and scope Submit manuscript

799 Accesses
12 Citations
Explore all metrics

Abstract

Acquiring relevant business concepts is a crucial first step for any software project for which the software experts are not domain experts. The wealth of information buried within an organization’s written documentation is a precious source of concepts, relationships and attributes which can be used to model the enterprise’s domain. The lack of targeted extraction tools can make perusing through this type of resource a lengthy and costly process. We propose a domain model focused extraction process aimed at the rapid discovery of knowledge relevant to the software expert. To avoid undesirable noise from high-level linguistic tools, the process is mainly composed of positive and negative base filters that are less error prone and more robust. The extracted candidates are then reordered using a weight propagation algorithm based on structural hints from source documents. When tested on French text corpora from public organizations, our process performs 2.7 times better than a statistical baseline for relevant concept discovery. A new metric to assess the performance discovery speed of relevant concepts is introduced. The annotation of a gold standard definition of software engineering oriented concepts for knowledge extraction tasks is also presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

Attributed to Timothy Lethbridge in personal communication on domain analysis, May 2003 from Pressman (2001).
Can be found at http://opencyc.org.
http://www-nlpir.nist.gov/projects/duc/data/2005_data.html.
The complete list can be found at the following address: https://sites.google.com/a/etsmtl.net/lincs-pa/ressources/stoplist-locution.
Dictionnaire électronique du LADL: http://infolingu.univ-mlv.fr/.
The detailed selection for each annotators can be consulted on https://sites.google.com/a/etsmtl.net/lincs-pa/ressources/gold-corpus.

References

Abeillé, A., Clément, L., Toussenel, F.: Building a Treebank for French. Treebanks Kluwer, Dordrecht (2003)
Book Google Scholar
Abran, A., Moore, J., Bourque, P., Dupuis, R., Tripp, L.: Guide to the Software Engineering Body of Knowledge. IEEE Computer Society Press, Los Alamitos (2004)
Google Scholar
Anderson, T.D.: Studying human judgments of relevance: interactions in context. In: Proceedings of the 1st International Conference on Information Interaction in Context, ACM, pp. 6–14 (2006)
Batini, C., Ceri, S., Navathe, S.: Conceptual Database Design: An Entity-Relationship Approach. Benjamin/Cummings Pub, Co, San Francisco (1992)
MATH Google Scholar
Borgida, A.: How knowledge representation meets software engineering (and often databases). Autom. Softw. Eng. 14(4), 443–464 (2007). doi:10.1007/s10515-007-0018-0
Article Google Scholar
Chen, P.: English sentence structure and entity-relationship diagrams. Inf. Sci. 29(2–3), 127–149 (1983). doi:10.1016/0020-0255(83)90014-2
Article Google Scholar
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts, I., Gorrell, G., et al.: Text Processing with GATE (Version 6) (2011)
Deeptimahanti, D., Sanyal, R.: Semi-automatic generation of UML models from natural language requirements. In: Proceedings of the 4th India Software Engineering Conference, ACM, pp. 165–174 (2011)
Farrell, J.: IBM Watson: a brief overview and thoughts for healthcare education and performance improvement. http://www.medbiq.org/sites/default/files/presentations/2011/Farrell.ppt. Accessed 14 June 2015 (2009)
Fellbaum, C.: WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, Cambridge, MA (1998)
MATH Google Scholar
Green, S., de Marneffe, M., Bauer, J., Manning, C.: Multiword expression identification with tree substitution grammars: a parsing tour de force with french. In: Conference on Empirical Methods in Natural Language Processing (2011)
Grenon, P., Smith, B.: SNAP and SPAN: towards dynamic spatial ontology. Spat. Cognit. Comput. 4(1), 69–103 (2004)
Google Scholar
Ittoo, A., Maruster, L., Wortmann, H., Bouma, G.: Textractor: A Framework for Extracting Relevant Domain Concepts from Irregular Corporate Textual Datasets, pp. 71–82. Springer, Heidelberg (2010). doi:10.1007/978-3-642-12814-1_7
Kof, L.: Requirements analysis: concept extraction and translation of textual specifications to executable models. In: Natural Language Processing and Information Systems, Springer, Berlin, Heidelberg (2010). doi:10.1007/978-3-642-12550-8_7
Kotonya, G., Sommerville, I.: Requirements Engineering : Processes and Techniques. Wiley, New York (1998)
Google Scholar
Leroy, G., Chen, H., Martinez, J.: A shallow parser based on closed-class words to capture relations in biomedical text. J. Biomed. Inf. 36(3), 145–158 (2003)
Article Google Scholar
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA (1999)
MATH Google Scholar
Maynard, D., Saggion, H., Yankova, M.: Natural language technology for information integration in business intelligence. In: Business Information Systems. Springer, Berlin, Heidelberg (2007). doi:10.1007/978-3-540-72035-5_28
Ménard, P.A., Ratté, S.: Classifier-based acronym extraction for business documents. Knowl. Inf. Syst. (2010). doi:10.1007/s10115-010-0341-9
Nenkova, A., Passonneau, R.: Evaluating content selection in summarization: the pyramid method. In: Proceedings of HLT-NAACL (2004)
Niles, I., Pease, A.: Towards a standard upper ontology. In: Proceedings of the International Conference on Formal Ontology in Information Systems, pp. 2–9 (2001)
Nivre, J., Hall, J.: MaltParser : a language-independent system for data-driven dependency parsing. In: Proceedings of the 4th Workshop on Treebanks and Linguistic Theories (2005)
Pfleeger, S.L., Atlee, J.M.: Software Engineering: Theory and Practice, 4th edn. Prentice Hall, Upper Saddle River (2009)
Google Scholar
Popescu, D., Rugaber, S., Medvidovic, N., Berry, D.: Reducing ambiguities in requirements specifications via automatically created object-oriented models. Lecture Notes in Computer Science, vol. 1, pp. 103–124, Springer, Heidelberg (2008)
Pressman, R.: Software Engineering: A Practitioner’s Approach. Palgrave Macmillan, London (2001)
MATH Google Scholar
Prieto-Diaz, R.: Domain analysis concepts and research directions. In: Prieto-Diaz , R., Arango, G. (eds.) Domain Analysis and Software Systems Modeling, pp. 9–32. IEEE Computer Society Press (1991)
Rose, S., Engel, D., Cramer, N.: Automatic keyword extraction from individual documents. Text Min. pp. 1–20 (2010)
Schamber, L.: Relevance and information behavior. Annu. Rev. Inf. Sci. Technol. 29, 3–48 (1994)
Google Scholar
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: International Conference on new methods in language processing, pp. 44–49 (1994)
Shilakes, C., Tylman, J.: Enterprise Information Portals. Merrill Lynch, New York (1998)
Google Scholar
Vossen, P.: EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Kluwer Academic Publishers, Dordrecht (1998)
Book MATH Google Scholar

Download references

Author information

Authors and Affiliations

École de technologie supérieure, Montreal, Canada
Pierre André Ménard & Sylvie Ratté

Authors

Pierre André Ménard
View author publications
You can also search for this author in PubMed Google Scholar
Sylvie Ratté
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pierre André Ménard.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ménard, P.A., Ratté, S. Concept extraction from business documents for software engineering projects. Autom Softw Eng 23, 649–686 (2016). https://doi.org/10.1007/s10515-015-0184-4

Download citation

Received: 01 April 2014
Accepted: 03 August 2015
Published: 21 August 2015
Issue Date: December 2016
DOI: https://doi.org/10.1007/s10515-015-0184-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Concept extraction from business documents for software engineering projects

Abstract

Access this article

Similar content being viewed by others

Concept Identification from Single-Documents

ProMine: A Text Mining Solution for Concept Extraction and Filtering

System for extracting domain topic using link analysis and searching for relevant features

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Concept extraction from business documents for software engineering projects

Abstract

Access this article

Similar content being viewed by others

Concept Identification from Single-Documents

ProMine: A Text Mining Solution for Concept Extraction and Filtering

System for extracting domain topic using link analysis and searching for relevant features

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation