Skip to main content
Log in

Concept extraction from business documents for software engineering projects

  • Published:
Automated Software Engineering Aims and scope Submit manuscript

Abstract

Acquiring relevant business concepts is a crucial first step for any software project for which the software experts are not domain experts. The wealth of information buried within an organization’s written documentation is a precious source of concepts, relationships and attributes which can be used to model the enterprise’s domain. The lack of targeted extraction tools can make perusing through this type of resource a lengthy and costly process. We propose a domain model focused extraction process aimed at the rapid discovery of knowledge relevant to the software expert. To avoid undesirable noise from high-level linguistic tools, the process is mainly composed of positive and negative base filters that are less error prone and more robust. The extracted candidates are then reordered using a weight propagation algorithm based on structural hints from source documents. When tested on French text corpora from public organizations, our process performs 2.7 times better than a statistical baseline for relevant concept discovery. A new metric to assess the performance discovery speed of relevant concepts is introduced. The annotation of a gold standard definition of software engineering oriented concepts for knowledge extraction tasks is also presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. Attributed to Timothy Lethbridge in personal communication on domain analysis, May 2003 from Pressman (2001).

  2. Can be found at http://opencyc.org.

  3. http://www-nlpir.nist.gov/projects/duc/data/2005_data.html.

  4. The complete list can be found at the following address: https://sites.google.com/a/etsmtl.net/lincs-pa/ressources/stoplist-locution.

  5. Dictionnaire électronique du LADL: http://infolingu.univ-mlv.fr/.

  6. The detailed selection for each annotators can be consulted on https://sites.google.com/a/etsmtl.net/lincs-pa/ressources/gold-corpus.

References

  • Abeillé, A., Clément, L., Toussenel, F.: Building a Treebank for French. Treebanks Kluwer, Dordrecht (2003)

    Book  Google Scholar 

  • Abran, A., Moore, J., Bourque, P., Dupuis, R., Tripp, L.: Guide to the Software Engineering Body of Knowledge. IEEE Computer Society Press, Los Alamitos (2004)

    Google Scholar 

  • Anderson, T.D.: Studying human judgments of relevance: interactions in context. In: Proceedings of the 1st International Conference on Information Interaction in Context, ACM, pp. 6–14 (2006)

  • Batini, C., Ceri, S., Navathe, S.: Conceptual Database Design: An Entity-Relationship Approach. Benjamin/Cummings Pub, Co, San Francisco (1992)

    MATH  Google Scholar 

  • Borgida, A.: How knowledge representation meets software engineering (and often databases). Autom. Softw. Eng. 14(4), 443–464 (2007). doi:10.1007/s10515-007-0018-0

    Article  Google Scholar 

  • Chen, P.: English sentence structure and entity-relationship diagrams. Inf. Sci. 29(2–3), 127–149 (1983). doi:10.1016/0020-0255(83)90014-2

    Article  Google Scholar 

  • Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts, I., Gorrell, G., et al.: Text Processing with GATE (Version 6) (2011)

  • Deeptimahanti, D., Sanyal, R.: Semi-automatic generation of UML models from natural language requirements. In: Proceedings of the 4th India Software Engineering Conference, ACM, pp. 165–174 (2011)

  • Farrell, J.: IBM Watson: a brief overview and thoughts for healthcare education and performance improvement. http://www.medbiq.org/sites/default/files/presentations/2011/Farrell.ppt. Accessed 14 June 2015 (2009)

  • Fellbaum, C.: WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, Cambridge, MA (1998)

    MATH  Google Scholar 

  • Green, S., de Marneffe, M., Bauer, J., Manning, C.: Multiword expression identification with tree substitution grammars: a parsing tour de force with french. In: Conference on Empirical Methods in Natural Language Processing (2011)

  • Grenon, P., Smith, B.: SNAP and SPAN: towards dynamic spatial ontology. Spat. Cognit. Comput. 4(1), 69–103 (2004)

    Google Scholar 

  • Ittoo, A., Maruster, L., Wortmann, H., Bouma, G.: Textractor: A Framework for Extracting Relevant Domain Concepts from Irregular Corporate Textual Datasets, pp. 71–82. Springer, Heidelberg (2010). doi:10.1007/978-3-642-12814-1_7

  • Kof, L.: Requirements analysis: concept extraction and translation of textual specifications to executable models. In: Natural Language Processing and Information Systems, Springer, Berlin, Heidelberg (2010). doi:10.1007/978-3-642-12550-8_7

  • Kotonya, G., Sommerville, I.: Requirements Engineering : Processes and Techniques. Wiley, New York (1998)

    Google Scholar 

  • Leroy, G., Chen, H., Martinez, J.: A shallow parser based on closed-class words to capture relations in biomedical text. J. Biomed. Inf. 36(3), 145–158 (2003)

    Article  Google Scholar 

  • Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA (1999)

    MATH  Google Scholar 

  • Maynard, D., Saggion, H., Yankova, M.: Natural language technology for information integration in business intelligence. In: Business Information Systems. Springer, Berlin, Heidelberg (2007). doi:10.1007/978-3-540-72035-5_28

  • Ménard, P.A., Ratté, S.: Classifier-based acronym extraction for business documents. Knowl. Inf. Syst. (2010). doi:10.1007/s10115-010-0341-9

  • Nenkova, A., Passonneau, R.: Evaluating content selection in summarization: the pyramid method. In: Proceedings of HLT-NAACL (2004)

  • Niles, I., Pease, A.: Towards a standard upper ontology. In: Proceedings of the International Conference on Formal Ontology in Information Systems, pp. 2–9 (2001)

  • Nivre, J., Hall, J.: MaltParser : a language-independent system for data-driven dependency parsing. In: Proceedings of the 4th Workshop on Treebanks and Linguistic Theories (2005)

  • Pfleeger, S.L., Atlee, J.M.: Software Engineering: Theory and Practice, 4th edn. Prentice Hall, Upper Saddle River (2009)

    Google Scholar 

  • Popescu, D., Rugaber, S., Medvidovic, N., Berry, D.: Reducing ambiguities in requirements specifications via automatically created object-oriented models. Lecture Notes in Computer Science, vol. 1, pp. 103–124, Springer, Heidelberg (2008)

  • Pressman, R.: Software Engineering: A Practitioner’s Approach. Palgrave Macmillan, London (2001)

    MATH  Google Scholar 

  • Prieto-Diaz, R.: Domain analysis concepts and research directions. In: Prieto-Diaz , R., Arango, G. (eds.) Domain Analysis and Software Systems Modeling, pp. 9–32. IEEE Computer Society Press (1991)

  • Rose, S., Engel, D., Cramer, N.: Automatic keyword extraction from individual documents. Text Min. pp. 1–20 (2010)

  • Schamber, L.: Relevance and information behavior. Annu. Rev. Inf. Sci. Technol. 29, 3–48 (1994)

    Google Scholar 

  • Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: International Conference on new methods in language processing, pp. 44–49 (1994)

  • Shilakes, C., Tylman, J.: Enterprise Information Portals. Merrill Lynch, New York (1998)

    Google Scholar 

  • Vossen, P.: EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Kluwer Academic Publishers, Dordrecht (1998)

    Book  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pierre André Ménard.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ménard, P.A., Ratté, S. Concept extraction from business documents for software engineering projects. Autom Softw Eng 23, 649–686 (2016). https://doi.org/10.1007/s10515-015-0184-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10515-015-0184-4

Keywords

Navigation