Abstract
Acquiring relevant business concepts is a crucial first step for any software project for which the software experts are not domain experts. The wealth of information buried within an organization’s written documentation is a precious source of concepts, relationships and attributes which can be used to model the enterprise’s domain. The lack of targeted extraction tools can make perusing through this type of resource a lengthy and costly process. We propose a domain model focused extraction process aimed at the rapid discovery of knowledge relevant to the software expert. To avoid undesirable noise from high-level linguistic tools, the process is mainly composed of positive and negative base filters that are less error prone and more robust. The extracted candidates are then reordered using a weight propagation algorithm based on structural hints from source documents. When tested on French text corpora from public organizations, our process performs 2.7 times better than a statistical baseline for relevant concept discovery. A new metric to assess the performance discovery speed of relevant concepts is introduced. The annotation of a gold standard definition of software engineering oriented concepts for knowledge extraction tasks is also presented.
Similar content being viewed by others
Notes
Attributed to Timothy Lethbridge in personal communication on domain analysis, May 2003 from Pressman (2001).
Can be found at http://opencyc.org.
The complete list can be found at the following address: https://sites.google.com/a/etsmtl.net/lincs-pa/ressources/stoplist-locution.
Dictionnaire électronique du LADL: http://infolingu.univ-mlv.fr/.
The detailed selection for each annotators can be consulted on https://sites.google.com/a/etsmtl.net/lincs-pa/ressources/gold-corpus.
References
Abeillé, A., Clément, L., Toussenel, F.: Building a Treebank for French. Treebanks Kluwer, Dordrecht (2003)
Abran, A., Moore, J., Bourque, P., Dupuis, R., Tripp, L.: Guide to the Software Engineering Body of Knowledge. IEEE Computer Society Press, Los Alamitos (2004)
Anderson, T.D.: Studying human judgments of relevance: interactions in context. In: Proceedings of the 1st International Conference on Information Interaction in Context, ACM, pp. 6–14 (2006)
Batini, C., Ceri, S., Navathe, S.: Conceptual Database Design: An Entity-Relationship Approach. Benjamin/Cummings Pub, Co, San Francisco (1992)
Borgida, A.: How knowledge representation meets software engineering (and often databases). Autom. Softw. Eng. 14(4), 443–464 (2007). doi:10.1007/s10515-007-0018-0
Chen, P.: English sentence structure and entity-relationship diagrams. Inf. Sci. 29(2–3), 127–149 (1983). doi:10.1016/0020-0255(83)90014-2
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts, I., Gorrell, G., et al.: Text Processing with GATE (Version 6) (2011)
Deeptimahanti, D., Sanyal, R.: Semi-automatic generation of UML models from natural language requirements. In: Proceedings of the 4th India Software Engineering Conference, ACM, pp. 165–174 (2011)
Farrell, J.: IBM Watson: a brief overview and thoughts for healthcare education and performance improvement. http://www.medbiq.org/sites/default/files/presentations/2011/Farrell.ppt. Accessed 14 June 2015 (2009)
Fellbaum, C.: WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, Cambridge, MA (1998)
Green, S., de Marneffe, M., Bauer, J., Manning, C.: Multiword expression identification with tree substitution grammars: a parsing tour de force with french. In: Conference on Empirical Methods in Natural Language Processing (2011)
Grenon, P., Smith, B.: SNAP and SPAN: towards dynamic spatial ontology. Spat. Cognit. Comput. 4(1), 69–103 (2004)
Ittoo, A., Maruster, L., Wortmann, H., Bouma, G.: Textractor: A Framework for Extracting Relevant Domain Concepts from Irregular Corporate Textual Datasets, pp. 71–82. Springer, Heidelberg (2010). doi:10.1007/978-3-642-12814-1_7
Kof, L.: Requirements analysis: concept extraction and translation of textual specifications to executable models. In: Natural Language Processing and Information Systems, Springer, Berlin, Heidelberg (2010). doi:10.1007/978-3-642-12550-8_7
Kotonya, G., Sommerville, I.: Requirements Engineering : Processes and Techniques. Wiley, New York (1998)
Leroy, G., Chen, H., Martinez, J.: A shallow parser based on closed-class words to capture relations in biomedical text. J. Biomed. Inf. 36(3), 145–158 (2003)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA (1999)
Maynard, D., Saggion, H., Yankova, M.: Natural language technology for information integration in business intelligence. In: Business Information Systems. Springer, Berlin, Heidelberg (2007). doi:10.1007/978-3-540-72035-5_28
Ménard, P.A., Ratté, S.: Classifier-based acronym extraction for business documents. Knowl. Inf. Syst. (2010). doi:10.1007/s10115-010-0341-9
Nenkova, A., Passonneau, R.: Evaluating content selection in summarization: the pyramid method. In: Proceedings of HLT-NAACL (2004)
Niles, I., Pease, A.: Towards a standard upper ontology. In: Proceedings of the International Conference on Formal Ontology in Information Systems, pp. 2–9 (2001)
Nivre, J., Hall, J.: MaltParser : a language-independent system for data-driven dependency parsing. In: Proceedings of the 4th Workshop on Treebanks and Linguistic Theories (2005)
Pfleeger, S.L., Atlee, J.M.: Software Engineering: Theory and Practice, 4th edn. Prentice Hall, Upper Saddle River (2009)
Popescu, D., Rugaber, S., Medvidovic, N., Berry, D.: Reducing ambiguities in requirements specifications via automatically created object-oriented models. Lecture Notes in Computer Science, vol. 1, pp. 103–124, Springer, Heidelberg (2008)
Pressman, R.: Software Engineering: A Practitioner’s Approach. Palgrave Macmillan, London (2001)
Prieto-Diaz, R.: Domain analysis concepts and research directions. In: Prieto-Diaz , R., Arango, G. (eds.) Domain Analysis and Software Systems Modeling, pp. 9–32. IEEE Computer Society Press (1991)
Rose, S., Engel, D., Cramer, N.: Automatic keyword extraction from individual documents. Text Min. pp. 1–20 (2010)
Schamber, L.: Relevance and information behavior. Annu. Rev. Inf. Sci. Technol. 29, 3–48 (1994)
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: International Conference on new methods in language processing, pp. 44–49 (1994)
Shilakes, C., Tylman, J.: Enterprise Information Portals. Merrill Lynch, New York (1998)
Vossen, P.: EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Kluwer Academic Publishers, Dordrecht (1998)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ménard, P.A., Ratté, S. Concept extraction from business documents for software engineering projects. Autom Softw Eng 23, 649–686 (2016). https://doi.org/10.1007/s10515-015-0184-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10515-015-0184-4