Abstract
Acronym extraction for business documents has been neglected in favor of acronym extraction for biomedical documents. Although there are overlapping challenges, the semi-structured and non-predictive nature of business documents hinder the effectiveness of the extraction methods used on biomedical documents and fail to deliver the expected performance. A classifier-based extraction subsystem is presented as part of the wider project, Binocle, for the analysis of French business corpora. Explicit and implicit acronym presentation cases are identified using textual and syntactical hints. Among the 7 features extracted from each candidate instance, we introduce “similarity” features, which compare a candidate’s characteristics with average length-related values calculated from a generic acronym repository. Commonly used rules for evaluating the candidate (matching first letters, ordered instances, etc.) are scored and aggregated in a single composite feature that permits a supple classification. One hundred and thirty-eight French business documents from 14 public organizations were used for the training and evaluation corpora, yielding a recall of 90.9% at a precision level of 89.1% for a search space size of 3 sentences.
Similar content being viewed by others
References
Ao H, Takagi T (2005) ALICE: an algorithm to extract abbreviations from MEDLINE. J Am Med Inform Assoc 12(5): 576–586
Breiman L et al (1984) Classification and regression trees. Wadsworth and Brooks, Belmont
Chang JT, Schütze H, Altman RB (2002) Creating an online dictionary of abbreviations from MEDLINE. J Am Med Inform Assoc 9(6): 612–620
Cohen WW (1995) Fast effective rule induction. In: Twelfth international conference on machine learning. pp 115–123
Cunningham H et al (2002) GATE: a framework and graphical development environment for Robust NLP tools and applications. In: Proceedings of the 40th anniversary meeting of the association for computational linguistics (ACL’02), July 2002, Philadelphia
Freund Y, Mason L (1999) The alternating decision tree learning algorithm. In: ICML ‘99: Proceedings of the sixteenth international conference on machine learning. Morgan Kaufmann Publishers Inc, Bled
Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28(2): 337–407
Gaines BR, Compton P (1995) Induction of ripple-down rules applied to modeling large databases. J Intell Inform Syst 5(3): 211–228
Hall M et al (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1): 10–18
Kabak Y, Dogac A (2010) A survey and analysis of electronic business document standards. Acm Comput Surv 42(3) [Epub ahead of print]
Knuth D, Morris JH, Pratt V (1977) Fast pattern matching in strings. SIAM J Comput 6(2): 323–350
Larkey LS et al (2000) Acrophile: an automated acronym extractor and server. In: ACM fifth international conference on digital libraries, DL ‘00. ACM Press, Dallas
Nadeau D, Turney P (2005) A supervised learning approach to acronym identification. In: 8th Canadian conference on artificial intelligence (AI’2005). Springer, Berlin, pp 319–329
Ni W, Huang Y (2008) Extracting and organizing acronyms based on tanking. In: 7th World congress on intelligent control and automation. Chongqing, China
Park J, Lee S-G (2010) Keyword search in relational databases. Knowl Inf Syst 19 [Epub ahead of print]
Park YB, Roy J (2001) Hybrid text mining for finding abbreviations and their definitions. In: Conference on empirical methods in natural language processing (EMNLP). June 2001
Pustejovsky J et al (2001) Automatic extraction of acronym-meaning pairs from medline databases. In: Proceedings 10th world congress on medical informatics
Quinlan R (1994) C4.5: programs for machine learning. Mach Learn 16(3): 235–240
Rua PL (2004) Acronyms & Co. A typology of typologies. Estudios Ingleses de la Universidad Complutense 12: 109–129
Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the international conference on new methods in language processing
Schwartz AS, Hearst MA (2003) A simple algorithm for identifying abbreviation definitions in biomedical text. In: Proceedings of the 2003 pacific symposium on biocomputing. Singapore
Sohn S et al (2008) Abbreviation definition identification based on automatic precision estimates. BMC Bioinform 9(1): 402
Wang Z, Wang Q, Wang D-W (2008) Bayesian network based business information retrieval model. Knowl Inform Syst 20(1): 63–79
Webb GI (1999) Decision tree grafting from the all-tests-but-one partition. In: IJCAI ‘99: proceedings of the sixteenth international joint conference on artificial intelligence. Morgan Kaufmann Publishers Inc
Witten IH, Frank E (2005) Data mining—practical machine learning tools and techniques. Elsevier, Amsterdam
Woon WL, Wong K-SD (2009) String alignment for automated document versioning. Knowl Inform Syst 18(3): 293–309
Xu Y et al (2009) MBA: a literature mining system for extracting biomedical abbreviations. BMC Bioinform 10(1): 14
Yeates S (1999) Automatic extraction of acronyms from text. In: Proceedings of the third New Zealand computer science research students’ conference
Zahariev M (2004) A linguistic approach to extracting acronym expansions from text. Knowl Inform Syst 6(3): 366–373
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ménard, P.A., Ratté, S. Classifier-based acronym extraction for business documents. Knowl Inf Syst 29, 305–334 (2011). https://doi.org/10.1007/s10115-010-0341-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-010-0341-9