Classifier-based acronym extraction for business documents

Ménard, Pierre André; Ratté, Sylvie

doi:10.1007/s10115-010-0341-9

Classifier-based acronym extraction for business documents

Regular Paper
Published: 18 September 2010

Volume 29, pages 305–334, (2011)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Pierre André Ménard¹ &
Sylvie Ratté¹

223 Accesses
7 Citations
Explore all metrics

Abstract

Acronym extraction for business documents has been neglected in favor of acronym extraction for biomedical documents. Although there are overlapping challenges, the semi-structured and non-predictive nature of business documents hinder the effectiveness of the extraction methods used on biomedical documents and fail to deliver the expected performance. A classifier-based extraction subsystem is presented as part of the wider project, Binocle, for the analysis of French business corpora. Explicit and implicit acronym presentation cases are identified using textual and syntactical hints. Among the 7 features extracted from each candidate instance, we introduce “similarity” features, which compare a candidate’s characteristics with average length-related values calculated from a generic acronym repository. Commonly used rules for evaluating the candidate (matching first letters, ordered instances, etc.) are scored and aggregated in a single composite feature that permits a supple classification. One hundred and thirty-eight French business documents from 14 public organizations were used for the training and evaluation corpora, yielding a recall of 90.9% at a precision level of 89.1% for a search space size of 3 sentences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Ao H, Takagi T (2005) ALICE: an algorithm to extract abbreviations from MEDLINE. J Am Med Inform Assoc 12(5): 576–586
Article Google Scholar
Breiman L et al (1984) Classification and regression trees. Wadsworth and Brooks, Belmont
MATH Google Scholar
Chang JT, Schütze H, Altman RB (2002) Creating an online dictionary of abbreviations from MEDLINE. J Am Med Inform Assoc 9(6): 612–620
Article Google Scholar
Cohen WW (1995) Fast effective rule induction. In: Twelfth international conference on machine learning. pp 115–123
Cunningham H et al (2002) GATE: a framework and graphical development environment for Robust NLP tools and applications. In: Proceedings of the 40th anniversary meeting of the association for computational linguistics (ACL’02), July 2002, Philadelphia
Freund Y, Mason L (1999) The alternating decision tree learning algorithm. In: ICML ‘99: Proceedings of the sixteenth international conference on machine learning. Morgan Kaufmann Publishers Inc, Bled
Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28(2): 337–407
Article MathSciNet MATH Google Scholar
Gaines BR, Compton P (1995) Induction of ripple-down rules applied to modeling large databases. J Intell Inform Syst 5(3): 211–228
Article Google Scholar
Hall M et al (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1): 10–18
Article Google Scholar
Kabak Y, Dogac A (2010) A survey and analysis of electronic business document standards. Acm Comput Surv 42(3) [Epub ahead of print]
Knuth D, Morris JH, Pratt V (1977) Fast pattern matching in strings. SIAM J Comput 6(2): 323–350
Article MathSciNet MATH Google Scholar
Larkey LS et al (2000) Acrophile: an automated acronym extractor and server. In: ACM fifth international conference on digital libraries, DL ‘00. ACM Press, Dallas
Nadeau D, Turney P (2005) A supervised learning approach to acronym identification. In: 8th Canadian conference on artificial intelligence (AI’2005). Springer, Berlin, pp 319–329
Ni W, Huang Y (2008) Extracting and organizing acronyms based on tanking. In: 7th World congress on intelligent control and automation. Chongqing, China
Park J, Lee S-G (2010) Keyword search in relational databases. Knowl Inf Syst 19 [Epub ahead of print]
Park YB, Roy J (2001) Hybrid text mining for finding abbreviations and their definitions. In: Conference on empirical methods in natural language processing (EMNLP). June 2001
Pustejovsky J et al (2001) Automatic extraction of acronym-meaning pairs from medline databases. In: Proceedings 10th world congress on medical informatics
Quinlan R (1994) C4.5: programs for machine learning. Mach Learn 16(3): 235–240
Google Scholar
Rua PL (2004) Acronyms & Co. A typology of typologies. Estudios Ingleses de la Universidad Complutense 12: 109–129
Google Scholar
Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the international conference on new methods in language processing
Schwartz AS, Hearst MA (2003) A simple algorithm for identifying abbreviation definitions in biomedical text. In: Proceedings of the 2003 pacific symposium on biocomputing. Singapore
Sohn S et al (2008) Abbreviation definition identification based on automatic precision estimates. BMC Bioinform 9(1): 402
Article Google Scholar
Wang Z, Wang Q, Wang D-W (2008) Bayesian network based business information retrieval model. Knowl Inform Syst 20(1): 63–79
Article Google Scholar
Webb GI (1999) Decision tree grafting from the all-tests-but-one partition. In: IJCAI ‘99: proceedings of the sixteenth international joint conference on artificial intelligence. Morgan Kaufmann Publishers Inc
Witten IH, Frank E (2005) Data mining—practical machine learning tools and techniques. Elsevier, Amsterdam
MATH Google Scholar
Woon WL, Wong K-SD (2009) String alignment for automated document versioning. Knowl Inform Syst 18(3): 293–309
Article Google Scholar
Xu Y et al (2009) MBA: a literature mining system for extracting biomedical abbreviations. BMC Bioinform 10(1): 14
Article Google Scholar
Yeates S (1999) Automatic extraction of acronyms from text. In: Proceedings of the third New Zealand computer science research students’ conference
Zahariev M (2004) A linguistic approach to extracting acronym expansions from text. Knowl Inform Syst 6(3): 366–373
Article Google Scholar

Download references

Author information

Authors and Affiliations

École de technologie supérieure, Montréal, QC, Canada
Pierre André Ménard & Sylvie Ratté

Authors

Pierre André Ménard
View author publications
You can also search for this author in PubMed Google Scholar
Sylvie Ratté
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pierre André Ménard.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ménard, P.A., Ratté, S. Classifier-based acronym extraction for business documents. Knowl Inf Syst 29, 305–334 (2011). https://doi.org/10.1007/s10115-010-0341-9

Download citation

Received: 25 November 2009
Revised: 20 July 2010
Accepted: 04 September 2010
Published: 18 September 2010
Issue Date: November 2011
DOI: https://doi.org/10.1007/s10115-010-0341-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Classifier-based acronym extraction for business documents

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

A comprehensive and analytical review of text clustering techniques

Automating data extraction in systematic reviews: a systematic review

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Classifier-based acronym extraction for business documents

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

A comprehensive and analytical review of text clustering techniques

Automating data extraction in systematic reviews: a systematic review

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation