Skip to main content
Log in

Classifier-based acronym extraction for business documents

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Acronym extraction for business documents has been neglected in favor of acronym extraction for biomedical documents. Although there are overlapping challenges, the semi-structured and non-predictive nature of business documents hinder the effectiveness of the extraction methods used on biomedical documents and fail to deliver the expected performance. A classifier-based extraction subsystem is presented as part of the wider project, Binocle, for the analysis of French business corpora. Explicit and implicit acronym presentation cases are identified using textual and syntactical hints. Among the 7 features extracted from each candidate instance, we introduce “similarity” features, which compare a candidate’s characteristics with average length-related values calculated from a generic acronym repository. Commonly used rules for evaluating the candidate (matching first letters, ordered instances, etc.) are scored and aggregated in a single composite feature that permits a supple classification. One hundred and thirty-eight French business documents from 14 public organizations were used for the training and evaluation corpora, yielding a recall of 90.9% at a precision level of 89.1% for a search space size of 3 sentences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ao H, Takagi T (2005) ALICE: an algorithm to extract abbreviations from MEDLINE. J Am Med Inform Assoc 12(5): 576–586

    Article  Google Scholar 

  2. Breiman L et al (1984) Classification and regression trees. Wadsworth and Brooks, Belmont

    MATH  Google Scholar 

  3. Chang JT, Schütze H, Altman RB (2002) Creating an online dictionary of abbreviations from MEDLINE. J Am Med Inform Assoc 9(6): 612–620

    Article  Google Scholar 

  4. Cohen WW (1995) Fast effective rule induction. In: Twelfth international conference on machine learning. pp 115–123

  5. Cunningham H et al (2002) GATE: a framework and graphical development environment for Robust NLP tools and applications. In: Proceedings of the 40th anniversary meeting of the association for computational linguistics (ACL’02), July 2002, Philadelphia

  6. Freund Y, Mason L (1999) The alternating decision tree learning algorithm. In: ICML ‘99: Proceedings of the sixteenth international conference on machine learning. Morgan Kaufmann Publishers Inc, Bled

  7. Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28(2): 337–407

    Article  MathSciNet  MATH  Google Scholar 

  8. Gaines BR, Compton P (1995) Induction of ripple-down rules applied to modeling large databases. J Intell Inform Syst 5(3): 211–228

    Article  Google Scholar 

  9. Hall M et al (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1): 10–18

    Article  Google Scholar 

  10. Kabak Y, Dogac A (2010) A survey and analysis of electronic business document standards. Acm Comput Surv 42(3) [Epub ahead of print]

  11. Knuth D, Morris JH, Pratt V (1977) Fast pattern matching in strings. SIAM J Comput 6(2): 323–350

    Article  MathSciNet  MATH  Google Scholar 

  12. Larkey LS et al (2000) Acrophile: an automated acronym extractor and server. In: ACM fifth international conference on digital libraries, DL ‘00. ACM Press, Dallas

  13. Nadeau D, Turney P (2005) A supervised learning approach to acronym identification. In: 8th Canadian conference on artificial intelligence (AI’2005). Springer, Berlin, pp 319–329

  14. Ni W, Huang Y (2008) Extracting and organizing acronyms based on tanking. In: 7th World congress on intelligent control and automation. Chongqing, China

  15. Park J, Lee S-G (2010) Keyword search in relational databases. Knowl Inf Syst 19 [Epub ahead of print]

  16. Park YB, Roy J (2001) Hybrid text mining for finding abbreviations and their definitions. In: Conference on empirical methods in natural language processing (EMNLP). June 2001

  17. Pustejovsky J et al (2001) Automatic extraction of acronym-meaning pairs from medline databases. In: Proceedings 10th world congress on medical informatics

  18. Quinlan R (1994) C4.5: programs for machine learning. Mach Learn 16(3): 235–240

    Google Scholar 

  19. Rua PL (2004) Acronyms & Co. A typology of typologies. Estudios Ingleses de la Universidad Complutense 12: 109–129

    Google Scholar 

  20. Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the international conference on new methods in language processing

  21. Schwartz AS, Hearst MA (2003) A simple algorithm for identifying abbreviation definitions in biomedical text. In: Proceedings of the 2003 pacific symposium on biocomputing. Singapore

  22. Sohn S et al (2008) Abbreviation definition identification based on automatic precision estimates. BMC Bioinform 9(1): 402

    Article  Google Scholar 

  23. Wang Z, Wang Q, Wang D-W (2008) Bayesian network based business information retrieval model. Knowl Inform Syst 20(1): 63–79

    Article  Google Scholar 

  24. Webb GI (1999) Decision tree grafting from the all-tests-but-one partition. In: IJCAI ‘99: proceedings of the sixteenth international joint conference on artificial intelligence. Morgan Kaufmann Publishers Inc

  25. Witten IH, Frank E (2005) Data mining—practical machine learning tools and techniques. Elsevier, Amsterdam

    MATH  Google Scholar 

  26. Woon WL, Wong K-SD (2009) String alignment for automated document versioning. Knowl Inform Syst 18(3): 293–309

    Article  Google Scholar 

  27. Xu Y et al (2009) MBA: a literature mining system for extracting biomedical abbreviations. BMC Bioinform 10(1): 14

    Article  Google Scholar 

  28. Yeates S (1999) Automatic extraction of acronyms from text. In: Proceedings of the third New Zealand computer science research students’ conference

  29. Zahariev M (2004) A linguistic approach to extracting acronym expansions from text. Knowl Inform Syst 6(3): 366–373

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pierre André Ménard.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ménard, P.A., Ratté, S. Classifier-based acronym extraction for business documents. Knowl Inf Syst 29, 305–334 (2011). https://doi.org/10.1007/s10115-010-0341-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-010-0341-9

Keywords

Navigation