Skip to main content
Log in

Stable methods for recognizing acronym-expansion pairs: from rule sets to hidden Markov models

  • Regular Paper
  • Published:
International Journal of Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

The replacement of textual units by synonymous canonical forms is an important prerequisite for many variants of automated text analysis. In scientific texts, one common normalization step is the consistent replacement of acronyms by their definitions. For many acronyms, the definition is found at a certain position of the text where the acronym is introduced and “expanded” to a synonymous sequence of full words. A recent approach to detecting acronym-expansion pairs by Park and Byrd [19] describes possible graphical correspondences between acronyms and expansions by means of fine-grained rules. Here we show how rule sets as used in [19] can be translated into hidden Markov models that abstract from details of the graphical correspondence and improve recall in a significant way. Stability in terms of precision is ensured by exploiting simple properties of the expansion with an optional reinforcement of linguistic knowledge. With this extension of the original formalism, the introduction of large rule sets can be avoided and a fixed model can be applied to a large variety of texts without retraining, with good values both for recall and precision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Andrade, M.A., Valencia, A.: Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics 14(7), 600–607 (1998)

    Article  PubMed  Google Scholar 

  2. Boguraev, B., Kennedy, C.: Applications of term identification terminology: domain description and content characterization. Nat. Lang. Eng. 5(1), 17–44 (1999)

    Article  Google Scholar 

  3. Basili, R., Moschitti, A.: Intelligent NLP-driven text classification. Int. J. Artif. Intell. Tools 11(3), 389–423 (2002)

    Article  Google Scholar 

  4. Teresa, C. M.: Terminology: Theory, Methods and Applications. John Benjamins John Benjamins Publishing Company, Amsterdam (1998)

    Google Scholar 

  5. Charniak, E.: Statistical Language Learning. MIT Press, Cambridge, MA (1993)

    Google Scholar 

  6. Cohen, J.D.: Highlights: language and domain independent automatic indexing terms for abstracting. J. Am. Soc. Inf. Sci. 46(3), 162–174 (1995)

    Article  Google Scholar 

  7. Dagan, I., Church, K.W.: Termight: identifying and translating technical terminology. In: Proceedings of the 7th Conference of the European Chapter of the Association for Computational Linguistics (EACL’95), pp. 34–40 (1995)

  8. Fung, P., McKeown, K.: A technical word and term translation aid using noisy parallel corpora across language groups. Mach. Transl. J. (Special Issue on New Tools for Human Translators) pp. 53–87 (1996)

  9. Gaizauskas, R., Demetriou, G.,Humphreys, K.: Term recognition in biological science journal articles. In: Proceedings of the Workshop on Computational Terminology for Medical and Biological Applications and 2nd International Conference on Natural Language Processing (NLP-2000), Patras, Greece, pp. 37–44 (2000)

  10. Hirschman, L., Park, J.C., Tsuji, J., Wong, L., Wu, C.H.: Accomplishments and challenges in literature data mining for biology. Bioinformatics 18(12), 1553–1561 (2002)

    Article  PubMed  Google Scholar 

  11. Justeson, J.S., Katz, S.M.: Technical terminology: some linguistic properties and an algorithm for identification in text. Nat. Lang. Eng. 1(1), 9–27 (1995)

    Article  Google Scholar 

  12. Larkey, L.S., Ogilvie, P., Price, M.A., Tamilio, B.: Acrophile: an automated acronym extractor and server. In: Proceedings of the 5th ACM International Conference on Digital Libraries (2000)

  13. Lehnert, W., Soderland, S., Aronow, D., Feng, F.: Inductive text classification for medical applications. J. Exp. Theor. Artif. Intell. 7(1), 49–80 (1995)

    Article  Google Scholar 

  14. Acronym/alias identification corpus of Brandeis University. http://www.medstract.org/gold-standards.html/ (2003)

  15. Medline—Searchable with PubMed. http://www.ncbi.nlm.nih.gov/PubMed/. Service by the U.S. National Library of Medicine (2003)

  16. Mikheev, A.: Periods, capitalized word, etc. Comput. Linguist. 28(3), 289–318 (2002)

    Article  Google Scholar 

  17. Nenadić, G., Spasić, I., Ananiadou, S.: Automatic acronym acquisition and term variation management within domain-specific texts. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation, vol. VI, pp. 2155–2162. European Language Resources Association (2002)

  18. U.S. National Library of Medicine: Fact sheet Medline. http://www.nlm.nih.gov/pubs/factsheets/medline.html (2002)

  19. Park, Y., Byrd, R.J.: Hybrid text mining for finding abbreviations and their definitions. In: Conference on Empirical Methods in Natural Language Processing (EMNLP). http://citeseer.nj.nec.com/444674.html (2001)

  20. Park, Y., Byrd, R.J., Boguraev, B.K.: Automatic glossary extraction: beyond terminology identification. In: Proceedings of COLING’02 (2002)

  21. Pustejovsky, J., Castaño, J., Cochran, B., Kotecki, M., Morrell, M., Rumshisky, A.: Linguistic knowledge extraction from medline: automatic construction of an acronym database. Updated version of a paper presented at Medinfo. http://medstract.org/publications.html (2001)

  22. Paice, C.D., Jones, P.A.: The identification of important concepts in highly structured technical papers. In: Proceedings of the 16th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 69–78 1993

  23. Rabiner, L.R.: A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)

    Article  Google Scholar 

  24. Swanson, D.R.: Medical literature as a potential source of new knowledge. Bull. Med. Libr. Assoc. 78(1), 29–37 (1990)

    PubMed  Google Scholar 

  25. Taghva, K., Gilbreth, J.: Recognizing acronyms and their definitions. Technical Report 95-03, ISRI Information Science Research Institute, University of Nevada, Las Vegas. (1995)

    Google Scholar 

  26. Taghva, K., Gilbreth, J.: Recognizing acronyms and their definitions. Int. J. Doc. Anal. Recog. 1(4), 191–198 (1999)

    Article  Google Scholar 

  27. Wright, S.E., Budin, G. (eds.): Handbook of Terminology Management, vol. 1, Basic Concepts of Terminology Management. John Benjamins, Amsterdam (1997)

    Google Scholar 

  28. Yeates, S., Bainbridge, D., Witten, I.H.: Using compression to identify acronyms in text. In: Conference on Data Compression, pp. 582 (2000)

  29. Yeates, S.: Automatic extraction of acronyms from text. In: New Zealand Computer Science Research Students’ Conference, pp. 117–124 (1999)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eduardo Torres Schumann.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schumann, E.T., Schulz, K.U. Stable methods for recognizing acronym-expansion pairs: from rule sets to hidden Markov models. IJDAR 8, 1–14 (2006). https://doi.org/10.1007/s10032-005-0146-7

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-005-0146-7

Keywords

Navigation