Abstract
In text mining, to calculate precise keyword frequency distributions in a particular document collection, we need to map different keywords that denote the same entity to a canonical form. In the life science domain, we can construct a large dictionary that contains the canonical forms and their variants based on the information from external resources and use this dictionary for the term aggregation. However, in this automatically generated dictionary, there are many invalid entries that have negative effects on the calculations of keyword frequencies. In this paper, we propose and test methods to detect invalid entries in the dictionary.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., et al.: SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research 31(1), 365–370 (2003)
Humphrey, B.L., Schoolman, H.M.: The Unified Medical Language System: An Informatics Research Collaboration. Journal of the American Medical Informatics Association 5(1), 1–11 (1998)
Koike, A., Takagi, T.: Gene/ Protein/ Family Name Recognition in Biomedical Literature. In: HLT-NAACL 2004 Workshop: BioLink 2004, Linking Biological Literature, Ontologies and Databases, pp. 9–16 (2004)
Krauthammer, M., Nenadic, G.: Term Identification in the Biomedical Literature. Journal of Biomedical Informatics 37(6), 512–526 (2004)
Liu, H., Hu, Z., Zhang, J., Wu, C.: BioThesaurus: A Web-Based Thesaurus of Protein and Gene Names. Bioinformatics 22(1), 103–105 (2006)
Nasukawa, T., Nagano, T.: Text analysis and knowledge mining system. IBM System Journal 40(4), 967–984 (2001)
Pruitt, K.D., Maglott, D.R.: RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Research 29(1), 137–140 (2001)
Schwartz, A.S., Hearst, M.A.: A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text. In: Proceeding of the Pac. Symp. Biocomput., pp. 451–462 (2003)
Shatkay, H., Feldman, R.: Mining the Biomedical Literature in the Genomic Era: An Overview. Journal of Computational Biology 10(6), 821–855 (2003)
Tsuruoka, Y., Tsujii, J.: Probabilistic Term Variant Generator for Biomedical Terms. In: Proceeding of the SIGIR 2003, pp. 167–173 (2003)
Tuason, O., Chen, L., Liu, H., Blake, J.A., Friedman, C.: Biological nomenclatures: a source of lexical knowledge and ambiguity. In: Proceeding of the Pac. Symp. Biocomput., pp. 238–249 (2004)
Uramoto, N., Matsuzawa, H., Nagano, T., Murakami, A., Takeuchi, H., Takeda, K.: A Text-Mining System for Knowledge Discovery from Biomedical Documents. IBM System Journal 43(3), 516–533 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Takeuchi, H., Yoshida, I., Ikawa, Y., Iida, K., Fukui, Y. (2006). Detecting Invalid Dictionary Entries for Biomedical Text Mining. In: Bremer, E.G., Hakenberg, J., Han, EH.(., Berrar, D., Dubitzky, W. (eds) Knowledge Discovery in Life Science Literature. KDLL 2006. Lecture Notes in Computer Science(), vol 3886. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11683568_10
Download citation
DOI: https://doi.org/10.1007/11683568_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32809-4
Online ISBN: 978-3-540-32810-0
eBook Packages: Computer ScienceComputer Science (R0)