Skip to main content

On the Semantic Similarity of Disease Mentions in \(\textsc {medline}^{\circledR } \) and Twitter

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10859))

Abstract

Social media mining is becoming an important technique to track the spread of infectious diseases and to understand specific needs of people affected by a medical condition. A common approach is to select a variety of synonyms for a disease derived from scientific literature to then retrieve social media posts for subsequent analysis. With this paper, we question the underlying assumption that user-generated text always makes use of such names, or assigns them the same meaning as in scientific literature. We analyze the most frequently used concepts in \(\textsc {medline}^{\circledR } \) for semantic similarity to Twitter use and compare their normalized entropy and cosine similarities based on a simple distributional model. We find that diseases are referred to in semantically different ways in both corpora, a difference that increases in inverse proportion to the frequency of the synonym, and of the commonness of the disease or condition. These results imply that, when sampling social media for disease-related micro-blogs, query expressions must be carefully chosen, and even more so for rarily mentioned diseases or conditions.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://www.nlm.nih.gov/bsd/pmresources.html.

  2. 2.

    https://www.nlm.nih.gov/mesh/intro_trees.html.

References

  1. Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of ACL 2014 (2014)

    Google Scholar 

  2. Dinu, G., Pham, N.T., Baroni, M.: DISSECT - DIStributional SEmantics composition toolkit. In: Proceedings of ACL 2013 (2013)

    Google Scholar 

  3. Doğan, R.I., Leaman, R., Lu, Z.: NCBI disease corpus: a resource for disease name recognition and concept normalization. J. Biomed. Inform. 47, 1–10 (2014)

    Article  Google Scholar 

  4. Hamosh, A., Scott, A.F., Amberger, J.S., Bocchini, C.A., McKusick, V.A.: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucl. Acids Res. 33, 514–517 (2005)

    Article  Google Scholar 

  5. Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954)

    Article  Google Scholar 

  6. He, L., Yang, Z., Lin, H., Li, Y.: Drug name recognition in biomedical texts: a machine-learning-based method. Drug Discov. Today 19(5), 610–617 (2014)

    Article  Google Scholar 

  7. Klinger, R., Kolářik, C., Fluck, J., Hofmann-Apitius, M., Friedrich, C.M.: Detection of IUPAC and IUPAC-like chemical names. Bioinformatics 24(13), 268–276 (2008)

    Article  Google Scholar 

  8. Leaman, R., Islamaj Doǧan, R., Lu, Z.: DNorm: disease name normalization with pairwise learning to rank. Bioinformatics 29(22), 2909–2917 (2013)

    Article  Google Scholar 

  9. Lipscomb, C.E.: Medical subject headings (MeSH). Bull. Med. Lib. Assoc. 88(3), 265–266 (2000)

    Google Scholar 

  10. Melamed, I.D.: Measuring semantic entropy. In: Proceedings of the SIGLEX Workshop on Tagging Text with Lexical Semantics (1997)

    Google Scholar 

  11. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013)

    Google Scholar 

  12. Nikfarjam, A., Sarker, A., O’Connor, K., Ginn, R.E., Gonzalez, G.: Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. JAMIA 22(3), 671–681 (2015)

    Google Scholar 

  13. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of EMNLP 2014 (2014)

    Google Scholar 

  14. Sarker, A., O’Connor, K., Ginn, R., Scotch, M., Smith, K., Malone, D., Gonzalez, G.: Social media mining for toxicovigilance: automatic monitoring of prescription medication abuse from Twitter. Drug Saf. 39(3), 231–240 (2016)

    Article  Google Scholar 

  15. Seargeant, P., Tagg, C. (eds.): The Language of Social Media. Palgrave Macmillan, London (2014)

    Google Scholar 

  16. Wei, C.H., Kao, H.Y., Lu, Z.: GNormPlus: an integrative approach for tagging genes, gene families, and protein domains. BioMed Res. Int. 2015 (2015) (2015). ID 918710

    Google Scholar 

  17. Yang, C.C., Yang, H., Jiang, L., Zhang, M.: Social media mining for drug safety signal detection. In: Proceedings of the 2012 International Workshop on Smart Health and Wellbeing (SHB 2012) (2012)

    Google Scholar 

Download references

Acknowledgments

This work was supported by a grant from the Ministry of Science, Research and Arts of Baden-Württemberg to Roman Klinger.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Camilo Thorne .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Thorne, C., Klinger, R. (2018). On the Semantic Similarity of Disease Mentions in \(\textsc {medline}^{\circledR } \) and Twitter. In: Silberztein, M., Atigui, F., Kornyshova, E., Métais, E., Meziane, F. (eds) Natural Language Processing and Information Systems. NLDB 2018. Lecture Notes in Computer Science(), vol 10859. Springer, Cham. https://doi.org/10.1007/978-3-319-91947-8_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-91947-8_34

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-91946-1

  • Online ISBN: 978-3-319-91947-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics