Skip to main content

Do Scaling Algorithms Preserve Word2Vec Semantics? A Case Study for Medical Entities

  • Conference paper
  • First Online:
Book cover Data Integration in the Life Sciences (DILS 2018)

Abstract

The exponential increase of scientific publications in the bio-medical field challenges access to scientific information, which primarily is encoded by semantic relationships between medical entities, such as active ingredients, diseases, or genes. Neural language models, such as Word2Vec, offer new ways of automatically learning semantically meaningful entity relationships even from large text corpora. They offer high scalability and deliver better accuracy than comparable approaches. Still, first the models have to be tuned by testing different training parameters. Arguably, the most critical parameter is the number of training dimensions for the neural network training and testing individually different numbers of dimensions is time-consuming. It usually takes hours or even days per training iteration on large corpora. In this paper we show a more efficient way to determine the optimal number of dimensions concerning quality measures such as precision/recall. We show that the quality of results gained using simpler and easier to compute scaling approaches like MDS or PCA correlates strongly with the expected quality when using the same number of Word2Vec training dimensions. This has even more impact if after initial Word2Vec training only a limited number of entities and their respective relations are of interest.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.pubpharm.de/vufind/.

  2. 2.

    https://www.whocc.no/atc_ddd_index/.

  3. 3.

    https://www.ncbi.nlm.nih.gov/pubmed/.

  4. 4.

    The complete list can be downloaded under: http://www.ifis.cs.tu-bs.de/webfm_send/2295.

  5. 5.

    https://www.drugbank.ca/.

  6. 6.

    https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/.

  7. 7.

    https://lucene.apache.org/.

  8. 8.

    https://radimrehurek.com/gensim/models/word2vec.html.

  9. 9.

    https://github.com/RaRe-Technologies/gensim-data/issues/28.

  10. 10.

    https://code.google.com/archive/p/word2vec/.

References

  1. Wawrzinek, J., Balke, W.-T.: Semantic facettation in pharmaceutical collections using deep learning for active substance contextualization. In: Choemprayong, S., Crestani, F., Cunningham, S.J. (eds.) ICADL 2017. LNCS, vol. 10647, pp. 41–53. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70232-2_4

    Chapter  Google Scholar 

  2. Wang, Z.Y., Zhang, H.Y.: Rational drug repositioning by medical genetics. Nat. Biotechnol. 31(12), 1080 (2013)

    Article  Google Scholar 

  3. Abdelaziz, I., Fokoue, A., Hassanzadeh, O., Zhang, P., Sadoghi, M.: Large-scale structural and textual similarity-based mining of knowledge graph to predict drug–drug interactions. Web Semant.: Sci., Serv. Agents World Wide Web 44, 104–117 (2017)

    Article  Google Scholar 

  4. Leser, U., Hakenberg, J.: What makes a gene name? Named entity recognition in the biomedical literature. Brief. Bioinform. 6(4), 357–369 (2005)

    Article  Google Scholar 

  5. Lotfi Shahreza, M., Ghadiri, N., Mousavi, S.R., Varshosaz, J., Green, J.R.: A review of network-based approaches to drug repositioning. Brief. Bioinform. bbx017 (2017)

    Google Scholar 

  6. Dudley, J.T., Deshpande, T., Butte, A.J.: Exploiting drug–disease relationships for computational drug repositioning. Brief. Bioinform. 12(4), 303–311 (2011)

    Article  Google Scholar 

  7. Willett, P., Barnard, J.M., Downs, G.M.: Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38(6), 983–996 (1998)

    Article  Google Scholar 

  8. Ngo, D.L., et al.: Application of word embedding to drug repositioning. J. Biomed. Sci. Eng. 9(01), 7 (2016)

    Article  Google Scholar 

  9. Lengerich, B.J., Maas, A.L., Potts, C.: Retrofitting Distributional Embeddings to Knowledge Graphs with Functional Relations. arXiv preprint arXiv:1708.00112 (2017)

  10. Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Long Papers, vol. 1, pp. 238–247 (2014)

    Google Scholar 

  11. Mikolov, T., Yih, W.T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751 (2013)

    Google Scholar 

  12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: NIPS (2013)

    Google Scholar 

  13. Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing, pp. 2177–2185 (2014)

    Google Scholar 

  14. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

    Google Scholar 

  15. Bengio, Y., Courville, A., Vincent, P., Collobert, R., Weston, J., et al.: Natural language processing (almost) from scratch. IEEE Trans. Pattern Anal. Mach. Intell. 35, 384–394 (2014)

    Google Scholar 

  16. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification, vol. 2, pp. 427–431 (2016). Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Valencia, Spain, 3–7 April 2017

    Google Scholar 

  17. Borg, I., Groenen, P.J.: Modern Multidimensional Scaling: Theory and Applications. Springer, New york (2005). https://doi.org/10.1007/0-387-28981-X

    Book  MATH  Google Scholar 

  18. Weinberg, S.L.: An introduction to multidimensional scaling. Meas. Eval. Couns. Dev. 24, 12–36 (1991)

    Google Scholar 

  19. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  20. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Book  Google Scholar 

  21. Hamilton, W.L., Leskovec, J., Jurafsky, D.: Diachronic word embeddings reveal statistical laws of semantic change. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016, pp. 1489–1501 (2016)

    Google Scholar 

  22. Altman, D.G., Bland, J.M.: Measurement in medicine: the analysis of method comparison studies. Statistician 32, 307–317 (1983)

    Article  Google Scholar 

  23. Schönemann, P.H.: A generalized solution of the orthogonal procrustes problem. Psychometrika 31, 1–10 (1966)

    Article  MathSciNet  Google Scholar 

  24. Jessop, D.M., Adams, S.E., Willighagen, E.L., Hawizy, L., Murray-Rust, P.: OSCAR4: a flexible architecture for chemical text-mining. J. Cheminformatics 3(1), 41 (2011)

    Article  Google Scholar 

  25. Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)

    Book  Google Scholar 

  26. Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing Systems, pp. 2177–2185 (2014)

    Google Scholar 

  27. Gittens, A., Achlioptas, D., Mahoney, M.W.: Skip-gram - zipf + uniform = vector additivity. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Long Papers, vol. 1, pp. 69–76 (2017)

    Google Scholar 

  28. Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X., Chen, E.: Word embedding revisited: a new representation learning and explicit matrix factorization perspective. In: IJCAI International Joint Conference on Artificial Intelligence, pp. 3650–3656 (2015)

    Google Scholar 

  29. Canese, K.: PubMed relevance sort. NLM Tech. Bull 394, e2 (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Janus Wawrzinek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wawrzinek, J., Pinto, J.M.G., Markiewka, P., Balke, WT. (2019). Do Scaling Algorithms Preserve Word2Vec Semantics? A Case Study for Medical Entities. In: Auer, S., Vidal, ME. (eds) Data Integration in the Life Sciences. DILS 2018. Lecture Notes in Computer Science(), vol 11371. Springer, Cham. https://doi.org/10.1007/978-3-030-06016-9_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-06016-9_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-06015-2

  • Online ISBN: 978-3-030-06016-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics