Skip to main content

Scope and Challenges of Language Modelling - An Interrogative Survey on Context and Embeddings

  • Conference paper
  • First Online:
Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2018)

Abstract

In this work we explore the domain of Language Modelling. We focus here on different context selection strategies, data augmentation techniques, and word embedding models. Many of the existing approaches are difficult to understand without specific expertise in this domain. Therefore, we concentrate on appropriate explanations and representations that enable us to compare several approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649 (2018)

    Google Scholar 

  2. Athiwaratkun, B., Wilson, A.: Multimodal word distributions. In: Conference of the Association for Computational Linguistics (ACL) (2017)

    Google Scholar 

  3. Bjerva, J., Östling, R., Han Veiga, M., Tiedemann, J., Augenstein, I.: What do language representations really represent? Comput. Linguist. 1–8 (2019, Just Accepted)

    Google Scholar 

  4. Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. J. Mach. Learn. Res

    Google Scholar 

  5. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. CoRR

    Google Scholar 

  6. Bolukbasi, T., Chang, K., Zou, J., Saligrama, V., Kalai, A.: Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. CoRR

    Google Scholar 

  7. Council, N.R., Committee, A.L.P.A.: Language and machines: computers in translation and linguistics, a report. In: National Academy of Sciences, National Research Council (1966)

    Google Scholar 

  8. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 41(6), 391–407 (1990)

    Article  Google Scholar 

  9. Dhingra, B., Liu, H., Salakhutdinov, R., Cohen, W.: A comparative study of word embeddings for reading comprehension. CoRR

    Google Scholar 

  10. Dyer, C.: Notes on noise contrastive estimation and negative sampling. CoRR

    Google Scholar 

  11. Gittens, A., Achlioptas, D., Mahoney, M.: Skip-gram - zipf + uniform = vector additivity. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Canada, vol. 1. pp. 69–76 (2017)

    Google Scholar 

  12. Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 5–6 (2005)

    Google Scholar 

  13. Herbelot, A., Baroni, M.: High-risk learning: acquiring new word vectors from tiny data. CoRR

    Google Scholar 

  14. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. CoRR

    Google Scholar 

  15. Józefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., Wu, Y.: Exploring the limits of language modeling. CoRR

    Google Scholar 

  16. Kim, Y., Jernite, Y., Sontag, D., Rush, A.: Character-aware neural language models. CoRR

    Google Scholar 

  17. Kneser, R., Ney, H.: Improved clustering techniques for class-based statistical language modelling. In: Third European Conference on Speech Communication and Technology (1993)

    Google Scholar 

  18. Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing Systems, vol. 27

    Google Scholar 

  19. Levy, O., Goldberg, Y.: Dependency-based word embeddings. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, USA, vol. 2. pp. 302–308 (2014). http://aclweb.org/anthology/P/P14/P14-2050.pdf

  20. Manning, C.D., Schuetze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    Google Scholar 

  21. McCann, B., Bradbury, J., Xiong, C., Socher, R.: Learned in translation: contextualized word vectors. CoRR

    Google Scholar 

  22. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. CoRR abs/1310.4546 (2013)

    Google Scholar 

  23. Mimno, D., Thompson, L.: The strange geometry of skip-gram with negative sampling. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, September 2017

    Google Scholar 

  24. Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. In: Proceedings of the 24th International Conference on Machine Learning, pp. 641–648. ACM (2007)

    Google Scholar 

  25. Ney, H., Essen, U., Kneser, R.: On structuring probabilistic dependences in stochastic language modelling. Comput. Speech Lang. 8(1), 1–38 (1994)

    Article  Google Scholar 

  26. Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. CoRR

    Google Scholar 

  27. Nitsche, M., Tropmann-Frick, M.: Context and embeddings in language modelling - an exploration. In: Selected Papers of the XX International Conference on Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2018), Moscow, Russia, 9–12 October 2018, pp. 131–138 (2018). http://ceur-ws.org/Vol-2277/paper24.pdf

  28. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014). http://www.aclweb.org/anthology/D14-1162

  29. Peters, M.E., et al.: Deep contextualized word representations. CoRR abs/1802.05365 (2018). http://arxiv.org/abs/1802.05365

  30. Pinter, Y., Guthrie, R., Eisenstein, J.: Mimicking word embeddings using subword RNNs. CoRR

    Google Scholar 

  31. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018). https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/languageunsupervised/anguageunderstandingpaper.pdf

  32. Rong, X.: word2vec parameter learning explained. CoRR abs/1411.2738 (2014). http://arxiv.org/abs/1411.2738

  33. Srivastava, R., Greff, K., Schmidhuber, J.: Highway networks. CoRR

    Google Scholar 

  34. Tissier, J., Gravier, C., Habrard, A.: Dict2vec: learning word embeddings using lexical dictionaries. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017, pp. 254–263 (2017)

    Google Scholar 

  35. Vaswani, A., et al.: Attention is all you need. CoRR abs/1706.03762 (2017). http://arxiv.org/abs/1706.03762

  36. Vilnis, L., McCallum, A.: Word representations via gaussian embedding. CoRR

    Google Scholar 

  37. Wieting, J., Bansal, M., Gimpel, K., Livescu, K.: Charagram: Embedding words and sentences via character n-grams. CoRR

    Google Scholar 

  38. Winograd, T.: Understanding natural language. Cogn. Psychol. 3(1), 1–191 (1972)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marina Tropmann-Frick .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nitsche, M., Tropmann-Frick, M. (2019). Scope and Challenges of Language Modelling - An Interrogative Survey on Context and Embeddings. In: Manolopoulos, Y., Stupnikov, S. (eds) Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2018. Communications in Computer and Information Science, vol 1003. Springer, Cham. https://doi.org/10.1007/978-3-030-23584-0_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-23584-0_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-23583-3

  • Online ISBN: 978-3-030-23584-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics