Probabilistic Approach for Embedding Arbitrary Features of Text

Potapenko, Anna

doi:10.1007/978-3-030-11027-7_14

Anna Potapenko²⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11179))

Included in the following conference series:

International Conference on Analysis of Images, Social Networks and Texts

792 Accesses

Abstract

Topic modeling is usually used to model words in documents by probabilistic mixtures of topics. We generalize this setup and consider arbitrary features of the positions in a corpus, e.g. “contains a word”, “belongs to a sentence”, “has a word in the local context”, “is labeled with a POS-tag”, etc. We build sparse probabilistic embeddings for positions and derive embeddings for the features by averaging of those. Importantly, we interpret the EM-algorithm as an iterative process of intersection and averaging steps that reestimate position and feature embeddings respectively. With this approach, we get several insights. First, we argue that a sentence should not be represented as an average of its words. While each word is a mixture of multiple senses, each word occurrence refers typically to just one specific sense. So in our approach, we obtain sentence embeddings by averaging position embeddings from the E-step. Second, we show that Biterm Topic Model (Yan et al. [11]) and Word Network Topic Model (Zuo et al. [12]) are equivalent with the only difference of tying word and context embeddings. We further extend these models by adjusting representation of each sliding window with a few iterations of EM-algorithm. Finally, we aim at consistent embeddings for hierarchical entities, e.g. for word-sentence-document structure. We discuss two alternative schemes of training and generalize to the case where the middle level of the hierarchy is unknown. It provides a unified formulation for topic segmentation and word sense disambiguation tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Arora, S., Li, Y., Liang, Y., Ma, T., Risteski, A.: Linear algebraic structure of word senses, with applications to polysemy. CoRR abs/1601.03764 (2016)
Google Scholar
Arora, S., Liang, Y., Ma, T.: A simple but tough-to-beat baseline for sentence embeddings. In: International Conference on Learning Representations (2017)
Google Scholar
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of EMNLP, pp. 670–680. Association for Computational Linguistics (2017)
Google Scholar
Hofmann, T.: Probabilistic latent semantic analysis. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, UAI 1999, pp. 289–296. Morgan Kaufmann Publishers Inc., San Francisco (1999)
Google Scholar
Inan, H., Khosravi, K., Socher, R.: Tying word vectors and word classifiers: A loss framework for language modeling. CoRR abs/1611.01462 (2016)
Google Scholar
Kiros, R., et al.: Skip-thought vectors. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS 2015, pp. 3294–3302. MIT Press, Cambridge (2015)
Google Scholar
Kochedykov, D., Apishev, M., Golitsyn, L., Vorontsov, K.: Fast and modular regularized topic modelling. In: Proceeding of the 21st Conference of FRUCT Association, ISMW, pp. 182–193 (2017)
Google Scholar
Pagliardini, M., Gupta, P., Jaggi, M.: Unsupervised learning of sentence embeddings using compositional n-gram features. In: Proceedings of NAACL (2018)
Google Scholar
Potapenko, A., Popov, A., Vorontsov, K.: Interpretable probabilistic embeddings: bridging the gap between topic models and neural networks. In: Filchenkov, A., Pivovarova, L., Žižka, J. (eds.) AINL 2017. CCIS, vol. 789, pp. 167–180. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-71746-3_15
Chapter Google Scholar
Press, O., Wolf, L.: Using the output embedding to improve language models. In: Proceedings of ACL: Volume 2, Short Papers, pp. 157–163. ACL (2017)
Google Scholar
Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of WWW, pp. 1445–1456 (2013)
Google Scholar
Zuo, Y., Zhao, J., Xu, K.: Word network topic model: a simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 48(2), 379–398 (2016)
Article Google Scholar

Download references

Acknowledgements

The research was supported by Russian Foundation for Basic Research (17-07-01536).

Author information

Authors and Affiliations

National Research University Higher School of Economics, Moscow, Russia
Anna Potapenko

Authors

Anna Potapenko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anna Potapenko .

Editor information

Editors and Affiliations

RWTH Aachen University, Aachen, Germany
Wil M. P. van der Aalst
University of Ljubljana, Ljubljana, Slovenia
Vladimir Batagelj
University of Mannheim, Mannheim, Germany
Goran Glavaš
National Research University Higher School of Economics, Moscow, Russia
Dmitry I. Ignatov
Institute of Mathematics and Mechanics, Yekaterinburg, Russia
Michael Khachay
National Research University Higher School of Economics, Moscow, Russia
Sergei O. Kuznetsov
National Research University Higher School of Economics , Saint Petersburg, Russia
Olessia Koltsova
National Research University Higher School of Economics, Moscow, Russia
Irina A. Lomazova
Moscow State University, Moscow, Russia
Natalia Loukachevitch
Loria, Vandoeuvre lès Nancy, France
Amedeo Napoli
University of Hamburg, Hamburg, Germany
Alexander Panchenko
University of Florida, Gainesville, FL, USA
Panos M. Pardalos
Ca Foscari University of Venice, Venice, Italy
Marcello Pelillo
National Research University Higher School of Economics, Nizhny Novgorod, Russia
Andrey V. Savchenko

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Potapenko, A. (2018). Probabilistic Approach for Embedding Arbitrary Features of Text. In: van der Aalst, W., et al. Analysis of Images, Social Networks and Texts. AIST 2018. Lecture Notes in Computer Science(), vol 11179. Springer, Cham. https://doi.org/10.1007/978-3-030-11027-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-11027-7_14
Published: 31 December 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11026-0
Online ISBN: 978-3-030-11027-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics