Skip to main content
Log in

PMIVec: a word embedding model guided by point-wise mutual information criterion

  • Research Article
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Word embedding aims to represent each word with a dense vector which reveals the semantic similarity between words. Existing methods such as word2vec derive such representations by factorizing the word–context matrix into two parts, i.e., word vectors and context vectors. However, only one part is used to represent the word, which may damage the semantic similarity between words. To address this problem, this paper proposes a novel word embedding method based on point-wise mutual information criterion (PMIVec). Our method explicitly learns the context vector as the final word representation for each word, while discarding the word vector. To avoid the damage of semantic similarity between words, we normalize the word vector during the training process. Moreover, this paper uses point-wise mutual information to measure the semantic similarity between words, which is more consistent with human intuition on semantic similarity. Experiments on public data sets show that our PMIVec model can consistently outperform state-of-the-art models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. https://github.com/facebookresearch/fastText.

  2. https://github.com/stanfordnlp/GloVe.

  3. https://github.com/yumeng5/Spherical-Text-Embedding.

  4. https://mbta.com.

  5. https://www.english-corpora.org/glowbe/.

  6. https://dumps.wikimedia.org/enwiki/.

References

  1. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  2. Bruni, E., Tran, N.-K., Baroni, M.: Multimodal distributional semantics. J. Artif. Intell. Res. 49, 1–47 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  3. Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)

    Google Scholar 

  4. Devlin, J., Chang, M.-W., Lee, K., Kristina, T.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, (2018)

  5. Dyer, C.: Notes on noise contrastive estimation and negative sampling. arXiv:1410.8251, (2014)

  6. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. ACM Trans. Inf. Syst. 20(1), 116–131 (2002)

    Article  Google Scholar 

  7. Gutmann, M.U., Hyvärinen, A.: Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. Mach. Learn. Res. 13, 307–361 (2012)

    MathSciNet  MATH  Google Scholar 

  8. Hashimoto, T.B., Alvarez-Melis, D., Jaakkola, T.S.: Word embeddings as metric recovery in semantic spaces. Trans. Assoc. Comput. Linguist. 4, 273–286 (2016)

    Article  Google Scholar 

  9. Hill, F., Reichart, R., Korhonen, A.: Simlex-999: evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. 41(4), 665–695 (2015)

    Article  MathSciNet  Google Scholar 

  10. Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9992–10002, (2020)

  11. Kolesnikova, O.: Survey of word co-occurrence measures for collocation detection. Comput. Sist. 20(3), 327–344 (2016)

    Google Scholar 

  12. Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Ghahramani, I.Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in neural information processing systems. Curran Associates Inc. 27, 2177–2185 (2014)

  13. Luong, M.-T., Socher, R., Manning, C.D.: Better word representations with recursive neural networks for morphology. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 104–113, (2013)

  14. Ma, T.: Non-convex optimization for machine learning: design, analysis, and understanding. PhD thesis, Princeton University, (2017)

  15. Meng, Y., Huang, J., Wang, G., Zhang, C., Zhuang, H., Kaplan, L., Han, J.: Spherical text embedding. In: Wallach, H., Larochelle, H., Beygelzimer, A., Alche, F., Fox, E., and Garnett, R (eds.) Advances in neural information processing systems. Curran Associates, Inc., 32, 8206–8215 (2019)

  16. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Burges, C.J., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in neural information processing systems. Curran Associates, Inc., 26, 3111–3119 (2013)

  17. Mnih, A., Kavukcuoglu, K.: Learning word embeddings efficiently with noise-contrastive estimation. In: Burges, C.J., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K.Q. (eds.) Advances in neural information processing systems. Curran Associates, Inc., 26, 2265–2273 (2013)

  18. Neelakantan, A., Shankar, J., Passos, A., McCallum, A.: Efficient non-parametric estimation of multiple embeddings per word in vector space. arXiv:1504.06654, (2015)

  19. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In : Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, (2014)

  20. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. arXiv:1802.05365, (2018)

  21. Socher, R., Bauer, J., Manning, C.D., et al.: Parsing with compositional vector grammars. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 455–465, (2013)

  22. Terra, E.L., Clarke, C.L.A.: Frequency estimates for statistical word similarity measures. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 244–251, (2003)

  23. Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 384–394. Association for Computational Linguistics, (2010)

  24. Xing, C., Wang, D., Liu, C., Lin, Y.: Normalized word embedding and orthogonal transform for bilingual word translation. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1006–1011, (2015)

Download references

Acknowledgements

This work was supported in part to Dr. Liansheng Zhuang by NSFC under contract No.U20B2070 and No.61976199, and in part to Dr. Houqiang Li by NSFC under contract No.61836011.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liansheng Zhuang.

Additional information

Communicated by B.-K. Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 Proof of Theorem 1

There are two steps to prove Theorem 1. The first step is to formulate the loss function of SG model by applying the noise contrastive estimation (NCE) method [7] to equation (3) and (4). The second step is to reveal that the simplification made by SG’s negative sampling technique will lead to property (5).

1.1.1 Step 1

To find the assumed optimal solutions \(\{{\mathbf {O}}_i,{\mathbf {I}}_k\}_{i,k=1,\ldots ,V}\), and yet to avoid directly computing the denominator of Eq. (4), which will be denoted as \(Z_{k}\), NCE proposes to treat it as a constant number to be estimated, and one can estimate \(\{{\mathbf {O}}_i,{\mathbf {I}}_k,Z_k\}_{i,k=1,\ldots ,V}\) via solving a supervised learning task.

One can draw \(T_d\) real samples from \(p(\cdot |w_{k})\), and \(K*T_d\) noise samples from the known noise distribution \(p_{n}(\cdot )\), where \(p_{n}(\cdot )\) can be chosen to be any distribution and K is a constant like 5. Each sample will have a label y, and \(y=1\) stands for real sample; \(y=0\) stands for noise sample. Then, one can mix these samples together and pick one of them, asking whether it is a real sample or not.

One can train a logistic regression model, whose parameters are \(\{{\mathbf {O}}_i,{\mathbf {I}}_k,Z_k\}_{i,k=1,\ldots ,V}\), to discriminate the real samples from the noise samples. For a real sample \(w_{i}\), the logistic regression model’s prediction on it being true is

$$\begin{aligned} \begin{array}{ll} &{}p(y=1|w_i;w_k)\\ &{}=\sigma \left[ \log (\exp \left( \left<{\mathbf {O}}_{i},{\mathbf {I}}_{k}\right>\right) )-\log (Z_{k})-\log (Kp_{n}(w_{i}))\right] . \end{array} \end{aligned}$$

According to NCE’s conclusion, if one treats \(Z_{k}\) as an additional scalar parameter that can be optimized, then the following optimization problems (minimizing the cross entropy) will have the same solution as to (3)

$$\begin{aligned} -{\mathbf {E}}_{i\sim p_{d}}\log p(1|w_i;w_k)-K{\mathbf {E}}_{t\sim p_{n}}\log p(0|w_t;w_k) \end{aligned}$$
(15)

where \(\forall k=1,\ldots ,V,\)

$$\begin{aligned} p(1|w_i;w_k)=\sigma \left[ \left<{\mathbf {O}}_{i},{\mathbf {I}}_{k}\right>-\log Z_{k}-\log Kp_{n}(w_{i})\right] , \end{aligned}$$

and \(p_{d}\) represents the true distribution of \(p(\cdot |w_{k})\). One can sample \(w_i\) from \(p_d\) with word window’s help.

1.1.2 Step 2

The following part will show how the negative-sampling technique adopted by the skip-gram model can lead to Eq. (5).

The SG model simplifies the calculation of \(p(1|w_i;w_k)\) by omitting \(\log Z_{k}\) and \(\log Kp_{n}(w_{i})\). This means that the logistic model assumes \(Kp_{n}(w_{i})=1\) [5], i.e.,

$$\begin{aligned} \begin{array}{rl} p\left( y=1|w_i;w_k\right) &{}=\sigma (\left<{\mathbf {O}}_{i},{\mathbf {I}}_{k}\right>)=\frac{\exp (\left<{\mathbf {O}}_{i},{\mathbf {I}}_{k}\right>)}{\exp (\left<{\mathbf {O}}_{i},{\mathbf {I}}_{k}\right>)+1},\\ {\hat{p}}_{d}\left( y=1|w_i;w_k\right) &{}= \frac{p_{d}(w_{i}|w_{k})}{p_{d}(w_{i}|w_{k}) + Kp_{n}(w_{i})}. \\ \end{array} \end{aligned}$$

where \({\hat{p}}_{d}(1|w_i;w_k)\) represents the true probability of sample \(w_{i}\) being true, and it can be derived by applying the Bayesian formula to it. It is easy to show that (15) is equivalent to

$$\begin{aligned} \sum \limits _{i}\rho *KL\left\{ {\hat{p}}_{d}\left( y|w_i;w_k\right) \Vert p\left( y|w_i;w_k\right) \right\} , \end{aligned}$$

where \(\rho =p_{d}(w_i;w_k)+Kp_{n}(w_{j})\). Obviously, the above objective will equal to 0 iff

$$\begin{aligned} {\hat{p}}_{d}\left( y|w_i;w_k\right) =p\left( y|w_i;w_k\right) ,\ y=1,0. \end{aligned}$$
(16)

Noticing that SG chooses \(p(w_{k})\) as the noise distribution, and with some simple rearrangement, the Eq. (16) implies

$$\begin{aligned} \left<{\mathbf {O}}_{i},{\mathbf {I}}_{k}\right>=\mathrm{PMI}(w_{i},w_{k})-\log K. \end{aligned}$$

1.2 Proof of Theorem 2

There are two steps to prove Theorem 2. The first step is to show the following equations hold for the assumed solutions \(\{{\mathbf {O}}_i,{\mathbf {I}}_k\}_{i,k=1,\ldots ,V}\)

$$\begin{aligned} \log \left( p(w_i)\right) + \log \left( {\hat{Z}}\right)&={\bar{\alpha }}_i*\left\| {\mathbf {O}}_i\right\| _2^2, \end{aligned}$$
(17)
$$\begin{aligned} \log \left( p(w_j)\right) + \log \left( {\hat{Z}}\right)&={\bar{\alpha }}_j*\left\| {\mathbf {O}}_j\right\| _2^2, \end{aligned}$$
(18)
$$\begin{aligned} \log \left( p(w_i,w_j)\right) + \log \left( {\hat{Z}}^\prime \right)&={\bar{\alpha }}_{ij}*\Vert {\mathbf {O}}_i+{\mathbf {O}}_j\Vert _2^2, \end{aligned}$$
(19)

where \({\hat{Z}}\)’s definition can be found in the Eq. (10), \({\hat{Z}}^\prime\) is the mean value of all \({\hat{Z}}_k^\prime\), and \({\hat{Z}}_k^\prime\) is defined in the Eq. (12). \({\bar{\alpha }}_i,{\bar{\alpha }}_j\), and \({\bar{\alpha }}_{ij}\) are both constant number. The second step is to show that

$$\begin{aligned} \begin{array}{rl} {\hat{Z}}^\prime &{}\approx {\hat{Z}}^2,\\ {\bar{\alpha }}_i&{}\approx {\bar{\alpha }}_j\approx {\bar{\alpha }}_{ij}, \end{array} \end{aligned}$$

and therefore, (19), (18), and (17) will prove this theorem.

1.2.1 Step 1

Similar to the analysis in the PMIVec section’s sketch proof, the assumed solution \(\{{\mathbf {O}}_i,{\mathbf {I}}_k\}_{i,k=1,\ldots ,V}\) satisfies

$$\begin{aligned} \begin{array}{rl} \sum \limits _{k=1}^{V}p\left( w_i,w_k\right) Z_k &{}=\sum \limits _{k=1}^{V} \exp \left( \left\| {\mathbf {O}}_i\right\| _2^2\cos \left( \theta _{ik}\right) \right) p(w_k),\\ \sum \limits _{k=1}^{V}p\left( w_i,w_j,w_k\right) Z_k^\prime &{}=\\ \sum \limits _{k=1}^{V}&{} \exp \left( \left\| {\mathbf {O}}_i+{\mathbf {O}}_j\right\| _2^2\cos (\theta _{(ij)k})\right) p(w_k).\\ \end{array} \end{aligned}$$

Then the \(\alpha _{i}\) is the mean value of \(\cos (\theta _{ik})\), \(\alpha _{j}\) is the mean value of \(\cos (\theta _{jk})\), and \(\alpha _{ij}\) is the mean value of \(\cos (\theta _{(ij)k})\).

The log value of both sides of these equations leads to Eqs. (17), (18), and (19).

1.2.2 Step 2

It is obvious that \({\bar{\alpha }}_i\approx {\bar{\alpha }}_j\approx {\bar{\alpha }}_{ij}\) because of the assumption about \(\theta _{ik},\theta _{jk}\), and \(\theta _{(ij)k}\).

Noticing that each term of \(Z^{\prime }_k\) is bigger than each term of \(Z_k^2\), and meanwhile that the total number of \(Z^{\prime }_k\) is smaller than the total number of \(Z_k^2\). Therefore, \(Z^{\prime }_k\approx Z^{2}_k\). This leads to the conclusion that \({\hat{Z}}^\prime \approx {\hat{Z}}^2\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yao, M., Zhuang, L., Wang, S. et al. PMIVec: a word embedding model guided by point-wise mutual information criterion. Multimedia Systems 28, 2275–2283 (2022). https://doi.org/10.1007/s00530-022-00928-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-022-00928-4

Keywords

Navigation