PMIVec: a word embedding model guided by point-wise mutual information criterion

Yao, Minghong; Zhuang, Liansheng; Wang, Shafei; Li, Houqiang

doi:10.1007/s00530-022-00928-4

PMIVec: a word embedding model guided by point-wise mutual information criterion

Research Article
Published: 09 June 2022

Volume 28, pages 2275–2283, (2022)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Minghong Yao^1,2,
Liansheng Zhuang¹,
Shafei Wang² &
…
Houqiang Li¹

283 Accesses
Explore all metrics

Abstract

Word embedding aims to represent each word with a dense vector which reveals the semantic similarity between words. Existing methods such as word2vec derive such representations by factorizing the word–context matrix into two parts, i.e., word vectors and context vectors. However, only one part is used to represent the word, which may damage the semantic similarity between words. To address this problem, this paper proposes a novel word embedding method based on point-wise mutual information criterion (PMIVec). Our method explicitly learns the context vector as the final word representation for each word, while discarding the word vector. To avoid the damage of semantic similarity between words, we normalize the word vector during the training process. Moreover, this paper uses point-wise mutual information to measure the semantic similarity between words, which is more consistent with human intuition on semantic similarity. Experiments on public data sets show that our PMIVec model can consistently outperform state-of-the-art models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Global-locality preserving projection for word embedding

Article 16 June 2022

Word Similarity Fails in Multiple Sense Word Embedding

Chinese Word Embedding Learning with Limited Data

Notes

References

Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Article Google Scholar
Bruni, E., Tran, N.-K., Baroni, M.: Multimodal distributional semantics. J. Artif. Intell. Res. 49, 1–47 (2014)
Article MathSciNet MATH Google Scholar
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Kristina, T.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, (2018)
Dyer, C.: Notes on noise contrastive estimation and negative sampling. arXiv:1410.8251, (2014)
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. ACM Trans. Inf. Syst. 20(1), 116–131 (2002)
Article Google Scholar
Gutmann, M.U., Hyvärinen, A.: Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. Mach. Learn. Res. 13, 307–361 (2012)
MathSciNet MATH Google Scholar
Hashimoto, T.B., Alvarez-Melis, D., Jaakkola, T.S.: Word embeddings as metric recovery in semantic spaces. Trans. Assoc. Comput. Linguist. 4, 273–286 (2016)
Article Google Scholar
Hill, F., Reichart, R., Korhonen, A.: Simlex-999: evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. 41(4), 665–695 (2015)
Article MathSciNet Google Scholar
Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9992–10002, (2020)
Kolesnikova, O.: Survey of word co-occurrence measures for collocation detection. Comput. Sist. 20(3), 327–344 (2016)
Google Scholar
Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Ghahramani, I.Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in neural information processing systems. Curran Associates Inc. 27, 2177–2185 (2014)
Luong, M.-T., Socher, R., Manning, C.D.: Better word representations with recursive neural networks for morphology. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 104–113, (2013)
Ma, T.: Non-convex optimization for machine learning: design, analysis, and understanding. PhD thesis, Princeton University, (2017)
Meng, Y., Huang, J., Wang, G., Zhang, C., Zhuang, H., Kaplan, L., Han, J.: Spherical text embedding. In: Wallach, H., Larochelle, H., Beygelzimer, A., Alche, F., Fox, E., and Garnett, R (eds.) Advances in neural information processing systems. Curran Associates, Inc., 32, 8206–8215 (2019)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Burges, C.J., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in neural information processing systems. Curran Associates, Inc., 26, 3111–3119 (2013)
Mnih, A., Kavukcuoglu, K.: Learning word embeddings efficiently with noise-contrastive estimation. In: Burges, C.J., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K.Q. (eds.) Advances in neural information processing systems. Curran Associates, Inc., 26, 2265–2273 (2013)
Neelakantan, A., Shankar, J., Passos, A., McCallum, A.: Efficient non-parametric estimation of multiple embeddings per word in vector space. arXiv:1504.06654, (2015)
Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In : Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, (2014)
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. arXiv:1802.05365, (2018)
Socher, R., Bauer, J., Manning, C.D., et al.: Parsing with compositional vector grammars. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 455–465, (2013)
Terra, E.L., Clarke, C.L.A.: Frequency estimates for statistical word similarity measures. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 244–251, (2003)
Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 384–394. Association for Computational Linguistics, (2010)
Xing, C., Wang, D., Liu, C., Lin, Y.: Normalized word embedding and orthogonal transform for bilingual word translation. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1006–1011, (2015)

Download references

Acknowledgements

This work was supported in part to Dr. Liansheng Zhuang by NSFC under contract No.U20B2070 and No.61976199, and in part to Dr. Houqiang Li by NSFC under contract No.61836011.

Author information

Authors and Affiliations

University of Science and Technology of China, Anhui, China
Minghong Yao, Liansheng Zhuang & Houqiang Li
Peng Cheng Laboratory, Shenzhen, China
Minghong Yao & Shafei Wang

Authors

Minghong Yao
View author publications
You can also search for this author inPubMed Google Scholar
Liansheng Zhuang
View author publications
You can also search for this author inPubMed Google Scholar
Shafei Wang
View author publications
You can also search for this author inPubMed Google Scholar
Houqiang Li
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Liansheng Zhuang.

Additional information

Communicated by B.-K. Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 Proof of Theorem 1

There are two steps to prove Theorem 1. The first step is to formulate the loss function of SG model by applying the noise contrastive estimation (NCE) method [7] to equation (3) and (4). The second step is to reveal that the simplification made by SG’s negative sampling technique will lead to property (5).

1.1.1 Step 1

To find the assumed optimal solutions $\{{\mathbf {O}}_i,{\mathbf {I}}_k\}_{i,k=1,\ldots ,V}$, and yet to avoid directly computing the denominator of Eq. (4), which will be denoted as $Z_{k}$, NCE proposes to treat it as a constant number to be estimated, and one can estimate $\{{\mathbf {O}}_i,{\mathbf {I}}_k,Z_k\}_{i,k=1,\ldots ,V}$ via solving a supervised learning task.

One can draw $T_d$ real samples from $p(\cdot |w_{k})$, and $K*T_d$ noise samples from the known noise distribution $p_{n}(\cdot )$, where $p_{n}(\cdot )$ can be chosen to be any distribution and K is a constant like 5. Each sample will have a label y, and $y=1$ stands for real sample; $y=0$ stands for noise sample. Then, one can mix these samples together and pick one of them, asking whether it is a real sample or not.

One can train a logistic regression model, whose parameters are $\{{\mathbf {O}}_i,{\mathbf {I}}_k,Z_k\}_{i,k=1,\ldots ,V}$, to discriminate the real samples from the noise samples. For a real sample $w_{i}$, the logistic regression model’s prediction on it being true is

$$\begin{aligned} \begin{array}{ll} &{}p(y=1|w_i;w_k)\\ &{}=\sigma \left[ \log (\exp \left( \left<{\mathbf {O}}_{i},{\mathbf {I}}_{k}\right>\right) )-\log (Z_{k})-\log (Kp_{n}(w_{i}))\right] . \end{array} \end{aligned}$$

According to NCE’s conclusion, if one treats $Z_{k}$ as an additional scalar parameter that can be optimized, then the following optimization problems (minimizing the cross entropy) will have the same solution as to (3)

$$\begin{aligned} -{\mathbf {E}}_{i\sim p_{d}}\log p(1|w_i;w_k)-K{\mathbf {E}}_{t\sim p_{n}}\log p(0|w_t;w_k) \end{aligned}$$

(15)

where $\forall k=1,\ldots ,V,$

$$\begin{aligned} p(1|w_i;w_k)=\sigma \left[ \left<{\mathbf {O}}_{i},{\mathbf {I}}_{k}\right>-\log Z_{k}-\log Kp_{n}(w_{i})\right] , \end{aligned}$$

and $p_{d}$ represents the true distribution of $p(\cdot |w_{k})$. One can sample $w_i$ from $p_d$ with word window’s help.

1.1.2 Step 2

The following part will show how the negative-sampling technique adopted by the skip-gram model can lead to Eq. (5).

The SG model simplifies the calculation of $p(1|w_i;w_k)$ by omitting $\log Z_{k}$ and $\log Kp_{n}(w_{i})$. This means that the logistic model assumes $Kp_{n}(w_{i})=1$ [5], i.e.,

$$\begin{aligned} \begin{array}{rl} p\left( y=1|w_i;w_k\right) &{}=\sigma (\left<{\mathbf {O}}_{i},{\mathbf {I}}_{k}\right>)=\frac{\exp (\left<{\mathbf {O}}_{i},{\mathbf {I}}_{k}\right>)}{\exp (\left<{\mathbf {O}}_{i},{\mathbf {I}}_{k}\right>)+1},\\ {\hat{p}}_{d}\left( y=1|w_i;w_k\right) &{}= \frac{p_{d}(w_{i}|w_{k})}{p_{d}(w_{i}|w_{k}) + Kp_{n}(w_{i})}. \\ \end{array} \end{aligned}$$

where ${\hat{p}}_{d}(1|w_i;w_k)$ represents the true probability of sample $w_{i}$ being true, and it can be derived by applying the Bayesian formula to it. It is easy to show that (15) is equivalent to

$$\begin{aligned} \sum \limits _{i}\rho *KL\left\{ {\hat{p}}_{d}\left( y|w_i;w_k\right) \Vert p\left( y|w_i;w_k\right) \right\} , \end{aligned}$$

where $\rho =p_{d}(w_i;w_k)+Kp_{n}(w_{j})$. Obviously, the above objective will equal to 0 iff

$$\begin{aligned} {\hat{p}}_{d}\left( y|w_i;w_k\right) =p\left( y|w_i;w_k\right) ,\ y=1,0. \end{aligned}$$

(16)

Noticing that SG chooses $p(w_{k})$ as the noise distribution, and with some simple rearrangement, the Eq. (16) implies

$$\begin{aligned} \left<{\mathbf {O}}_{i},{\mathbf {I}}_{k}\right>=\mathrm{PMI}(w_{i},w_{k})-\log K. \end{aligned}$$

1.2 Proof of Theorem 2

There are two steps to prove Theorem 2. The first step is to show the following equations hold for the assumed solutions $\{{\mathbf {O}}_i,{\mathbf {I}}_k\}_{i,k=1,\ldots ,V}$

$$\begin{aligned} \log \left( p(w_i)\right) + \log \left( {\hat{Z}}\right)&={\bar{\alpha }}_i*\left\| {\mathbf {O}}_i\right\| _2^2, \end{aligned}$$

(17)

$$\begin{aligned} \log \left( p(w_j)\right) + \log \left( {\hat{Z}}\right)&={\bar{\alpha }}_j*\left\| {\mathbf {O}}_j\right\| _2^2, \end{aligned}$$

(18)

$$\begin{aligned} \log \left( p(w_i,w_j)\right) + \log \left( {\hat{Z}}^\prime \right)&={\bar{\alpha }}_{ij}*\Vert {\mathbf {O}}_i+{\mathbf {O}}_j\Vert _2^2, \end{aligned}$$

(19)

where ${\hat{Z}}$’s definition can be found in the Eq. (10), ${\hat{Z}}^\prime$ is the mean value of all ${\hat{Z}}_k^\prime$, and ${\hat{Z}}_k^\prime$ is defined in the Eq. (12). ${\bar{\alpha }}_i,{\bar{\alpha }}_j$, and ${\bar{\alpha }}_{ij}$ are both constant number. The second step is to show that

$$\begin{aligned} \begin{array}{rl} {\hat{Z}}^\prime &{}\approx {\hat{Z}}^2,\\ {\bar{\alpha }}_i&{}\approx {\bar{\alpha }}_j\approx {\bar{\alpha }}_{ij}, \end{array} \end{aligned}$$

and therefore, (19), (18), and (17) will prove this theorem.

1.2.1 Step 1

Similar to the analysis in the PMIVec section’s sketch proof, the assumed solution $\{{\mathbf {O}}_i,{\mathbf {I}}_k\}_{i,k=1,\ldots ,V}$ satisfies

$$\begin{aligned} \begin{array}{rl} \sum \limits _{k=1}^{V}p\left( w_i,w_k\right) Z_k &{}=\sum \limits _{k=1}^{V} \exp \left( \left\| {\mathbf {O}}_i\right\| _2^2\cos \left( \theta _{ik}\right) \right) p(w_k),\\ \sum \limits _{k=1}^{V}p\left( w_i,w_j,w_k\right) Z_k^\prime &{}=\\ \sum \limits _{k=1}^{V}&{} \exp \left( \left\| {\mathbf {O}}_i+{\mathbf {O}}_j\right\| _2^2\cos (\theta _{(ij)k})\right) p(w_k).\\ \end{array} \end{aligned}$$

Then the $\alpha _{i}$ is the mean value of $\cos (\theta _{ik})$, $\alpha _{j}$ is the mean value of $\cos (\theta _{jk})$, and $\alpha _{ij}$ is the mean value of $\cos (\theta _{(ij)k})$.

The log value of both sides of these equations leads to Eqs. (17), (18), and (19).

1.2.2 Step 2

It is obvious that ${\bar{\alpha }}_i\approx {\bar{\alpha }}_j\approx {\bar{\alpha }}_{ij}$ because of the assumption about $\theta _{ik},\theta _{jk}$, and $\theta _{(ij)k}$.

Noticing that each term of $Z^{\prime }_k$ is bigger than each term of $Z_k^2$, and meanwhile that the total number of $Z^{\prime }_k$ is smaller than the total number of $Z_k^2$. Therefore, $Z^{\prime }_k\approx Z^{2}_k$. This leads to the conclusion that ${\hat{Z}}^\prime \approx {\hat{Z}}^2$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yao, M., Zhuang, L., Wang, S. et al. PMIVec: a word embedding model guided by point-wise mutual information criterion. Multimedia Systems 28, 2275–2283 (2022). https://doi.org/10.1007/s00530-022-00928-4

Download citation

Received: 31 March 2021
Accepted: 21 February 2022
Published: 09 June 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s00530-022-00928-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PMIVec: a word embedding model guided by point-wise mutual information criterion

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Global-locality preserving projection for word embedding

Word Similarity Fails in Multiple Sense Word Embedding

Chinese Word Embedding Learning with Limited Data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 Proof of Theorem 1

1.1.1 Step 1

1.1.2 Step 2

1.2 Proof of Theorem 2

1.2.1 Step 1

1.2.2 Step 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now