Abstract
Word embedding aims to represent each word with a dense vector which reveals the semantic similarity between words. Existing methods such as word2vec derive such representations by factorizing the word–context matrix into two parts, i.e., word vectors and context vectors. However, only one part is used to represent the word, which may damage the semantic similarity between words. To address this problem, this paper proposes a novel word embedding method based on point-wise mutual information criterion (PMIVec). Our method explicitly learns the context vector as the final word representation for each word, while discarding the word vector. To avoid the damage of semantic similarity between words, we normalize the word vector during the training process. Moreover, this paper uses point-wise mutual information to measure the semantic similarity between words, which is more consistent with human intuition on semantic similarity. Experiments on public data sets show that our PMIVec model can consistently outperform state-of-the-art models.

Similar content being viewed by others
References
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Bruni, E., Tran, N.-K., Baroni, M.: Multimodal distributional semantics. J. Artif. Intell. Res. 49, 1–47 (2014)
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)
Devlin, J., Chang, M.-W., Lee, K., Kristina, T.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, (2018)
Dyer, C.: Notes on noise contrastive estimation and negative sampling. arXiv:1410.8251, (2014)
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. ACM Trans. Inf. Syst. 20(1), 116–131 (2002)
Gutmann, M.U., Hyvärinen, A.: Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. Mach. Learn. Res. 13, 307–361 (2012)
Hashimoto, T.B., Alvarez-Melis, D., Jaakkola, T.S.: Word embeddings as metric recovery in semantic spaces. Trans. Assoc. Comput. Linguist. 4, 273–286 (2016)
Hill, F., Reichart, R., Korhonen, A.: Simlex-999: evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. 41(4), 665–695 (2015)
Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9992–10002, (2020)
Kolesnikova, O.: Survey of word co-occurrence measures for collocation detection. Comput. Sist. 20(3), 327–344 (2016)
Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Ghahramani, I.Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in neural information processing systems. Curran Associates Inc. 27, 2177–2185 (2014)
Luong, M.-T., Socher, R., Manning, C.D.: Better word representations with recursive neural networks for morphology. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 104–113, (2013)
Ma, T.: Non-convex optimization for machine learning: design, analysis, and understanding. PhD thesis, Princeton University, (2017)
Meng, Y., Huang, J., Wang, G., Zhang, C., Zhuang, H., Kaplan, L., Han, J.: Spherical text embedding. In: Wallach, H., Larochelle, H., Beygelzimer, A., Alche, F., Fox, E., and Garnett, R (eds.) Advances in neural information processing systems. Curran Associates, Inc., 32, 8206–8215 (2019)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Burges, C.J., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in neural information processing systems. Curran Associates, Inc., 26, 3111–3119 (2013)
Mnih, A., Kavukcuoglu, K.: Learning word embeddings efficiently with noise-contrastive estimation. In: Burges, C.J., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K.Q. (eds.) Advances in neural information processing systems. Curran Associates, Inc., 26, 2265–2273 (2013)
Neelakantan, A., Shankar, J., Passos, A., McCallum, A.: Efficient non-parametric estimation of multiple embeddings per word in vector space. arXiv:1504.06654, (2015)
Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In : Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, (2014)
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. arXiv:1802.05365, (2018)
Socher, R., Bauer, J., Manning, C.D., et al.: Parsing with compositional vector grammars. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 455–465, (2013)
Terra, E.L., Clarke, C.L.A.: Frequency estimates for statistical word similarity measures. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 244–251, (2003)
Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 384–394. Association for Computational Linguistics, (2010)
Xing, C., Wang, D., Liu, C., Lin, Y.: Normalized word embedding and orthogonal transform for bilingual word translation. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1006–1011, (2015)
Acknowledgements
This work was supported in part to Dr. Liansheng Zhuang by NSFC under contract No.U20B2070 and No.61976199, and in part to Dr. Houqiang Li by NSFC under contract No.61836011.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by B.-K. Bao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Proof of Theorem 1
There are two steps to prove Theorem 1. The first step is to formulate the loss function of SG model by applying the noise contrastive estimation (NCE) method [7] to equation (3) and (4). The second step is to reveal that the simplification made by SG’s negative sampling technique will lead to property (5).
1.1.1 Step 1
To find the assumed optimal solutions \(\{{\mathbf {O}}_i,{\mathbf {I}}_k\}_{i,k=1,\ldots ,V}\), and yet to avoid directly computing the denominator of Eq. (4), which will be denoted as \(Z_{k}\), NCE proposes to treat it as a constant number to be estimated, and one can estimate \(\{{\mathbf {O}}_i,{\mathbf {I}}_k,Z_k\}_{i,k=1,\ldots ,V}\) via solving a supervised learning task.
One can draw \(T_d\) real samples from \(p(\cdot |w_{k})\), and \(K*T_d\) noise samples from the known noise distribution \(p_{n}(\cdot )\), where \(p_{n}(\cdot )\) can be chosen to be any distribution and K is a constant like 5. Each sample will have a label y, and \(y=1\) stands for real sample; \(y=0\) stands for noise sample. Then, one can mix these samples together and pick one of them, asking whether it is a real sample or not.
One can train a logistic regression model, whose parameters are \(\{{\mathbf {O}}_i,{\mathbf {I}}_k,Z_k\}_{i,k=1,\ldots ,V}\), to discriminate the real samples from the noise samples. For a real sample \(w_{i}\), the logistic regression model’s prediction on it being true is
According to NCE’s conclusion, if one treats \(Z_{k}\) as an additional scalar parameter that can be optimized, then the following optimization problems (minimizing the cross entropy) will have the same solution as to (3)
where \(\forall k=1,\ldots ,V,\)
and \(p_{d}\) represents the true distribution of \(p(\cdot |w_{k})\). One can sample \(w_i\) from \(p_d\) with word window’s help.
1.1.2 Step 2
The following part will show how the negative-sampling technique adopted by the skip-gram model can lead to Eq. (5).
The SG model simplifies the calculation of \(p(1|w_i;w_k)\) by omitting \(\log Z_{k}\) and \(\log Kp_{n}(w_{i})\). This means that the logistic model assumes \(Kp_{n}(w_{i})=1\) [5], i.e.,
where \({\hat{p}}_{d}(1|w_i;w_k)\) represents the true probability of sample \(w_{i}\) being true, and it can be derived by applying the Bayesian formula to it. It is easy to show that (15) is equivalent to
where \(\rho =p_{d}(w_i;w_k)+Kp_{n}(w_{j})\). Obviously, the above objective will equal to 0 iff
Noticing that SG chooses \(p(w_{k})\) as the noise distribution, and with some simple rearrangement, the Eq. (16) implies
1.2 Proof of Theorem 2
There are two steps to prove Theorem 2. The first step is to show the following equations hold for the assumed solutions \(\{{\mathbf {O}}_i,{\mathbf {I}}_k\}_{i,k=1,\ldots ,V}\)
where \({\hat{Z}}\)’s definition can be found in the Eq. (10), \({\hat{Z}}^\prime\) is the mean value of all \({\hat{Z}}_k^\prime\), and \({\hat{Z}}_k^\prime\) is defined in the Eq. (12). \({\bar{\alpha }}_i,{\bar{\alpha }}_j\), and \({\bar{\alpha }}_{ij}\) are both constant number. The second step is to show that
and therefore, (19), (18), and (17) will prove this theorem.
1.2.1 Step 1
Similar to the analysis in the PMIVec section’s sketch proof, the assumed solution \(\{{\mathbf {O}}_i,{\mathbf {I}}_k\}_{i,k=1,\ldots ,V}\) satisfies
Then the \(\alpha _{i}\) is the mean value of \(\cos (\theta _{ik})\), \(\alpha _{j}\) is the mean value of \(\cos (\theta _{jk})\), and \(\alpha _{ij}\) is the mean value of \(\cos (\theta _{(ij)k})\).
The log value of both sides of these equations leads to Eqs. (17), (18), and (19).
1.2.2 Step 2
It is obvious that \({\bar{\alpha }}_i\approx {\bar{\alpha }}_j\approx {\bar{\alpha }}_{ij}\) because of the assumption about \(\theta _{ik},\theta _{jk}\), and \(\theta _{(ij)k}\).
Noticing that each term of \(Z^{\prime }_k\) is bigger than each term of \(Z_k^2\), and meanwhile that the total number of \(Z^{\prime }_k\) is smaller than the total number of \(Z_k^2\). Therefore, \(Z^{\prime }_k\approx Z^{2}_k\). This leads to the conclusion that \({\hat{Z}}^\prime \approx {\hat{Z}}^2\).
Rights and permissions
About this article
Cite this article
Yao, M., Zhuang, L., Wang, S. et al. PMIVec: a word embedding model guided by point-wise mutual information criterion. Multimedia Systems 28, 2275–2283 (2022). https://doi.org/10.1007/s00530-022-00928-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-022-00928-4