Skip to main content
Log in

Self-supervised phrase embedding method by fusing internal and external semantic information of phrases

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The quality of the phrase embedding is related to the performance of many NLP downstream tasks. Most of the existing phrase embedding methods are difficult to achieve satisfactory performance, or the robustness is ignored in pursuit of performance. In response to these problems, this paper proposes an effective phrase embedding method called Multi-loss Optimized Self-supervised Phrase Embedding (MOSPE). This method inputs pre-trained phrase embedding and component word embedding into an encoder composed of LSTM, a fully connected network, and an attention mechanism to obtain a embedding vector. Subsequently, the entire network is trained by the embedding vector to the original input through multiple loss functions. LSTM can capture the sequence information of component words. The attention mechanism can capture the importance of different component words. The fully connected network can effectively integrate the above information. Different loss functions are called weighted mean square error loss functions. They use the cosine similarity to calculate the correlation between the component word embedding and the distributed embedding of the phrase to measure the component word’s importance weight. They can also measure the ratio of the phrase’s internal and external information through the elements sum of the phrase constituent words and the cosine similarity of the phrase embeddings. This method does not need the supervision data and can get well-represented phrase embeddings. We use four evaluation methods to conduct experiments on three widely used phrase embedding evaluation datasets. The experimental results show that the Spearman correlation coefficient of the method on the English phrase similarity dataset reaches 0.686, the Chinese phrase similarity dataset reaches 0.846, and the F1 value on the phrase classification dataset reaches 0.715. Overall, it outperforms strong baseline methods with good robustness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

The data in this paper can be obtained by contacting the corresponding author’s email address.

Code availability

The code in this paper can be obtained by contacting the corresponding author’s email address.

References

  1. Ajallouda L, Najmani K, Zellou A (2022) Doc2Vec, SBERT, InferSent, and USE Which embedding technique for noun phrases?. In: 2022 2nd International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET). IEEE, pp 1–5

  2. Arora S, Liang Y, Ma T (2019) A simple but tough-to-beat baseline for sentence embeddings. In: 5th international conference on learning representations, ICLR 2017

  3. Bu F, Zhu XY, Li M (2011) A new multiword expression metric and its applications. J Comput Sci Technol 26(1):3–13

    Article  MATH  Google Scholar 

  4. Chandra S, Gourisaria MK, Harshvardhan GM, Rautaray SS, Pandey M, Mohanty SN (2021) Semantic analysis of sentiments through web-mined twitter corpus. In: ISIC, pp 122–135

  5. Chelba C, Mikolov T, Schuster M, Ge Q, Brants T, Koehn P, Robinson T (2013) One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005

  6. Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp 4171–4186

  7. Diao S, Bai J, Song Y, Zhang T, Wang Y (2020) ZEN: pre-training chinese text encoder enhanced by N-gram representations. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp 4729–4740

  8. Elnagar A, Al-Debsi R, Einea O (2020) Arabic text classification using deep learning models. Inf Process Manag 57(1):102121

    Article  Google Scholar 

  9. Grave É, Bojanowski P, Gupta P, Joulin A, Mikolov T (2018) Learning word vectors for 157 languages. In: Proceedings of the eleventh international conference on Language Resources and Evaluation (LREC 2018)

  10. Gupta S, Kanchinadam T, Conathan D, Fung G (2020) Task-optimized word embeddings for text classification representations. Front Appl Math Stat 5:67

    Article  Google Scholar 

  11. Harris ZS (1954) Distributional structure. Word 10(2–3):146–162

    Article  Google Scholar 

  12. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  13. Huang J, Ji D, Yao S, Huang W, Chen B (2016) Learning phrase representations based on word and character embeddings. In: International conference on neural information processing. Springer, Cham, pp 547–554

  14. Korkontzelos I, Zesch T, Zanzotto FM, Biemann C (2013) Semeval-2013 task 5: evaluating phrasal semantics. In: Second joint conference on lexical and computational semantics (* SEM), volume 2: proceedings of the seventh international workshop on Semantic Evaluation (SemEval 2013), pp 39–47

  15. Koster CH, Beney JG, Verberne S, Vogel M (2011) Phrase-based document categorization. In: Current challenges in patent information retrieval. Springer, Berlin, pp 263–286

  16. Levy O, Goldberg Y (2014) Dependency-based word embeddings. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 2: short papers), pp 302–308

  17. Li M, Lu Q, Xiong D, Long Y (2018) Phrase embedding learning based on external and internal context with compositionality constraint. Knowl Based Syst 152:107–116

    Article  Google Scholar 

  18. Li B, Yang X, Wang B, Wang W, Cui W, Zhang X (2018) An adaptive hierarchical compositional model for phrase embedding. In: IJCAI, pp 4144–4151

  19. Li R, Huang S, Mao X, He J, Shen L (2021) TransPhrase: a new method for generating phrase embedding from word embedding in Chinese. Expert Syst Appl 168:114387

  20. Li R, Yu Q, Huang S, Shen L, Wei C, Sun X (2021) Phrase embedding learning from internal and external information based on autoencoder. Inf Process Manag 58(1):102422

    Article  Google Scholar 

  21. Li W, Li Y, Liu W, Wang C (2022) An influence maximization method based on crowd emotion under an emotion-based attribute social network. Inf Process Manag 59(2):102818

    Article  Google Scholar 

  22. Lin Z, Feng M, Santos CND, Yu M, Xiang B, Zhou B, Bengio Y (2017) A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130

  23. Lin JW, Thanh TD, Chang RG (2022) Multi-channel word embeddings for sentiment analysis. Soft Comput:1–13

  24. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D ... Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

  25. Ma S, Liu JW, Zuo X, Li WM (2021) Heterogeneous graph gated attention network. In: 2021 International Joint Conference on Neural Networks (IJCNN). IEEE, pp 1–6

  26. Meškelė D, Frasincar F (2020) ALDONAr: a hybrid solution for sentence-level aspect-based sentiment analysis using a lexicalized domain ontology and a regularized neural attention model. Inf Process Manag 57(3):102211

    Article  Google Scholar 

  27. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  28. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

  29. Mitchell J, Lapata M (2010) Composition in distributional models of semantics. Cogn Sci 34(8):1388–1429

    Article  Google Scholar 

  30. Moghadasi MN, Zhuang Y (2020) Sent2vec: a new sentence embedding representation with sentimental semantic. In: 2020 IEEE international conference on big data (big data). IEEE, pp 4672–4680

  31. Nguyen KA, im Walde SS, Vu NT (2016) Integrating distributional lexical contrast into word embeddings for antonym-synonym distinction. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: short papers), pp 454–459

  32. Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long papers), pp 2227–2237

  33. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536

    Article  MATH  Google Scholar 

  34. Salehi B, Cook P, Baldwin T (2015) A word embedding approach to predicting the compositionity of multiword expressions. North American chapter of the association for computational linguistics

  35. Song Y, Shi S, Li J, Zhang H (2018) Directional skip-gram: explicitly distinguishing left and right context for word embeddings. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 2 (short papers), pp 175–180

  36. Wang Y (2019) single training dimension selection for word embedding with PCA. In: Proceedings of the 2019 conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp 3588–3593

  37. Wang S, Zong C (2017) Comparison study on critical components in composition model for phrase representation. ACM Trans Asian Low-Resource Lang Inform Process 16(3):1–25

    Article  MathSciNet  Google Scholar 

  38. Wei C, Wang B, Kuo CCJ (2022) Task-Specific dependency-based word embedding methods. Pattern Recognit Lett 159:174–180

  39. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV (2019) Xlnet: Generalized autoregressive pretraining for language understanding. In: Advances in neural information processing systems, pp 5753–5763

  40. Yin W, Schütze H (2015) Discriminative phrase embedding for paraphrase identification. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1368–1373

  41. Zeng W, Tang J, Zhao X (2019) Measuring entity relatedness via entity and text joint embedding. Neural Process Lett 50(2):1861–1875

    Article  Google Scholar 

  42. Zhao A, Yu Y (2021) Knowledge-enabled BERT for aspect-based sentiment analysis. Knowl Based Syst 227:107220

    Article  Google Scholar 

Download references

Funding

This study was funded by two Fundamental Research Funds for the Central Universities. Their numbers are: 3072022CF0601 and 3072022CFJ0602. And this study was also funded by the Youth Science Foundation of Heilongjiang Institute of Technology (No. 2022QJ06).

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: [Rongsheng Li]; Methodology: [Rongsheng Li], [Naiyu Yan], [Chi Wei]; Formal analysis and investigation: [Rongsheng Li]; Writing-original draft preparation: [Rongsheng Li]; Writing-review and editing: [Rongsheng Li], [Chi Wei]; Software:[Rongsheng Li], [Naiyu Yan]; Supervision: [Shaobin Huang].

Corresponding author

Correspondence to Naiyu Yan.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, R., Wei, C., Huang, S. et al. Self-supervised phrase embedding method by fusing internal and external semantic information of phrases. Multimed Tools Appl 82, 20477–20495 (2023). https://doi.org/10.1007/s11042-022-14312-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-14312-x

Keywords

Navigation