Self-supervised phrase embedding method by fusing internal and external semantic information of phrases

Li, Rongsheng; Wei, Chi; Huang, Shaobin; Yan, Naiyu

doi:10.1007/s11042-022-14312-x

Self-supervised phrase embedding method by fusing internal and external semantic information of phrases

Published: 24 December 2022

Volume 82, pages 20477–20495, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Rongsheng Li¹^na1,
Chi Wei¹^na1,
Shaobin Huang¹ &
…
Naiyu Yan^1,2

141 Accesses
1 Altmetric
Explore all metrics

Abstract

The quality of the phrase embedding is related to the performance of many NLP downstream tasks. Most of the existing phrase embedding methods are difficult to achieve satisfactory performance, or the robustness is ignored in pursuit of performance. In response to these problems, this paper proposes an effective phrase embedding method called Multi-loss Optimized Self-supervised Phrase Embedding (MOSPE). This method inputs pre-trained phrase embedding and component word embedding into an encoder composed of LSTM, a fully connected network, and an attention mechanism to obtain a embedding vector. Subsequently, the entire network is trained by the embedding vector to the original input through multiple loss functions. LSTM can capture the sequence information of component words. The attention mechanism can capture the importance of different component words. The fully connected network can effectively integrate the above information. Different loss functions are called weighted mean square error loss functions. They use the cosine similarity to calculate the correlation between the component word embedding and the distributed embedding of the phrase to measure the component word’s importance weight. They can also measure the ratio of the phrase’s internal and external information through the elements sum of the phrase constituent words and the cosine similarity of the phrase embeddings. This method does not need the supervision data and can get well-represented phrase embeddings. We use four evaluation methods to conduct experiments on three widely used phrase embedding evaluation datasets. The experimental results show that the Spearman correlation coefficient of the method on the English phrase similarity dataset reaches 0.686, the Chinese phrase similarity dataset reaches 0.846, and the F1 value on the phrase classification dataset reaches 0.715. Overall, it outperforms strong baseline methods with good robustness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TextConvoNet: a convolutional neural network based architecture for text classification

Article 22 October 2022

The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives

Article Open access 17 February 2024

A Review on Word Embedding Techniques for Text Classification

Data availability

The data in this paper can be obtained by contacting the corresponding author’s email address.

Code availability

The code in this paper can be obtained by contacting the corresponding author’s email address.

References

Ajallouda L, Najmani K, Zellou A (2022) Doc2Vec, SBERT, InferSent, and USE Which embedding technique for noun phrases?. In: 2022 2nd International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET). IEEE, pp 1–5
Arora S, Liang Y, Ma T (2019) A simple but tough-to-beat baseline for sentence embeddings. In: 5th international conference on learning representations, ICLR 2017
Bu F, Zhu XY, Li M (2011) A new multiword expression metric and its applications. J Comput Sci Technol 26(1):3–13
Article MATH Google Scholar
Chandra S, Gourisaria MK, Harshvardhan GM, Rautaray SS, Pandey M, Mohanty SN (2021) Semantic analysis of sentiments through web-mined twitter corpus. In: ISIC, pp 122–135
Chelba C, Mikolov T, Schuster M, Ge Q, Brants T, Koehn P, Robinson T (2013) One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp 4171–4186
Diao S, Bai J, Song Y, Zhang T, Wang Y (2020) ZEN: pre-training chinese text encoder enhanced by N-gram representations. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp 4729–4740
Elnagar A, Al-Debsi R, Einea O (2020) Arabic text classification using deep learning models. Inf Process Manag 57(1):102121
Article Google Scholar
Grave É, Bojanowski P, Gupta P, Joulin A, Mikolov T (2018) Learning word vectors for 157 languages. In: Proceedings of the eleventh international conference on Language Resources and Evaluation (LREC 2018)
Gupta S, Kanchinadam T, Conathan D, Fung G (2020) Task-optimized word embeddings for text classification representations. Front Appl Math Stat 5:67
Article Google Scholar
Harris ZS (1954) Distributional structure. Word 10(2–3):146–162
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Huang J, Ji D, Yao S, Huang W, Chen B (2016) Learning phrase representations based on word and character embeddings. In: International conference on neural information processing. Springer, Cham, pp 547–554
Korkontzelos I, Zesch T, Zanzotto FM, Biemann C (2013) Semeval-2013 task 5: evaluating phrasal semantics. In: Second joint conference on lexical and computational semantics (* SEM), volume 2: proceedings of the seventh international workshop on Semantic Evaluation (SemEval 2013), pp 39–47
Koster CH, Beney JG, Verberne S, Vogel M (2011) Phrase-based document categorization. In: Current challenges in patent information retrieval. Springer, Berlin, pp 263–286
Levy O, Goldberg Y (2014) Dependency-based word embeddings. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 2: short papers), pp 302–308
Li M, Lu Q, Xiong D, Long Y (2018) Phrase embedding learning based on external and internal context with compositionality constraint. Knowl Based Syst 152:107–116
Article Google Scholar
Li B, Yang X, Wang B, Wang W, Cui W, Zhang X (2018) An adaptive hierarchical compositional model for phrase embedding. In: IJCAI, pp 4144–4151
Li R, Huang S, Mao X, He J, Shen L (2021) TransPhrase: a new method for generating phrase embedding from word embedding in Chinese. Expert Syst Appl 168:114387
Li R, Yu Q, Huang S, Shen L, Wei C, Sun X (2021) Phrase embedding learning from internal and external information based on autoencoder. Inf Process Manag 58(1):102422
Article Google Scholar
Li W, Li Y, Liu W, Wang C (2022) An influence maximization method based on crowd emotion under an emotion-based attribute social network. Inf Process Manag 59(2):102818
Article Google Scholar
Lin Z, Feng M, Santos CND, Yu M, Xiang B, Zhou B, Bengio Y (2017) A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130
Lin JW, Thanh TD, Chang RG (2022) Multi-channel word embeddings for sentiment analysis. Soft Comput:1–13
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D ... Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
Ma S, Liu JW, Zuo X, Li WM (2021) Heterogeneous graph gated attention network. In: 2021 International Joint Conference on Neural Networks (IJCNN). IEEE, pp 1–6
Meškelė D, Frasincar F (2020) ALDONAr: a hybrid solution for sentence-level aspect-based sentiment analysis using a lexicalized domain ontology and a regularized neural attention model. Inf Process Manag 57(3):102211
Article Google Scholar
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Mitchell J, Lapata M (2010) Composition in distributional models of semantics. Cogn Sci 34(8):1388–1429
Article Google Scholar
Moghadasi MN, Zhuang Y (2020) Sent2vec: a new sentence embedding representation with sentimental semantic. In: 2020 IEEE international conference on big data (big data). IEEE, pp 4672–4680
Nguyen KA, im Walde SS, Vu NT (2016) Integrating distributional lexical contrast into word embeddings for antonym-synonym distinction. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: short papers), pp 454–459
Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long papers), pp 2227–2237
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536
Article MATH Google Scholar
Salehi B, Cook P, Baldwin T (2015) A word embedding approach to predicting the compositionity of multiword expressions. North American chapter of the association for computational linguistics
Song Y, Shi S, Li J, Zhang H (2018) Directional skip-gram: explicitly distinguishing left and right context for word embeddings. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 2 (short papers), pp 175–180
Wang Y (2019) single training dimension selection for word embedding with PCA. In: Proceedings of the 2019 conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp 3588–3593
Wang S, Zong C (2017) Comparison study on critical components in composition model for phrase representation. ACM Trans Asian Low-Resource Lang Inform Process 16(3):1–25
Article MathSciNet Google Scholar
Wei C, Wang B, Kuo CCJ (2022) Task-Specific dependency-based word embedding methods. Pattern Recognit Lett 159:174–180
Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV (2019) Xlnet: Generalized autoregressive pretraining for language understanding. In: Advances in neural information processing systems, pp 5753–5763
Yin W, Schütze H (2015) Discriminative phrase embedding for paraphrase identification. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1368–1373
Zeng W, Tang J, Zhao X (2019) Measuring entity relatedness via entity and text joint embedding. Neural Process Lett 50(2):1861–1875
Article Google Scholar
Zhao A, Yu Y (2021) Knowledge-enabled BERT for aspect-based sentiment analysis. Knowl Based Syst 227:107220
Article Google Scholar

Download references

Funding

This study was funded by two Fundamental Research Funds for the Central Universities. Their numbers are: 3072022CF0601 and 3072022CFJ0602. And this study was also funded by the Youth Science Foundation of Heilongjiang Institute of Technology (No. 2022QJ06).

Author information

Rongsheng Li and Chi Wei contributed equally to this work.

Authors and Affiliations

College of Computer Science and Technology, Harbin Engineering University, Harbin, 150001, China
Rongsheng Li, Chi Wei, Shaobin Huang & Naiyu Yan
College of Computer Science and Technology, Heilongjiang Institute of Technology, Harbin, 150050, China
Naiyu Yan

Authors

Rongsheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Chi Wei
View author publications
You can also search for this author in PubMed Google Scholar
Shaobin Huang
View author publications
You can also search for this author in PubMed Google Scholar
Naiyu Yan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: [Rongsheng Li]; Methodology: [Rongsheng Li], [Naiyu Yan], [Chi Wei]; Formal analysis and investigation: [Rongsheng Li]; Writing-original draft preparation: [Rongsheng Li]; Writing-review and editing: [Rongsheng Li], [Chi Wei]; Software:[Rongsheng Li], [Naiyu Yan]; Supervision: [Shaobin Huang].

Corresponding author

Correspondence to Naiyu Yan.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, R., Wei, C., Huang, S. et al. Self-supervised phrase embedding method by fusing internal and external semantic information of phrases. Multimed Tools Appl 82, 20477–20495 (2023). https://doi.org/10.1007/s11042-022-14312-x

Download citation

Received: 24 June 2022
Revised: 12 September 2022
Accepted: 10 December 2022
Published: 24 December 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s11042-022-14312-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Self-supervised phrase embedding method by fusing internal and external semantic information of phrases

Abstract

Access this article

Similar content being viewed by others

TextConvoNet: a convolutional neural network based architecture for text classification

The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives

A Review on Word Embedding Techniques for Text Classification

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Self-supervised phrase embedding method by fusing internal and external semantic information of phrases

Abstract

Access this article

Similar content being viewed by others

TextConvoNet: a convolutional neural network based architecture for text classification

The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives

A Review on Word Embedding Techniques for Text Classification

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation