Abstract
The quality of the phrase embedding is related to the performance of many NLP downstream tasks. Most of the existing phrase embedding methods are difficult to achieve satisfactory performance, or the robustness is ignored in pursuit of performance. In response to these problems, this paper proposes an effective phrase embedding method called Multi-loss Optimized Self-supervised Phrase Embedding (MOSPE). This method inputs pre-trained phrase embedding and component word embedding into an encoder composed of LSTM, a fully connected network, and an attention mechanism to obtain a embedding vector. Subsequently, the entire network is trained by the embedding vector to the original input through multiple loss functions. LSTM can capture the sequence information of component words. The attention mechanism can capture the importance of different component words. The fully connected network can effectively integrate the above information. Different loss functions are called weighted mean square error loss functions. They use the cosine similarity to calculate the correlation between the component word embedding and the distributed embedding of the phrase to measure the component word’s importance weight. They can also measure the ratio of the phrase’s internal and external information through the elements sum of the phrase constituent words and the cosine similarity of the phrase embeddings. This method does not need the supervision data and can get well-represented phrase embeddings. We use four evaluation methods to conduct experiments on three widely used phrase embedding evaluation datasets. The experimental results show that the Spearman correlation coefficient of the method on the English phrase similarity dataset reaches 0.686, the Chinese phrase similarity dataset reaches 0.846, and the F1 value on the phrase classification dataset reaches 0.715. Overall, it outperforms strong baseline methods with good robustness.
Similar content being viewed by others
Data availability
The data in this paper can be obtained by contacting the corresponding author’s email address.
Code availability
The code in this paper can be obtained by contacting the corresponding author’s email address.
References
Ajallouda L, Najmani K, Zellou A (2022) Doc2Vec, SBERT, InferSent, and USE Which embedding technique for noun phrases?. In: 2022 2nd International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET). IEEE, pp 1–5
Arora S, Liang Y, Ma T (2019) A simple but tough-to-beat baseline for sentence embeddings. In: 5th international conference on learning representations, ICLR 2017
Bu F, Zhu XY, Li M (2011) A new multiword expression metric and its applications. J Comput Sci Technol 26(1):3–13
Chandra S, Gourisaria MK, Harshvardhan GM, Rautaray SS, Pandey M, Mohanty SN (2021) Semantic analysis of sentiments through web-mined twitter corpus. In: ISIC, pp 122–135
Chelba C, Mikolov T, Schuster M, Ge Q, Brants T, Koehn P, Robinson T (2013) One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp 4171–4186
Diao S, Bai J, Song Y, Zhang T, Wang Y (2020) ZEN: pre-training chinese text encoder enhanced by N-gram representations. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp 4729–4740
Elnagar A, Al-Debsi R, Einea O (2020) Arabic text classification using deep learning models. Inf Process Manag 57(1):102121
Grave É, Bojanowski P, Gupta P, Joulin A, Mikolov T (2018) Learning word vectors for 157 languages. In: Proceedings of the eleventh international conference on Language Resources and Evaluation (LREC 2018)
Gupta S, Kanchinadam T, Conathan D, Fung G (2020) Task-optimized word embeddings for text classification representations. Front Appl Math Stat 5:67
Harris ZS (1954) Distributional structure. Word 10(2–3):146–162
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Huang J, Ji D, Yao S, Huang W, Chen B (2016) Learning phrase representations based on word and character embeddings. In: International conference on neural information processing. Springer, Cham, pp 547–554
Korkontzelos I, Zesch T, Zanzotto FM, Biemann C (2013) Semeval-2013 task 5: evaluating phrasal semantics. In: Second joint conference on lexical and computational semantics (* SEM), volume 2: proceedings of the seventh international workshop on Semantic Evaluation (SemEval 2013), pp 39–47
Koster CH, Beney JG, Verberne S, Vogel M (2011) Phrase-based document categorization. In: Current challenges in patent information retrieval. Springer, Berlin, pp 263–286
Levy O, Goldberg Y (2014) Dependency-based word embeddings. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 2: short papers), pp 302–308
Li M, Lu Q, Xiong D, Long Y (2018) Phrase embedding learning based on external and internal context with compositionality constraint. Knowl Based Syst 152:107–116
Li B, Yang X, Wang B, Wang W, Cui W, Zhang X (2018) An adaptive hierarchical compositional model for phrase embedding. In: IJCAI, pp 4144–4151
Li R, Huang S, Mao X, He J, Shen L (2021) TransPhrase: a new method for generating phrase embedding from word embedding in Chinese. Expert Syst Appl 168:114387
Li R, Yu Q, Huang S, Shen L, Wei C, Sun X (2021) Phrase embedding learning from internal and external information based on autoencoder. Inf Process Manag 58(1):102422
Li W, Li Y, Liu W, Wang C (2022) An influence maximization method based on crowd emotion under an emotion-based attribute social network. Inf Process Manag 59(2):102818
Lin Z, Feng M, Santos CND, Yu M, Xiang B, Zhou B, Bengio Y (2017) A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130
Lin JW, Thanh TD, Chang RG (2022) Multi-channel word embeddings for sentiment analysis. Soft Comput:1–13
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D ... Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
Ma S, Liu JW, Zuo X, Li WM (2021) Heterogeneous graph gated attention network. In: 2021 International Joint Conference on Neural Networks (IJCNN). IEEE, pp 1–6
Meškelė D, Frasincar F (2020) ALDONAr: a hybrid solution for sentence-level aspect-based sentiment analysis using a lexicalized domain ontology and a regularized neural attention model. Inf Process Manag 57(3):102211
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Mitchell J, Lapata M (2010) Composition in distributional models of semantics. Cogn Sci 34(8):1388–1429
Moghadasi MN, Zhuang Y (2020) Sent2vec: a new sentence embedding representation with sentimental semantic. In: 2020 IEEE international conference on big data (big data). IEEE, pp 4672–4680
Nguyen KA, im Walde SS, Vu NT (2016) Integrating distributional lexical contrast into word embeddings for antonym-synonym distinction. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: short papers), pp 454–459
Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long papers), pp 2227–2237
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536
Salehi B, Cook P, Baldwin T (2015) A word embedding approach to predicting the compositionity of multiword expressions. North American chapter of the association for computational linguistics
Song Y, Shi S, Li J, Zhang H (2018) Directional skip-gram: explicitly distinguishing left and right context for word embeddings. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 2 (short papers), pp 175–180
Wang Y (2019) single training dimension selection for word embedding with PCA. In: Proceedings of the 2019 conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp 3588–3593
Wang S, Zong C (2017) Comparison study on critical components in composition model for phrase representation. ACM Trans Asian Low-Resource Lang Inform Process 16(3):1–25
Wei C, Wang B, Kuo CCJ (2022) Task-Specific dependency-based word embedding methods. Pattern Recognit Lett 159:174–180
Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV (2019) Xlnet: Generalized autoregressive pretraining for language understanding. In: Advances in neural information processing systems, pp 5753–5763
Yin W, Schütze H (2015) Discriminative phrase embedding for paraphrase identification. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1368–1373
Zeng W, Tang J, Zhao X (2019) Measuring entity relatedness via entity and text joint embedding. Neural Process Lett 50(2):1861–1875
Zhao A, Yu Y (2021) Knowledge-enabled BERT for aspect-based sentiment analysis. Knowl Based Syst 227:107220
Funding
This study was funded by two Fundamental Research Funds for the Central Universities. Their numbers are: 3072022CF0601 and 3072022CFJ0602. And this study was also funded by the Youth Science Foundation of Heilongjiang Institute of Technology (No. 2022QJ06).
Author information
Authors and Affiliations
Contributions
Conceptualization: [Rongsheng Li]; Methodology: [Rongsheng Li], [Naiyu Yan], [Chi Wei]; Formal analysis and investigation: [Rongsheng Li]; Writing-original draft preparation: [Rongsheng Li]; Writing-review and editing: [Rongsheng Li], [Chi Wei]; Software:[Rongsheng Li], [Naiyu Yan]; Supervision: [Shaobin Huang].
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, R., Wei, C., Huang, S. et al. Self-supervised phrase embedding method by fusing internal and external semantic information of phrases. Multimed Tools Appl 82, 20477–20495 (2023). https://doi.org/10.1007/s11042-022-14312-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-14312-x