research-article

Keyword Extractor for Contrastive Learning of Unsupervised Sentence Embedding

Authors:

Qing XuAuthors Info & Claims

MLNLP '22: Proceedings of the 2022 5th International Conference on Machine Learning and Natural Language Processing

Pages 88 - 93

https://doi.org/10.1145/3578741.3578759

Published: 06 March 2023 Publication History

Abstract

Contrastive learning has been widely applied to learning unsupervised sentence embedding. One major method is unsupervised SimCSE, which only uses random dropouts as noise to build positive pairs for contrastive learning. But token-level matching does not always properly represent the similarity between texts. Sentence-level semantic representation based on keyword-level could enhance the similarity matching performance between texts. To emphasize the contribution of keywords in sentence representation, we propose KESimCSE, which not only includes the constrastive learning of sentence embedding but also adds KL divergence of two elements to the loss function. One element is the dot product of each token embedding and the [CLS] embedding of the BERT outputs, as [CLS] generally captures the semantics of the whole sentence. The other element is the weight list of all tokens obtained through keyword extraction method. Our experiments show that the averaged Spearman’s correlation of KESimCSE on semantic textual similarity tasks raises to 77.21%, which outperforms unsupervised SimCSE by nearly 1.30%.

References

[1]

Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Iñigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, and Janyce Wiebe. 2015. SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability. NAACL (2015).

[2]

Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2016. SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation. NAACL (2016).

[3]

Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence embeddings. ICLR (2017).

[4]

David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.

Digital Library

[5]

Ricardo Campos, Vítor Mangaravite, Arian Pasquali, Alípio Mário Jorge, Célia Nunes, and Adam Jatowt. 2020. YAKE! Keyword extraction from single documents using multiple local features. Information Sciences(2020).

[6]

Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation. arXiv: Computation and Language(2017).

[7]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. ICML (2020).

[8]

Alexis Conneau and Douwe Kiela. 2018. SentEval: An Evaluation Toolkit for Universal Sentence Representations. LREC (2018).

[9]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL (2018).

[10]

Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. EMNLP (2021), 6894–6910.

[11]

Maarten Grootendorst. 2020. KeyBERT: Minimal keyword extraction with BERT.https://doi.org/10.5281/zenodo.4461265

[12]

James M Joyce. 2011. Kullback-Leibler Divergence. In International encyclopedia of statistical science. Springer, 720–722.

[13]

Taeuk Kim, Kang Min Yoo, and Sang-goo Lee. 2021. Self-Guided Contrastive Learning for BERT Sentence Representations. ACL-JCNLP (2021), 2528–2540.

[14]

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2015. Skip-thought vectors. NeurIPS (2015).

[15]

Guillaume Lample and Alexis Conneau. 2019. Cross-lingual Language Model Pretraining.NeurIPS (2019).

[16]

Jey Han Lau and Timothy Baldwin. 2016. An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. ACL (2016), 78.

[17]

Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Yiming Yang, and Lei Li. 2020. On the Sentence Embeddings from Pre-trained Language Models. EMNLP (2020).

[18]

Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. ICLR (2018).

[19]

Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. EMNLP, 404–411.

[20]

Martin Müller, Marcel Salathé, and Per Egil Kummervold. 2020. COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter. arXiv: Computation and Language(2020).

[21]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP.

[22]

Prafull Sharma and Yingbo Li. 2019. Self-Supervised Contextual Keyword and Keyphrase Retrieval with Self-Labelling. (2019). https://doi.org/10.20944/preprints201908.0073.v1

[23]

Jianlin Su, Jiarun Cao, Weijie Liu, and Yangyiwen Ou. 2021. Whitening Sentence Representations for Better Semantics and Faster Retrieval.arXiv: Computation and Language(2021).

[24]

Si Sun, Chenyan Xiong, Zhenghao Liu, Zhiyuan Liu, and Jie Bao. 2020. Joint Keyphrase Chunking and Salience Ranking with BERT. CoRR abs/2004.13639.

[25]

Yi Sun, Hangping Qiu, Yu Zheng, Zhongwei Wang, and Chaoran Zhang. 2020. SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-Trained Language Model. IEEE Access (2020).

[26]

Zhuofeng Wu, Sinong Wang, Jiatao Gu, Madian Khabsa, Fei Sun, and Hao Ma. 2020. CLEAR: Contrastive Learning for Sentence Representation.arXiv: Computation and Language(2020).

[27]

Ziyi Yang, Yinfei Yang, Daniel Cer, Jax Law, and Eric Darve. 2021. Universal Sentence Representation Learning with Conditional Masked Language Model. EMNLP, 6216–6228.

[28]

Li Zhang, Jun Li, and Chao Wang. 2017. Automatic synonym extraction using Word2Vec and spectral clustering. In 2017 36th Chinese Control Conference (CCC). IEEE, 5629–5632.

[29]

Yan Zhang, Ruidan He, Zuozhu Liu, Kwan Hui Lim, and Lidong Bing. 2020. An Unsupervised Sentence Embedding Method by Mutual Information Maximization. EMNLP (2020), 1601–1610.

Cited By

Lu XZhou XGan SHe XChen XXiao YLiu Y(2025)SAEQ: Semantic anomaly event quantifier for event detection and judgement in social mediaExpert Systems with Applications10.1016/j.eswa.2025.126522(126522)Online publication date: Jan-2025
https://doi.org/10.1016/j.eswa.2025.126522

Index Terms

Keyword Extractor for Contrastive Learning of Unsupervised Sentence Embedding
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning

Recommendations

Contrastive Learning Models for Sentence Representations
Sentence representation learning is a crucial task in natural language processing, as the quality of learned representations directly influences downstream tasks, such as sentence classification and sentiment analysis. Transformer-based pretrained ...
DebCSE: Rethinking Unsupervised Contrastive Sentence Embedding Learning in the Debiasing Perspective
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

Several prior studies have suggested that word frequency biases can cause the Bert model to learn indistinguishable sentence embeddings. Contrastive learning schemes such as SimCSE and ConSERT have already been adopted successfully in unsupervised ...
AdCSE: An Adversarial Method for Contrastive Learning of Sentence Embeddings
Database Systems for Advanced Applications
Abstract
Due to the impressive results on semantic textual similarity (STS) tasks, unsupervised sentence embedding methods based on contrastive learning have attracted much attention from researchers. Most of these approaches focus on constructing high-...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

MLNLP '22: Proceedings of the 2022 5th International Conference on Machine Learning and Natural Language Processing

December 2022

406 pages

ISBN:9781450399067

DOI:10.1145/3578741

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 March 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Shanghai Municipal Commission of Economy and Informatization

Conference

MLNLP 2022

MLNLP 2022: 2022 5th International Conference on Machine Learning and Natural Language Processing

December 23 - 25, 2022

Sanya, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
65
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lu XZhou XGan SHe XChen XXiao YLiu Y(2025)SAEQ: Semantic anomaly event quantifier for event detection and judgement in social mediaExpert Systems with Applications10.1016/j.eswa.2025.126522(126522)Online publication date: Jan-2025
https://doi.org/10.1016/j.eswa.2025.126522

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten