skip to main content
10.1145/3578741.3578759acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmlnlpConference Proceedingsconference-collections
research-article

Keyword Extractor for Contrastive Learning of Unsupervised Sentence Embedding

Published: 06 March 2023 Publication History

Abstract

Contrastive learning has been widely applied to learning unsupervised sentence embedding. One major method is unsupervised SimCSE, which only uses random dropouts as noise to build positive pairs for contrastive learning. But token-level matching does not always properly represent the similarity between texts. Sentence-level semantic representation based on keyword-level could enhance the similarity matching performance between texts. To emphasize the contribution of keywords in sentence representation, we propose KESimCSE, which not only includes the constrastive learning of sentence embedding but also adds KL divergence of two elements to the loss function. One element is the dot product of each token embedding and the [CLS] embedding of the BERT outputs, as [CLS] generally captures the semantics of the whole sentence. The other element is the weight list of all tokens obtained through keyword extraction method. Our experiments show that the averaged Spearman’s correlation of KESimCSE on semantic textual similarity tasks raises to 77.21%, which outperforms unsupervised SimCSE by nearly 1.30%.

References

[1]
Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Iñigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, and Janyce Wiebe. 2015. SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability. NAACL (2015).
[2]
Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2016. SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation. NAACL (2016).
[3]
Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence embeddings. ICLR (2017).
[4]
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.
[5]
Ricardo Campos, Vítor Mangaravite, Arian Pasquali, Alípio Mário Jorge, Célia Nunes, and Adam Jatowt. 2020. YAKE! Keyword extraction from single documents using multiple local features. Information Sciences(2020).
[6]
Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation. arXiv: Computation and Language(2017).
[7]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. ICML (2020).
[8]
Alexis Conneau and Douwe Kiela. 2018. SentEval: An Evaluation Toolkit for Universal Sentence Representations. LREC (2018).
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL (2018).
[10]
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. EMNLP (2021), 6894–6910.
[11]
Maarten Grootendorst. 2020. KeyBERT: Minimal keyword extraction with BERT.https://doi.org/10.5281/zenodo.4461265
[12]
James M Joyce. 2011. Kullback-Leibler Divergence. In International encyclopedia of statistical science. Springer, 720–722.
[13]
Taeuk Kim, Kang Min Yoo, and Sang-goo Lee. 2021. Self-Guided Contrastive Learning for BERT Sentence Representations. ACL-JCNLP (2021), 2528–2540.
[14]
Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2015. Skip-thought vectors. NeurIPS (2015).
[15]
Guillaume Lample and Alexis Conneau. 2019. Cross-lingual Language Model Pretraining.NeurIPS (2019).
[16]
Jey Han Lau and Timothy Baldwin. 2016. An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. ACL (2016), 78.
[17]
Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Yiming Yang, and Lei Li. 2020. On the Sentence Embeddings from Pre-trained Language Models. EMNLP (2020).
[18]
Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. ICLR (2018).
[19]
Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. EMNLP, 404–411.
[20]
Martin Müller, Marcel Salathé, and Per Egil Kummervold. 2020. COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter. arXiv: Computation and Language(2020).
[21]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP.
[22]
Prafull Sharma and Yingbo Li. 2019. Self-Supervised Contextual Keyword and Keyphrase Retrieval with Self-Labelling. (2019). https://doi.org/10.20944/preprints201908.0073.v1
[23]
Jianlin Su, Jiarun Cao, Weijie Liu, and Yangyiwen Ou. 2021. Whitening Sentence Representations for Better Semantics and Faster Retrieval.arXiv: Computation and Language(2021).
[24]
Si Sun, Chenyan Xiong, Zhenghao Liu, Zhiyuan Liu, and Jie Bao. 2020. Joint Keyphrase Chunking and Salience Ranking with BERT. CoRR abs/2004.13639.
[25]
Yi Sun, Hangping Qiu, Yu Zheng, Zhongwei Wang, and Chaoran Zhang. 2020. SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-Trained Language Model. IEEE Access (2020).
[26]
Zhuofeng Wu, Sinong Wang, Jiatao Gu, Madian Khabsa, Fei Sun, and Hao Ma. 2020. CLEAR: Contrastive Learning for Sentence Representation.arXiv: Computation and Language(2020).
[27]
Ziyi Yang, Yinfei Yang, Daniel Cer, Jax Law, and Eric Darve. 2021. Universal Sentence Representation Learning with Conditional Masked Language Model. EMNLP, 6216–6228.
[28]
Li Zhang, Jun Li, and Chao Wang. 2017. Automatic synonym extraction using Word2Vec and spectral clustering. In 2017 36th Chinese Control Conference (CCC). IEEE, 5629–5632.
[29]
Yan Zhang, Ruidan He, Zuozhu Liu, Kwan Hui Lim, and Lidong Bing. 2020. An Unsupervised Sentence Embedding Method by Mutual Information Maximization. EMNLP (2020), 1601–1610.

Cited By

View all
  • (2025)SAEQ: Semantic anomaly event quantifier for event detection and judgement in social mediaExpert Systems with Applications10.1016/j.eswa.2025.126522(126522)Online publication date: Jan-2025

Index Terms

  1. Keyword Extractor for Contrastive Learning of Unsupervised Sentence Embedding

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      MLNLP '22: Proceedings of the 2022 5th International Conference on Machine Learning and Natural Language Processing
      December 2022
      406 pages
      ISBN:9781450399067
      DOI:10.1145/3578741
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 06 March 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. contrastive learning
      2. keyword extraction
      3. pre-trained language model
      4. sentence embedding

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      Conference

      MLNLP 2022

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)20
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 28 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)SAEQ: Semantic anomaly event quantifier for event detection and judgement in social mediaExpert Systems with Applications10.1016/j.eswa.2025.126522(126522)Online publication date: Jan-2025

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media