Generative non-autoregressive unsupervised keyphrase extraction with neural topic modeling

https://doi.org/10.1016/j.engappai.2023.105934Get rights and content

Abstract

Unsupervised keyphrase extraction (UKE) aims to detect out a set of keyphrases of a document without using any annotation signal for training the UKE model. Existing UKE models however either fail to solve the issue of low keyphrase relevance, or suffer from the problem of unsatisfactory keyphrase coverage. Besides, the current strong-performing neural based UKE methods unfortunately come with the cost of lower decoding efficiency. In this paper we propose a novel framework for UKE. We model the task as a fully generative process with an encoder–decoder framework, which unifies the prediction of both the present and absent keyphrases in an end-to-end manner. We accelerate the decoding by installing a non-autoregressive decoder, which yields all keyphrase extraction in parallel fast. To enhance the keyphrase relevance, we investigate a neural topic module that unsupervisingly learns the latent topic information, aiding the semantic understanding of the input document. To strengthen the keyphrase coverage, we present an anchor-region scheme for the extraction of present keyphrases, which first locates the anchor points of present phrases and then determines the final boundaries of keyphrases. The initial anchor phrases are organized into an anchor-aware graph, which is modeled by our proposed topic-guided graph attention network which captures the document contexts at global level. Over six benchmark datasets of UKE our system shows better experimental results than existing strong-performing baselines, becoming the new state-of-the-art UKE model. Further in-depth analyses show that our model can effectively resolve both the relevance and coverage of keyphrase extraction, meanwhile being faster on decoding.

Introduction

Keyphrase extraction (KPE) is a natural language processing (NLP) task that seeks to extract a set of key words or phrases given a document that describe the document (Hasan and Ng, 2014). The keyphrases helps readers quickly grasp the key idea of a document or article, and automatic KPE greatly relieves the index and retrieval of scientific article publication where the documents have no associated keyphrase. Training a KPE model requires a vast amount of labeled data, i.e., supervised KPE (Alzaidy et al., 2019, Florescu and Jin, 2019, Lai et al., 2020, Song et al., 2021). However, manually data annotation involves expensive cost, which is labor-intensive. To combat this, the task of unsupervised keyphrase extraction is introduced, which learns a KPE model without relying on data labels (Mihalcea, 2004, Wan and Xiao, 2008, Bougouin et al., 2013, Boudin, 2018, Sun et al., 2020, Liang et al., 2021).

Earlier methods for UKE are mostly based on statistical machine learning technique, e.g., utilizing the candidate position, frequency, length, and capitalization to determine the importance of a word (Mihalcea and Tarau, 2004, Rose et al., 2010, Beliga et al., 2016, Campos et al., 2018). Within recent decades, the development of deep learning witness a great success. Thus the deep neural methods are extensively employed for building strong UKE models (Zhang et al., 2018, Prasad and Kan, 2019, Kim et al., 2021, Wu et al., 2021a). Unfortunately, almost all the current neural UKE models take the sequential labeling scheme of keyphrase extraction, which may lead to the only detection of present keyphrases, while missing the absent keyphrases. As depicted in Fig. 1(a), it is quite ubiquitous that the keyphrases are not occurred (fully or partially) in the document. Without accurately modeling the absent keyphrases, the overall UKE performances will be inevitably suboptimal.

On the other hand, the key of high performance UKE lies in both the high relevance and high coverage of keyphrases, preserving both of which however is a challenge in existing work, i.e., two sides of the same coin. (1) The keyphrase relevance reflects the extent on capturing the semantics of the document text. Especially in the unsupervised learning scenario, the topic information is much vital to UKE (Bougouin et al., 2013, Zhao et al., 2011, Teneva and Cheng, 2017, Wang et al., 2018). The ideal keyphrases should well describes the relevant topics of the document at the semantic level. Unfortunately, existing neural UKE models fail to well model the topic knowledge, i.e., topic-independent (Mihalcea and Tarau, 2004, Campos et al., 2018). (2) The coverage emphasizes the ability to modeling the globally informative features, i.e., generating different keyphrases that can describe the document from different angles comprehensively. Most earlier UKE methods (i.e., statistical-based) pay too much focus on the modeling of the localized features (Bougouin et al., 2013, Boudin, 2018). As a consequence, they are inclined to extract homogeneous keyphrases with less diversification and bad coverage. Existing UKE methods either suffer from the weak relevance of UKE or weak coverage of UKE, as illustrated in Fig. 1(b).

In this paper we present a novel neural UKE model that effectively addresses all the above challenges. First of all, unlike the current sequential labeling scheme of UKE, we model the task as a fully generative process with an encoder–decoder framework, which enables to predict both the present and absent keyphrases in an end-to-end manner. However, directly using the existing sequential generative decoder (i.e., LSTM predicting from left to right) for producing keyphrases could be problematic, as the keyphrases essentially have no sequential dependence or order in between. Alternatively, we employ a non-autoregressive decoder, which yields all possible keyphrases in one-shot, as shown in Fig. 2.

In our framework, we adopt an anchor-region scheme for the extraction of present keyphrases, which first locates the anchor points (i.e., fuzzy boundaries) of present phrases based on the n-gram, and then determines the final boundaries of keyphrases. For the detection of absent keyphrases relying more on the text semantics, we based on the variational autoencoder technique propose a neural topic module (NTM) for unsupervisingly learning the topic features. We further introduce a novel anchor-aware graph (A2G) module, where the initial anchor present phrases are organized into a graph. We devise a topic-guided graph attention network (namely, TgGAT) to model the graph, which, with the attendance of topic information learned from NTM, can model the global contexts. During the graph propagation, the n-gram of present phrases are ranked and filtered effectively. Finally, the non-autoregressive decoder produces all present and absent keyphrases.

We carry out experiments on a total of six benchmark datasets of UKE, including Inspec, SemEval2017, SemEval2010, DUC2001, Krapivin and NUS, as well as the real-time Twitter data. The results show that our proposed method significantly outperforms the current best-performing systems on all the benchmark datasets of UKE. Further in-depth analyses reveal the advantages of our neural UKE framework, including both the higher relevance and coverage of keyphrase extraction, and faster inference by non-autoregressive decoding.

To summarize, we in this work make the following contributions.

  • We propose to solve the UKE with a generative encoder–decoder neural framework, which is able to detect both the absent and present keyphrases.

  • We adopt a non-autoregressive decoding scheme for the keyphrase generation, which helps achieve faster prediction than existing neural UKE models.

  • We devise a neural topic module, which automatically induces topic information for better capturing the text semantics for better UKE.

  • We introduce a anchor-aware graph to model the n-gram graph of present phrases at global level, which is modeled with a novel topic-guided graph attention network.

In the remainder of the article we organize the paper as follows. Section 2 surveys the related work. Section 3 elaborates in detail our UKE method, including the task formalization and the framework. In Section 4 we show the experimental setups and in Section 5 we give all the results including the ablation analysis. Section 6 further give the in-depth analysis concerning the task and present more insights into our proposed method. Finally, Section 7 concludes this work, and indicates the potential future work.

Section snippets

Related work

Automatic keyphrase extraction is defined as to automatically extract a set of important and summary phrases from a document (Papagiannopoulou and Tsoumakas, 2020), which is one of the key subtask of information extraction (Li et al., 2021, Li et al., 2022, Shi et al., 2022, Chen et al., 2022, Cao et al., 2022) in NLP. Throughout the history of KPE, the earliest annotated keyphrase extraction corpora are mostly derived from scientific domains, including technical reports and scientific

Task formalization

We first describe the modeling of the UKE task. Given an input document text D={w1,w2,,wn} (n is the text length), the task is to output a set of keyphrases C={c1,c2,,cm} (m is the number of keyphrases), where a keyphrase cm may be a string of document (i.e., cmD), which is a present keyphrase; or may be is a string not occurring at D, which is a absent keyphrase.

Framework overview

We here show the overall architecture of our framework. As illustrated in Fig. 2, we take the encoder–decoder backbone

Experimental setups

In this section, we evaluate both the efficacy and efficiency of our proposed UKE system. We first describe the dataset as well as the other detailed experimental setups, and then present the experimental results in addition with the in-depth analyses.

Experimental results

In this section, we present the experimental results of the baseline methods and our proposed model. Besides, we show the ablation analysis of our system in terms of different modules.

Analyses and discussions

We in this section further conduct some in-depth analyses and present more insights into our proposed method for better understanding its strengths.

Conclusion and future work

In this paper we propose a novel model for unsupervised keyphrase extraction (UKE). We model the task as a fully generative process with an encoder–decoder framework so as to unify the prediction of both the present and absent keyphrases in an end-to-end manner. We introduce a neural topic module (NTM) to unsupervisingly learn the latent topic features for aiding the semantic understanding of the input document. We present an anchor-region scheme for the extraction of present keyphrases, which

CRediT authorship contribution statement

Xun Zhu: Conceptualization, Programming, Writing – original draft. Yinxia Lou: Methodology, Writing & editing. Jing Zhao: Writing & editing. Wang Gao: Experiments, Review. Hongtao Deng: Project administration, Supervision, Review.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by the Project of Wuhan Education Bureau, China (No. 2019067), the Research Project of Jianghan University, China (No. 2021yb062), the Doctor Scientific Research Fund of Jianghan University, China (No. 2021010).

References (87)

  • Alzaidy, R., Caragea, C., Giles, C.L., 2019. Bi-LSTM-CRF Sequence Labeling for Keyphrase Extraction from Scholarly...
  • Augenstein, I., Das, M., Riedel, S., Vikraman, L., McCallum, A., 2017. SemEval 2017 Task 10: ScienceIE - Extracting...
  • Aziz, W., Schulz, P., 2018. Variational Inference and Deep Generative Models. In: Proceedings of the 56th Annual...
  • Bahdanau, D., Cho, K., Bengio, Y., 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In:...
  • BahuleyanH. et al.

    Variational attention for sequence-to-sequence models

    (2017)
  • BeligaS. et al.

    Selectivity-based keyword extraction method

    Int. J. Semant. Web Inf. Syst.

    (2016)
  • Bennani-Smires, K., Musat, C., Hossmann, A., Baeriswyl, M., Jaggi, M., 2018. Simple Unsupervised Keyphrase Extraction...
  • BojanowskiP. et al.

    Enriching word vectors with subword information

    Trans. Assoc. Comput. Linguist.

    (2017)
  • Boudin, F., 2018. Unsupervised Keyphrase Extraction with Multipartite Graphs. In: Proceedings of the 2018 Conference of...
  • Boudin, F., Gallina, Y., Aizawa, A., 2020. Keyphrase Generation for Scientific Document Retrieval. In: Proceedings of...
  • Bougouin, A., Boudin, F., Daille, B., 2013. TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction. In:...
  • BowmanS.R. et al.

    Generating sentences from a continuous space

    (2015)
  • Campos, R., Mangaravite, V., Pasquali, A., Jorge, A.M., Nunes, C., Jatowt, A., 2018. YAKE! Collection-Independent...
  • Cao, H., Li, J., Su, F., Li, F., Fei, H., Wu, S., Li, B., Zhao, L., Ji, D., 2022. OneEE: A One-Stage Framework for Fast...
  • ChaudhariP. et al.

    Data augmentation using MG-GAN for improved cancer classification on gene expression data

    Soft Comput.

    (2020)
  • Chen, S., Shi, X., Li, J., Wu, S., Fei, H., Li, F., Ji, D., 2022. Joint Alignment of Multi-Task Feature and Label...
  • Chopra, S., Auli, M., Rush, A.M., 2016. Abstractive Sentence Summarization with Attentive Recurrent Neural Networks....
  • DevikaR. et al.

    A deep learning model based on BERT and sentence transformer for semantic keyphrase extraction on big social data

    IEEE Access

    (2021)
  • Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2019. BERT: Pre-training of Deep Bidirectional Transformers for...
  • Ding, H., Luo, X., 2021. AttentionRank: Unsupervised Keyphrase Extraction using Self and Cross Attentions. In:...
  • Dong, Z., Dong, Q., 2003. HowNet-a hybrid language and knowledge resource. In: International Conference on Natural...
  • EkstedtE. et al.

    TurnGPT: a transformer-based language model for predicting turn-taking in spoken dialog

  • FeiH. et al.

    On the robustness of aspect-based sentiment analysis: Rethinking model, data and training

    ACM Trans. Inf. Syst. (TOIS)

    (2022)
  • Fei, H., Li, F., Li, B., Ji, D., 2021a. Encoder-Decoder Based Unified Semantic Role Labeling with Label-Aware Syntax....
  • Fei, H., Li, J., Wu, S., Li, C., Ji, D., Li, F., 2022b. Global Inference with Explicit Syntactic and Discourse...
  • Fei, H., Ren, Y., Ji, D., 2020a. Retrofitting Structure-aware Transformer Language Model for End Tasks. In: Proceedings...
  • Fei, H., Ren, Y., Wu, S., Li, B., Ji, D., 2021b. Latent Target-Opinion as Prior for Document-Level Sentiment...
  • FeiH. et al.

    Nonautoregressive encoder-decoder neural framework for end-to-end aspect-based sentiment triplet extraction

    IEEE Trans. Neural Netw. Learn. Syst.

    (2021)
  • FeiH. et al.

    Enriching contextualized language model from knowledge graph for biomedical information extraction

    Brief. Bioinform.

    (2021)
  • FeiH. et al.

    LasUIE: Unifying information extraction with latent adaptive structure-aware generative language model

  • FeiH. et al.

    Better combine them together! integrating syntactic constituency and dependency representations for semantic role labeling

  • Fei, H., Wu, S., Ren, Y., Zhang, M., 2022c. Matching Structure for Dual Learning. In: Proceedings of the International...
  • Fei, H., Wu, S., Zhang, M., Ren, Y., Ji, D., 2022d. Conversational Semantic Role Labeling with Predicate-Oriented...
  • Fei, H., Zhang, M., Ji, D., 2020c. Cross-Lingual Semantic Role Labeling with High-Quality Translated Training Corpus....
  • Fei, H., Zhang, M., Li, B., Ji, D., 2021f. End-to-end Semantic Role Labeling with Neural Transition-based Model. In:...
  • Fei, H., Zhang, Y., Ren, Y., Ji, D., 2020d. Latent Emotion Memory for Multi-Label Emotion Classification. In:...
  • Florescu, C., Jin, W., 2019. A Supervised Keyphrase Extraction System Based on Graph Representation Learning. In:...
  • Goyal, A.G.A.P., Sordoni, A., Côté, M.-A., Ke, N.R., Bengio, Y., 2017. Z-forcing: Training stochastic recurrent...
  • Guo, J., Xu, L., Chen, E., 2020. Jointly Masked Sequence-to-Sequence Model for Non-Autoregressive Neural Machine...
  • Hasan, K.S., Ng, V., 2014. Automatic Keyphrase Extraction: A Survey of the State of the Art. In: Proceedings of the...
  • Hulth, A., 2003. Improved Automatic Keyword Extraction Given More Linguistic Knowledge. In: Proceedings of the 2003...
  • Kim, S.N., Medelyan, O., Kan, M.-Y., Baldwin, T., 2010. SemEval-2010 Task 5 : Automatic Keyphrase Extraction from...
  • Kim, J., Song, Y., Hwang, S., 2021. Web Document Encoding for Structure-Aware Keyphrase Extraction. In: Proceedings of...
  • Cited by (10)

    View all citing articles on Scopus
    View full text