Generative non-autoregressive unsupervised keyphrase extraction with neural topic modeling
Introduction
Keyphrase extraction (KPE) is a natural language processing (NLP) task that seeks to extract a set of key words or phrases given a document that describe the document (Hasan and Ng, 2014). The keyphrases helps readers quickly grasp the key idea of a document or article, and automatic KPE greatly relieves the index and retrieval of scientific article publication where the documents have no associated keyphrase. Training a KPE model requires a vast amount of labeled data, i.e., supervised KPE (Alzaidy et al., 2019, Florescu and Jin, 2019, Lai et al., 2020, Song et al., 2021). However, manually data annotation involves expensive cost, which is labor-intensive. To combat this, the task of unsupervised keyphrase extraction is introduced, which learns a KPE model without relying on data labels (Mihalcea, 2004, Wan and Xiao, 2008, Bougouin et al., 2013, Boudin, 2018, Sun et al., 2020, Liang et al., 2021).
Earlier methods for UKE are mostly based on statistical machine learning technique, e.g., utilizing the candidate position, frequency, length, and capitalization to determine the importance of a word (Mihalcea and Tarau, 2004, Rose et al., 2010, Beliga et al., 2016, Campos et al., 2018). Within recent decades, the development of deep learning witness a great success. Thus the deep neural methods are extensively employed for building strong UKE models (Zhang et al., 2018, Prasad and Kan, 2019, Kim et al., 2021, Wu et al., 2021a). Unfortunately, almost all the current neural UKE models take the sequential labeling scheme of keyphrase extraction, which may lead to the only detection of present keyphrases, while missing the absent keyphrases. As depicted in Fig. 1(a), it is quite ubiquitous that the keyphrases are not occurred (fully or partially) in the document. Without accurately modeling the absent keyphrases, the overall UKE performances will be inevitably suboptimal.
On the other hand, the key of high performance UKE lies in both the high relevance and high coverage of keyphrases, preserving both of which however is a challenge in existing work, i.e., two sides of the same coin. (1) The keyphrase relevance reflects the extent on capturing the semantics of the document text. Especially in the unsupervised learning scenario, the topic information is much vital to UKE (Bougouin et al., 2013, Zhao et al., 2011, Teneva and Cheng, 2017, Wang et al., 2018). The ideal keyphrases should well describes the relevant topics of the document at the semantic level. Unfortunately, existing neural UKE models fail to well model the topic knowledge, i.e., topic-independent (Mihalcea and Tarau, 2004, Campos et al., 2018). (2) The coverage emphasizes the ability to modeling the globally informative features, i.e., generating different keyphrases that can describe the document from different angles comprehensively. Most earlier UKE methods (i.e., statistical-based) pay too much focus on the modeling of the localized features (Bougouin et al., 2013, Boudin, 2018). As a consequence, they are inclined to extract homogeneous keyphrases with less diversification and bad coverage. Existing UKE methods either suffer from the weak relevance of UKE or weak coverage of UKE, as illustrated in Fig. 1(b).
In this paper we present a novel neural UKE model that effectively addresses all the above challenges. First of all, unlike the current sequential labeling scheme of UKE, we model the task as a fully generative process with an encoder–decoder framework, which enables to predict both the present and absent keyphrases in an end-to-end manner. However, directly using the existing sequential generative decoder (i.e., LSTM predicting from left to right) for producing keyphrases could be problematic, as the keyphrases essentially have no sequential dependence or order in between. Alternatively, we employ a non-autoregressive decoder, which yields all possible keyphrases in one-shot, as shown in Fig. 2.
In our framework, we adopt an anchor-region scheme for the extraction of present keyphrases, which first locates the anchor points (i.e., fuzzy boundaries) of present phrases based on the n-gram, and then determines the final boundaries of keyphrases. For the detection of absent keyphrases relying more on the text semantics, we based on the variational autoencoder technique propose a neural topic module (NTM) for unsupervisingly learning the topic features. We further introduce a novel anchor-aware graph (A2G) module, where the initial anchor present phrases are organized into a graph. We devise a topic-guided graph attention network (namely, TgGAT) to model the graph, which, with the attendance of topic information learned from NTM, can model the global contexts. During the graph propagation, the n-gram of present phrases are ranked and filtered effectively. Finally, the non-autoregressive decoder produces all present and absent keyphrases.
We carry out experiments on a total of six benchmark datasets of UKE, including Inspec, SemEval2017, SemEval2010, DUC2001, Krapivin and NUS, as well as the real-time Twitter data. The results show that our proposed method significantly outperforms the current best-performing systems on all the benchmark datasets of UKE. Further in-depth analyses reveal the advantages of our neural UKE framework, including both the higher relevance and coverage of keyphrase extraction, and faster inference by non-autoregressive decoding.
To summarize, we in this work make the following contributions.
- •
We propose to solve the UKE with a generative encoder–decoder neural framework, which is able to detect both the absent and present keyphrases.
- •
We adopt a non-autoregressive decoding scheme for the keyphrase generation, which helps achieve faster prediction than existing neural UKE models.
- •
We devise a neural topic module, which automatically induces topic information for better capturing the text semantics for better UKE.
- •
We introduce a anchor-aware graph to model the n-gram graph of present phrases at global level, which is modeled with a novel topic-guided graph attention network.
In the remainder of the article we organize the paper as follows. Section 2 surveys the related work. Section 3 elaborates in detail our UKE method, including the task formalization and the framework. In Section 4 we show the experimental setups and in Section 5 we give all the results including the ablation analysis. Section 6 further give the in-depth analysis concerning the task and present more insights into our proposed method. Finally, Section 7 concludes this work, and indicates the potential future work.
Section snippets
Related work
Automatic keyphrase extraction is defined as to automatically extract a set of important and summary phrases from a document (Papagiannopoulou and Tsoumakas, 2020), which is one of the key subtask of information extraction (Li et al., 2021, Li et al., 2022, Shi et al., 2022, Chen et al., 2022, Cao et al., 2022) in NLP. Throughout the history of KPE, the earliest annotated keyphrase extraction corpora are mostly derived from scientific domains, including technical reports and scientific
Task formalization
We first describe the modeling of the UKE task. Given an input document text ( is the text length), the task is to output a set of keyphrases ( is the number of keyphrases), where a keyphrase may be a string of document (i.e., ), which is a present keyphrase; or may be is a string not occurring at , which is a absent keyphrase.
Framework overview
We here show the overall architecture of our framework. As illustrated in Fig. 2, we take the encoder–decoder backbone
Experimental setups
In this section, we evaluate both the efficacy and efficiency of our proposed UKE system. We first describe the dataset as well as the other detailed experimental setups, and then present the experimental results in addition with the in-depth analyses.
Experimental results
In this section, we present the experimental results of the baseline methods and our proposed model. Besides, we show the ablation analysis of our system in terms of different modules.
Analyses and discussions
We in this section further conduct some in-depth analyses and present more insights into our proposed method for better understanding its strengths.
Conclusion and future work
In this paper we propose a novel model for unsupervised keyphrase extraction (UKE). We model the task as a fully generative process with an encoder–decoder framework so as to unify the prediction of both the present and absent keyphrases in an end-to-end manner. We introduce a neural topic module (NTM) to unsupervisingly learn the latent topic features for aiding the semantic understanding of the input document. We present an anchor-region scheme for the extraction of present keyphrases, which
CRediT authorship contribution statement
Xun Zhu: Conceptualization, Programming, Writing – original draft. Yinxia Lou: Methodology, Writing & editing. Jing Zhao: Writing & editing. Wang Gao: Experiments, Review. Hongtao Deng: Project administration, Supervision, Review.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is supported by the Project of Wuhan Education Bureau, China (No. 2019067), the Research Project of Jianghan University, China (No. 2021yb062), the Doctor Scientific Research Fund of Jianghan University, China (No. 2021010).
References (87)
- Alzaidy, R., Caragea, C., Giles, C.L., 2019. Bi-LSTM-CRF Sequence Labeling for Keyphrase Extraction from Scholarly...
- Augenstein, I., Das, M., Riedel, S., Vikraman, L., McCallum, A., 2017. SemEval 2017 Task 10: ScienceIE - Extracting...
- Aziz, W., Schulz, P., 2018. Variational Inference and Deep Generative Models. In: Proceedings of the 56th Annual...
- Bahdanau, D., Cho, K., Bengio, Y., 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In:...
- et al.
Variational attention for sequence-to-sequence models
(2017) - et al.
Selectivity-based keyword extraction method
Int. J. Semant. Web Inf. Syst.
(2016) - Bennani-Smires, K., Musat, C., Hossmann, A., Baeriswyl, M., Jaggi, M., 2018. Simple Unsupervised Keyphrase Extraction...
- et al.
Enriching word vectors with subword information
Trans. Assoc. Comput. Linguist.
(2017) - Boudin, F., 2018. Unsupervised Keyphrase Extraction with Multipartite Graphs. In: Proceedings of the 2018 Conference of...
- Boudin, F., Gallina, Y., Aizawa, A., 2020. Keyphrase Generation for Scientific Document Retrieval. In: Proceedings of...
Generating sentences from a continuous space
Data augmentation using MG-GAN for improved cancer classification on gene expression data
Soft Comput.
A deep learning model based on BERT and sentence transformer for semantic keyphrase extraction on big social data
IEEE Access
TurnGPT: a transformer-based language model for predicting turn-taking in spoken dialog
On the robustness of aspect-based sentiment analysis: Rethinking model, data and training
ACM Trans. Inf. Syst. (TOIS)
Nonautoregressive encoder-decoder neural framework for end-to-end aspect-based sentiment triplet extraction
IEEE Trans. Neural Netw. Learn. Syst.
Enriching contextualized language model from knowledge graph for biomedical information extraction
Brief. Bioinform.
LasUIE: Unifying information extraction with latent adaptive structure-aware generative language model
Better combine them together! integrating syntactic constituency and dependency representations for semantic role labeling
Cited by (10)
Weight prediction and recognition of latent subject terms based on the fusion of explicit & implicit information about keyword
2023, Engineering Applications of Artificial Intelligence