skip to main content
research-article

Domain-Specific Keyword Extraction Using Joint Modeling of Local and Global Contextual Semantics

Published: 08 January 2022 Publication History

Abstract

Domain-specific keyword extraction is a vital task in the field of text mining. There are various research tasks, such as spam e-mail classification, abusive language detection, sentiment analysis, and emotion mining, where a set of domain-specific keywords (aka lexicon) is highly effective. Existing works for keyword extraction list all keywords rather than domain-specific keywords from a document corpus. Moreover, most of the existing approaches perform well on formal document corpuses but fail on noisy and informal user-generated content in online social media. In this article, we present a hybrid approach by jointly modeling the local and global contextual semantics of words, utilizing the strength of distributional word representation and contrasting-domain corpus for domain-specific keyword extraction. Starting with a seed set of a few domain-specific keywords, we model the text corpus as a weighted word-graph. In this graph, the initial weight of a node (word) represents its semantic association with the target domain calculated as a linear combination of three semantic association metrics, and the weight of an edge connecting a pair of nodes represents the co-occurrence count of the respective words. Thereafter, a modified PageRank method is applied to the word-graph to identify the most relevant words for expanding the initial set of domain-specific keywords. We evaluate our method over both formal and informal text corpuses (comprising six datasets), and show that it performs significantly better in comparison to state-of-the-art methods. Furthermore, we generalize our approach to handle the language-agnostic case, and show that it outperforms existing language-agnostic approaches.

References

[1]
Muhammad Abulaish, Sielvie Sharma, and Mohd Fazil. 2019. A multi-attributed graph-based approach for text data modeling and event detection in Twitter. In Proceedings of the 11th International Conference on Communication Systems & Networks. IEEE Computer Society, 703–708.
[2]
Muhammad Abulaish, Md. Imran Hossain Showrov, and Mohd Fazil. 2018. A layered approach for summarization and context learning from microblogging data. In Proceedings of the 20th International Conference on Information Integration and Web-Based Applications & Services. ACM, 70–78.
[3]
Izzat Alsmadi and Ikdam Alhami. 2015. Clustering and classification of email contents. Journal of King Saud University–Computer and Information Sciences 27, 1 (2015), 46–57.
[4]
Nikita Astrakhantsev. 2017. ATR4S: Toolkit with state-of-the-art automatic terms recognition methods in Scala. Language Resources and Evaluation 52, 3 (2017), 853–872.
[5]
Marco Basaldella, Elisa Antolli, Giuseppe Serra, and Carlo Tasso. 2018. Bidirectional LSTM recurrent neural networkfor keyphrase extraction. In Proceedings of the Italian Research Conference on Digital Libraries. Springer, Cham, 180–187.
[6]
Abdelghani Bellaachia and Mohammed Al-Dhelaan. 2012. NE-Rank: A novel graph-based keyphrase extraction in Twitter. In Proceedings of the International Conferences on Web Intelligence and Intelligent Agent Technology. IEEE Computer Society, 372–379.
[7]
Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of Machine Learning Research 3, 1 (2003), 1137–1155.
[8]
Saroj Kr. Biswas, Monali Bordoloi, and Jacob Shreya. 2018. A graph based keyword extraction model using collective node weight. Expert Systems with Applications 97, 5 (2018), 51–59.
[9]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3, 1 (2003), 993–1022.
[10]
Abraham Bookstein and Don R. Swanson. 1974. Probabilistic models for automatic indexing. Journal of the American Society for Information Science 25, 5 (1974), 312–318.
[11]
Georgeta Bordea, Paul Buitelaar, and Tamara Polajnar. 2013. Domain-independent term extraction through domain modelling. In Proceedings of the 10th International Conference on Terminology and Artificial Intelligence. ICSA, 1–8.
[12]
Christopher Brewster, Jose Iria, Ziqi Zhang, Fabio Ciravegna, Louise Guthrie, and Yorick Wilks. 2007. Dynamic iterative ontology learning. In Proceedings of the 6th International Conference on Recent Advances in Natural Language Processing. ACL, 1–5.
[13]
Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems 30, 11 (1998), 107–117.
[14]
Pete Burnap and Matthew L. Williams. 2016. Us and them: Identifying cyber hate onTwitter across multiple protectedcharacteristics. EPJ Data Science 5, 11 (2016), 1–15.
[15]
Teresa M. Chung. 2003. A corpus comparison approach for terminology extraction. Terminology 9, 2 (2003), 221–246.
[16]
Kenneth Church, William Gale, Patrick Hanks, and Donald Hindle. 1991. Using statistics in lexical analysis. Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon 1, 1 (1991), 115–164.
[17]
Kenneth Ward Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics 16, 1 (1990), 22–29.
[18]
W. B. Croft and D. J. Harper. 1979. Using probabilistic models of document retrieval without relevance information. Journal of Documentation 35, 4 (1979), 285–295.
[19]
Sally F. Dennis. 1964. The construction of a thesaurus automatically from a sample of text. In Proceedings of the Symposium on Statistical Association Methods For Mechanized Documentation. ACL, 61–148.
[20]
Swagata Duari and Vasudha Bhatnagar. 2018. sCAKE: Semantic connectivity aware keyword extraction. Information Sciences 477, 1 (2018), 100–117.
[21]
Gonenc Ercan and Ilyas Cicekli. 2007. Using lexical chains for keyword extraction. Information Processing and Management 43, 3 (2007), 1705–1714.
[22]
Andrea Esuli and Fabrizio Sebastiani. 2006. SENTIWORDNET: A publicly available lexical resourcefor opinion mining. In Proceedings of the 5th International Conference on Language Resources and Evaluation. European Language Resources Association, 417–422.
[23]
Corina Florescu and Cornelia Caragea. 2017. A position-biased PageRank algorithm for keyphrase extraction. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI, 4923–4924.
[24]
Corina Florescu and Cornelia Caragea. 2017. PositionRank: An unsupervised approach to keyphrase extraction from scholarly documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. ACL, 1105–1115.
[25]
Corina Florescu and Wei Jin. 2019. A supervised keyphrase extraction system based on graph representation learning. In Proceedings of the European Conference on Information Retrieval. Springer, Cham, 197–212.
[26]
Antigoni-Maria Founta, Constantinos Djouvas, Despoina Chatzakou, Ilias Leontiadis, Jeremy Blackburn, Gianluca Stringhini, Athena Vakali, Michael Sirivianos, and Nicolas Kourtellis. 2018. Large scale crowdsourcing and characterization of Twitter abusive behavior. In Proceedings of the 12th International Conference on Web and Social Media. Association for the Advancement of Artificial Intelligence, 491–500.
[27]
Zelalem Gero and Joyce C. Ho. 2019. NamedKeys: Unsupervised keyphrase extraction for biomedical documents. In Proceedings of the 10th International Conference on Bioinformatics, Computational Biology and Health Informatics. ACM, 328–337.
[28]
Sujatha D. Gollapalli, Xiao-Li Li, and Peng Yang. 2017. Incorporating expert knowledge into keyphrase extraction. In Proceedings of the 31st International Conference on Artificial Intelligence. AAAI, 3180–3187.
[29]
Stephen P. Harter. 1975. A probabilistic approach to automatic keywords indexing. Journal of the American Society for Information Science 26, 5 (1975), 197–206.
[30]
Kazi Saidul Hasan and Vincent Ng. 2014. Automatic keyphrase extraction: A survey of the state of the art. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. ACL, 1262–1273.
[31]
Samer Hassan, Rada Mihalcea, and Carmen Banea. 2007. Random-walk term weighting for improved text classification. In Proceedings of the 1st International Conference on Semantic Computing. IEEE Computer Society, 242–249.
[32]
J. P. Herreraa and P. A. Pury. 2008. Statistical keyword detection in literary corpora. The European Physical Journal B 63, 1 (2008), 135–146.
[33]
Anette Hulth. 2003. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the International Conference on Empirical Methods in Natural Language Processing. ACL, 216–223.
[34]
Kyo Kageura and Bin Umino. 1996. Methods of automatic term recognition: A review. Terminology 3, 2 (1996), 259–289.
[35]
karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 1 (1972), 11–21.
[36]
Su N. Kim, Timothy Baldwin, and Min Y. Kan. 2009. Extracting domain-specific words - a statistical approach. In Proceedings of the Australasian Language Technology Association Workshop. ACL, 94–98.
[37]
Chunyu Kit and Xiaoyue Liu. 2008. Measuring mono-word termhood by rank difference via corpus comparison. Terminology 14, 2 (2008), 204–229.
[38]
Ugur Kursuncu, Manas Gaur, Usha Lokala, Krishna prasad Thirunarayan, Amit Sheth, and Budak Arpinar. 2018. Predictive Analysis on Twitter: Techniques and Applications. Springer, Cham, Chapter Emerging Research Challenges and Opportunities in Computational Social Network Analysis and Mining, 67–104.
[39]
Marina Litvak, Mark Last, Hen Aizenman, and Inbal Gobits. 2011. DegExt: A language-independent graph-based keyphrase extractor. In Proceedings of the International Conference on Advances in Intelligent Web Mastering. Springer, 121–130.
[40]
Debanjan Mahata, John Kuriakose, Rajiv Ratn Shah, and Roger Zimmermann. 2018. Key2Vec: Automatic ranked keyphrase extraction from scientific articles using phrase embeddings. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, 634–639.
[41]
Yutaka Matsuo and Mitsuru Ishizuka. 2003. Keyword extraction from a single documentusing word co-occurrence statistical information. In Proceedings of the 16th International Florida Artificial Intelligence Research Society Conference. AAAI, 392–396.
[42]
Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky, and Yu Chi. 2017. Deep keyphrase generation. In Proceedings of 55th the Annual Meeting of the Association for Computational Linguistics. ACL, 582–592.
[43]
Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing order into text. In Proceedings of the International Conferences Empirical Methods in Natural Language Processing. ACL, 404–411.
[44]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations inVector space. Computing Research Repository (CoRR). 1–12. https://arxiv.org/abs/1301.3781.
[45]
George A. Miller. 1995. WordNet: A lexical database for English. Communications of the ACM 38, 11 (1995), 39–41.
[46]
Saif Mohammad and Peter Turney. 2013. Crowdsourcing a word–emotion association lexicon. Computational Intelligence 29, 3 (2013), 436–465.
[47]
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab, 1–17.
[48]
Eirini Papagiannopoulou and Grigorios Tsoumakas. 2018. Local word vectors guiding keyphrase extraction. Information Processing and Management 54, 6 (2018), 888–902.
[49]
Youngja Park, Siddharth Patwardhan, Karthik Visweswariah, and Stephen C. Gates. 2008. An empirical analysis of word error rate and keyword error rate. In Proceedings of the 9th Annual Conference of the International Speech Communication Association. ICSA, 270–273.
[50]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532–1543.
[51]
Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen. 2010. Opinion word expansion and target extraction through double propagation. Computational Linguistics 37, 1 (2010), 9–27. DOI:DOI:https://doi.org/10.1162/coli_a_00034
[52]
S. E. Robertson and S. Walker. 1997. On relevance weights with little relevance information. In Proceedings of the 12th International Conference on Research and Development in Information Retrieval. ACM, 16–24.
[53]
Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley. 2010. Automatic Keyword Extractionfrom Individual Documents. John Wiley & Sons.
[54]
Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A. Smith. 2019. The risk of racial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. ACL, 1668–1678.
[55]
Geetika Sarna and M. P. S. Bhatia. 2016. A probalistic approach to automatically extract new words from social media. In Proceedings of the International Conference on Advances in Social Networks Analysis and Mining. IEEE Computer Society, 719–725.
[56]
Rushdi Shams and Robert E. Mercer. 2013. Classifying spam emails using text and readability features. In Proceedings of the 13th International Conference on Data Mining. IEEE Compouter Society, Washington DC, USA, 657–666.
[57]
Jacopo Staiano and Rada Marco Guerini. 2014. DepecheMood: A lexicon for emotion analysis from crowd-annotated news. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. ACL, 427–433.
[58]
Lucas Sterckx, Cornelia Caragea, Thomas Demeester, and Chris Develder. 2016. Supervised keyphrase extraction as positive unlabeled learning. In Proceedings of the International Conference on Empirical Methods in Natural Language Processing. ACL, 1924–1929.
[59]
Carlo Strapparava and Alessandro Valitutti. 2004. WordNet-Affect: An affective extension of WordNet. In Proceedings of the 4th International Conference on Language Resources and Evaluation. European Language Resources Association, 1083–1086.
[60]
Kabir Taneja and Kriti M. Shah. 2019. The conflict in Jammu and Kashmir and the convergence of technology and terrorism. Royal United Services Institute for Defence and Security Studies, Paper No. 11 (2019), 1–14.
[61]
Ming-Feng Tsai, Chuan-Ju Wang, and Pu Chuan Chien. 2016. Discovering finance keywords via continuous-space language models. ACM Transactions on Management Information Systems 7, 3 (2016), 1–17.
[62]
Xiaojun Wan and Jianguo Xiao. 2008. Single document keyphrase extraction using neighborhood knowledge. In Proceedings of the 23rd International Conference on Artificial Intelligence. AAAI, 855–860.
[63]
Rui Wang, Wei Liu, and Chris McDonald. 2015. Corpus-independent generic keyphrase extraction using word embedding vectors. In Proceedings of the Software Engineering Research Conference. 1–8.
[64]
Ian H. Witten, Alistair Moffat, and Timothy C. Bell. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann.
[65]
Yuxiang Zhang, Huan Liu, Suge Wang, W. H. Ip., Wei Fan, and Chunjing Xiao. 2019. Automatic keyphrase extraction using word embeddings. Soft Computing 23, 5 (2019), 1–16.
[66]
Yu Zhang, Mingxiang Tuo, Qingyu Yin, Le Qi, Xuxiang Wang, and Ting Liu. 2020. Keywords extraction with deep neural network model. Neurocomputing 383, 1 (2020), 113–121.
[67]
Yong Zhang and Weidong Xiao. 2018. Keyphrase generation based on deep Seq2seq model. IEEE Access 6, 9 (2018), 46047–46057.
[68]
Ziqi Zhang, Ziqi Zhang, and Ziqi Zhang. 2018. SemRe-Rank: Improving automatic term extraction by incorporating semantic relatedness with personalised PageRank. ACM Transactions on Knowledge Discovery from Data 12, 5 (2018), 1–41.
[69]
Hongding Zhou and Gary W. Slater. 2003. A metric to search for relevant words. Physica A: Statistical Mechanics and its Applications 329, 1 (2003), 309–327.

Cited By

View all
  • (2024)BiCapsHate: Attention to the Linguistic Context of Hate via Bidirectional Capsules and HatebaseIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.323652711:2(1781-1792)Online publication date: Apr-2024
  • (2024)An entropy-based corpus method for improving keyword extractionEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108049133:PBOnline publication date: 1-Jul-2024
  • (2023)A novel frequency-range analysis (FRA) method for determining critical words among English high-stakes testsJournal of Intelligent & Fuzzy Systems10.3233/JIFS-23153945:6(9605-9620)Online publication date: 2-Dec-2023
  • Show More Cited By

Index Terms

  1. Domain-Specific Keyword Extraction Using Joint Modeling of Local and Global Contextual Semantics

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Knowledge Discovery from Data
        ACM Transactions on Knowledge Discovery from Data  Volume 16, Issue 4
        August 2022
        529 pages
        ISSN:1556-4681
        EISSN:1556-472X
        DOI:10.1145/3505210
        Issue’s Table of Contents

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 08 January 2022
        Accepted: 01 October 2021
        Revised: 01 August 2021
        Received: 01 March 2021
        Published in TKDD Volume 16, Issue 4

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Text mining
        2. information extraction
        3. domain-specific keyword extraction
        4. language-agnostic keyword extraction

        Qualifiers

        • Research-article
        • Refereed

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)112
        • Downloads (Last 6 weeks)23
        Reflects downloads up to 15 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)BiCapsHate: Attention to the Linguistic Context of Hate via Bidirectional Capsules and HatebaseIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.323652711:2(1781-1792)Online publication date: Apr-2024
        • (2024)An entropy-based corpus method for improving keyword extractionEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108049133:PBOnline publication date: 1-Jul-2024
        • (2023)A novel frequency-range analysis (FRA) method for determining critical words among English high-stakes testsJournal of Intelligent & Fuzzy Systems10.3233/JIFS-23153945:6(9605-9620)Online publication date: 2-Dec-2023
        • (2023)W-rank: A keyphrase extraction method for webpage based on linguistics and DOM-base featuresVAWKUM Transactions on Computer Sciences10.21015/vtcs.v11i1.149311:1(217-228)Online publication date: 30-May-2023
        • (2023)Learning Entangled Interactions of Complex Causality via Self-Paced Contrastive LearningACM Transactions on Knowledge Discovery from Data10.1145/363240618:3(1-24)Online publication date: 9-Dec-2023
        • (2023)SkyWords: An automatic keyword extraction system based on the skyline operator and semantic similarityEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.106338123(106338)Online publication date: Aug-2023
        • (2023)Building Domain-Specific Sentiment Lexicon Using Random Walk-Based Model on Common-Sense Semantic NetworkInternational Conference on Innovative Computing and Communications10.1007/978-981-99-3010-4_17(193-204)Online publication date: 1-Aug-2023
        • (2022)BiCHATJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2022.05.00634:7(4335-4344)Online publication date: 1-Jul-2022

        View Options

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        Full Text

        HTML Format

        View this article in HTML Format.

        HTML Format

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media