A graph based keyword extraction model using collective node weight

https://doi.org/10.1016/j.eswa.2017.12.025Get rights and content

Highlights

  • This paper proposes a keyword extraction method called KECNW.

  • KECNW is a novel unsupervised graph based keyword extraction method.

  • It determines node weight by collectively taking various influencing parameters.

  • This model is validated with five datasets.

  • Its performance is far better than the other existing methods.

Abstract

In the recent times, a huge amount of text is being generated for social purposes on twitter social networking site. Summarizing and analysing of twitter content is an important task as it benefits many applications such as information retrieval, automatic indexing, automatic classification, automatic clustering, automatic filtering etc. One of the most important tasks in analyzing tweets is automatic keyword extraction. There are some graph based approaches for keyword extraction which determine keywords only based on centrality measure. However, the importance of a keyword in twitter depends on various parameters such as frequency, centrality, position and strength of neighbors of the keyword. Therefore, this paper proposes a novel unsupervised graph based keyword extraction method called Keyword Extraction using Collective Node Weight (KECNW) which determines the importance of a keyword by collectively taking various influencing parameters. The KECNW is based on Node Edge rank centrality with node weight depending on various parameters. The model is validated with five datasets: Uri Attack, American Election, Harry Potter, IPL and Donald Trump. The result of KECMW is compared with three existing models. It is observed from the experimental results that the proposed method is far better than the others. The performances are shown in terms of precision, recall and F-measure.

Introduction

Keywords are defined as a series of one or more words which provide a compact representation of a document's content (Berry and Kogan, 2010, Boudin, 2013, Grineva et al., 2009, Lahiri et al., 2014). Keywords are widely used to define queries within information retrieval (IR) systems as they are easy to define, revise, remember, and share. Other applications using keywords include automatic indexing, automatic summarization, automatic classification, automatic clustering, automatic topic detection and tracking, and automatic filtering (Palshikar, 2007). The task of mining these keywords from a document is called as keyword extraction. The manual assignment of keywords is a very time consuming and tedious task so it is important to have a proficient automated keyword extraction approach.

Micro-blogs have been recently attracting people to express their opinion and socialize with others. Micro blogging is a combination of blogging and instant messaging that allows users to create short messages to be posted and shared with an audience online. Social platforms like twitter have become extremely popular forms of this new type of blogging, especially on the mobile web – making it much more convenient to communicate with people compared to the days when desktop web browsing and interaction was the norm. Users share thoughts, links and pictures on Twitter, journalists comment on live events, and companies promote products and engage with customers. The list of different ways to use twitter could be really long, and with 500 millions of tweets per day, there is a lot of data to analyze and explore. One of the most important tasks in analyzing twitter data is keyword extraction. If keywords of a text are extracted properly, subject of the text can be studied and analyzed comprehensively and good decision can be made on the text.

Texts are commonly represented using the well-known Vector Space Model (VSM) (Salton, Yang, & Yu, 1975), however it results in sparse matrices to be dealt with computationally and while target application involves twitter contents, compared with traditional text collections, this problem becomes even worse. Due to the short texts (140 characters), diversity in twitter contents, informality, grammatical errors, buzzwords, slangs, and the speed with which real-time content is generated, an effective technique is required (Ediger et al., 2010) to extract useful keywords. Graph based technique to extract keywords is appropriate in such situation and has gained popularity in the recent times.

Bellaachia and Al-Dhelaan (2012) proposed a graph based method to extract keywords from twitter data, which uses node weight with TextRank and results in a node-edge weighting approach called NE-Rank (Node and Edge Rank). Term Frequency–Inverse Document Frequency (TF–IDF) is used as the node weight. But, keywords in twitter data do not only depend on TF–IDF. Abilhoa and Castro (2014) proposed a graph based technique to extract keywords from twitter data, which uses closeness and eccentricity centralities to determine node weight and, degree centrality as the tie breaker. Closeness and eccentricity centralities do not work well for disconnected graphs. However in most of the cases, the graph made from tweets becomes a disconnected graph due to the diversity of the tweet contents. Therefore, an effective graph based keyword extraction method is required which can overcome most of the drawbacks of graph based model including the ones cited above. This paper proposes such a graph based keyword extraction method called Keyword Extraction using Collective Node Weight (KECNW) which depends on many parameters of a node like frequency, centrality, position and strength of neighbors.

The remaining part of the research article is organized as follows. Section 2 presents literature survey which describes the related previous works. Section 3 discusses the proposed model in great detail. An illustrative example is presented in Section 4 to understand the proposed model clearly. Results with discussion are presented in Sections 5 and 6 draws some conclusions about the research work.

Section snippets

Literature survey

The keyword extraction techniques can be divided into four categories namely, linguistic approach, machine learning approach, statistical approach and other approaches (Zahang et al., 2008). Linguistic approach uses the linguistic properties of the words, sentences and documents and the most commonly examined linguistic properties are lexical, syntactic, semantic and discourse analysis (Cohen-Kerner, 2003, Hulth, 2003, Nguyen and Kan, 2007). Machine learning approach considers supervised or

Proposed KECNW model

The KECNW model considers frequency, centrality, position and strength of neighbors of a node to calculate importance of the node. The implementation of the model is segregated in 4 phases: preprocessing, textual graph representation, node weight assignment and keyword extraction. The details of all the phases are given below.

An illustrative example

Five tweets from the IPL dataset are taken to illustrate the whole model in detail. Tweets are given as follows:

  • (i)

    `Very excited for todays IPL contest RPS vs KKR, @msdhoni vs @GautamGambhir fight! #IPL'

  • (ii)

    `#poll who score 50+ score today #smithy #dhoni #stokes #Rahane #KKRvRPS #rpsvskkr #cricketlovers #ipl #IPL2017′

  • (iii)

    `RPS should be happy team today because KKR have decided to rest NCN. He has been in prime form. #KKRvRPS #IPL @RPSupergiants @KKRiders'

  • (iv)

    `KKR seek to extend unbeaten run against Pune //t.co/NdEuZIdxL5

Results and discussion

The experiments are conducted with five datasets: Donald Trump, Harry Potter, IPL, Uri Attack and American Election. All datasets are collected from twitter. Donald Trump, Harry Potter and IPL datasets contain 2000 tweets each while Uri Attack and American Elections contain 500 tweets each.

Precision (Pr), Recall (Re), and F-measure performance measures are used as evaluation metrics for keyword extraction and are given in Eqs. (10)–(12). Pr=|{Relevant}{Retrieved}||{Retrieved}|Re=|{Inter_Relevan

Conclusions

Keyword extraction is one of the most important tasks in analyzing twitter data. This paper proposes a keyword extraction model called KECNW which consists of four phases: pre-processing, textual graph representation, node weight assignment and keyword extraction. Pre-processing involves removing unwanted noise, tokenization and stop word removal. Textual graph representation involves vertex assignment and establishment of edges between vertices. Node weight assignment phase determines node

Funding

This study is not funded by any research grant.

Conflict of Interest

There is no conflict of interest.

Human and animal rights

This article does not contain any studies with human or animal participant.

References (31)

  • W.D. Abilhoa et al.

    A keyword extraction method from twitter messages represented as graphs

    Applied Mathematics and Computation

    (2014)
  • ChenP. et al.

    Automatic keyword prediction using Google similarity distance

    Expert System with Application

    (2010)
  • S. Beliga et al.

    An overview of graph-based keyword extraction methods and approaches

    J. Inf. Org. Soc.

    (2015)
  • A. Bellaachia et al.

    NE-Rank: a novel graph-based key phrase extraction in twitter

  • M.W. Berry et al.

    Text mining: applications and theory

    (2010)
  • F. Boudin

    A comparison of centrality measures for graph-based keyphrase extraction

  • A. Bougouin et al.

    TopicRank: Graph-based topic ranking for keyphrase extraction

  • H. Cohen-Kerner

    Automatic extraction of keyword from abstracts, automatic extraction of keyword from abstracts

    Lecture Notes in Computer Science

    (2003)
  • D. Ediger et al.

    Massive social network analysis: Mining twitter for social good

  • M. Grineva et al.

    Extracting key terms from noisy and multi-theme documents

  • A. Hotho et al.

    A brief survey of text mining

    LDV Forum GLDV Journal for Computational Linguistics and Language Technology

    (2005)
  • A. Hulth

    Improved automatic keyword extraction given more linguistic knowledge

  • T.Md. Khan et al.

    Term ranker: A graph based re-ranking approach

  • KwonK. et al.

    A graph based representative keywords extraction model from news articles

  • Lahiri, S., Choudhury, S. R., & Caragea, C. (2014). Keyword and keyphrase extraction using centrality measures on...
  • Cited by (93)

    • Feature extraction of search product based on multi-feature fusion-oriented to Chinese online reviews

      2022, Data Science and Management
      Citation Excerpt :

      Liu et al. (2016) chose the N-gram which satisfied the Bayesian Nonparametric model (BNP) as a candidate, and used the boundary average information entropy of the N-gram and the sub-string dependency to filter the candidates and extract the final product feature values. Biswas et al. (2017) constructed a novel unsupervised graph-based keyword extraction method called Keyword Extraction by using Collective Node Weight (KECNW), which could determine the importance of a keyword by collectively taking various influencing parameters. Rodrigues and Chiplunkar (2016) used the Apriori algorithm for feature extraction and classified product features based on the unsupervised SentiWordNet method.

    • Exploring ensemble oversampling method for imbalanced keyword extraction learning in policy text based on three-way decisions and SMOTE

      2022, Expert Systems with Applications
      Citation Excerpt :

      Abilhoa and Castro (2014) represented tweet texts as graphs and applied centrality measures to find the relevant keywords. To summarize the twitter content, Biswas et al. (2018) proposed an automatic keyword extraction method to analyze tweets. They calculated the collective node weights to determine the importance of words.

    View all citing articles on Scopus
    View full text