A graph based keyword extraction model using collective node weight
Introduction
Keywords are defined as a series of one or more words which provide a compact representation of a document's content (Berry and Kogan, 2010, Boudin, 2013, Grineva et al., 2009, Lahiri et al., 2014). Keywords are widely used to define queries within information retrieval (IR) systems as they are easy to define, revise, remember, and share. Other applications using keywords include automatic indexing, automatic summarization, automatic classification, automatic clustering, automatic topic detection and tracking, and automatic filtering (Palshikar, 2007). The task of mining these keywords from a document is called as keyword extraction. The manual assignment of keywords is a very time consuming and tedious task so it is important to have a proficient automated keyword extraction approach.
Micro-blogs have been recently attracting people to express their opinion and socialize with others. Micro blogging is a combination of blogging and instant messaging that allows users to create short messages to be posted and shared with an audience online. Social platforms like twitter have become extremely popular forms of this new type of blogging, especially on the mobile web – making it much more convenient to communicate with people compared to the days when desktop web browsing and interaction was the norm. Users share thoughts, links and pictures on Twitter, journalists comment on live events, and companies promote products and engage with customers. The list of different ways to use twitter could be really long, and with 500 millions of tweets per day, there is a lot of data to analyze and explore. One of the most important tasks in analyzing twitter data is keyword extraction. If keywords of a text are extracted properly, subject of the text can be studied and analyzed comprehensively and good decision can be made on the text.
Texts are commonly represented using the well-known Vector Space Model (VSM) (Salton, Yang, & Yu, 1975), however it results in sparse matrices to be dealt with computationally and while target application involves twitter contents, compared with traditional text collections, this problem becomes even worse. Due to the short texts (140 characters), diversity in twitter contents, informality, grammatical errors, buzzwords, slangs, and the speed with which real-time content is generated, an effective technique is required (Ediger et al., 2010) to extract useful keywords. Graph based technique to extract keywords is appropriate in such situation and has gained popularity in the recent times.
Bellaachia and Al-Dhelaan (2012) proposed a graph based method to extract keywords from twitter data, which uses node weight with TextRank and results in a node-edge weighting approach called NE-Rank (Node and Edge Rank). Term Frequency–Inverse Document Frequency (TF–IDF) is used as the node weight. But, keywords in twitter data do not only depend on TF–IDF. Abilhoa and Castro (2014) proposed a graph based technique to extract keywords from twitter data, which uses closeness and eccentricity centralities to determine node weight and, degree centrality as the tie breaker. Closeness and eccentricity centralities do not work well for disconnected graphs. However in most of the cases, the graph made from tweets becomes a disconnected graph due to the diversity of the tweet contents. Therefore, an effective graph based keyword extraction method is required which can overcome most of the drawbacks of graph based model including the ones cited above. This paper proposes such a graph based keyword extraction method called Keyword Extraction using Collective Node Weight (KECNW) which depends on many parameters of a node like frequency, centrality, position and strength of neighbors.
The remaining part of the research article is organized as follows. Section 2 presents literature survey which describes the related previous works. Section 3 discusses the proposed model in great detail. An illustrative example is presented in Section 4 to understand the proposed model clearly. Results with discussion are presented in Sections 5 and 6 draws some conclusions about the research work.
Section snippets
Literature survey
The keyword extraction techniques can be divided into four categories namely, linguistic approach, machine learning approach, statistical approach and other approaches (Zahang et al., 2008). Linguistic approach uses the linguistic properties of the words, sentences and documents and the most commonly examined linguistic properties are lexical, syntactic, semantic and discourse analysis (Cohen-Kerner, 2003, Hulth, 2003, Nguyen and Kan, 2007). Machine learning approach considers supervised or
Proposed KECNW model
The KECNW model considers frequency, centrality, position and strength of neighbors of a node to calculate importance of the node. The implementation of the model is segregated in 4 phases: preprocessing, textual graph representation, node weight assignment and keyword extraction. The details of all the phases are given below.
An illustrative example
Five tweets from the IPL dataset are taken to illustrate the whole model in detail. Tweets are given as follows:
- (i)
`Very excited for todays IPL contest RPS vs KKR, @msdhoni vs @GautamGambhir fight! #IPL'
- (ii)
`#poll who score 50+ score today #smithy #dhoni #stokes #Rahane #KKRvRPS #rpsvskkr #cricketlovers #ipl #IPL2017′
- (iii)
`RPS should be happy team today because KKR have decided to rest NCN. He has been in prime form. #KKRvRPS #IPL @RPSupergiants @KKRiders'
- (iv)
`KKR seek to extend unbeaten run against Pune //t.co/NdEuZIdxL5
Results and discussion
The experiments are conducted with five datasets: Donald Trump, Harry Potter, IPL, Uri Attack and American Election. All datasets are collected from twitter. Donald Trump, Harry Potter and IPL datasets contain 2000 tweets each while Uri Attack and American Elections contain 500 tweets each.
Precision (Pr), Recall (Re), and F-measure performance measures are used as evaluation metrics for keyword extraction and are given in Eqs. (10)–(12).
Conclusions
Keyword extraction is one of the most important tasks in analyzing twitter data. This paper proposes a keyword extraction model called KECNW which consists of four phases: pre-processing, textual graph representation, node weight assignment and keyword extraction. Pre-processing involves removing unwanted noise, tokenization and stop word removal. Textual graph representation involves vertex assignment and establishment of edges between vertices. Node weight assignment phase determines node
Funding
This study is not funded by any research grant.
Conflict of Interest
There is no conflict of interest.
Human and animal rights
This article does not contain any studies with human or animal participant.
References (31)
- et al.
A keyword extraction method from twitter messages represented as graphs
Applied Mathematics and Computation
(2014) - et al.
Automatic keyword prediction using Google similarity distance
Expert System with Application
(2010) - et al.
An overview of graph-based keyword extraction methods and approaches
J. Inf. Org. Soc.
(2015) - et al.
NE-Rank: a novel graph-based key phrase extraction in twitter
- et al.
Text mining: applications and theory
(2010) A comparison of centrality measures for graph-based keyphrase extraction
- et al.
TopicRank: Graph-based topic ranking for keyphrase extraction
Automatic extraction of keyword from abstracts, automatic extraction of keyword from abstracts
Lecture Notes in Computer Science
(2003)- et al.
Massive social network analysis: Mining twitter for social good
- et al.
Extracting key terms from noisy and multi-theme documents
A brief survey of text mining
LDV Forum GLDV Journal for Computational Linguistics and Language Technology
Improved automatic keyword extraction given more linguistic knowledge
Term ranker: A graph based re-ranking approach
A graph based representative keywords extraction model from news articles
Cited by (93)
Unsupervised technical phrase extraction by incorporating structure and position information
2024, Expert Systems with ApplicationsKEST: A graph-based keyphrase extraction technique for tweets summarization using Markov Decision Process
2022, Expert Systems with ApplicationsFeature extraction of search product based on multi-feature fusion-oriented to Chinese online reviews
2022, Data Science and ManagementCitation Excerpt :Liu et al. (2016) chose the N-gram which satisfied the Bayesian Nonparametric model (BNP) as a candidate, and used the boundary average information entropy of the N-gram and the sub-string dependency to filter the candidates and extract the final product feature values. Biswas et al. (2017) constructed a novel unsupervised graph-based keyword extraction method called Keyword Extraction by using Collective Node Weight (KECNW), which could determine the importance of a keyword by collectively taking various influencing parameters. Rodrigues and Chiplunkar (2016) used the Apriori algorithm for feature extraction and classified product features based on the unsupervised SentiWordNet method.
Exploring ensemble oversampling method for imbalanced keyword extraction learning in policy text based on three-way decisions and SMOTE
2022, Expert Systems with ApplicationsCitation Excerpt :Abilhoa and Castro (2014) represented tweet texts as graphs and applied centrality measures to find the relevant keywords. To summarize the twitter content, Biswas et al. (2018) proposed an automatic keyword extraction method to analyze tweets. They calculated the collective node weights to determine the importance of words.
Empower Keywords Generation for Short Texts with Graph-to-Sequence Learning
2022, Procedia Computer Science