Identifying influential segments from word co-occurrence networks using AHP
Introduction
In recent years, complex networks have received much attention from academic researchers and practitioners including market basket analysis, spreading rumours, virus spreading, spread of disease, natural sciences and word adjacency model. Semantics of a complex network is different for every domain. Identifying influential nodes in the complex network (Chen, Lü, Shang, Zhang, & Zhou, 2012) is of theoretical and practical significance. Event detection research is useful for industrial/business domain to understand what the topic of discussion is, for smart – city projects to identify the routes, which might get blocked, by predicting the future event; for developing the recommender system on the basis of historical data from the user’s profile, for marketing/advertisements and creating awareness about newly introduced programs and products.
Social media is a great source of user-generated data produced by naïve user. The major challenges which are related to identification of events using keyphrase extraction is ill-formed user generated data, computational expensive large scale data analytics and dynamically everchanging domains of information. Using network of words occurring together, different network metrics can be used to identify influential segment. In this research work, weighted-edge network metric is considered as significant parameter to study frequently co-occurring words and phrases.
Social media data is uncertain text which may contain ill-formed data including slangs, abbreviations and other short-hand notations. Due to location specific slang, text normalization is tedious task for handling ill-formed data. Hence, a context independent approach is required. Moreover, this approach is language independent and can be used for different languages. Although the number of traditional tweet segmentation approaches have been proposed semantically, but statistically computational techniques are required to identify influential segment as semantically independent and content insensitive approach.
On social media, the naïve users discuss the ever-changing set of topics based on different location, different time and other different local and global environments. The topics, trends and events are independent of the previous set of discussions. Hence, an un-supervised textual networks based approach is required for keyphrase extraction from social media data.
Word co-occurrence networks are graphical networks which are generated from contextual information. These are also referred as word-adjacency model. There are different types of word co-occurrence networks which can be framed from given data. For each graph G, node is considered as word and edge is considered as a link connecting two words co-occurring in the given document. In this research work, the document is considered as twitter feed. Based on the co-occurrence and associativity, the structure of graph is decided on three parameters namely directed/undirected, weighted/un-weighted and nearest neighbour edging/all neighbour edging (Abilhoa & de Castro, 2014). For this research work, weighted directed and nearest neighbour edging is used.
A set of plots is generated from word co-occurrence network by using First Story Decomposition (Petrović, Osborne, & Lavrenko, 2010) dataset. Fig. 1(a) shows that the plot of textual networks is assortative word adjacency matrix. This indicates that words having higher number of links are connected to words which also have higher number of links. Thus, corresponding links appears to be important and significant. However, the number of links corresponding to a node shows the co-occurrence of word with other words and hence average number of links. With similar concept, experiments have been performed to identify the average of edge weights of neighbouring edges with respect to all the edges with weight w.
The average edge weight of k nearest neighbours (knn) of predecessor, successor and all edges at y-axis for different values of weight (x-axis) is plotted in Fig. 1(b), (c) and (d) respectively. As shown in Fig. 1(b) and (c), that edges either have corresponding average of neighbours with high edge weight which is more than 58 and 29 respectively, or have corresponding average of neighbour edges with lower edge weight which is less than 10 and 15 respectively. However, in Fig. 1(d), the graph clearly indicates that the average of maximum neighbouring edge weight is either more than 20 or less than 10. These observations clearly indicates that there are some phrases in the word co-occurrence network which contains sub-graphs with sum of high edge weight indicating that the phrase is repeated for large number of times in discussions. Hence, the idea is to use edge based measures instead of node based measure for identifying influential segments. The clustering coefficient obtained for the textual network of twitter tweets is 0.0757002632049 which is approximately three times of 10 random network clustering coefficient calculated as 0.0259882722963 for same number of nodes, links and degree distribution. This shows that the real-world word co-occurrence networks contain significant patterns which can be studied to obtain useful insights. These plots are meaningful for this study because they clearly indicate the importance and significance of word co-occurrence network based metrics which can be used to map different patterns, keywords and key-phrases.
Section snippets
Related work
Graph based key-phrase extraction techniques can be both supervised and unsupervised, context dependent and context independent. In this research work, many context-independent unsupervised graph based keyword extraction techniques have been explored. KeyWorld is an automatic indexing system which has been proposed by (Matsuo, Ohsawa, & Ishizuka, 2001) which extracts candidate keywords by measuring their influence on small-world properties. It captures characteristic path length and extended
Proposed method
The proposed methodology deals with extracting twitter feeds from the standard dataset of First Story Detection (Petrović et al., 2010) and identifying keyphrase from twitter feeds using word adjacency model. The sub-modules for proposed method are discussed in this section. The topics in dataset named FSD (First Story Detection) have been used as baseline for experimentation and evaluation of results. Also, 72 twitter Topic Detection and Tracking (TDT) (Aiello et al., 2013) dataset and
Experiments and evaluation
Various experiments have been performed on the basis of proposed method. The dataset used for this research work is FSD (First Story Detection) dataset, 72 twitter TDT dataset and RIO Olympics dataset which is available in public domain. In FSD dataset, 51,879,318 tweet Ids are given. For evaluating first story detection, 3034 tweets have been tagged with topic. Total of 27 topics have been marked as ground truth data to which the evaluated result have been compared. Tweets of about randomly
Conclusion
This research work deals with word co-occurrence network analysis. Keyphrase have been obtained from textual networks which have been further ranked using AHP optimization technique. It has been observed that with varying approximation value of p, the threshold t varies and hence, longest sequence so obtained also varies. For results obtained using ranking by AHP, the keyphrase are ranked in three categories namely headline which contain lower number of keywords, detailed description which
Appendix
Rank Phrase Num Phrase Number of Words Inference 1 2 rip amy winehouse dies from 5 Important headlines obtained 2 3 rip amy winehouse dead her for bellatrix 7 3 10 singer amy winehouse dies from 5 4 4 rip amy winehouse found dead her for bellatrix 8 5 11 singer amy winehouse dead her for bellatrix 7 6 7 rip amy winehouse has died the cause death unknown but there are rumors mrs weasley mistook her for bellatrix 19 Detailed description of topic has been obtained 7 12 singer amy winehouse found dead her for bellatrix 8 8 6 rip amy
References (21)
- et al.
A keyword extraction method from twitter messages represented as graphs
Applied Mathematics and Computation
(2014) - et al.
Identifying influential nodes in complex networks
Physica A: Statistical Mechanics and its Applications
(2012) How to make a decision: The analytic hierarchy process
European Journal of Operational Research
(1990)- et al.
Sensing trending topics in Twitter
IEEE Transactions on Multimedia
(2013) - et al.
An overview of graph-based keyword extraction methods and approaches
Journal of Information and Organizational Sciences
(2015) - et al.
Semantic networks: Structure and dynamics
Entropy
(2010) - Boudin, F. (2013, October). A comparison of centrality measures for graph-based keyphrase extraction. In International...
- et al.
LexRank: Graph-based lexical centrality as salience in text summarization
Journal of Artificial Intelligence Research
(2004) - Grineva, M., Grinev, M., & Lizorkin, D. (2009, April). Extracting key terms from noisy and multitheme documents. In...
- Lahiri, S., Choudhury, S. R., & Caragea, C. (2014). Keyword and keyphrase extraction using centrality measures on...
Cited by (28)
Text characterization based on recurrence networks
2023, Information SciencesKEST: A graph-based keyphrase extraction technique for tweets summarization using Markov Decision Process
2022, Expert Systems with ApplicationsCitation Excerpt :The WCN is a path based network which helps in summarizing the multiple tweets using graphical methods. The node in WCN gives importance of the word and because of repetitive nature of data, the overlapping of path (edges) also plays an important role in summarizing the information using graphical methods (Garg & Kumar, 2018a). In this research work, a dynamic chain of such segments is obtained using Markov Decision Process (MDP) with maintained lexical sequence of words using directed WCN.
UBIS: Unigram Bigram Importance Score for Feature Selection from Short Text
2022, Expert Systems with ApplicationsCitation Excerpt :Similar hybrid filter-wrapper algorithm was used for feature selection (Alyasiri, Cheah, & Abasi, 2021) Many rich and poor optimization algorithms were exploited for text classification to reduce the high dimensional feature space (Thirumoorthy & Muneeswaran, 2021). Some other feature selection algorithms are multi-criteria decision making techniques (Garg & Kumar, 2018a; Kou et al., 2020), Particle Swarm Optimization (Akhtar, Gupta, Ekbal, & Bhattacharyya, 2017), Jaya Algorithm (Thirumoorthy & Muneeswaran, 2020), Grey Wolf Optimizer (Asgarnezhad, Monadjemi, & Soltanaghaei, 2021) and Grasshopper Optimization Algorithm (Purushothaman, Rajagopalan, & Dhandapani, 2020). Recently, a feature selection algorithm was proposed over independent feature space search (Liu, Ju, Wang, & Su, 2020).
A network-based feature extraction model for imbalanced text data
2022, Expert Systems with ApplicationsCitation Excerpt :Its global properties can not be inferred from the local interactions between certain node groups but emerge from the interactions of the whole elements (Cong, & Liu, 2014). It is a hot field to study the natural language with complex networks, and a considerable number of researches have emerged in the last few years (Cong, & Liu, 2014; Arruda et al., 2016; Akimushkin et al., 2017; Garg, & Kumar, 2018). A network model of text is constructed with the language units (words or phrases) as the nodes connected by their interrelations (Cong, & Liu, 2014).
A network-based CNN model to identify the hidden information in text data
2022, Physica A: Statistical Mechanics and its ApplicationsAn eye-tracking attention based model for abstractive text headline
2019, Cognitive Systems Research