Identifying influential segments from word co-occurrence networks using AHP

doi:10.1016/j.cogsys.2017.07.003

Cognitive Systems Research

Volume 47, January 2018, Pages 28-41

https://doi.org/10.1016/j.cogsys.2017.07.003 Get rights and content

Abstract

Identifying important segments in textual data seems to be an important area of research for various applications including topic modelling, trend detection, summarization and event detection. In existing research work, different metrics have been studied to analyse the word co-occurrence network. This research work contributes towards non-semantic and an unsupervised topic identification using the word co-occurrence networks. In this research work, keyphrase have been identified by preserving the lexical sequence using a directed and weighted word co-occurrence network. Further AHP (Analytic Hierarchy Process) model based upon four significant attributes of the word co-occurrence networks have been proposed to rank the keyphrases. Most frequently occurring segment is identified as an influential segment. Experimental results proved high effectiveness of the proposed approach. Results for the First Story Detection, 72 Twitter TDT, synthesized Rio Olympics dataset have been discussed to demonstrate its potential in precisely discovering influential segments.

Introduction

In recent years, complex networks have received much attention from academic researchers and practitioners including market basket analysis, spreading rumours, virus spreading, spread of disease, natural sciences and word adjacency model. Semantics of a complex network is different for every domain. Identifying influential nodes in the complex network (Chen, Lü, Shang, Zhang, & Zhou, 2012) is of theoretical and practical significance. Event detection research is useful for industrial/business domain to understand what the topic of discussion is, for smart – city projects to identify the routes, which might get blocked, by predicting the future event; for developing the recommender system on the basis of historical data from the user’s profile, for marketing/advertisements and creating awareness about newly introduced programs and products.

Social media is a great source of user-generated data produced by naïve user. The major challenges which are related to identification of events using keyphrase extraction is ill-formed user generated data, computational expensive large scale data analytics and dynamically everchanging domains of information. Using network of words occurring together, different network metrics can be used to identify influential segment. In this research work, weighted-edge network metric is considered as significant parameter to study frequently co-occurring words and phrases.

Social media data is uncertain text which may contain ill-formed data including slangs, abbreviations and other short-hand notations. Due to location specific slang, text normalization is tedious task for handling ill-formed data. Hence, a context independent approach is required. Moreover, this approach is language independent and can be used for different languages. Although the number of traditional tweet segmentation approaches have been proposed semantically, but statistically computational techniques are required to identify influential segment as semantically independent and content insensitive approach.

On social media, the naïve users discuss the ever-changing set of topics based on different location, different time and other different local and global environments. The topics, trends and events are independent of the previous set of discussions. Hence, an un-supervised textual networks based approach is required for keyphrase extraction from social media data.

Word co-occurrence networks are graphical networks which are generated from contextual information. These are also referred as word-adjacency model. There are different types of word co-occurrence networks which can be framed from given data. For each graph G, node is considered as word and edge is considered as a link connecting two words co-occurring in the given document. In this research work, the document is considered as twitter feed. Based on the co-occurrence and associativity, the structure of graph is decided on three parameters namely directed/undirected, weighted/un-weighted and nearest neighbour edging/all neighbour edging (Abilhoa & de Castro, 2014). For this research work, weighted directed and nearest neighbour edging is used.

A set of plots is generated from word co-occurrence network by using First Story Decomposition (Petrović, Osborne, & Lavrenko, 2010) dataset. Fig. 1(a) shows that the plot of textual networks is assortative word adjacency matrix. This indicates that words having higher number of links are connected to words which also have higher number of links. Thus, corresponding links appears to be important and significant. However, the number of links corresponding to a node $n$ shows the co-occurrence of word with other words and hence average number of links. With similar concept, experiments have been performed to identify the average of edge weights of neighbouring edges with respect to all the edges with weight w.

The average edge weight of k nearest neighbours (knn) of predecessor, successor and all edges at y-axis for different values of weight (x-axis) is plotted in Fig. 1(b), (c) and (d) respectively. As shown in Fig. 1(b) and (c), that edges either have corresponding average of neighbours with high edge weight which is more than 58 and 29 respectively, or have corresponding average of neighbour edges with lower edge weight which is less than 10 and 15 respectively. However, in Fig. 1(d), the graph clearly indicates that the average of maximum neighbouring edge weight is either more than 20 or less than 10. These observations clearly indicates that there are some phrases in the word co-occurrence network which contains sub-graphs with sum of high edge weight indicating that the phrase is repeated for large number of times in discussions. Hence, the idea is to use edge based measures instead of node based measure for identifying influential segments. The clustering coefficient obtained for the textual network of twitter tweets is 0.0757002632049 which is approximately three times of 10 random network clustering coefficient calculated as 0.0259882722963 for same number of nodes, links and degree distribution. This shows that the real-world word co-occurrence networks contain significant patterns which can be studied to obtain useful insights. These plots are meaningful for this study because they clearly indicate the importance and significance of word co-occurrence network based metrics which can be used to map different patterns, keywords and key-phrases.

Section snippets

Related work

Graph based key-phrase extraction techniques can be both supervised and unsupervised, context dependent and context independent. In this research work, many context-independent unsupervised graph based keyword extraction techniques have been explored. KeyWorld is an automatic indexing system which has been proposed by (Matsuo, Ohsawa, & Ishizuka, 2001) which extracts candidate keywords by measuring their influence on small-world properties. It captures characteristic path length and extended

Proposed method

The proposed methodology deals with extracting twitter feeds from the standard dataset of First Story Detection (Petrović et al., 2010) and identifying keyphrase from twitter feeds using word adjacency model. The sub-modules for proposed method are discussed in this section. The topics in dataset named FSD (First Story Detection) have been used as baseline for experimentation and evaluation of results. Also, 72 twitter Topic Detection and Tracking (TDT) (Aiello et al., 2013) dataset and

Experiments and evaluation

Various experiments have been performed on the basis of proposed method. The dataset used for this research work is FSD (First Story Detection) dataset, 72 twitter TDT dataset and RIO Olympics dataset which is available in public domain. In FSD dataset, 51,879,318 tweet Ids are given. For evaluating first story detection, 3034 tweets have been tagged with topic. Total of 27 topics have been marked as ground truth data to which the evaluated result have been compared. Tweets of about randomly

Conclusion

This research work deals with word co-occurrence network analysis. Keyphrase have been obtained from textual networks which have been further ranked using AHP optimization technique. It has been observed that with varying approximation value of p, the threshold t varies and hence, longest sequence so obtained also varies. For results obtained using ranking by AHP, the keyphrase are ranked in three categories namely headline which contain lower number of keywords, detailed description which

Appendix

Rank	Phrase Num	Phrase	Number of Words	Inference
1	2	rip amy winehouse dies from	5	Important headlines obtained
2	3	rip amy winehouse dead her for bellatrix	7
3	10	singer amy winehouse dies from	5
4	4	rip amy winehouse found dead her for bellatrix	8
5	11	singer amy winehouse dead her for bellatrix	7
6	7	rip amy winehouse has died the cause death unknown but there are rumors mrs weasley mistook her for bellatrix	19	Detailed description of topic has been obtained
7	12	singer amy winehouse found dead her for bellatrix	8
8	6	rip amy

References (21)

W.D. Abilhoa et al.
A keyword extraction method from twitter messages represented as graphs
Applied Mathematics and Computation
(2014)
D. Chen et al.
Identifying influential nodes in complex networks
Physica A: Statistical Mechanics and its Applications
(2012)
T.L. Saaty
How to make a decision: The analytic hierarchy process
European Journal of Operational Research
(1990)
L.M. Aiello et al.
Sensing trending topics in Twitter
IEEE Transactions on Multimedia
(2013)
S. Beliga et al.
An overview of graph-based keyword extraction methods and approaches
Journal of Information and Organizational Sciences
(2015)
J. Borge-Holthoefer et al.
Semantic networks: Structure and dynamics
Entropy
(2010)
Boudin, F. (2013, October). A comparison of centrality measures for graph-based keyphrase extraction. In International...
G. Erkan et al.
LexRank: Graph-based lexical centrality as salience in text summarization
Journal of Artificial Intelligence Research
(2004)
Grineva, M., Grinev, M., & Lizorkin, D. (2009, April). Extracting key terms from noisy and multitheme documents. In...
Lahiri, S., Choudhury, S. R., & Caragea, C. (2014). Keyword and keyphrase extraction using centrality measures on...

There are more references available in the full text version of this article.

Cited by (28)

Text characterization based on recurrence networks
2023, Information Sciences
Several complex systems are characterized by exhibiting intricate properties that occur at multiple scales. These multi-scale characterizations are used in various applications. In particular, texts can be characterized by a hierarchical structure, which can be approached by using multi-scale concepts and methods. Here, we adopt an extension of the multi-scale, mesoscopic approach – hereafter referred to as a recurrence network – to represent text narratives, in which only the recurrent relationships among tagged parts of speech (subject, verb and direct object) are considered to establish connections among sequential pieces of text. The characterization of the texts was then achieved by considering scale-dependent complementary methods: accessibility and symmetry. To evaluate the potential of these concepts, we approached the problem of distinguishing between meaningful and meaningless texts and different literary genres (namely, fiction and non-fiction). A set of 300 books was considered and compared by using the above approaches. The recurrence network characterization was able to discriminate to some extent between real and meaningless and between the two genres assessed. Thus, our results indicate that recurrence networks are able to capture subtleties in book plots, suggesting that a similar methodology can be used in related networked applications.
KEST: A graph-based keyphrase extraction technique for tweets summarization using Markov Decision Process
2022, Expert Systems with Applications
Citation Excerpt :
The WCN is a path based network which helps in summarizing the multiple tweets using graphical methods. The node in WCN gives importance of the word and because of repetitive nature of data, the overlapping of path (edges) also plays an important role in summarizing the information using graphical methods (Garg & Kumar, 2018a). In this research work, a dynamic chain of such segments is obtained using Markov Decision Process (MDP) with maintained lexical sequence of words using directed WCN.
Multi-document summarization finds its application in many downstream information retrieval and natural language processing tasks. In the light of recent developments in social media data mining, Tweet summarization has emerged as a fundamental task of automatically detecting important keyphrases from a set of Tweets about current happenings. In the existing literature, the graph-based keyphrase extraction techniques are well-established unsupervised algorithms to capture summaries from dynamically evolving data. We argue that the traditional multi-tweet summarization technique may or may not capture user’s interest-specific keyphrases during tweet summarization. The nature of user-generated factual short-text is different from well-formed descriptive and perceptual long-text due to their repetitive nature. In this context, we introduce a simple yet effective interest-specific keyphrase extraction technique for tweet summarization as KEST: Key Extraction for Summarization of Tweets using Markov Decision Process (MDP). In this research work, we generate a path as evolving chain of highly interconnected words from sub-components in graph of words. We evaluate the effectiveness of our computationally, inexpensive, graph-based, abstractive keyphrase extraction approach over two datasets which we make publicly available.
UBIS: Unigram Bigram Importance Score for Feature Selection from Short Text
2022, Expert Systems with Applications
Citation Excerpt :
Similar hybrid filter-wrapper algorithm was used for feature selection (Alyasiri, Cheah, & Abasi, 2021) Many rich and poor optimization algorithms were exploited for text classification to reduce the high dimensional feature space (Thirumoorthy & Muneeswaran, 2021). Some other feature selection algorithms are multi-criteria decision making techniques (Garg & Kumar, 2018a; Kou et al., 2020), Particle Swarm Optimization (Akhtar, Gupta, Ekbal, & Bhattacharyya, 2017), Jaya Algorithm (Thirumoorthy & Muneeswaran, 2020), Grey Wolf Optimizer (Asgarnezhad, Monadjemi, & Soltanaghaei, 2021) and Grasshopper Optimization Algorithm (Purushothaman, Rajagopalan, & Dhandapani, 2020). Recently, a feature selection algorithm was proposed over independent feature space search (Liu, Ju, Wang, & Su, 2020).
A huge amount of data has been generated over the internet since few decades which is increasing exponentially. It has become difficult to manually classify the online and offline short textual documents. In this context, two major feature extraction techniques are used in existing literature, namely, TFIDF vectorizer and Count vectorizer. The major challenge in the existing feature extraction techniques is the number of textual features extracted. The textual feature reduction techniques are associated with the use of features and its correlation with resulting value or category. However, it is interesting to note that the importance of uni-grams and bi-grams may contribute more efficiently in determining the feature space vector. In this research work, the Graph of Words (GoW) based selective feature extraction technique is proposed as Uni-gram Bi-gram Importance Score (UBIS) as obtained from node score and edge score in Graph of Words. The experimental results show the effectiveness of the UBIS over TFIDF vectorizer and Count Vectorizer which are hybridized with feature selection techniques. To test and validate the experiments, logistic regression with gradient descent is used as the linear classification model over three different binary text classification dataset.
A network-based feature extraction model for imbalanced text data
2022, Expert Systems with Applications
Citation Excerpt :
Its global properties can not be inferred from the local interactions between certain node groups but emerge from the interactions of the whole elements (Cong, & Liu, 2014). It is a hot field to study the natural language with complex networks, and a considerable number of researches have emerged in the last few years (Cong, & Liu, 2014; Arruda et al., 2016; Akimushkin et al., 2017; Garg, & Kumar, 2018). A network model of text is constructed with the language units (words or phrases) as the nodes connected by their interrelations (Cong, & Liu, 2014).
The explosive growth of text data has attracted many researchers to explore the efficient method to extract valuable hidden information. Many technologies, especially deep learning methods, have achieved great success in text analysis. However, the most powerful methods always require a considerable quantity of data for training, which may suffer from imbalanced data in some cases. In this paper, we propose a network-based Convolution Neural Network (NCNN) to mitigate the effect of imbalanced data. The proposed model first generates new synthetic samples for the imbalanced data based on the random walking of the network. Then an extra layer called Polar Layer is introduced to connect the output from the network model of the text to the classical CNN. Two electing strategies (n-NCNN and x-NCNN) are proposed to improve the performance of NCNN further. In the experimental section, the proposed model is applied to Reuters 21578 and WebKb. By comparing with six approaches, we prove the effectiveness of the proposed NCNN model on the imbalanced text data.
A network-based CNN model to identify the hidden information in text data
2022, Physica A: Statistical Mechanics and its Applications
With the development of the internet and big data, the missing or hidden information identification of text data has become an imperative task. At present, the challenge in the hidden information study is judging whether there is hidden information and where it exists. In this paper, hidden information refers to the words that do not appear in a sentence, however, they have certain correlations with the existing words or sentence and have a great influence on the comprehension of a sentence or part of the text data. This paper focuses on discovering the key and influential hidden information in the text data. A keyword-based hidden information extraction framework is proposed in this paper to search hidden entities, with the assumption that the importance of hidden objects is reflected by the keywords in the text data. A network-based Convolution Neural Network (CNN) model is developed to identify the hidden information related to keywords. The model is based on the results of CNN, and cosine similarity is used to judge whether there is hidden information in the source text data or not. We primarily form the word co-occurrence network of text, select the words with the highest degree as keywords, and generate random walk paths on the network. Besides, we use the random walk path where the last word is the keyword to train CNN. In the experimental section, the proposed model is applied to the dataset in 20Newgroups. The results show that the proposed model can effectively identify the hidden information associated with the keywords in the source text data, and the detection accuracy of keywords can reach 98%–99% achieved by CNN.
An eye-tracking attention based model for abstractive text headline
2019, Cognitive Systems Research
Online network platforms provide great convenience for users to obtain information. However, it’s challenging to select the required information from enormous texts. Automatic text headline generation methods not only guide users to select the information they are interested in, but also solve the problem of information overload. Nevertheless, the existing works mainly utilize the grammar rules to obtain the key information of the source text, while ignoring the dwell time of user’s attention on different text contents. To address this issue, this paper proposes an abstractive text headline generation model based on the eye-tracking attention mechanism. Specifically, this model first relies on the eye-tracking data to establish the mapping relationship between text words and the words’ reading time. Then, an eye-tracking attention mechanism is constructed to judge the importance of different words. Finally, this attention mechanism is integrated into the encoder-decoder framework to generate a high-quality headline. Experimental results obtained from different datasets demonstrate that the headline generated by our model is more concise. Moreover, our proposed model outperforms significantly the classical headline generation models on ROUGE-1, ROUGE-2 and ROUGE-L.

View all citing articles on Scopus

View full text

Identifying influential segments from word co-occurrence networks using AHP

Abstract

Introduction

Section snippets

Related work

Proposed method

Experiments and evaluation

Conclusion

Appendix

Applied Mathematics and Computation

Physica A: Statistical Mechanics and its Applications

European Journal of Operational Research

Sensing trending topics in Twitter

IEEE Transactions on Multimedia

An overview of graph-based keyword extraction methods and approaches

Journal of Information and Organizational Sciences

Semantic networks: Structure and dynamics

Entropy

LexRank: Graph-based lexical centrality as salience in text summarization

Journal of Artificial Intelligence Research