Text structuring methods based on complex network: a systematic review

Oliva, Samuel Zanferdini; Oliveira-Ciabati, Livia; Dezembro, Denise Gazotto; Júnior, Mário Sérgio Adolfi; de Carvalho Silva, Maísa; Pessotti, Hugo Cesar; Pollettini, Juliana Tarossi

doi:10.1007/s11192-020-03785-y

Text structuring methods based on complex network: a systematic review

Published: 03 January 2021

Volume 126, pages 1471–1493, (2021)
Cite this article

Scientometrics Aims and scope Submit manuscript

989 Accesses
9 Citations
1 Altmetric
Explore all metrics

Abstract

Currently, there is a large amount of text being shared through the Internet. These texts are available in different forms—structured, unstructured and semi structured. There are different ways of analyzing texts, but domain experts usually divide this process in some steps such as pre-processing, feature extraction and a final step that could be classification, clustering, summarization, and keyword extraction, depending on the purpose over the text. For this processing, several approaches have been proposed in the literature based on variations of methods like artificial neural network and deep learning. In this paper, we conducted a systematic review of papers dealing with the use of complex networks approaches for the process of analyzing text. The main results showed that complex network topological properties, measures and modeling can capture and identify text structures concerning different purposes such as text analysis, classification, topic and keyword extraction, and summarization. We conclude that complex network topological properties provide promising strategies with respect of processing texts, considering their different aspects and structures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on deep learning approaches for text-to-SQL

Article Open access 23 January 2023

Impact of word embedding models on text analytics in deep learning environment: a review

Article 22 February 2023

The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives

Article Open access 17 February 2024

Notes

https://www.ncbi.nlm.nih.gov/pubmed.
https://mjl.clarivate.com/home.
https://dblp.uni-trier.de/.
means “Term Frequency—Inverse Document Frequency”—a technique to quantify a word in documents. It is used to compute a weight to represent the importance of the word in the corpus.
Approach that computes vector representations of words from large datasets.
Grant #2019/27797-2, São Paulo Research Foundation (FAPESP).
Grant #2019/12787-1, São Paulo Research Foundation (FAPESP).
Grant #2019/07960-6, São Paulo Research Foundation (FAPESP).
Grant #2019/07461-0, São Paulo Research Foundation (FAPESP).
Grant #2017/27251-4, São Paulo Research Foundation (FAPESP).

References

Akimushkin, C., Amancio, D. R., & Oliveira, O. N. Jr. (2017). Text authorship identified using the dynamics of word co-occurrence networks. PloS ONE, 12(1), e0170527.
Google Scholar
Almutawah, K. A. (2014). A decision support system for academic advisors. International Journal of Business Information Systems, 16(2), 177–195.
Google Scholar
Alwidian, S. A., Bani-Salameh, H. A., & Alslaity, A. N. (2015). Text data mining: A proposed framework and future perspectives. International Journal of Business Information Systems, 18(2), 127–140.
Google Scholar
Amancio, D. R. (2015). Probing the topological properties of complex networks modeling short written texts. PloS ONE, 10(2), e0118394.
Google Scholar
Amancio, D. R., Aluisio, S. M., Oliveira, O. N. Jr., & Costa, L. (2012a). Complex networks analysis of language complexity. EPL, 100(5), 58002. d. F..
Google Scholar
Amancio, D. R., Fabbri, R., Oliveira, O. N., Nunes, M. G., & da Fontoura Costa, L. (2011). Opinion discrimination using complex network features. In L. da Fontoura Costa, A. Evsukoff, G. Mangioni, & R. Menezes (Eds.), Complex networks (pp. 154–162). Berlin: Springer.
Google Scholar
Amancio, D. R., Nunes, M. d. G. V., Oliveira, O., & Costa, L. (2012b). Using complex networks concepts to assess approaches for citations in scientific papers. Scientometrics, 91(3), 827–842. d. F..
Google Scholar
Amancio, D. R., Nunes, M. G., Oliveira, O. N., Jr., & Costa, L. (2012c). Extractive summarization using complex networks and syntactic dependency. Physica A: Statistical Mechanics and Its Applications, 391(4), 1855–1864. d. F.
Google Scholar
Amancio, D. R., Oliveira, O. N., Jr., & Costa, L. D. F. (2012d). Structure–semantics interplay in complex networks and its effects on the predictability of similarity in texts. Physica A: Statistical Mechanics and Its Applications, 391(18), 4406–4419.
Google Scholar
Amancio, D. R., Oliveira, O. N. Jr., & da F. Costa, L (2012e). Identification of literary movements using complex networks to represent texts. New Journal of Physics, 14(4), 043029.
Google Scholar
Anami, B. S., Wadawadagi, R. S., & Pagi, V. B. (2014). Machine learning techniques in Web content mining: a comparative analysis. Journal of Information & Knowledge Management, 13(01), 1450005.
Google Scholar
Antiqueira, L., Oliveira, O. N. Jr., da Fontoura Costa, L., & Nunes, M. (2009). A complex network approach to text summarization. Information Sciences, 179(5), 584–599. d. G. V..
MATH Google Scholar
Baccar, S., Rouached, M., & Abid, M. (2016). A capabilities driven model for web services description and composition. International Journal of Business Information Systems, 22(1), 26–40.
Google Scholar
Balinsky, H., Balinsky, A., & Simske, S. J. (2011) Automatic text summarization and small-world networks. In Proceedings of the 11th ACM symposium on document engineering (pp. 175–184).
Beliga, S., Kitanović, O., Stanković, R., & Martinčić-Ipšić, S. (2017). Keyword extraction from parallel abstracts of scientific publications. In J. Szymański & Y. Velegrakis (Eds.), Semanitic keyword-based search on structured data sources (pp. 44–55). Cham: Springer.
Google Scholar
Beliga, S., & Martinčić-Ipšić, S. (2014). Node selectivity as a measure for graph-based keyword extraction in Croatian news. In Proceedings of the 6th international conference on information technologies and information society, Slovenia (pp. 8–17).
Beliga, S., Meštrović, A., & Martinčić-Ipšić, S. (2015). An overview of graph-based keyword extraction methods and approaches. Journal of information and organizational sciences, 39(1), 1–20.
Google Scholar
Blanco, R., & Lioma, C. (2012). Graph-based term weighting for information retrieval. Information Retrieval, 15(1), 54–92. doi:https://doi.org/10.1007/s10791-011-9172-x.
Article Google Scholar
Cao, D., & Xu, L. (2016). Analysis of complex network methods for extractive automatic text summarization. In 2016 2nd IEEE international conference on computer and communications (ICCC) (pp. 2749–2756). IEEE.
Chen, Q., Jiang, Z., & Bian, J. (2014). Chinese keyword extraction using semantically weighted network. In 2014 sixth international conference on intelligent human–machine systems and cybernetics (pp. 83–86). IEEE.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2018). Natural language processing (almost) from scratch (2011).
Conneau, A., Schwenk, H., Barrault, L., & Lecun, Y. (2016). Very deep convolutional networks for natural language processing. In European chapter of the Association for Computational Linguistics (EACL).
Correa, E. A. Jr., Lopes, A. A., & Amancio, D. R. (2018). Word sense disambiguation: A complex network approach. Information Sciences, 442, 103–113.
MathSciNet Google Scholar
de Arruda, H. F., Marinho, V. Q., Costa, L., d., F., & Amancio, D. R. (2019). Paragraph-based representation of texts: A complex networks approach. Information Processing & Management, 56(3), 479–494.
Google Scholar
Duari, S., & Bhatnagar, V. (2020). Complex network based supervised keyword extractor. Expert Systems with Applications, 140, 112876.
Google Scholar
Erkan, G., & Radev, D. R. (2004). Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22, 457–479.
Google Scholar
Feng, X., & Zuo, W. (2014) Discovery of topic based on mass incidents and research of user roles. In 2014 IEEE workshop on advanced research and technology in industry applications (WARTIA) (pp. 144–146). IEEE.
Ferraz de Arruda, H., Nascimento Silva, F., Queiroz Marinho, V., Raphael Amancio, D., & da Fontoura Costa, L. (2018). Representation of texts as complex networks: A mesoscopic approach. Journal of Complex Networks, 6(1), 125–144.
Google Scholar
Goh, W. P., Luke, K.-K., & Cheong, S. A. (2018). Functional shortcuts in language co-occurrence networks. PloS ONE, 13(9), e0203025.
Google Scholar
Goldberg, Y. (2017). Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies, 10(1), 1–309.
Google Scholar
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge: MIT Press.
MATH Google Scholar
Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political analysis, 21(3), 267–297.
Google Scholar
Guan, Q., An, H., Li, H., & Hao, X. (2017). The rapid bi-level exploration on the evolution of regional solar energy development. Physica A: Statistical Mechanics and its Applications, 465, 49–61.
Google Scholar
Harrison, K. R., Ventresca, M., & Ombuki-Berman, B. M. (2016). A meta-analysis of centrality measures for comparing and generating complex network models. Journal of Computational Science, 17, 205–215. https://doi.org/10.1016/j.jocs.2015.09.011.
Article Google Scholar
Hassan, S., Mihalcea, R., & Banea, C. (2007). Random walk term weighting for improved text classification. International Journal of Semantic Computing, 1(04), 421–439.
Google Scholar
Hearst, M. A. (1999). Untangling text data mining. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics (pp. 3–10).
Iqbal, F., Binsalleeh, H., Fung, B. C., & Debbabi, M. (2013). A unified data mining solution for authorship analysis in anonymous textual communications. Information Sciences, 231, 98–112.
Google Scholar
Jiang, J., Zheng, J., Zhao, C., Su, J., Guan, Y., & Yu, Q. (2016). Clinical-decision support based on medical literature: A complex network approach. Physica A: Statistical Mechanics and its Applications, 459, 42–54.
MathSciNet MATH Google Scholar
Jurafsky, D., & Martin, J. H. (2000). Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Computational Linguistics, 26(4), 638–641.
Google Scholar
Ke, X., Zeng, Y., Ma, Q., & Zhu, L. (2014). Complex dynamics of text analysis. Physica A: Statistical Mechanics and its Applications, 415, 307–314.
MathSciNet MATH Google Scholar
Koplenig, A., Meyer, P., Wolfer, S., & Mueller-Spitzer, C. (2017). The statistical trade-off between word order and word structure—Large-scale evidence for the principle of least effort. PloS ONE, 12(3), e0173614.
Google Scholar
Kuramochi, T., Okada, N., Tanikawa, K., Hijikata, Y., & Nishida, S. (2012). Community extracting using intersection graph and content analysis in complex network. In 2012 IEEE/WIC/ACM international conferences on web intelligence and intelligent agent technology (pp. 222–229). IEEE.
Lane, J., & Kim, H. J. (2015). Big data: Web-crawling and analysing financial news using RapidMiner. International Journal of Business Information Systems, 19(1), 41–57. https://doi.org/10.1504/ijbis.2015.069064.
Article Google Scholar
Li, H., An, H., Wang, Y., Huang, J., & Gao, X. (2016). Evolutionary features of academic articles co-keyword network and keywords co-occurrence network: Based on two-mode affiliation network. Physica A: Statistical Mechanics and its Applications, 450, 657–669.
Google Scholar
Li, X., Peng, Q., Sun, Z., Chai, L., & Wang, Y. (2017). Predicting social emotions from readers’ perspective. IEEE Transactions on Affective Computing, 10(2), 255–264.
Google Scholar
Lima, R., Espinasse, B., & Freitas, F. (2015). Relation extraction from texts with symbolic rules induced by inductive logic programming. In 2015 IEEE 27th international conference on tools with artificial intelligence (ICTAI) (pp. 194–201). IEEE.
Liu, H., & Cong, J. (2014). Empirical characterization of modern Chinese as a multi-level system from the complex network approach. Journal of Chinese Linguistics, 42(1), 1–38.
Google Scholar
Liu, Y., & Zhang, M. (2018). Neural network methods for natural language processing. Cambridge: MIT Press.
Google Scholar
Malliaros, F. D., & Skianis, K. (2015). Graph-based term weighting for text categorization. In Proceedings of the 2015 IEEE/ACM international conference on advances in social networks analysis and mining 2015 (pp. 1473–1479).
Manning, C., & Schutze, H. (1999). Foundations of statistical natural language processing. Cambridge: MIT Press.
MATH Google Scholar
Margan, D., Martinčić-Ipšić, S., & Meštrović, A. (2014a). Network differences between normal and shuffled texts: Case of Croatian. In P. Contucci, R. Menezes, A. Omicini, & J. Poncela-Casasnovas (Eds.), Complex Networks (pp. 275–283). Cham: Springer.
Google Scholar
Margan, D., Mestrovic, A., & Martinčić-Ipšić, S. (2014b). Complex networks measures for differentiation between normal and shuffled Croatian texts. In 2014 37th international convention on information and communication technology, electronics and microelectronics (MIPRO) (pp. 1598–1602). IEEE.
Marinho, V. Q., Hirst, G., & Amancio, D. R. (2018). Labelled network motifs reveal stylistic subtleties in written texts. Journal of Complex Networks, 6(4), 620–638.
MathSciNet Google Scholar
Martinčić-Ipšić, S., Margan, D., & Meštrović, A. (2016a). Multilayer network of language: A unified framework for structural analysis of linguistic subsystems. Physica A: Statistical Mechanics and its Applications, 457, 117–128.
Google Scholar
Martinčić-Ipšić, S., Miličić, T., & Meštrović, A. (2016b). Text type differentiation based on the structural properties of language networks. In International conference on information and software technologies (pp. 536–548). Berlin: Springer.
Martinčić-Ipšić, S., Miličić, T., & Todorovski, L. (2019). The influence of feature representation of text on the performance of document classification. Applied Sciences, 9(4), 743.
Google Scholar
Masucci, A. P., & Rodgers, G. J. (2009). Differences between normal and shuffled texts: Structural properties of weighted networks. Advances in Complex Systems, 12(01), 113–129.
MathSciNet Google Scholar
Menaka, S., & Radha, N. (2013). Text classification using keyword extraction technique. International Journal of Advanced Research in Computer Science and Software Engineering, 3(12), 734–740.
Google Scholar
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In ICLR workshop papers.
Nan, J., Xiao, B., Lin, Z., & Xu, Q. (2014). Keywords extraction from Chinese document based on complex network theory. In 2014 seventh international symposium on computational intelligence and design, 13–14 Dec. 2014 (pp. 383–386).
Naw, N., & Hlaing, E. E. (2013). Relevant words extraction method for recommendation system. Bulletin of Electrical Engineering and Informatics, 2(3), 169–176.
Google Scholar
Otter, D. W., Medina, J. R., & Kalita, J. K. (2020). A survey of the usages of deep learning in natural language processing. IEEE Transactions on Neural Networks and Learning Systems.
Panigrahi, P. K., & Bele, N. (2016). A review of recent advances in text mining of Indian languages. International Journal of Business Information Systems, 23(2), 175–193.
Google Scholar
Papadakis, G., Giannakopoulos, G., & Paliouras, G. (2016). Graph vs. bag representation models for the topic classification of web documents. World Wide Web, 19(5), 887–920.
Google Scholar
Rossi, R. G., de Paulo Faleiros, T., de Andrade Lopes, A., & Rezende, S. O. (2012). Inductive model generation for text categorization using a bipartite heterogeneous network. In 2012 IEEE 12th international conference on data mining (pp. 1086–1091). IEEE.
Rousseau, F., Kiagias, E., & Vazirgiannis, M. (2015). Text categorization as a graph classification problem. In Proceedings of the 53rd annual meeting of the Association for Computational Linguistics and the 7th international joint conference on natural language processing (Vol. 1: Long Papers, pp. 1702–1712).
Rubinov, M., & Sporns, O. (2010). Complex network measures of brain connectivity: Uses and interpretations (Article). Neuroimage, 52(3), 1059–1069. doi:https://doi.org/10.1016/j.neuroimage.2009.10.003.
Article Google Scholar
Silva, F. N., Amancio, D. R., Bardosova, M., Costa, L. d. F., & Oliveira, O. N. (2016). Using network science and text analytics to produce surveys in a scientific topic. Journal of Informetrics, 10(2), 487–502. doi:https://doi.org/10.1016/j.joi.2016.03.008.
Article Google Scholar
Silva, T. C., & Amancio, D. R. (2012). Word sense disambiguation via high order of learning in complex networks. EPL, 98(5), 58001.
Google Scholar
Sridharan, K., & Sivakumar, P. (2018). A systematic review on techniques of feature selection and classification for text mining. International Journal of Business Information Systems, 28(4), 504–518.
Google Scholar
Stevanak, J., Larue, D. M., & Carr, L. D. (2010). Distinguishing fact from fiction: pattern recognition in texts using complex networks. arXiv preprint arXiv:1007.3254v2.
Suh, J. H. (2019). SocialTERM-Extractor: Identifying and predicting social-problem-specific key noun terms from a large number of online news articles using text mining and machine learning techniques. Sustainability, 11(1), 196.
Google Scholar
Tachimori, Y., Iwanaga, H., & Tahara, T. (2013). The networks from medical knowledge and clinical practice have small-world, scale-free, and hierarchical features. Physica A: Statistical Mechanics and its Applications, 392(23), 6084–6089.
Google Scholar
Taghandiki, K., Zaeri, A., & Shirani, A. (2016). A supervised approach for automatic web documents topic extraction using well-known web design features. International Journal of Modern Education and Computer Science, 8(11), 20.
Google Scholar
Taskin, Y., Hecking, T., & Hoppe, H. U. (2019). ESA-T2N: A novel approach to network-text analysis. In International conference on complex networks and their applications (pp. 129–139). Berlin: Springer.
Tobon-Mejia, D. A., Medjaher, K., Zerhouni, N., & Tripot, G. (2012). A data-driven failure prognostics method based on mixture of Gaussians hidden Markov models. IEEE Transactions on reliability, 61(2), 491–503.
Google Scholar
Tohalino, J. V., & Amancio, D. R. (2018). Extractive multi-document summarization using multilayer networks. Physica A: Statistical Mechanics and its Applications, 503, 526–539.
Google Scholar
Vega, D., & Magnani, M. (2018). Foundations of temporal text networks. Applied Network Science, 3(1), 25. https://doi.org/10.1007/s41109-018-0082-3.
Article Google Scholar
Wachs-Lopes, G. A., & Rodrigues, P. S. (2016). Analyzing natural human language from the point of view of dynamic of a complex network. Expert Systems with Applications, 45, 8–22.
Google Scholar
Wang, R., & Wang, G. (2019). Web text categorization based on statistical merging algorithm in big data environment. International Journal of Ambient Computing and Intelligence (IJACI), 10(3), 17–32.
Google Scholar
Wang, Z., Li, H., & Tang, R. (2019). Network analysis of coal mine hazards based on text mining and link prediction. International Journal of Modern Physics C (IJMPC), 30(07), 1–22.
Google Scholar
Yan, D., Li, K., & Ye, J. (2019). Correlation analysis of short text based on network model. Physica A: Statistical Mechanics and its Applications, 531, 121728.
Google Scholar
Yang, L., Li, K., & Huang, H. (2018). A new network model for extracting text keywords. Scientometrics, 116(1), 339–361.
Google Scholar
Yang, L., Li, K., Zhao, D., Gu, S., & Yan, D. (2019). A network method for identifying the root cause of high-speed rail faults based on text data. Energies, 12(10), 1908.
Google Scholar
Young, T., Hazarika, D., Poria, S., & Cambria, E. (2018). Recent trends in deep learning based natural language processing. IEEE Computational Intelligence Magazine, 13(3), 55–75.
Google Scholar
Zhang, D., Fan, F., & Park, S. D. (2019). Network analysis of actors and policy keywords for sustainable environmental governance: focusing on Chinese environmental policy. Sustainability, 11(15), 4068.
Google Scholar
Zhang, Z., Xu, H., & Huo, W. (2015). Topical text network construction based on seed word augmentation. In 2015 12th international conference on fuzzy systems and knowledge discovery (FSKD) (pp. 1470–1474). IEEE.
Zhao, A., Manandhar, S., & Yu, L. (2018). Topology and semantic based topic dependency structure discovery. Filomat, 32(5), 1843–1851.
MathSciNet Google Scholar
Zimmer, B., Sahlgren, M., & Kerren, A. (2017). Visual analysis of relationships between heterogeneous networks and texts: An application on the IEEE VIS publication dataset. In Informatics (p. 112). Multidisciplinary Digital Publishing Institute.
Zou, S., Yang, X., Jin, Y., & Du, Z. (2013). Text research based on complex network. In 2013 10th web information system and application conference (pp. 33–37). IEEE.

Download references

Acknowledgements

Authors thank FAPESP^{Footnote 6}^,^{Footnote 7}^,^{Footnote 8}^,^{Footnote 9}^,^{Footnote 10} for financial support.

Author information

Authors and Affiliations

Kidopi Soluções em Informática Ltda, Ribeirão Preto, São Paulo, Brazil
Samuel Zanferdini Oliva, Livia Oliveira-Ciabati, Denise Gazotto Dezembro, Mário Sérgio Adolfi Júnior, Maísa de Carvalho Silva, Hugo Cesar Pessotti & Juliana Tarossi Pollettini

Authors

Samuel Zanferdini Oliva
View author publications
You can also search for this author in PubMed Google Scholar
Livia Oliveira-Ciabati
View author publications
You can also search for this author in PubMed Google Scholar
Denise Gazotto Dezembro
View author publications
You can also search for this author in PubMed Google Scholar
Mário Sérgio Adolfi Júnior
View author publications
You can also search for this author in PubMed Google Scholar
Maísa de Carvalho Silva
View author publications
You can also search for this author in PubMed Google Scholar
Hugo Cesar Pessotti
View author publications
You can also search for this author in PubMed Google Scholar
Juliana Tarossi Pollettini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Samuel Zanferdini Oliva.

Appendices

Appendix 1: Complex network measures

Accessibility \({\alpha }_{i}^{h} =exp\left(-\sum {p}_{i,j}^{\left(h\right)}\text{log}{p}_{i,j}^{\left(h\right)}\right)\), where \({p}_{i,j}^{\left(h\right)}\) is the probability of a random walker to reach a given node j departing from i, in h steps.
Aggregation coefficient \({S}_{i}=\frac{{K}_{i}}{\left(\genfrac{}{}{0pt}{}{{D}_{i}}{2}\right)}=\frac{2{K}_{i}}{{D}_{i}({D}_{i}-1)}\), where \({D}_{i}\) represents the degree of node \({v}_{i}\), and \({K}_{i}\) represents the degree of aggregation of node \(V\) and can be defined as: \({K}_{i}=\sum _{{v}_{i}\cdot {v}_{k}\cdot {v}_{k}}{\beta }_{i\cdot j\cdot k}\), where \({\beta }_{i\cdot j\cdot k}\) represents the connection between node \({v}_{i}\) and \({v}_{j}\) and \({v}_{k}\) \((i\ne j\ne k)\).
Assortativity \({\Gamma }=\frac{\frac{1}{\varvec{M}}\sum _{\varvec{j}>\varvec{i}}{\varvec{k}}_{\varvec{i}}{\varvec{k}}_{\varvec{j}}{\varvec{a}}_{\varvec{i}\varvec{j}}-{\left[\frac{1}{\varvec{M}}\sum _{\varvec{j}>\varvec{i}}\frac{1}{2}({\varvec{k}}_{\varvec{i}}+{\varvec{k}}_{\varvec{j}}){\varvec{a}}_{\varvec{i}\varvec{j}}\right]}^{2}}{\frac{1}{\varvec{M}}\sum _{\varvec{j}>\varvec{i}}\frac{1}{2}({\varvec{k}}_{\varvec{i}}^{2}+{\varvec{k}}_{\varvec{j}}^{2}){\varvec{a}}_{\varvec{i}\varvec{j}}-{\left[\frac{1}{\varvec{M}}\sum _{\varvec{j}>\varvec{i}}\frac{1}{2}({\varvec{k}}_{\varvec{i}}+{\varvec{k}}_{\varvec{j}}){\varvec{a}}_{\varvec{i}\varvec{j}}\right]}^{2}}\), where M is the number of edges of the network, and \({a}_{ij}=1\) if nodes \(i\) and \(j\) are connected and \({a}_{ij}=0\) otherwise.
Assortative coefficient \(r=\frac{\sum _{jk}jk\left({e}_{jk}-{q}_{j}{q}_{k}\right)}{{\sigma }_{q}^{2}}\), where \({e}_{jk}\) is the joint probability distribution of the remaining degrees of the two nodes j and k, \({q}_{k}\) is the distribution of the remaining degrees and \({\sigma }_{q}^{2}\) is the maximal value of function for perfectly assortativity network.
Average clustering coefficient \(C=\frac{1}{N}\sum _{i}{c}_{i}\), where \({c}_{i}\) is the clustering coefficient of vertex \(i\).
Average connection density \({K}_{den}=\frac{\left|E\right|}{\left|{N}^{2}\right|-\left|N\right|}\), where \(\left|E\right|\) is the cardinality of the set of edges and \(\left|N\right|\) is the cardinality of the set of nodes.
Average degree \(\langle K \rangle =\frac{K}{N}\), where \(N\) is the number of vertices, \(N=\left|V\right|\), and \(K\) is the number of edges, \(K=\left|E\right|\).
Average path length \(L=\sum _{i,j}\frac{{d}_{ij}}{N(N-1)}\), where \({d}_{ij}\) is the shortest path between \(i\) and \(j\), and \(N\) is the number of vertices.
Average shortest path when the network contains node k, the average shortest path of the network is \({L}_{k}=\frac{\sum _{i\ne j}{d}_{ij}}{N}(i,j\in V)\) and when the network does not contain the node k, the average shortest path of the network L is: \(L=\frac{\sum _{i\ne k\ne j}{d}_{ij}}{N-1}(i,j\in V)\).
Average shortest path length \(L=\frac{1}{N(N-1)}\sum _{i\ne j}{d}_{ij}\), where \({d}_{ij}\) is the shortest path between \(i\) and \(j\), and \(N\) is the number of vertices.
Betweenness (B) \({B}_{i} ={\sum }_{s,t}\frac{{g}_{s,t}^{i}}{{g}_{s,t}}={\sum }_{j}{w}_{ji}\), where \({g}_{s,t}^{i}\) is the number of shortest paths connecting nodes s and t that include node i, and \({g}_{s,t}\) is the number of shortest paths connecting s and t, for all pairs s and t.
Betweenness centrality \({bc}_{i}=\sum _{t\ne k\ne i}\frac{{\sigma }_{tk\left(i\right)}}{{\sigma }_{tk}}\), where \({\sigma }_{tk}\) is defined as the total number of shortest paths from node t to k, and \({\sigma }_{tk\left(i\right)}\) is the number of shortest paths from node \(t\) to \(k\) passing through node \(i\).
Cliques it represents complete subgraphs.
Comprehensive Eigenvalue \(CE=\alpha \varDelta L+\beta \varDelta C+\gamma {C}_{b\left(k\right)}\), where \(\varDelta L\) is the shortest path variation of node \(k\), \(\varDelta C\) is the clustering coefficient variation and \({C}_{b\left(k\right)}\) is the betweenness of node \(k\) and \(\alpha +\beta +\gamma =1\).
Coreness is a network degeneracy property that decomposes the graph (\(G\)) into a set of maximal connected subgraphs \({G}_{k}\), where \(k\) denotes the core, such that nodes in \(G\) have degree at least \(k\) within the subgraph and \({G}_{k}\subseteq {G}_{k+1}.\) Coreness of a node is the highest core to which it belongs.
Closeness centrality \({C}_{i}={l}_{i}^{-1}=n/{\sum }_{j}{d}_{ij}\), where \({l}_{i}\) is the average distance from node \(i\) to all the other nodes, and \({d}_{ij}\) is the length of a geodesic path connecting nodes \(i\) and \(j\).
Clustering coefficient it can be calculated as \({c}_{i}=2{n}_{i}/({k}_{i}^{2}-{k}_{i})\), where \({k}_{i}\) is the degree of vertex \(i\), \({n}_{i}\) is the number of neighbors of \(i\).
Clustering coefficient variation clustering coefficient variation of node \(k\) is \(\varDelta C=\left|{C}_{k}-C\right|\), where \({C}_{k}\) is the average clustering coefficient when the network contains node \(k\) and \(C\) is the average clustering coefficient when the network does not contain node \(k\).
Degree of a vertex it quantifies the number of immediate neighbors of a vertex \(i\) and can be obtained as: \({k}_{j} ={\sum }_{j}{A}_{ij}\), where \({A}_{ij}\) is the adjacency matrix such that \({A}_{ij}\) is 1 if there is an edge from vertex \(j\) to vertex \(i\), otherwise 0. In a directed network, each vertex has two degrees. The out-degree is the number of outgoing edges going out of from a vertex \({k}_{i}^{out} ={\sum }_{j}{A}_{ji}\) and the in-degree is the number of incoming edges onto a vertex \({k}_{i}^{in} ={\sum }_{j}{A}_{ij}\). The total degree of the vertex is the sum of its in- and out-degree \({k}_{i}^{tot}={k}_{i}^{in}+{k}_{i}^{out}\).
Degree centrality \({degree}_{i}=\frac{1}{(n-1)}\sum _{j\ne i}{m}_{ij}\), where \({m}_{ij}=1\) if node \(i\) is connected to node \(j\).
Degree distribution the degree distribution of a network, \(p\left(k\right), k=0, 1,\) ..., measures the proportion of nodes in the network having degree \(k\). Formally: \(p\left(k\right)=\frac{{n}_{k}}{n}\).
Diameter is the largest of all longest paths, \(max\left\{{d}_{ij}\right\}\), where \({d}_{ij}\) represents the network diameters.
Density \(D=\frac{K}{N(N-1)}\), where \(N\) is the number of vertices, \(N=\left|V\right|\), and \(K\) is the number of edges, \(K=\left|E\right|\).
Eccentricity the eccentricity of a node \(i\) is a centrality index equal to the maximum length of all shortest paths from \(i\) to the other nodes in the network.
Efficiency \({E}_{glob}\left(G\right)=\frac{1}{N(N-1)}\sum _{i\ne j\in G}\frac{1}{{d}_{ij}}\), where \({d}_{ij}\) is the shortest path between vertices \(i\) and \(j\), \(N\) is the number of vertices, \(N=\left|V\right|\), and \(G\) is the graph composed of vertices \(V\) and a set of edges \(E\). Since efficiency is also defined for disconnected graphs, the local efficiency can also be analyzed and is defined as the average efficiency of the local subgraphs and can be formally stated as: \({E}_{loc}\left(G\right)=\frac{1}{N}\sum _{i\in G}{E}_{glob}\left({G}_{i}\right), i\notin {G}_{i}\), where \({G}_{i}\) is the subgraph of the neighbors of vertex \(i\).
Eigenvector centrality the eigenvector centrality assigns a value to a given node \(i\) proportional to the sum of the eigenvector centrality values of the nodes connected to \(i\).
Frequency of labelled motifs \({n}_{w,m}=\frac{{\tilde{n}}_{w,m}}{{n}_{m}}\), where \({\tilde{n}}_{w,m}\) is the total number of occurrences of node \(w\) in motif m and \({n}_{m}\) is the total number of occurrences of motif \(m\), irrespective of any node labels.
Fraction of reciprocal connections mathematically, a reciprocal connection exists if \({M}_{i,j}>0\) and \({M}_{j,i}>0\) for \(i\ne j\), where \(M\) is the weight matrix. The ratio of the number of reciprocal edges to the number of edges from the network is a measure known as the fraction of reciprocal connections (FRC).
Generalized Accessibility \({\alpha }_{i}^{\left(\infty \right)}=exp\left(-\sum {P}_{i,j}\text{log}{P}_{i,j}\right)\), where \(P\) is the probability transition of all the pairs of nodes i and j.
Generalized selectivity \({C}_{D-in/out}^{w\alpha }={k}^{in/out} \times \left(\frac{{s}_{i}^{in/out}}{{k}_{i}^{in/out}}\right)={k}_{i}^{in/out}\times {\left({e}_{i}^{in/out}\right)}^{\alpha }, \alpha >0\), where \({s}_{i}^{in/out}\) is the in-strength and out-strength of the node \(i\) defined by \({s}_{i}^{in/out}=\sum _{j}{w}_{ij/ji}\), where \({w}_{ij/ji}\) represents the weight of link, and \({e}_{i}^{in/out}\) is the selectivity (or average strength) defined by \({e}_{i}^{in/out}=\frac{{s}_{i}^{in/out}}{{k}_{i}^{in/out}}\).
Intermittency it is not a traditional network measurement, but it has a strong relationship with the concept of cycle length in networks, and is used to measure how periodically a word is repeated.
Load centrality it is similar to betweenness centrality but considering weights on edges.
Local efficiency \({E}_{loc}=\frac{1}{N}\sum _{i\in G}{E}_{glob}\left({G}_{i}\right)\), \(i\notin G\), where \({G}_{i}\) is the subgraph of the neighbors of \(i\).
Locality index \({l}_{i}=\frac{{N}_{i}^{int}}{{N}_{i}^{int}+{N}_{i}^{ext}}\), where \({N}_{i}^{int}\) is the number of internal connections of the neighbors of a node \(i\) and \({N}_{i}^{ext}\) is the number of edges between the \({k}_{i}\) neighbors of \(i\), plus the \({k}_{i}\) edges that connect node \(i\) to its neighbors.
Matching index \({\mu }_{i,j}=\frac{{\sum }_{k\ne i,j}{a}_{ik}{a}_{jk}}{\sum _{k\ne j}{a}_{ik}+\sum _{k\ne i}{a}_{jk}}\), where \({a}_{ij}\) is an element of the adjacency matrix, and \({a}_{ij}=1\) if nodes \(i\) and \(j\) are connected.
Modularity \(Q=\frac{1}{2m}{\sum }_{i=1}^{n}{\sum }_{j=1}^{n}\left[{a}_{ij}-\frac{{k}_{i}{k}_{j}}{2m}\right]\delta \left({c}_{i}, {c}_{j}\right)\), where \(m\) is the number of edges, \(n\) is the number of nodes, \(\delta \left({c}_{i}, {c}_{j}\right)=1\) if the nodes i and j are from the same class (community) and \(\delta \left({c}_{i}, {c}_{j}\right)=0\), otherwise. This measurement ranges from \(-\infty \le Q<1\). For \(Q>0\), the number of edges inside the communities is greater than the expected.
Neighborhood it quantifies the amount of nodes in the h-th concentric level around node \(i\).
Radius is the smallest of all longest paths, \(in\left({d}_{ij}\right)\), where \({d}_{ij}\) represents the network diameters.
Selectivity \({\theta }_{i}=\frac{{s}_{i}}{{k}_{i}}\), where \({s}_{i}\) is the strength of vertex \(i\) and \({k}_{i}\) is the degree of vertex \(i\).
Shortest path the value of shortest path of given nodes \(i\) and \(j\) is defined by \({SP}_{i}=\sum {d}_{ij}\), where \({d}_{ij}\) represents the network diameters.
Shortest path variation the shortest path variation of node k is the difference between the shortest path of the network when the network contains k and when it does not contain, \(\varDelta L=\left|{L}_{k}-L\right|\).
Symmetry \({S}_{i}^{\left(h\right)}=\frac{exp\left(-\sum {p}_{i,j}^{\left(h\right)}\text{log}{p}_{i,j}^{\left(h\right)}\right)}{\left|{H}_{h}\left(i\right)\right|+{\sum }_{r=0}^{h-1}{\eta }_{r}}\), where \({H}_{h}\left(i\right)\) is the set of all nodes in the \(h - th\) hierarchic level of node \(i\), \(\left|{H}_{h}\left(i\right)\right|\) is the number of nodes in \({H}_{h}\left(i\right)\), and by considering a given hierarchic level \(r\), \({\eta }_{r}\) is the number of nodes without edges connecting to the next hierarchical level.
Strength \({s}_{i}^{in/out}=\sum _{j}{w}_{ij/ji}\), where \(w\) is weight adjacency matrix in which \({w}_{ij/ji}\) represents the weight of link.
Strength of a node \(strength\left({v}_{i}\right) ={\sum }_{j}{w}_{ij}={\sum }_{j}{w}_{ji}\), where \({w}_{ij}\) is the corresponding entry in the weight matrix W for edge (\({v}_{i}\), \({v}_{j}\))
Strength centrality \({SC}_{i}=\sum {w}_{ij}\), where \({SC}_{i}\) is the value of strength centrality of the given node \(i\) and \({w}_{ij}\)is the similarity value between node \(i\) and \(j\).
Transitivity the transitivity of a network is the fraction of all possible triangles present in the network. Possible triangles are identified by the number of triads (two links with a shared vertex), where \(T=(3\#triangles)/(\#triads)\).
Weight distribution it can be computed throughout probabilistic density function represented by a histogram of weights.

Appendix 2: Evaluation metrics

Accuracy \(accuracy=\frac{TP+TN}{TP+FP+FN+TN}\).
Area under the curve (AUC)–Receiver operating characteristic (ROC) curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. The ROC curve is plotted with true positive rate (TPR) against the false positive rate (FPR) where TPR is on y-axis and FPR is on the x-axis.
F1-score \(F1=\frac{2PR}{P+R}\), where \(P\) is precision and \(R\) is recall. There are also F2-score, \(F2 =\frac{5PR}{4P+R}\), and Fβ-score, \(F\beta =\left(1+{\beta }^{2}\right)\frac{PR}{{\beta }^{2}P+R}\), but these weren’t used in the papers addressed in this study.
Link perplexity Perplexity is a measure of a model’s ability to generalize to unseen data. It is defined as the reciprocal geometric mean of the likelihood of a test corpus of a given model and the collapsed state of the Markov chain. It can be used to evaluate the topic dependency relationship structure through the link perplexity, which can be formally stated as: \(linkPerp=exp-\frac{{\sum }_{i=P+1}^{L}{log}p\left({\tilde{l}}_{i}\right|{l}_{1:P})}{\left|L\right|}\), where \({l}_{1:P}\) represents the topic relationship links from a topic dependency network and \(p\left({l}_{i}\right|{l}_{1:P})\) is the relationship predictive distribution of the remaining dependency relationships and \(L\) is a set of links of a topic relationship network.
Macro average F1 value \(acroF1=\frac{MacroP\times MacroR\times 2}{MacroP+MacroR}\) .
Macro average recall rate \(MacroR=\frac{1}{n}\sum _{i=1}^{n}{R}_{i}\), where \({R}_{i}\) is the recall rate of class \(i\) text, and \(n\) is the total number of categories.
Macro average accuracy rate \(MacroP=\frac{1}{n}\sum _{i=1}^{n}{P}_{i}\), where \({P}_{i}\) is the accuray rate of class \(i\) text and \(n\) is the total number of categories.
Precision \(P=\frac{A\cap B}{A}\), where \(A\) is collection of documents (words) extracted by an algorithm and \(B\) is collection of documents (keywords) in text files.
p-value represents the likelihood of obtaining the corresponding accuracy rate in a random classification.
Recall \(R=\frac{A\cap B}{B}\), where \(A\) is collection of documents (words) extracted by an algorithm and \(B\) is collection of documents (keywords) in text files.
Recall-Oriented Understanding for Gisting Evaluation \(uge-N=\frac{\sum _{S\in \left\{referenceSummaries\right\}}{\sum }_{{gram}_{n}\in S}{count}_{match}\left({gram}_{n}\right)}{\sum _{S\in \left\{referenceSummaries\right\}}{\sum }_{{gram}_{n}\in S}count\left({gram}_{n}\right)}\), where n is the length of the n-gram and \({gram}_{n}\), \({count}_{match}\left({gram}_{n}\right)\) is the maximum of n-grams co-occurring in the candidate summary.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Oliva, S.Z., Oliveira-Ciabati, L., Dezembro, D.G. et al. Text structuring methods based on complex network: a systematic review. Scientometrics 126, 1471–1493 (2021). https://doi.org/10.1007/s11192-020-03785-y

Download citation

Received: 03 July 2020
Accepted: 09 November 2020
Published: 03 January 2021
Issue Date: February 2021
DOI: https://doi.org/10.1007/s11192-020-03785-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Text structuring methods based on complex network: a systematic review

Abstract

Access this article

Similar content being viewed by others

A survey on deep learning approaches for text-to-SQL

Impact of word embedding models on text analytics in deep learning environment: a review

The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Complex network measures

Appendix 2: Evaluation metrics

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Text structuring methods based on complex network: a systematic review

Abstract

Access this article

Similar content being viewed by others

A survey on deep learning approaches for text-to-SQL

Impact of word embedding models on text analytics in deep learning environment: a review

The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Complex network measures

Appendix 2: Evaluation metrics

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation