Skip to main content
Log in

Text structuring methods based on complex network: a systematic review

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Currently, there is a large amount of text being shared through the Internet. These texts are available in different forms—structured, unstructured and semi structured. There are different ways of analyzing texts, but domain experts usually divide this process in some steps such as pre-processing, feature extraction and a final step that could be classification, clustering, summarization, and keyword extraction, depending on the purpose over the text. For this processing, several approaches have been proposed in the literature based on variations of methods like artificial neural network and deep learning. In this paper, we conducted a systematic review of papers dealing with the use of complex networks approaches for the process of analyzing text. The main results showed that complex network topological properties, measures and modeling can capture and identify text structures concerning different purposes such as text analysis, classification, topic and keyword extraction, and summarization. We conclude that complex network topological properties provide promising strategies with respect of processing texts, considering their different aspects and structures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. https://www.ncbi.nlm.nih.gov/pubmed.

  2. https://mjl.clarivate.com/home.

  3. https://dblp.uni-trier.de/.

  4. means “Term Frequency—Inverse Document Frequency”—a technique to quantify a word in documents. It is used to compute a weight to represent the importance of the word in the corpus.

  5. Approach that computes vector representations of words from large datasets.

  6. Grant #2019/27797-2, São Paulo Research Foundation (FAPESP).

  7. Grant #2019/12787-1, São Paulo Research Foundation (FAPESP).

  8. Grant #2019/07960-6, São Paulo Research Foundation (FAPESP).

  9. Grant #2019/07461-0, São Paulo Research Foundation (FAPESP).

  10. Grant #2017/27251-4, São Paulo Research Foundation (FAPESP).

References

  • Akimushkin, C., Amancio, D. R., & Oliveira, O. N. Jr. (2017). Text authorship identified using the dynamics of word co-occurrence networks. PloS ONE, 12(1), e0170527.

    Google Scholar 

  • Almutawah, K. A. (2014). A decision support system for academic advisors. International Journal of Business Information Systems, 16(2), 177–195.

    Google Scholar 

  • Alwidian, S. A., Bani-Salameh, H. A., & Alslaity, A. N. (2015). Text data mining: A proposed framework and future perspectives. International Journal of Business Information Systems, 18(2), 127–140.

    Google Scholar 

  • Amancio, D. R. (2015). Probing the topological properties of complex networks modeling short written texts. PloS ONE, 10(2), e0118394.

    Google Scholar 

  • Amancio, D. R., Aluisio, S. M., Oliveira, O. N. Jr., & Costa, L. (2012a). Complex networks analysis of language complexity. EPL, 100(5), 58002. d. F..

    Google Scholar 

  • Amancio, D. R., Fabbri, R., Oliveira, O. N., Nunes, M. G., & da Fontoura Costa, L. (2011). Opinion discrimination using complex network features. In L. da Fontoura Costa, A. Evsukoff, G. Mangioni, & R. Menezes (Eds.), Complex networks (pp. 154–162). Berlin: Springer.

    Google Scholar 

  • Amancio, D. R., Nunes, M. d. G. V., Oliveira, O., & Costa, L. (2012b). Using complex networks concepts to assess approaches for citations in scientific papers. Scientometrics, 91(3), 827–842. d. F..

    Google Scholar 

  • Amancio, D. R., Nunes, M. G., Oliveira, O. N., Jr., & Costa, L. (2012c). Extractive summarization using complex networks and syntactic dependency. Physica A: Statistical Mechanics and Its Applications, 391(4), 1855–1864. d. F.

    Google Scholar 

  • Amancio, D. R., Oliveira, O. N., Jr., & Costa, L. D. F. (2012d). Structure–semantics interplay in complex networks and its effects on the predictability of similarity in texts. Physica A: Statistical Mechanics and Its Applications, 391(18), 4406–4419.

    Google Scholar 

  • Amancio, D. R., Oliveira, O. N. Jr., & da F. Costa, L (2012e). Identification of literary movements using complex networks to represent texts. New Journal of Physics, 14(4), 043029.

    Google Scholar 

  • Anami, B. S., Wadawadagi, R. S., & Pagi, V. B. (2014). Machine learning techniques in Web content mining: a comparative analysis. Journal of Information & Knowledge Management, 13(01), 1450005.

    Google Scholar 

  • Antiqueira, L., Oliveira, O. N. Jr., da Fontoura Costa, L., & Nunes, M. (2009). A complex network approach to text summarization. Information Sciences, 179(5), 584–599. d. G. V..

    MATH  Google Scholar 

  • Baccar, S., Rouached, M., & Abid, M. (2016). A capabilities driven model for web services description and composition. International Journal of Business Information Systems, 22(1), 26–40.

    Google Scholar 

  • Balinsky, H., Balinsky, A., & Simske, S. J. (2011) Automatic text summarization and small-world networks. In Proceedings of the 11th ACM symposium on document engineering (pp. 175–184).

  • Beliga, S., Kitanović, O., Stanković, R., & Martinčić-Ipšić, S. (2017). Keyword extraction from parallel abstracts of scientific publications. In J. Szymański & Y. Velegrakis (Eds.), Semanitic keyword-based search on structured data sources (pp. 44–55). Cham: Springer.

    Google Scholar 

  • Beliga, S., & Martinčić-Ipšić, S. (2014). Node selectivity as a measure for graph-based keyword extraction in Croatian news. In Proceedings of the 6th international conference on information technologies and information society, Slovenia (pp. 8–17).

  • Beliga, S., Meštrović, A., & Martinčić-Ipšić, S. (2015). An overview of graph-based keyword extraction methods and approaches. Journal of information and organizational sciences, 39(1), 1–20.

    Google Scholar 

  • Blanco, R., & Lioma, C. (2012). Graph-based term weighting for information retrieval. Information Retrieval, 15(1), 54–92. doi:https://doi.org/10.1007/s10791-011-9172-x.

    Article  Google Scholar 

  • Cao, D., & Xu, L. (2016). Analysis of complex network methods for extractive automatic text summarization. In 2016 2nd IEEE international conference on computer and communications (ICCC) (pp. 2749–2756). IEEE.

  • Chen, Q., Jiang, Z., & Bian, J. (2014). Chinese keyword extraction using semantically weighted network. In 2014 sixth international conference on intelligent human–machine systems and cybernetics (pp. 83–86). IEEE.

  • Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2018). Natural language processing (almost) from scratch (2011).

  • Conneau, A., Schwenk, H., Barrault, L., & Lecun, Y. (2016). Very deep convolutional networks for natural language processing. In European chapter of the Association for Computational Linguistics (EACL).

  • Correa, E. A. Jr., Lopes, A. A., & Amancio, D. R. (2018). Word sense disambiguation: A complex network approach. Information Sciences, 442, 103–113.

    MathSciNet  Google Scholar 

  • de Arruda, H. F., Marinho, V. Q., Costa, L., d., F., & Amancio, D. R. (2019). Paragraph-based representation of texts: A complex networks approach. Information Processing & Management, 56(3), 479–494.

    Google Scholar 

  • Duari, S., & Bhatnagar, V. (2020). Complex network based supervised keyword extractor. Expert Systems with Applications, 140, 112876.

    Google Scholar 

  • Erkan, G., & Radev, D. R. (2004). Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22, 457–479.

    Google Scholar 

  • Feng, X., & Zuo, W. (2014) Discovery of topic based on mass incidents and research of user roles. In 2014 IEEE workshop on advanced research and technology in industry applications (WARTIA) (pp. 144–146). IEEE.

  • Ferraz de Arruda, H., Nascimento Silva, F., Queiroz Marinho, V., Raphael Amancio, D., & da Fontoura Costa, L. (2018). Representation of texts as complex networks: A mesoscopic approach. Journal of Complex Networks, 6(1), 125–144.

    Google Scholar 

  • Goh, W. P., Luke, K.-K., & Cheong, S. A. (2018). Functional shortcuts in language co-occurrence networks. PloS ONE, 13(9), e0203025.

    Google Scholar 

  • Goldberg, Y. (2017). Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies, 10(1), 1–309.

    Google Scholar 

  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge: MIT Press.

    MATH  Google Scholar 

  • Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political analysis, 21(3), 267–297.

    Google Scholar 

  • Guan, Q., An, H., Li, H., & Hao, X. (2017). The rapid bi-level exploration on the evolution of regional solar energy development. Physica A: Statistical Mechanics and its Applications, 465, 49–61.

    Google Scholar 

  • Harrison, K. R., Ventresca, M., & Ombuki-Berman, B. M. (2016). A meta-analysis of centrality measures for comparing and generating complex network models. Journal of Computational Science, 17, 205–215. https://doi.org/10.1016/j.jocs.2015.09.011.

    Article  Google Scholar 

  • Hassan, S., Mihalcea, R., & Banea, C. (2007). Random walk term weighting for improved text classification. International Journal of Semantic Computing, 1(04), 421–439.

    Google Scholar 

  • Hearst, M. A. (1999). Untangling text data mining. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics (pp. 3–10).

  • Iqbal, F., Binsalleeh, H., Fung, B. C., & Debbabi, M. (2013). A unified data mining solution for authorship analysis in anonymous textual communications. Information Sciences, 231, 98–112.

    Google Scholar 

  • Jiang, J., Zheng, J., Zhao, C., Su, J., Guan, Y., & Yu, Q. (2016). Clinical-decision support based on medical literature: A complex network approach. Physica A: Statistical Mechanics and its Applications, 459, 42–54.

    MathSciNet  MATH  Google Scholar 

  • Jurafsky, D., & Martin, J. H. (2000). Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Computational Linguistics, 26(4), 638–641.

    Google Scholar 

  • Ke, X., Zeng, Y., Ma, Q., & Zhu, L. (2014). Complex dynamics of text analysis. Physica A: Statistical Mechanics and its Applications, 415, 307–314.

    MathSciNet  MATH  Google Scholar 

  • Koplenig, A., Meyer, P., Wolfer, S., & Mueller-Spitzer, C. (2017). The statistical trade-off between word order and word structure—Large-scale evidence for the principle of least effort. PloS ONE, 12(3), e0173614.

    Google Scholar 

  • Kuramochi, T., Okada, N., Tanikawa, K., Hijikata, Y., & Nishida, S. (2012). Community extracting using intersection graph and content analysis in complex network. In 2012 IEEE/WIC/ACM international conferences on web intelligence and intelligent agent technology (pp. 222–229). IEEE.

  • Lane, J., & Kim, H. J. (2015). Big data: Web-crawling and analysing financial news using RapidMiner. International Journal of Business Information Systems, 19(1), 41–57. https://doi.org/10.1504/ijbis.2015.069064.

    Article  Google Scholar 

  • Li, H., An, H., Wang, Y., Huang, J., & Gao, X. (2016). Evolutionary features of academic articles co-keyword network and keywords co-occurrence network: Based on two-mode affiliation network. Physica A: Statistical Mechanics and its Applications, 450, 657–669.

    Google Scholar 

  • Li, X., Peng, Q., Sun, Z., Chai, L., & Wang, Y. (2017). Predicting social emotions from readers’ perspective. IEEE Transactions on Affective Computing, 10(2), 255–264.

    Google Scholar 

  • Lima, R., Espinasse, B., & Freitas, F. (2015). Relation extraction from texts with symbolic rules induced by inductive logic programming. In 2015 IEEE 27th international conference on tools with artificial intelligence (ICTAI) (pp. 194–201). IEEE.

  • Liu, H., & Cong, J. (2014). Empirical characterization of modern Chinese as a multi-level system from the complex network approach. Journal of Chinese Linguistics, 42(1), 1–38.

    Google Scholar 

  • Liu, Y., & Zhang, M. (2018). Neural network methods for natural language processing. Cambridge: MIT Press.

    Google Scholar 

  • Malliaros, F. D., & Skianis, K. (2015). Graph-based term weighting for text categorization. In Proceedings of the 2015 IEEE/ACM international conference on advances in social networks analysis and mining 2015 (pp. 1473–1479).

  • Manning, C., & Schutze, H. (1999). Foundations of statistical natural language processing. Cambridge: MIT Press.

    MATH  Google Scholar 

  • Margan, D., Martinčić-Ipšić, S., & Meštrović, A. (2014a). Network differences between normal and shuffled texts: Case of Croatian. In P. Contucci, R. Menezes, A. Omicini, & J. Poncela-Casasnovas (Eds.), Complex Networks (pp. 275–283). Cham: Springer.

    Google Scholar 

  • Margan, D., Mestrovic, A., & Martinčić-Ipšić, S. (2014b). Complex networks measures for differentiation between normal and shuffled Croatian texts. In 2014 37th international convention on information and communication technology, electronics and microelectronics (MIPRO) (pp. 1598–1602). IEEE.

  • Marinho, V. Q., Hirst, G., & Amancio, D. R. (2018). Labelled network motifs reveal stylistic subtleties in written texts. Journal of Complex Networks, 6(4), 620–638.

    MathSciNet  Google Scholar 

  • Martinčić-Ipšić, S., Margan, D., & Meštrović, A. (2016a). Multilayer network of language: A unified framework for structural analysis of linguistic subsystems. Physica A: Statistical Mechanics and its Applications, 457, 117–128.

    Google Scholar 

  • Martinčić-Ipšić, S., Miličić, T., & Meštrović, A. (2016b). Text type differentiation based on the structural properties of language networks. In International conference on information and software technologies (pp. 536–548). Berlin: Springer.

  • Martinčić-Ipšić, S., Miličić, T., & Todorovski, L. (2019). The influence of feature representation of text on the performance of document classification. Applied Sciences, 9(4), 743.

    Google Scholar 

  • Masucci, A. P., & Rodgers, G. J. (2009). Differences between normal and shuffled texts: Structural properties of weighted networks. Advances in Complex Systems, 12(01), 113–129.

    MathSciNet  Google Scholar 

  • Menaka, S., & Radha, N. (2013). Text classification using keyword extraction technique. International Journal of Advanced Research in Computer Science and Software Engineering, 3(12), 734–740.

    Google Scholar 

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In ICLR workshop papers.

  • Nan, J., Xiao, B., Lin, Z., & Xu, Q. (2014). Keywords extraction from Chinese document based on complex network theory. In 2014 seventh international symposium on computational intelligence and design, 13–14 Dec. 2014 (pp. 383–386).

  • Naw, N., & Hlaing, E. E. (2013). Relevant words extraction method for recommendation system. Bulletin of Electrical Engineering and Informatics, 2(3), 169–176.

    Google Scholar 

  • Otter, D. W., Medina, J. R., & Kalita, J. K. (2020). A survey of the usages of deep learning in natural language processing. IEEE Transactions on Neural Networks and Learning Systems.

  • Panigrahi, P. K., & Bele, N. (2016). A review of recent advances in text mining of Indian languages. International Journal of Business Information Systems, 23(2), 175–193.

    Google Scholar 

  • Papadakis, G., Giannakopoulos, G., & Paliouras, G. (2016). Graph vs. bag representation models for the topic classification of web documents. World Wide Web, 19(5), 887–920.

    Google Scholar 

  • Rossi, R. G., de Paulo Faleiros, T., de Andrade Lopes, A., & Rezende, S. O. (2012). Inductive model generation for text categorization using a bipartite heterogeneous network. In 2012 IEEE 12th international conference on data mining (pp. 1086–1091). IEEE.

  • Rousseau, F., Kiagias, E., & Vazirgiannis, M. (2015). Text categorization as a graph classification problem. In Proceedings of the 53rd annual meeting of the Association for Computational Linguistics and the 7th international joint conference on natural language processing (Vol. 1: Long Papers, pp. 1702–1712).

  • Rubinov, M., & Sporns, O. (2010). Complex network measures of brain connectivity: Uses and interpretations (Article). Neuroimage, 52(3), 1059–1069. doi:https://doi.org/10.1016/j.neuroimage.2009.10.003.

    Article  Google Scholar 

  • Silva, F. N., Amancio, D. R., Bardosova, M., Costa, L. d. F., & Oliveira, O. N. (2016). Using network science and text analytics to produce surveys in a scientific topic. Journal of Informetrics, 10(2), 487–502. doi:https://doi.org/10.1016/j.joi.2016.03.008.

    Article  Google Scholar 

  • Silva, T. C., & Amancio, D. R. (2012). Word sense disambiguation via high order of learning in complex networks. EPL, 98(5), 58001.

    Google Scholar 

  • Sridharan, K., & Sivakumar, P. (2018). A systematic review on techniques of feature selection and classification for text mining. International Journal of Business Information Systems, 28(4), 504–518.

    Google Scholar 

  • Stevanak, J., Larue, D. M., & Carr, L. D. (2010). Distinguishing fact from fiction: pattern recognition in texts using complex networks. arXiv preprint arXiv:1007.3254v2.

  • Suh, J. H. (2019). SocialTERM-Extractor: Identifying and predicting social-problem-specific key noun terms from a large number of online news articles using text mining and machine learning techniques. Sustainability, 11(1), 196.

    Google Scholar 

  • Tachimori, Y., Iwanaga, H., & Tahara, T. (2013). The networks from medical knowledge and clinical practice have small-world, scale-free, and hierarchical features. Physica A: Statistical Mechanics and its Applications, 392(23), 6084–6089.

    Google Scholar 

  • Taghandiki, K., Zaeri, A., & Shirani, A. (2016). A supervised approach for automatic web documents topic extraction using well-known web design features. International Journal of Modern Education and Computer Science, 8(11), 20.

    Google Scholar 

  • Taskin, Y., Hecking, T., & Hoppe, H. U. (2019). ESA-T2N: A novel approach to network-text analysis. In International conference on complex networks and their applications (pp. 129–139). Berlin: Springer.

  • Tobon-Mejia, D. A., Medjaher, K., Zerhouni, N., & Tripot, G. (2012). A data-driven failure prognostics method based on mixture of Gaussians hidden Markov models. IEEE Transactions on reliability, 61(2), 491–503.

    Google Scholar 

  • Tohalino, J. V., & Amancio, D. R. (2018). Extractive multi-document summarization using multilayer networks. Physica A: Statistical Mechanics and its Applications, 503, 526–539.

    Google Scholar 

  • Vega, D., & Magnani, M. (2018). Foundations of temporal text networks. Applied Network Science, 3(1), 25. https://doi.org/10.1007/s41109-018-0082-3.

    Article  Google Scholar 

  • Wachs-Lopes, G. A., & Rodrigues, P. S. (2016). Analyzing natural human language from the point of view of dynamic of a complex network. Expert Systems with Applications, 45, 8–22.

    Google Scholar 

  • Wang, R., & Wang, G. (2019). Web text categorization based on statistical merging algorithm in big data environment. International Journal of Ambient Computing and Intelligence (IJACI), 10(3), 17–32.

    Google Scholar 

  • Wang, Z., Li, H., & Tang, R. (2019). Network analysis of coal mine hazards based on text mining and link prediction. International Journal of Modern Physics C (IJMPC), 30(07), 1–22.

    Google Scholar 

  • Yan, D., Li, K., & Ye, J. (2019). Correlation analysis of short text based on network model. Physica A: Statistical Mechanics and its Applications, 531, 121728.

    Google Scholar 

  • Yang, L., Li, K., & Huang, H. (2018). A new network model for extracting text keywords. Scientometrics, 116(1), 339–361.

    Google Scholar 

  • Yang, L., Li, K., Zhao, D., Gu, S., & Yan, D. (2019). A network method for identifying the root cause of high-speed rail faults based on text data. Energies, 12(10), 1908.

    Google Scholar 

  • Young, T., Hazarika, D., Poria, S., & Cambria, E. (2018). Recent trends in deep learning based natural language processing. IEEE Computational Intelligence Magazine, 13(3), 55–75.

    Google Scholar 

  • Zhang, D., Fan, F., & Park, S. D. (2019). Network analysis of actors and policy keywords for sustainable environmental governance: focusing on Chinese environmental policy. Sustainability, 11(15), 4068.

    Google Scholar 

  • Zhang, Z., Xu, H., & Huo, W. (2015). Topical text network construction based on seed word augmentation. In 2015 12th international conference on fuzzy systems and knowledge discovery (FSKD) (pp. 1470–1474). IEEE.

  • Zhao, A., Manandhar, S., & Yu, L. (2018). Topology and semantic based topic dependency structure discovery. Filomat, 32(5), 1843–1851.

    MathSciNet  Google Scholar 

  • Zimmer, B., Sahlgren, M., & Kerren, A. (2017). Visual analysis of relationships between heterogeneous networks and texts: An application on the IEEE VIS publication dataset. In Informatics (p. 112). Multidisciplinary Digital Publishing Institute.

  • Zou, S., Yang, X., Jin, Y., & Du, Z. (2013). Text research based on complex network. In 2013 10th web information system and application conference (pp. 33–37). IEEE.

Download references

Acknowledgements

Authors thank FAPESPFootnote 6,Footnote 7,Footnote 8,Footnote 9,Footnote 10 for financial support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Samuel Zanferdini Oliva.

Appendices

Appendix 1: Complex network measures

  • Accessibility \({\alpha }_{i}^{h} =exp\left(-\sum {p}_{i,j}^{\left(h\right)}\text{log}{p}_{i,j}^{\left(h\right)}\right)\), where \({p}_{i,j}^{\left(h\right)}\) is the probability of a random walker to reach a given node j departing from i, in h steps.

  • Aggregation coefficient \({S}_{i}=\frac{{K}_{i}}{\left(\genfrac{}{}{0pt}{}{{D}_{i}}{2}\right)}=\frac{2{K}_{i}}{{D}_{i}({D}_{i}-1)}\), where \({D}_{i}\) represents the degree of node \({v}_{i}\), and \({K}_{i}\) represents the degree of aggregation of node \(V\) and can be defined as: \({K}_{i}=\sum _{{v}_{i}\cdot {v}_{k}\cdot {v}_{k}}{\beta }_{i\cdot j\cdot k}\), where \({\beta }_{i\cdot j\cdot k}\) represents the connection between node \({v}_{i}\) and \({v}_{j}\) and \({v}_{k}\) \((i\ne j\ne k)\).

  • Assortativity \({\Gamma }=\frac{\frac{1}{\varvec{M}}\sum _{\varvec{j}>\varvec{i}}{\varvec{k}}_{\varvec{i}}{\varvec{k}}_{\varvec{j}}{\varvec{a}}_{\varvec{i}\varvec{j}}-{\left[\frac{1}{\varvec{M}}\sum _{\varvec{j}>\varvec{i}}\frac{1}{2}({\varvec{k}}_{\varvec{i}}+{\varvec{k}}_{\varvec{j}}){\varvec{a}}_{\varvec{i}\varvec{j}}\right]}^{2}}{\frac{1}{\varvec{M}}\sum _{\varvec{j}>\varvec{i}}\frac{1}{2}({\varvec{k}}_{\varvec{i}}^{2}+{\varvec{k}}_{\varvec{j}}^{2}){\varvec{a}}_{\varvec{i}\varvec{j}}-{\left[\frac{1}{\varvec{M}}\sum _{\varvec{j}>\varvec{i}}\frac{1}{2}({\varvec{k}}_{\varvec{i}}+{\varvec{k}}_{\varvec{j}}){\varvec{a}}_{\varvec{i}\varvec{j}}\right]}^{2}}\), where M is the number of edges of the network, and \({a}_{ij}=1\) if nodes \(i\) and \(j\) are connected and \({a}_{ij}=0\) otherwise.

  • Assortative coefficient \(r=\frac{\sum _{jk}jk\left({e}_{jk}-{q}_{j}{q}_{k}\right)}{{\sigma }_{q}^{2}}\), where \({e}_{jk}\) is the joint probability distribution of the remaining degrees of the two nodes j and k, \({q}_{k}\) is the distribution of the remaining degrees and \({\sigma }_{q}^{2}\) is the maximal value of function for perfectly assortativity network.

  • Average clustering coefficient \(C=\frac{1}{N}\sum _{i}{c}_{i}\), where \({c}_{i}\) is the clustering coefficient of vertex \(i\).

  • Average connection density \({K}_{den}=\frac{\left|E\right|}{\left|{N}^{2}\right|-\left|N\right|}\), where \(\left|E\right|\) is the cardinality of the set of edges and \(\left|N\right|\) is the cardinality of the set of nodes.

  • Average degree \(\langle K \rangle =\frac{K}{N}\), where \(N\) is the number of vertices, \(N=\left|V\right|\), and \(K\) is the number of edges, \(K=\left|E\right|\).

  • Average path length \(L=\sum _{i,j}\frac{{d}_{ij}}{N(N-1)}\), where \({d}_{ij}\) is the shortest path between \(i\) and \(j\), and \(N\) is the number of vertices.

  • Average shortest path when the network contains node k, the average shortest path of the network is \({L}_{k}=\frac{\sum _{i\ne j}{d}_{ij}}{N}(i,j\in V)\) and when the network does not contain the node k, the average shortest path of the network L is: \(L=\frac{\sum _{i\ne k\ne j}{d}_{ij}}{N-1}(i,j\in V)\).

  • Average shortest path length \(L=\frac{1}{N(N-1)}\sum _{i\ne j}{d}_{ij}\), where \({d}_{ij}\) is the shortest path between \(i\) and \(j\), and \(N\) is the number of vertices.

  • Betweenness (B) \({B}_{i} ={\sum }_{s,t}\frac{{g}_{s,t}^{i}}{{g}_{s,t}}={\sum }_{j}{w}_{ji}\), where \({g}_{s,t}^{i}\) is the number of shortest paths connecting nodes s and t that include node i, and \({g}_{s,t}\) is the number of shortest paths connecting s and t, for all pairs s and t.

  • Betweenness centrality \({bc}_{i}=\sum _{t\ne k\ne i}\frac{{\sigma }_{tk\left(i\right)}}{{\sigma }_{tk}}\), where \({\sigma }_{tk}\) is defined as the total number of shortest paths from node t to k, and \({\sigma }_{tk\left(i\right)}\) is the number of shortest paths from node \(t\) to \(k\) passing through node \(i\).

  • Cliques it represents complete subgraphs.

  • Comprehensive Eigenvalue \(CE=\alpha \varDelta L+\beta \varDelta C+\gamma {C}_{b\left(k\right)}\), where \(\varDelta L\) is the shortest path variation of node \(k\), \(\varDelta C\) is the clustering coefficient variation and \({C}_{b\left(k\right)}\) is the betweenness of node \(k\) and \(\alpha +\beta +\gamma =1\).

  • Coreness is a network degeneracy property that decomposes the graph (\(G\)) into a set of maximal connected subgraphs \({G}_{k}\), where \(k\) denotes the core, such that nodes in \(G\) have degree at least \(k\) within the subgraph and \({G}_{k}\subseteq {G}_{k+1}.\) Coreness of a node is the highest core to which it belongs.

  • Closeness centrality \({C}_{i}={l}_{i}^{-1}=n/{\sum }_{j}{d}_{ij}\), where \({l}_{i}\) is the average distance from node \(i\) to all the other nodes, and \({d}_{ij}\) is the length of a geodesic path connecting nodes \(i\) and \(j\).

  • Clustering coefficient it can be calculated as \({c}_{i}=2{n}_{i}/({k}_{i}^{2}-{k}_{i})\), where \({k}_{i}\) is the degree of vertex \(i\), \({n}_{i}\) is the number of neighbors of \(i\).

  • Clustering coefficient variation clustering coefficient variation of node \(k\) is \(\varDelta C=\left|{C}_{k}-C\right|\), where \({C}_{k}\) is the average clustering coefficient when the network contains node \(k\) and \(C\) is the average clustering coefficient when the network does not contain node \(k\).

  • Degree of a vertex it quantifies the number of immediate neighbors of a vertex \(i\) and can be obtained as: \({k}_{j} ={\sum }_{j}{A}_{ij}\), where \({A}_{ij}\) is the adjacency matrix such that \({A}_{ij}\) is 1 if there is an edge from vertex \(j\) to vertex \(i\), otherwise 0. In a directed network, each vertex has two degrees. The out-degree is the number of outgoing edges going out of from a vertex \({k}_{i}^{out} ={\sum }_{j}{A}_{ji}\) and the in-degree is the number of incoming edges onto a vertex \({k}_{i}^{in} ={\sum }_{j}{A}_{ij}\). The total degree of the vertex is the sum of its in- and out-degree \({k}_{i}^{tot}={k}_{i}^{in}+{k}_{i}^{out}\).

  • Degree centrality \({degree}_{i}=\frac{1}{(n-1)}\sum _{j\ne i}{m}_{ij}\), where \({m}_{ij}=1\) if node \(i\) is connected to node \(j\).

  • Degree distribution the degree distribution of a network, \(p\left(k\right), k=0, 1,\) ..., measures the proportion of nodes in the network having degree \(k\). Formally: \(p\left(k\right)=\frac{{n}_{k}}{n}\).

  • Diameter is the largest of all longest paths, \(max\left\{{d}_{ij}\right\}\), where \({d}_{ij}\) represents the network diameters.

  • Density \(D=\frac{K}{N(N-1)}\), where \(N\) is the number of vertices, \(N=\left|V\right|\), and \(K\) is the number of edges, \(K=\left|E\right|\).

  • Eccentricity the eccentricity of a node \(i\) is a centrality index equal to the maximum length of all shortest paths from \(i\) to the other nodes in the network.

  • Efficiency \({E}_{glob}\left(G\right)=\frac{1}{N(N-1)}\sum _{i\ne j\in G}\frac{1}{{d}_{ij}}\), where \({d}_{ij}\) is the shortest path between vertices \(i\) and \(j\), \(N\) is the number of vertices, \(N=\left|V\right|\), and \(G\) is the graph composed of vertices \(V\) and a set of edges \(E\). Since efficiency is also defined for disconnected graphs, the local efficiency can also be analyzed and is defined as the average efficiency of the local subgraphs and can be formally stated as: \({E}_{loc}\left(G\right)=\frac{1}{N}\sum _{i\in G}{E}_{glob}\left({G}_{i}\right), i\notin {G}_{i}\), where \({G}_{i}\) is the subgraph of the neighbors of vertex \(i\).

  • Eigenvector centrality the eigenvector centrality assigns a value to a given node \(i\) proportional to the sum of the eigenvector centrality values of the nodes connected to \(i\).

  • Frequency of labelled motifs \({n}_{w,m}=\frac{{\tilde{n}}_{w,m}}{{n}_{m}}\), where \({\tilde{n}}_{w,m}\) is the total number of occurrences of node \(w\) in motif m and \({n}_{m}\) is the total number of occurrences of motif \(m\), irrespective of any node labels.

  • Fraction of reciprocal connections mathematically, a reciprocal connection exists if \({M}_{i,j}>0\) and \({M}_{j,i}>0\) for \(i\ne j\), where \(M\) is the weight matrix. The ratio of the number of reciprocal edges to the number of edges from the network is a measure known as the fraction of reciprocal connections (FRC).

  • Generalized Accessibility \({\alpha }_{i}^{\left(\infty \right)}=exp\left(-\sum {P}_{i,j}\text{log}{P}_{i,j}\right)\), where \(P\) is the probability transition of all the pairs of nodes i and j.

  • Generalized selectivity \({C}_{D-in/out}^{w\alpha }={k}^{in/out} \times \left(\frac{{s}_{i}^{in/out}}{{k}_{i}^{in/out}}\right)={k}_{i}^{in/out}\times {\left({e}_{i}^{in/out}\right)}^{\alpha }, \alpha >0\), where \({s}_{i}^{in/out}\) is the in-strength and out-strength of the node \(i\) defined by \({s}_{i}^{in/out}=\sum _{j}{w}_{ij/ji}\), where \({w}_{ij/ji}\) represents the weight of link, and \({e}_{i}^{in/out}\) is the selectivity (or average strength) defined by \({e}_{i}^{in/out}=\frac{{s}_{i}^{in/out}}{{k}_{i}^{in/out}}\).

  • Intermittency it is not a traditional network measurement, but it has a strong relationship with the concept of cycle length in networks, and is used to measure how periodically a word is repeated.

  • Load centrality it is similar to betweenness centrality but considering weights on edges.

  • Local efficiency \({E}_{loc}=\frac{1}{N}\sum _{i\in G}{E}_{glob}\left({G}_{i}\right)\), \(i\notin G\), where \({G}_{i}\) is the subgraph of the neighbors of \(i\).

  • Locality index \({l}_{i}=\frac{{N}_{i}^{int}}{{N}_{i}^{int}+{N}_{i}^{ext}}\), where \({N}_{i}^{int}\) is the number of internal connections of the neighbors of a node \(i\) and \({N}_{i}^{ext}\) is the number of edges between the \({k}_{i}\) neighbors of \(i\), plus the \({k}_{i}\) edges that connect node \(i\) to its neighbors.

  • Matching index \({\mu }_{i,j}=\frac{{\sum }_{k\ne i,j}{a}_{ik}{a}_{jk}}{\sum _{k\ne j}{a}_{ik}+\sum _{k\ne i}{a}_{jk}}\), where \({a}_{ij}\) is an element of the adjacency matrix, and \({a}_{ij}=1\) if nodes \(i\) and \(j\) are connected.

  • Modularity \(Q=\frac{1}{2m}{\sum }_{i=1}^{n}{\sum }_{j=1}^{n}\left[{a}_{ij}-\frac{{k}_{i}{k}_{j}}{2m}\right]\delta \left({c}_{i}, {c}_{j}\right)\), where \(m\) is the number of edges, \(n\) is the number of nodes, \(\delta \left({c}_{i}, {c}_{j}\right)=1\) if the nodes i and j are from the same class (community) and \(\delta \left({c}_{i}, {c}_{j}\right)=0\), otherwise. This measurement ranges from \(-\infty \le Q<1\). For \(Q>0\), the number of edges inside the communities is greater than the expected.

  • Neighborhood it quantifies the amount of nodes in the h-th concentric level around node \(i\).

  • Radius is the smallest of all longest paths, \(in\left({d}_{ij}\right)\), where \({d}_{ij}\) represents the network diameters.

  • Selectivity \({\theta }_{i}=\frac{{s}_{i}}{{k}_{i}}\), where \({s}_{i}\) is the strength of vertex \(i\) and \({k}_{i}\) is the degree of vertex \(i\).

  • Shortest path the value of shortest path of given nodes \(i\) and \(j\) is defined by \({SP}_{i}=\sum {d}_{ij}\), where \({d}_{ij}\) represents the network diameters.

  • Shortest path variation the shortest path variation of node k is the difference between the shortest path of the network when the network contains k and when it does not contain, \(\varDelta L=\left|{L}_{k}-L\right|\).

  • Symmetry \({S}_{i}^{\left(h\right)}=\frac{exp\left(-\sum {p}_{i,j}^{\left(h\right)}\text{log}{p}_{i,j}^{\left(h\right)}\right)}{\left|{H}_{h}\left(i\right)\right|+{\sum }_{r=0}^{h-1}{\eta }_{r}}\), where \({H}_{h}\left(i\right)\) is the set of all nodes in the \(h - th\) hierarchic level of node \(i\), \(\left|{H}_{h}\left(i\right)\right|\) is the number of nodes in \({H}_{h}\left(i\right)\), and by considering a given hierarchic level \(r\), \({\eta }_{r}\) is the number of nodes without edges connecting to the next hierarchical level.

  • Strength \({s}_{i}^{in/out}=\sum _{j}{w}_{ij/ji}\), where \(w\) is weight adjacency matrix in which \({w}_{ij/ji}\) represents the weight of link.

  • Strength of a node \(strength\left({v}_{i}\right) ={\sum }_{j}{w}_{ij}={\sum }_{j}{w}_{ji}\), where \({w}_{ij}\) is the corresponding entry in the weight matrix W for edge (\({v}_{i}\), \({v}_{j}\))

  • Strength centrality \({SC}_{i}=\sum {w}_{ij}\), where \({SC}_{i}\) is the value of strength centrality of the given node \(i\) and \({w}_{ij}\)is the similarity value between node \(i\) and \(j\).

  • Transitivity the transitivity of a network is the fraction of all possible triangles present in the network. Possible triangles are identified by the number of triads (two links with a shared vertex), where \(T=(3\#triangles)/(\#triads)\).

  • Weight distribution it can be computed throughout probabilistic density function represented by a histogram of weights.

Appendix 2: Evaluation metrics

  • Accuracy \(accuracy=\frac{TP+TN}{TP+FP+FN+TN}\).

  • Area under the curve (AUC)–Receiver operating characteristic (ROC) curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. The ROC curve is plotted with true positive rate (TPR) against the false positive rate (FPR) where TPR is on y-axis and FPR is on the x-axis.

  • F1-score \(F1=\frac{2PR}{P+R}\), where \(P\) is precision and \(R\) is recall. There are also F2-score, \(F2 =\frac{5PR}{4P+R}\), and Fβ-score, \(F\beta =\left(1+{\beta }^{2}\right)\frac{PR}{{\beta }^{2}P+R}\), but these weren’t used in the papers addressed in this study.

  • Link perplexity Perplexity is a measure of a model’s ability to generalize to unseen data. It is defined as the reciprocal geometric mean of the likelihood of a test corpus of a given model and the collapsed state of the Markov chain. It can be used to evaluate the topic dependency relationship structure through the link perplexity, which can be formally stated as: \(linkPerp=exp-\frac{{\sum }_{i=P+1}^{L}{log}p\left({\tilde{l}}_{i}\right|{l}_{1:P})}{\left|L\right|}\), where \({l}_{1:P}\) represents the topic relationship links from a topic dependency network and \(p\left({l}_{i}\right|{l}_{1:P})\) is the relationship predictive distribution of the remaining dependency relationships and \(L\) is a set of links of a topic relationship network.

  • Macro average F1 value \(acroF1=\frac{MacroP\times MacroR\times 2}{MacroP+MacroR}\) .

  • Macro average recall rate \(MacroR=\frac{1}{n}\sum _{i=1}^{n}{R}_{i}\), where \({R}_{i}\) is the recall rate of class \(i\) text, and \(n\) is the total number of categories.

  • Macro average accuracy rate \(MacroP=\frac{1}{n}\sum _{i=1}^{n}{P}_{i}\), where \({P}_{i}\) is the accuray rate of class \(i\) text and \(n\) is the total number of categories.

  • Precision \(P=\frac{A\cap B}{A}\), where \(A\) is collection of documents (words) extracted by an algorithm and \(B\) is collection of documents (keywords) in text files.

  • p-value represents the likelihood of obtaining the corresponding accuracy rate in a random classification.

  • Recall \(R=\frac{A\cap B}{B}\), where \(A\) is collection of documents (words) extracted by an algorithm and \(B\) is collection of documents (keywords) in text files.

  • Recall-Oriented Understanding for Gisting Evaluation \(uge-N=\frac{\sum _{S\in \left\{referenceSummaries\right\}}{\sum }_{{gram}_{n}\in S}{count}_{match}\left({gram}_{n}\right)}{\sum _{S\in \left\{referenceSummaries\right\}}{\sum }_{{gram}_{n}\in S}count\left({gram}_{n}\right)}\), where n is the length of the n-gram and \({gram}_{n}\), \({count}_{match}\left({gram}_{n}\right)\) is the maximum of n-grams co-occurring in the candidate summary.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Oliva, S.Z., Oliveira-Ciabati, L., Dezembro, D.G. et al. Text structuring methods based on complex network: a systematic review. Scientometrics 126, 1471–1493 (2021). https://doi.org/10.1007/s11192-020-03785-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-020-03785-y

Keywords

Navigation