Abstract
There has been a significant effort by the research community to address the problem of providing methods to organize documentation, with the help of Information Retrieval methods. In this paper, we present several experiments with stream analysis methods to explore streams of text documents. This paper also presents possible architectures of the Text Document Stream Organization, with the use of incremental algorithms like Incremental Sparse TF-IDF and Incremental Similarity. Our results show that with this architecture, significant improvements are achieved, regarding efficiency in grouping of similar documents. These improvements are important since it is of general knowledge that great amounts of text analysis are a high dimensional and complex subject of study, in the data analysis area.
Similar content being viewed by others
References
Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams, VLDB ’04, 852–863 (VLDB Endowment). http://dl.acm.org/citation.cfm?id=1316689.1316763
Aggarwal CC, Yu PS (2006) A framework for clustering massive text and categorical data streams 479–483. https://doi.org/10.1137/1.9781611972764.44
Aggarwal CC, Yu PS (2005) Online analysis of community evolution in data streams 56–67. https://doi.org/10.1137/1.9781611972757.6
Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. Fast unfolding of community hierarchies in large networks. CoRR abs/0803.0476 (2008)
Carmona Cejudo JM (2013) Nuevas tendencias en fundamentos teóricos aplicaciones de la minería de datos aplicada a la clasificación de textos en lenguaje natural. Ph.D. thesis, U. Salamanca
Cordeiro M, Sarmento R, Gama J (2016) Dynamic community detection in evolving networks using locality modularity optimization. Social Netw Analys Mining 6(1):15:1-15:20. https://doi.org/10.1007/s13278-016-0325-1
Cordeiro M, Sarmento RP, Gama J (2016) Dynamic community detection in evolving networks using locality modularity optimization. Soc Netw Anal Min 6(1):1–20. https://doi.org/10.1007/s13278-016-0325-1
Corney D, Albakour D, Martinez M, Moussa S (2016) What do a million news articles look like? 42–47. http://ceur-ws.org/Vol-1568/paper8.pdf
Cossu J-V, Labatut V, Dugué N (2016) A review of features for the discrimination of twitter users: application to the prediction of offline influence. Soc Netw Anal Min 6(1):25. https://doi.org/10.1007/s13278-016-0329-x
Csardi G, Nepusz T (2006) The igraph software package for complex network research. InterJournal Complex Systems, 1695. http://igraph.org
Eddelbuettel D, Balamuta JJ (2017) Extending extitR with extitC++: A Brief Introduction to extitRcpp. PeerJ Preprints 5:e3188v1. https://doi.org/10.7287/peerj.preprints.3188v1
Eddelbuettel D, François R (2011) SRcpp: eamless R and C++ integration. J Stat Softw 40(8):1–18. https://doi.org/10.18637/jss.v040.i08
Eddelbuettel D (2013) Seamless R and C++ Integration with Rcpp. Springer, New York. ISBN 978-1-4614-6867-7
Feinerer I, Hornik K, Meyer D (2008) Text mining infrastructure in r. J Stat Softw 25(5):1–54
Feinerer I, Hornik K (2018) tm: Text Mining Package. https://CRAN.R-project.org/package=tm. R package version 0.7-5
Feldman R, Sanger J (2006) Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, New York
Fortunato S (2010) Community detection in graphs. Phys Rep 486(3–5):75–174. https://doi.org/10.1016/j.physrep.2009.11.002. arXiv:0906.0612v2
Gama J (2010) Knowledge Discovery from Data Streams, 1st edn. Chapman & Hall/CRC, California
Iacobucci D (1994) Graphs and Matrices. In: Wasserman S (ed) Social network analysis: methods and applications. Cambridge University Press, New York, pp 92–166
Mersmann O (2018) microbenchmark: Accurate Timing Functions. https://CRAN.R-project.org/package=microbenchmark. R package version 1.4-6
Oliveira MDB, Guerreiro A, Gama J (2014) Dynamic communities in evolving customer networks: an analysis using landmark and sliding windows. Social Netw Analys Mining 4(1):208. https://doi.org/10.1007/s13278-014-0208-2
Pons P, Latapy M (2005) Computing communities in large networks using random walks, ISCIS‘05. Springer-Verlag, Berlin, pp 284–293. https://doi.org/10.1007/11569596_31
Sarmento RP, Lemos L, Cordeiro M, Rossetti G, Cardoso D (2019) Dyncomm R package - dynamic community detection for evolving networks. CoRR abs/1905.01498. arXiv:1905.01498
Sarmento R, Cordeiro M, Gama J (2015) Streaming networks sampling using top-k networks 228–234. https://doi.org/10.5220/0005341402280234
Trigo L, Víta M, Sarmento R, Brazdil P (2015) Retrieval, visualization and validation of affinities between documents. INSTICC (SciTePress), pp 452–459
Trigo L, Brazdil P (2014) Affinity analysis between researchers using text mining and differential analysis of graphs. https://phdsession-ecmlpkdd2014.greyc.fr/sites/phdsession-ecmlpkdd2014.greyc.fr/files/papers/Paper_20702.pdf
Urbanek S (2017) fastmatch: Fast match() function. https://CRAN.R-project.org/package=fastmatch. R package version 1.1-0
Wasserman S, Faust K (1994) Social network analysis: Methods and applications, vol 8. Cambridge University Press, Cambridge
Acknowledgements
This work was fully financed by the Faculty of Engineering of Porto University. Rui Portocarrero Sarmento also gratefully acknowledges funding from FCT (Portuguese Foundation for Science and Technology) through a Ph.D. grant (SFRH/BD/119108/2016).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sarmento, R.P., O. Cardoso, D., Dearo, K. et al. Text documents streams with improved incremental similarity. Soc. Netw. Anal. Min. 11, 113 (2021). https://doi.org/10.1007/s13278-021-00826-z
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278-021-00826-z