Abstract
In recent years, with the increase in users in social network, the social network has had the feature of big data. The large-scale social network has become an indispensable part in people’s life. However, the traditional data mining technology cannot suit the large-scale social network. Thus, it is urgent to develop a more suitable mining technology for the large-scale social network. In this section, a crawler model based on semantic analysis and spatial clustering is proposed firstly. Then, the content extraction model based on document object model tree is built to extract the target text information from the links fetched by the proposed crawler model. The similarities between textual information in different regions are computed to choose the important information. Moreover, a two-stage topic clustering model based on time information is presented. The time information is introduced into the similarity computation between two posts or clusters. The single-pass algorithm is improved and applied in different clustering stage to improve the clustering accuracy. Finally, the proposed algorithms are evaluated on Hadoop platform. The Hadoop platform can effectively reduce the computing time and improve the server quality of users in large-scale social network. Meanwhile, the experiments demonstrate that the proposed algorithms are suitable for the data processing in large-scale social network.





















Similar content being viewed by others
References
Akhgar B, Saathoff GB, Arabnia HR, Hill R, Staniforth A, Bayerl PS (2015) Application of big data for national security: a practitioner’s guide to emerging technologies. Butterworth-Heinemann, Oxford
Arabnia HR, Fang WC, Lee C et al (2010) Context-aware middleware and intelligent agents for smart environments. IEEE Intell Syst 25(2):10–11
Salton G, Wong A, Yang CS (1974) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Hull DA (1996) Stemming algorithms: a case study for detailed evaluation. J Assoc Inf Sci Technol 47(1):1–27
Fortunato S, Barthélemy M (2007) Resolution limit in community detection. Proc Natl Acad Sci USA 104(1):36–41
Hassan T, Cruz C (2017) Ontology-based approach for unsupervised and adaptive focused crawling. In: International Workshop on Semantic Big Data. ACM (2), pp 1–24
Bai S, Hussain S, Khoja S (2016) A framework for focused linked data crawler using context graphs. In: International Conference on Information and Communication Technologies. IEEE, pp 1–6
Gupta S (2016) Design of focused crawler based on feature extraction, classification and term extraction. In: International Conference on Computing for Sustainable Global Development. IEEE, pp 3429–3434
Vieira K, Barbosa L, Silva AS et al (2016) Finding seeds to bootstrap focused crawlers. World Wide Web-internet Web Inf Syst 19(3):449–474
Almuhareb A (2016) Arabic poetry focused crawling using SVM and keywords. In: Saudi International Conference on Information Technology. IEEE, pp 1–4
Du Y, Liu W, Lv X et al (2015) An improved focused crawler based on Semantic Similarity Vector Space Model. Appl Soft Comput 36(C):392–407
Wei Y, Li P (2018) Designing focused crawler based on improved genetic algorithm. In: Tenth International Conference on Advanced Computational Intelligence. IEEE, pp 319–343
Boukadi K, Rekik M, Rekik M et al (2018) FC4CD: a new SOA-based focused crawler for cloud service discovery. Computing 6:1–27
Pouriyeh S, Allahyari M, Kochut K et al (2018) Combining word embedding and knowledge-based topic modeling for entity summarization. In: IEEE, International Conference on Semantic Computing. IEEE Computer Society, pp 252–255
Luper D, Cameron D, Miller JA, Arabnia HR (2007) Spatial and temporal target association through semantic analysis and GPS data mining. In: Proceedings of 2007 International Conference on Information and Knowledge Engineering (IKE’07), USA, pp 251–257
Zhang J, Ding WZ (2016) An improved ontology-based web information extraction. In: Educational Innovation Through Technology. IEEE, pp 37–41
Fagin R, Kimelfeld B, Reiss F et al (2014) Cleaning inconsistencies in information extraction via prioritized repairs. ACM 23:164–175
Velasco-Elizondo P, Marín-Piña R, Vazquez-Reyes S et al (2016) Knowledge representation and information extraction for analyzing architectural patterns. Sci Comput Program 121:176–189
Gao B, Zhu J et al (2016) High-quality information extraction and query-oriented summarization for automatic query-reply in social network. Expert Syst Appl Int J 44(C):92–101
Mehdi A, Seyedamin P, Krys K, Hamid RA (2017) A knowledge-based topic modeling approach for automatic topic labeling. Int J Adv Comput Sci Appl 8(9):335–349
Seyedamin P, Mehdi A, Krys K, Gong C, and Hamid RA (2017) ES-LDA: entity summarization using knowledge-based topic modeling. In: Proceedings of the Eighth International Joint Conference on Natural Language, pp 316–325
Yeh JF, Tan YS, Lee CH (2016) Topic detection and tracking for conversational content by using conceptual dynamic latent Dirichlet allocation. Neurocomputing 216:310–318
Lin H, Sun B, Wu J et al (2016) Topic detection from short text: a term-based consensus clustering method. In: International Conference on Service Systems and Service Management. IEEE, pp 1–6
Chakraborti S, Dey S (2016) Multi-level K-means text clustering technique for topic identification for competitor intelligence. In: IEEE Tenth International Conference on Research Challenges in Information Science. IEEE, pp 1–11
Hashimoto K, Kontonatsios G, Miwa M et al (2016) Topic detection using paragraph vectors to support active learning in systematic reviews. J Biomed Inform 62(C):59–65
Zhang C, Wang H, Cao L et al (2016) A hybrid term–term relations analysis approach for topic detection. Knowledge-Based Syst 93:109–120
Nguyen KL (2016) Hot topic detection and technology trend tracking for patents utilizing term frequency and proportional document frequency and semantic information. In: International Conference on Big Data and Smart Computing. IEEE, pp 223–230
Mehta B, Narvekar M (2015) DOM tree-based approach for Web content extraction. In: International Conference on Communication, Information and Computing Technology. IEEE, pp 1–6
Bisson M, Bernaschi M, Mastrostefano E (2016) Parallel distributed breadth first search on the Kepler architecture. IEEE Trans Parallel Distrib Syst 27(7):2091–2102
Qiu L, Lou Y, Chang M (2016) Research on theme crawler based on Shark-Search and PageRank algorithm. In: International Conference on Cloud Computing and Intelligence Systems. IEEE, pp 268–271
Shahrivari S, Jalili S (2016) Single-pass and linear-time k-means clustering based on MapReduce. Elsevier, Amsterdam
Acknowledgements
The work was supported by the National Natural Science Foundation (NSF) under grants (No. 61672397, No. 61873341, No. 61472294, No. 61771354), Application Foundation Frontier Project of WuHan (No. 2018010401011290). Open Fund of Key Laboratory of Urban Land Resources Monitoring and Simulation, Ministry of Land and Resources (Grant No. KF-2018-03-005). Key Lab of Guangdong for Utilization of Remote Sensing and Geographical Information System, Guangzhou Institute of Geography (Grant No. 2017B030314138). Jiangsu Key Laboratory of Meteorological Observation and Information Processing, Nanjing University of Information Science and Technology (Grant No. KDXS1804). Any opinions, findings, and conclusions are those of the authors and do not necessarily reflect the views of the above agencies.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, C., Bai, J. Automatic content extraction and time-aware topic clustering for large-scale social network on cloud platform. J Supercomput 75, 2890–2924 (2019). https://doi.org/10.1007/s11227-018-2704-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-018-2704-z