Automatic content extraction and time-aware topic clustering for large-scale social network on cloud platform

Li, Chunlin; Bai, Jingpan

doi:10.1007/s11227-018-2704-z

Automatic content extraction and time-aware topic clustering for large-scale social network on cloud platform

Published: 26 November 2018

Volume 75, pages 2890–2924, (2019)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Chunlin Li^1,2,3,4 &
Jingpan Bai²

328 Accesses
Explore all metrics

Abstract

In recent years, with the increase in users in social network, the social network has had the feature of big data. The large-scale social network has become an indispensable part in people’s life. However, the traditional data mining technology cannot suit the large-scale social network. Thus, it is urgent to develop a more suitable mining technology for the large-scale social network. In this section, a crawler model based on semantic analysis and spatial clustering is proposed firstly. Then, the content extraction model based on document object model tree is built to extract the target text information from the links fetched by the proposed crawler model. The similarities between textual information in different regions are computed to choose the important information. Moreover, a two-stage topic clustering model based on time information is presented. The time information is introduced into the similarity computation between two posts or clusters. The single-pass algorithm is improved and applied in different clustering stage to improve the clustering accuracy. Finally, the proposed algorithms are evaluated on Hadoop platform. The Hadoop platform can effectively reduce the computing time and improve the server quality of users in large-scale social network. Meanwhile, the experiments demonstrate that the proposed algorithms are suitable for the data processing in large-scale social network.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Research on hot news discovery model based on user interest and topic discovery

Article 13 February 2018

Construction of Social Network Big Data Storage Model Under Cloud Computing

LP-HD: An Efficient Hybrid Model for Topic Detection in Social Network

References

Akhgar B, Saathoff GB, Arabnia HR, Hill R, Staniforth A, Bayerl PS (2015) Application of big data for national security: a practitioner’s guide to emerging technologies. Butterworth-Heinemann, Oxford
Google Scholar
Arabnia HR, Fang WC, Lee C et al (2010) Context-aware middleware and intelligent agents for smart environments. IEEE Intell Syst 25(2):10–11
Article Google Scholar
Salton G, Wong A, Yang CS (1974) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Article MATH Google Scholar
Hull DA (1996) Stemming algorithms: a case study for detailed evaluation. J Assoc Inf Sci Technol 47(1):1–27
Google Scholar
Fortunato S, Barthélemy M (2007) Resolution limit in community detection. Proc Natl Acad Sci USA 104(1):36–41
Article Google Scholar
Hassan T, Cruz C (2017) Ontology-based approach for unsupervised and adaptive focused crawling. In: International Workshop on Semantic Big Data. ACM (2), pp 1–24
Bai S, Hussain S, Khoja S (2016) A framework for focused linked data crawler using context graphs. In: International Conference on Information and Communication Technologies. IEEE, pp 1–6
Gupta S (2016) Design of focused crawler based on feature extraction, classification and term extraction. In: International Conference on Computing for Sustainable Global Development. IEEE, pp 3429–3434
Vieira K, Barbosa L, Silva AS et al (2016) Finding seeds to bootstrap focused crawlers. World Wide Web-internet Web Inf Syst 19(3):449–474
Article Google Scholar
Almuhareb A (2016) Arabic poetry focused crawling using SVM and keywords. In: Saudi International Conference on Information Technology. IEEE, pp 1–4
Du Y, Liu W, Lv X et al (2015) An improved focused crawler based on Semantic Similarity Vector Space Model. Appl Soft Comput 36(C):392–407
Article Google Scholar
Wei Y, Li P (2018) Designing focused crawler based on improved genetic algorithm. In: Tenth International Conference on Advanced Computational Intelligence. IEEE, pp 319–343
Boukadi K, Rekik M, Rekik M et al (2018) FC4CD: a new SOA-based focused crawler for cloud service discovery. Computing 6:1–27
Google Scholar
Pouriyeh S, Allahyari M, Kochut K et al (2018) Combining word embedding and knowledge-based topic modeling for entity summarization. In: IEEE, International Conference on Semantic Computing. IEEE Computer Society, pp 252–255
Luper D, Cameron D, Miller JA, Arabnia HR (2007) Spatial and temporal target association through semantic analysis and GPS data mining. In: Proceedings of 2007 International Conference on Information and Knowledge Engineering (IKE’07), USA, pp 251–257
Zhang J, Ding WZ (2016) An improved ontology-based web information extraction. In: Educational Innovation Through Technology. IEEE, pp 37–41
Fagin R, Kimelfeld B, Reiss F et al (2014) Cleaning inconsistencies in information extraction via prioritized repairs. ACM 23:164–175
Google Scholar
Velasco-Elizondo P, Marín-Piña R, Vazquez-Reyes S et al (2016) Knowledge representation and information extraction for analyzing architectural patterns. Sci Comput Program 121:176–189
Article Google Scholar
Gao B, Zhu J et al (2016) High-quality information extraction and query-oriented summarization for automatic query-reply in social network. Expert Syst Appl Int J 44(C):92–101
Google Scholar
Mehdi A, Seyedamin P, Krys K, Hamid RA (2017) A knowledge-based topic modeling approach for automatic topic labeling. Int J Adv Comput Sci Appl 8(9):335–349
Google Scholar
Seyedamin P, Mehdi A, Krys K, Gong C, and Hamid RA (2017) ES-LDA: entity summarization using knowledge-based topic modeling. In: Proceedings of the Eighth International Joint Conference on Natural Language, pp 316–325
Yeh JF, Tan YS, Lee CH (2016) Topic detection and tracking for conversational content by using conceptual dynamic latent Dirichlet allocation. Neurocomputing 216:310–318
Article Google Scholar
Lin H, Sun B, Wu J et al (2016) Topic detection from short text: a term-based consensus clustering method. In: International Conference on Service Systems and Service Management. IEEE, pp 1–6
Chakraborti S, Dey S (2016) Multi-level K-means text clustering technique for topic identification for competitor intelligence. In: IEEE Tenth International Conference on Research Challenges in Information Science. IEEE, pp 1–11
Hashimoto K, Kontonatsios G, Miwa M et al (2016) Topic detection using paragraph vectors to support active learning in systematic reviews. J Biomed Inform 62(C):59–65
Article Google Scholar
Zhang C, Wang H, Cao L et al (2016) A hybrid term–term relations analysis approach for topic detection. Knowledge-Based Syst 93:109–120
Article Google Scholar
Nguyen KL (2016) Hot topic detection and technology trend tracking for patents utilizing term frequency and proportional document frequency and semantic information. In: International Conference on Big Data and Smart Computing. IEEE, pp 223–230
Mehta B, Narvekar M (2015) DOM tree-based approach for Web content extraction. In: International Conference on Communication, Information and Computing Technology. IEEE, pp 1–6
Bisson M, Bernaschi M, Mastrostefano E (2016) Parallel distributed breadth first search on the Kepler architecture. IEEE Trans Parallel Distrib Syst 27(7):2091–2102
Article Google Scholar
Qiu L, Lou Y, Chang M (2016) Research on theme crawler based on Shark-Search and PageRank algorithm. In: International Conference on Cloud Computing and Intelligence Systems. IEEE, pp 268–271
Shahrivari S, Jalili S (2016) Single-pass and linear-time k-means clustering based on MapReduce. Elsevier, Amsterdam
Book Google Scholar

Download references

Acknowledgements

The work was supported by the National Natural Science Foundation (NSF) under grants (No. 61672397, No. 61873341, No. 61472294, No. 61771354), Application Foundation Frontier Project of WuHan (No. 2018010401011290). Open Fund of Key Laboratory of Urban Land Resources Monitoring and Simulation, Ministry of Land and Resources (Grant No. KF-2018-03-005). Key Lab of Guangdong for Utilization of Remote Sensing and Geographical Information System, Guangzhou Institute of Geography (Grant No. 2017B030314138). Jiangsu Key Laboratory of Meteorological Observation and Information Processing, Nanjing University of Information Science and Technology (Grant No. KDXS1804). Any opinions, findings, and conclusions are those of the authors and do not necessarily reflect the views of the above agencies.

Author information

Authors and Affiliations

Key Laboratory of Urban Land Resources Monitoring and Simulation, Ministry of Land and Resources, Shenzhen, People’s Republic of China
Chunlin Li
Department of Computer Science, Wuhan University of Technology, Wuhan, 430063, People’s Republic of China
Chunlin Li & Jingpan Bai
Key Lab of Guangdong for Utilization of Remote Sensing and Geographical Information System, Guangzhou Institute of Geography, Guangzhou, People’s Republic of China
Chunlin Li
Jiangsu Key Laboratory of Meteorological Observation and Information Processing, Collaborative Innovation Center of Atmospheric Environment and Equipment Technology, Nanjing University of Information Science and Technology, Nanjing, People’s Republic of China
Chunlin Li

Authors

Chunlin Li
View author publications
You can also search for this author inPubMed Google Scholar
Jingpan Bai
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Chunlin Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, C., Bai, J. Automatic content extraction and time-aware topic clustering for large-scale social network on cloud platform. J Supercomput 75, 2890–2924 (2019). https://doi.org/10.1007/s11227-018-2704-z

Download citation

Published: 26 November 2018
Issue Date: 01 May 2019
DOI: https://doi.org/10.1007/s11227-018-2704-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic content extraction and time-aware topic clustering for large-scale social network on cloud platform

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Research on hot news discovery model based on user interest and topic discovery

Construction of Social Network Big Data Storage Model Under Cloud Computing

LP-HD: An Efficient Hybrid Model for Topic Detection in Social Network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now