Skip to main content
Log in

Automatic content extraction and time-aware topic clustering for large-scale social network on cloud platform

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

In recent years, with the increase in users in social network, the social network has had the feature of big data. The large-scale social network has become an indispensable part in people’s life. However, the traditional data mining technology cannot suit the large-scale social network. Thus, it is urgent to develop a more suitable mining technology for the large-scale social network. In this section, a crawler model based on semantic analysis and spatial clustering is proposed firstly. Then, the content extraction model based on document object model tree is built to extract the target text information from the links fetched by the proposed crawler model. The similarities between textual information in different regions are computed to choose the important information. Moreover, a two-stage topic clustering model based on time information is presented. The time information is introduced into the similarity computation between two posts or clusters. The single-pass algorithm is improved and applied in different clustering stage to improve the clustering accuracy. Finally, the proposed algorithms are evaluated on Hadoop platform. The Hadoop platform can effectively reduce the computing time and improve the server quality of users in large-scale social network. Meanwhile, the experiments demonstrate that the proposed algorithms are suitable for the data processing in large-scale social network.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

References

  1. Akhgar B, Saathoff GB, Arabnia HR, Hill R, Staniforth A, Bayerl PS (2015) Application of big data for national security: a practitioner’s guide to emerging technologies. Butterworth-Heinemann, Oxford

    Google Scholar 

  2. Arabnia HR, Fang WC, Lee C et al (2010) Context-aware middleware and intelligent agents for smart environments. IEEE Intell Syst 25(2):10–11

    Article  Google Scholar 

  3. Salton G, Wong A, Yang CS (1974) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  MATH  Google Scholar 

  4. Hull DA (1996) Stemming algorithms: a case study for detailed evaluation. J Assoc Inf Sci Technol 47(1):1–27

    Google Scholar 

  5. Fortunato S, Barthélemy M (2007) Resolution limit in community detection. Proc Natl Acad Sci USA 104(1):36–41

    Article  Google Scholar 

  6. Hassan T, Cruz C (2017) Ontology-based approach for unsupervised and adaptive focused crawling. In: International Workshop on Semantic Big Data. ACM (2), pp 1–24

  7. Bai S, Hussain S, Khoja S (2016) A framework for focused linked data crawler using context graphs. In: International Conference on Information and Communication Technologies. IEEE, pp 1–6

  8. Gupta S (2016) Design of focused crawler based on feature extraction, classification and term extraction. In: International Conference on Computing for Sustainable Global Development. IEEE, pp 3429–3434

  9. Vieira K, Barbosa L, Silva AS et al (2016) Finding seeds to bootstrap focused crawlers. World Wide Web-internet Web Inf Syst 19(3):449–474

    Article  Google Scholar 

  10. Almuhareb A (2016) Arabic poetry focused crawling using SVM and keywords. In: Saudi International Conference on Information Technology. IEEE, pp 1–4

  11. Du Y, Liu W, Lv X et al (2015) An improved focused crawler based on Semantic Similarity Vector Space Model. Appl Soft Comput 36(C):392–407

    Article  Google Scholar 

  12. Wei Y, Li P (2018) Designing focused crawler based on improved genetic algorithm. In: Tenth International Conference on Advanced Computational Intelligence. IEEE, pp 319–343

  13. Boukadi K, Rekik M, Rekik M et al (2018) FC4CD: a new SOA-based focused crawler for cloud service discovery. Computing 6:1–27

    Google Scholar 

  14. Pouriyeh S, Allahyari M, Kochut K et al (2018) Combining word embedding and knowledge-based topic modeling for entity summarization. In: IEEE, International Conference on Semantic Computing. IEEE Computer Society, pp 252–255

  15. Luper D, Cameron D, Miller JA, Arabnia HR (2007) Spatial and temporal target association through semantic analysis and GPS data mining. In: Proceedings of 2007 International Conference on Information and Knowledge Engineering (IKE’07), USA, pp 251–257

  16. Zhang J, Ding WZ (2016) An improved ontology-based web information extraction. In: Educational Innovation Through Technology. IEEE, pp 37–41

  17. Fagin R, Kimelfeld B, Reiss F et al (2014) Cleaning inconsistencies in information extraction via prioritized repairs. ACM 23:164–175

    Google Scholar 

  18. Velasco-Elizondo P, Marín-Piña R, Vazquez-Reyes S et al (2016) Knowledge representation and information extraction for analyzing architectural patterns. Sci Comput Program 121:176–189

    Article  Google Scholar 

  19. Gao B, Zhu J et al (2016) High-quality information extraction and query-oriented summarization for automatic query-reply in social network. Expert Syst Appl Int J 44(C):92–101

    Google Scholar 

  20. Mehdi A, Seyedamin P, Krys K, Hamid RA (2017) A knowledge-based topic modeling approach for automatic topic labeling. Int J Adv Comput Sci Appl 8(9):335–349

    Google Scholar 

  21. Seyedamin P, Mehdi A, Krys K, Gong C, and Hamid RA (2017) ES-LDA: entity summarization using knowledge-based topic modeling. In: Proceedings of the Eighth International Joint Conference on Natural Language, pp 316–325

  22. Yeh JF, Tan YS, Lee CH (2016) Topic detection and tracking for conversational content by using conceptual dynamic latent Dirichlet allocation. Neurocomputing 216:310–318

    Article  Google Scholar 

  23. Lin H, Sun B, Wu J et al (2016) Topic detection from short text: a term-based consensus clustering method. In: International Conference on Service Systems and Service Management. IEEE, pp 1–6

  24. Chakraborti S, Dey S (2016) Multi-level K-means text clustering technique for topic identification for competitor intelligence. In: IEEE Tenth International Conference on Research Challenges in Information Science. IEEE, pp 1–11

  25. Hashimoto K, Kontonatsios G, Miwa M et al (2016) Topic detection using paragraph vectors to support active learning in systematic reviews. J Biomed Inform 62(C):59–65

    Article  Google Scholar 

  26. Zhang C, Wang H, Cao L et al (2016) A hybrid term–term relations analysis approach for topic detection. Knowledge-Based Syst 93:109–120

    Article  Google Scholar 

  27. Nguyen KL (2016) Hot topic detection and technology trend tracking for patents utilizing term frequency and proportional document frequency and semantic information. In: International Conference on Big Data and Smart Computing. IEEE, pp 223–230

  28. Mehta B, Narvekar M (2015) DOM tree-based approach for Web content extraction. In: International Conference on Communication, Information and Computing Technology. IEEE, pp 1–6

  29. Bisson M, Bernaschi M, Mastrostefano E (2016) Parallel distributed breadth first search on the Kepler architecture. IEEE Trans Parallel Distrib Syst 27(7):2091–2102

    Article  Google Scholar 

  30. Qiu L, Lou Y, Chang M (2016) Research on theme crawler based on Shark-Search and PageRank algorithm. In: International Conference on Cloud Computing and Intelligence Systems. IEEE, pp 268–271

  31. Shahrivari S, Jalili S (2016) Single-pass and linear-time k-means clustering based on MapReduce. Elsevier, Amsterdam

    Book  Google Scholar 

Download references

Acknowledgements

The work was supported by the National Natural Science Foundation (NSF) under grants (No. 61672397, No. 61873341, No. 61472294, No. 61771354), Application Foundation Frontier Project of WuHan (No. 2018010401011290). Open Fund of Key Laboratory of Urban Land Resources Monitoring and Simulation, Ministry of Land and Resources (Grant No. KF-2018-03-005). Key Lab of Guangdong for Utilization of Remote Sensing and Geographical Information System, Guangzhou Institute of Geography (Grant No. 2017B030314138). Jiangsu Key Laboratory of Meteorological Observation and Information Processing, Nanjing University of Information Science and Technology (Grant No. KDXS1804). Any opinions, findings, and conclusions are those of the authors and do not necessarily reflect the views of the above agencies.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chunlin Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, C., Bai, J. Automatic content extraction and time-aware topic clustering for large-scale social network on cloud platform. J Supercomput 75, 2890–2924 (2019). https://doi.org/10.1007/s11227-018-2704-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-018-2704-z

Keywords