Skip to main content
Log in

Stream-based live public opinion monitoring approach with adaptive probabilistic topic model

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Public opinion monitoring, also known as first story detection, is defined within the topic detection and tracking on a particular Internet news event. Generally, it is used to find news propagation. Traditional method adopts text matching to address opinion monitoring. But it has some limitations such as hidden and latent topic discovery and incorrect relevance ranking of matching results on large-scale data. In this paper, we propose three solutions to live public opinion monitoring: simple keyword computing and matching, simple probabilistic topic computing and matching, and stream-based live probabilistic topic computing and matching. We point out the disadvantages of the first two solutions such as semantic matching and low efficiency on timely big data. Stream-based real-time topic computing and topic matching with query-time document and field boosting are proposed to make substantial improvements. Finally, our topic computing and matching experiments with crawled historical Netease news records show that our approaches are effective and efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

References

  • Alomari A (2017) Distance impact on quality of video streaming services in cloud environment. Int J Space-Based Situated Comput 7(3):119–128. https://doi.org/10.1504/IJSSC.2017.10010050

    Article  Google Scholar 

  • Arridha R, Sukaridhoto S, Pramadihanto D, Funabiki N (2017) Classification extension based on IoT-big data analytic for smart environment monitoring and analytic in real-time system. Int J Space-Based Situated Comput 7(2):82–93. https://doi.org/10.1504/IJSSC.2017.10008038

    Article  Google Scholar 

  • Anstead N, O’Loughlin B (2015) Social media analysis and public opinion: the 2010 UK general election. J Comput Med Commun 20(2):204–220

    Article  Google Scholar 

  • Badia A, Muezzinoglu T, Nasraoui O (2006) Focused crawling: experiences in a real world project. In: Proceedings of the 15th international conference on world wide web. ACM, pp 1043–1044

  • Barbosa L, Freire J (2007) An adaptive crawler for locating hidden-web entry points. In: Proceedings of the 16th international conference on world wide web. ACM, pp 441–450

  • Batsakis S, Petrakis EG, Milios E (2009) Improving the performance of focused web crawlers. Data Knowl Eng 68(10):1001–1013

    Article  Google Scholar 

  • Benhardus J, Kalita J (2013) Streaming trend detection in Twitter. Int J Web Based Commun 9(1):122–139

    Article  Google Scholar 

  • Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(Jan):993–1022

    MATH  Google Scholar 

  • Boldi P, Codenotti B, Santini M, Vigna S (2004) Ubicrawler: a scalable fully distributed web crawler. Softw Pract Exp 34(8):711–726

    Article  Google Scholar 

  • Bordes A, Glorot X, Weston J, Bengio Y (2014) A semantic matching energy function for learning with multi-relational data. Mach Learn 94(2):233–259

    Article  MathSciNet  MATH  Google Scholar 

  • Bošnjak M, Oliveira E, Martins J, Mendes Rodrigues E, Sarmento L (2012) Twitterecho: a distributed focused crawler to support open research with twitter data. In: Proceedings of the 21st international conference on world wide web. ACM, pp 1233–1240

  • Brin S, Page L (2012) Reprint of: the anatomy of a large-scale hypertextual web search engine. Comput Netw 56(18):3825–3833

    Article  Google Scholar 

  • Chang PC, Galley M, Manning CD (2008) Optimizing Chinese word segmentation for machine translation performance. In: Proceedings of the 3rd workshop on statistical machine translation. Association for Computational Linguistics, pp 224–232

  • Cho J, Garcia-Molina H, Page L (1998) Efficient crawling through URL ordering. Comput Netw ISDN Syst 30(1C7):161–172

    Article  Google Scholar 

  • Cui C, Shen J, Nie L, Hong R, Ma J (2017) Augmented collaborative filtering for sparseness reduction in personalized POI recommendation. ACM Trans Intell Syst Technol (TIST) 8(5):71

    Google Scholar 

  • De Bra P, Houben GJ, Kornatzky Y, Post R (1994) Information retrieval in distributed hypertexts. In: Intelligent multimedia information retrieval systems and management-volume 1. Le Centre de Hautes Etudes Internationales d’Informatique Documentaire, pp 481–491

  • De Francisci Morales G, Gionis A, Sozio M (2011) Social content matching in mapreduce. Proc VLDB Endow 4(7):460–469

    Article  Google Scholar 

  • Di Pietro G, Aliprandi C, De Luca AE, Raffaelli M, Soru T (2014) Semantic crawling: an approach based on named entity recognition. In: 2014 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM). IEEE, pp 695–699

  • Dong H, Hussain FK (2014) Self-adaptive semantic focused crawler for mining services information discovery. IEEE Trans Ind Inf 10(2):1616–1626

    Article  Google Scholar 

  • Dong H, Hussain FK, Chang E (2009) State of the art in semantic focused crawlers. In: International conference on computational science and its applications. Springer, pp 910–924

  • Fang M, Lu Q (2017) Study on clustering of micro-blog business enterprise users reputation based on web crawler. Int J Comput Sci Math 8(3):279–290

    Article  Google Scholar 

  • Gao W, Farahani MR, Aslam A, Hosamani S (2017) Distance learning techniques for ontology similarity measuring and ontology mapping. Clust Comput 20(2):959–968

    Article  Google Scholar 

  • Goh HL, Tan KK, Huang S, de Silva CW (2006) Development of bluewave: a wireless protocol for industrial automation. IEEE Trans Ind Inf 2(4):221–230

    Article  Google Scholar 

  • Guo X (2016) Shandong public opinion monitoring system. http://news.e23.cn/content/2016-09-05/2016090500499.html

  • Guo K, Shi L, Ye W, Li X (2014) A survey of internet public opinion mining. In: 2014 International conference on progress in informatics and computing (PIC). IEEE, pp 173–179

  • Guo J, Fan Y, Ai Q, Croft WB (2016) A deep relevance matching model for ad-hoc retrieval. In: Proceedings of the 25th ACM international on conference on information and knowledge management. ACM, pp 55–64

  • Han X, Wang L, Cui C, Ma J, Zhang S (2017) Linking multiple online identities in criminal investigations: A spectral co-clustering framework. IEEE Trans Inf Forensics Secur 12(9):2242–2255

    Article  Google Scholar 

  • Haveliwala TH (2002) Topic-sensitive pagerank. In: Proceedings of the 11th international conference on world wide web. ACM, pp 517–526

  • Huang B, Yu G (2015) Research and application of public opinion retrieval based on user behavior modeling. Neurocomputing 167:596–603

    Article  Google Scholar 

  • Kononenko O, Baysal O, Holmes R, Godfrey MW (2014) Mining modern repositories with elasticsearch. In: Proceedings of the 11th working conference on mining software repositories. ACM, pp 328–331

  • Krippendorff K (2012) Content analysis: an introduction to its methodology. Sage, Beverley Hills

    Google Scholar 

  • Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media? In: Proceedings of the 19th international conference on world wide web. ACM, pp 591–600

  • Lee MJ, Chun JW (2016) Reading others comments and public opinion poll results on social media: social judgment and spiral of empowerment. Comput Hum Behav 65:479–487

    Article  Google Scholar 

  • Liu Z, Zhang Y, Chang EY, Sun M (2011) Plda+: parallel latent Dirichlet allocation with data placement and pipeline processing. ACM Trans Intell Syst Technol (TIST) 2(3):26

    Google Scholar 

  • Ma K, Tang Z (2014) An online social mutual help architecture for multi-tenant mobile clouds. Int J Intell Inf Database Syst 8(4):359–374

    MathSciNet  Google Scholar 

  • Ma K, Yang B, Abraham A (2012) A template-based model transformation approach for deriving multi-tenant SaaS applications. Acta Polytech Hung 9(2):25–41

    Google Scholar 

  • Ma K, Dong F, Yang B (2014) Incremental object matching approach of schema-free data with mapreduce. Int J Comput Appl 36(2):72–77

    Google Scholar 

  • Ma K, Dong F, Yang B (2015) Large-scale schema-free data deduplication approach with adaptive sliding window using mapreduce. Comput J 58(11):3187–3201

    Article  Google Scholar 

  • Ma K, Tang Z, Zhong J, Yang B (2016) LPSMon: a stream-based live public sentiment monitoring system. Lect Notes Comput Sci 9659:534–536

    Google Scholar 

  • Ma K, Yu Z, Ji K, Yang B (2017) Stream-based live probabilistic topic computing and matching. In: International conference on algorithms and architectures for parallel processing. Springer, pp 397–406

  • Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D (2014) The Stanford CoreNLP natural language processing toolkit. In: ACL (System Demonstrations), pp 55–60

  • Matthes J, Kohring M (2008) The content analysis of media frames: toward improving reliability and validity. J Commun 58(2):258–279

    Article  Google Scholar 

  • McCandless M, Hatcher E, Gospodnetic O (2010) Lucene in action: covers apache Lucene 3.0. Manning Publications Co., Shelter Island

    Google Scholar 

  • Media co LTD SS (2018) Shandong Shunwang official website. http://www.e23.cn

  • Mihalcea R, Tarau P (2004) Textrank: bringing order into texts. Association for Computational Linguistics, Berlin

    Google Scholar 

  • Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013a) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

  • Mikolov T, Wt Y, Zweig G (2013b) Linguistic regularities in continuous space word representations. HLT-NAACL 13:746–751

    Google Scholar 

  • Miyoshi T, Nakagami Y (2007) Sentiment classification of customer reviews on electric products. In: 2007 IEEE international conference on systems, man and cybernetics. IEEE, pp 2028–2033

  • O’Connor B, Balasubramanyan R, Routledge BR, Smith NA et al (2010) From tweets to polls: linking text sentiment to public opinion time series. ICWSM 11(122–129):1–2

    Google Scholar 

  • Petrović S, Osborne M, Lavrenko V (2010) Streaming first story detection with application to Twitter. In: Human language technologies: the 2010 annual conference of the North American chapter of the association for computational linguistics. Association for Computational Linguistics, pp 181–189

  • Phan XH, Nguyen CT (2007) Gibbslda++: Ac/c++ implementation of latent Dirichlet allocation (LDA)

  • Phan XH, Nguyen LM, Horiguchi S (2008) Learning to classify short and sparse text and web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on world wide web. ACM, pp 91–100

  • Qian R, Zhang K, Zhao G (2013) A topic-specific web crawler based on content and structure mining. In: 2013 3rd international conference on computer science and network technology (ICCSNT). IEEE, pp 458–461

  • Qiu G, Liu B, Bu J, Chen C (2009) Expanding domain sentiment lexicon through double propagation. IJCAI 9:1199–1204

    Google Scholar 

  • Ramos M, Shao J, Reis SD, Anteneodo C, Andrade JS, Havlin S, Makse HA (2015) How does public opinion become extreme? Sci Rep 5(10):032

    Google Scholar 

  • Sakaji H, Ishibuchi J, Sakai H (2016) Extraction of polarity comments from Nico Nico Douga. Int J Space-Based Situated Comput 6(3):165–172. https://doi.org/10.1504/IJSSC.2016.080283

    Article  Google Scholar 

  • Shahi D (2015) Apache Solr: an introduction. In: Apache Solr. Springer, pp 1–9

  • Socher R, Perelygin A, Wu JY, Chuang J, Manning CD, Ng AY, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), vol 1631, Citeseer, p 1642

  • Su C, Gao Y, Yang J, Luo B (2005) An efficient adaptive focused crawler based on ontology learning. In: 5th International conference on hybrid intelligent systems (HIS’05). IEEE, p 6

  • Su LYF, Cacciatore MA, Liang X, Brossard D, Scheufele DA, Xenos MA (2016) Analyzing public sentiments online: combining human-and computer-based content analysis. Inf Commun Soc 20:1–22

    Google Scholar 

  • Tang Z, Ma K (2014) Rsscube: a content syndication and recommendation architecture. Int J Database Theory Appl 7(4):237–248

    Article  MathSciNet  Google Scholar 

  • Tsirakis N, Poulopoulos V, Tsantilas P, Varlamis I (2016) Large scale opinion mining for social, news and blog data. J Syst Softw 127:1–12

    Google Scholar 

  • Vuurens JB, de Vries AP (2016) First story detection using multiple nearest neighbors. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 845–848

  • Wang Y, Bai H, Stanton M, Chen WY, Chang EY (2009) Plda: parallel latent Dirichlet allocation for large-scale applications. In: International conference on algorithmic applications in management. Springer, pp 301–314

  • Wang Y, Zhao X, Sun Z, Yan H, Wang L, Jin Z, Wang L, Gao Y, Law C, Zeng J (2015) Peacock: learning long-tail topic features for industrial applications. ACM Trans Intell Syst Technol (TIST) 6(4):47

    Google Scholar 

  • Wu HC, Luk RWP, Wong KF, Kwok KL (2008) Interpreting TF-IDF term weights as making relevance decisions. ACM Trans Inf Syst (TOIS) 26(3):13

    Article  Google Scholar 

  • Yadollahi A, Shahraki AG, Zaiane OR (2017) Current state of text sentiment analysis from opinion to emotion mining. ACM Comput Surv (CSUR) 50(2):25

    Article  Google Scholar 

  • Yu X, Wang H, Zheng X (2018) Mining top-k approximate closed patterns in an imprecise database. Int J Grid Utility Comput 9(2):97–107. https://doi.org/10.1504/IJGUC.2018.1001279

    Article  Google Scholar 

  • Yuan J, Gao F, Ho Q, Dai W, Wei J, Zheng X, Xing EP, Liu TY, Ma WY (2015) Lightlda: big topic models on modest computer clusters. In: Proceedings of the 24th international conference on world wide web. ACM, pp 1351–1361

  • Zhai Z, Xu H, Kang B, Jia P (2011) Exploiting effective features for Chinese sentiment classification. Expert Syst Appl 38(8):9139–9146

    Article  Google Scholar 

  • Zhang M, Chakrabarti K (2013) Infogather+: semantic matching and annotation of numeric and time-varying attributes in web tables. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data. ACM, pp 145–156

  • Zhang D, Xu H, Su Z, Xu Y (2015) Chinese comments sentiment classification based on word2vec and SVMperf. Expert Syst Appl 42(4):1857–1863

    Article  Google Scholar 

  • Zheng HT, Kang BY, Kim HG (2008) An ontology-based approach to learnable focused crawling. Inf Sci 178(23):4512–4522

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (61772231 & 61702217 & 61702216), the Shandong Provincial Natural Science Foundation (ZR2017MF025 & ZR2014FQ029), the Shandong Provincial Key R&D Program of China (2015GGX106007 & 2016ZDJS01A12 & 2017CXGC0701 & 2018CXGC0706), the Science and Technology Program of University of Jinan (XKY1734 & XKY1828).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kun Ma.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, K., Yu, Z., Ji, K. et al. Stream-based live public opinion monitoring approach with adaptive probabilistic topic model. Soft Comput 23, 7451–7470 (2019). https://doi.org/10.1007/s00500-018-3391-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-018-3391-7

Keywords

Navigation