Abstract
Public opinion monitoring, also known as first story detection, is defined within the topic detection and tracking on a particular Internet news event. Generally, it is used to find news propagation. Traditional method adopts text matching to address opinion monitoring. But it has some limitations such as hidden and latent topic discovery and incorrect relevance ranking of matching results on large-scale data. In this paper, we propose three solutions to live public opinion monitoring: simple keyword computing and matching, simple probabilistic topic computing and matching, and stream-based live probabilistic topic computing and matching. We point out the disadvantages of the first two solutions such as semantic matching and low efficiency on timely big data. Stream-based real-time topic computing and topic matching with query-time document and field boosting are proposed to make substantial improvements. Finally, our topic computing and matching experiments with crawled historical Netease news records show that our approaches are effective and efficient.
Similar content being viewed by others
References
Alomari A (2017) Distance impact on quality of video streaming services in cloud environment. Int J Space-Based Situated Comput 7(3):119–128. https://doi.org/10.1504/IJSSC.2017.10010050
Arridha R, Sukaridhoto S, Pramadihanto D, Funabiki N (2017) Classification extension based on IoT-big data analytic for smart environment monitoring and analytic in real-time system. Int J Space-Based Situated Comput 7(2):82–93. https://doi.org/10.1504/IJSSC.2017.10008038
Anstead N, O’Loughlin B (2015) Social media analysis and public opinion: the 2010 UK general election. J Comput Med Commun 20(2):204–220
Badia A, Muezzinoglu T, Nasraoui O (2006) Focused crawling: experiences in a real world project. In: Proceedings of the 15th international conference on world wide web. ACM, pp 1043–1044
Barbosa L, Freire J (2007) An adaptive crawler for locating hidden-web entry points. In: Proceedings of the 16th international conference on world wide web. ACM, pp 441–450
Batsakis S, Petrakis EG, Milios E (2009) Improving the performance of focused web crawlers. Data Knowl Eng 68(10):1001–1013
Benhardus J, Kalita J (2013) Streaming trend detection in Twitter. Int J Web Based Commun 9(1):122–139
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(Jan):993–1022
Boldi P, Codenotti B, Santini M, Vigna S (2004) Ubicrawler: a scalable fully distributed web crawler. Softw Pract Exp 34(8):711–726
Bordes A, Glorot X, Weston J, Bengio Y (2014) A semantic matching energy function for learning with multi-relational data. Mach Learn 94(2):233–259
Bošnjak M, Oliveira E, Martins J, Mendes Rodrigues E, Sarmento L (2012) Twitterecho: a distributed focused crawler to support open research with twitter data. In: Proceedings of the 21st international conference on world wide web. ACM, pp 1233–1240
Brin S, Page L (2012) Reprint of: the anatomy of a large-scale hypertextual web search engine. Comput Netw 56(18):3825–3833
Chang PC, Galley M, Manning CD (2008) Optimizing Chinese word segmentation for machine translation performance. In: Proceedings of the 3rd workshop on statistical machine translation. Association for Computational Linguistics, pp 224–232
Cho J, Garcia-Molina H, Page L (1998) Efficient crawling through URL ordering. Comput Netw ISDN Syst 30(1C7):161–172
Cui C, Shen J, Nie L, Hong R, Ma J (2017) Augmented collaborative filtering for sparseness reduction in personalized POI recommendation. ACM Trans Intell Syst Technol (TIST) 8(5):71
De Bra P, Houben GJ, Kornatzky Y, Post R (1994) Information retrieval in distributed hypertexts. In: Intelligent multimedia information retrieval systems and management-volume 1. Le Centre de Hautes Etudes Internationales d’Informatique Documentaire, pp 481–491
De Francisci Morales G, Gionis A, Sozio M (2011) Social content matching in mapreduce. Proc VLDB Endow 4(7):460–469
Di Pietro G, Aliprandi C, De Luca AE, Raffaelli M, Soru T (2014) Semantic crawling: an approach based on named entity recognition. In: 2014 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM). IEEE, pp 695–699
Dong H, Hussain FK (2014) Self-adaptive semantic focused crawler for mining services information discovery. IEEE Trans Ind Inf 10(2):1616–1626
Dong H, Hussain FK, Chang E (2009) State of the art in semantic focused crawlers. In: International conference on computational science and its applications. Springer, pp 910–924
Fang M, Lu Q (2017) Study on clustering of micro-blog business enterprise users reputation based on web crawler. Int J Comput Sci Math 8(3):279–290
Gao W, Farahani MR, Aslam A, Hosamani S (2017) Distance learning techniques for ontology similarity measuring and ontology mapping. Clust Comput 20(2):959–968
Goh HL, Tan KK, Huang S, de Silva CW (2006) Development of bluewave: a wireless protocol for industrial automation. IEEE Trans Ind Inf 2(4):221–230
Guo X (2016) Shandong public opinion monitoring system. http://news.e23.cn/content/2016-09-05/2016090500499.html
Guo K, Shi L, Ye W, Li X (2014) A survey of internet public opinion mining. In: 2014 International conference on progress in informatics and computing (PIC). IEEE, pp 173–179
Guo J, Fan Y, Ai Q, Croft WB (2016) A deep relevance matching model for ad-hoc retrieval. In: Proceedings of the 25th ACM international on conference on information and knowledge management. ACM, pp 55–64
Han X, Wang L, Cui C, Ma J, Zhang S (2017) Linking multiple online identities in criminal investigations: A spectral co-clustering framework. IEEE Trans Inf Forensics Secur 12(9):2242–2255
Haveliwala TH (2002) Topic-sensitive pagerank. In: Proceedings of the 11th international conference on world wide web. ACM, pp 517–526
Huang B, Yu G (2015) Research and application of public opinion retrieval based on user behavior modeling. Neurocomputing 167:596–603
Kononenko O, Baysal O, Holmes R, Godfrey MW (2014) Mining modern repositories with elasticsearch. In: Proceedings of the 11th working conference on mining software repositories. ACM, pp 328–331
Krippendorff K (2012) Content analysis: an introduction to its methodology. Sage, Beverley Hills
Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media? In: Proceedings of the 19th international conference on world wide web. ACM, pp 591–600
Lee MJ, Chun JW (2016) Reading others comments and public opinion poll results on social media: social judgment and spiral of empowerment. Comput Hum Behav 65:479–487
Liu Z, Zhang Y, Chang EY, Sun M (2011) Plda+: parallel latent Dirichlet allocation with data placement and pipeline processing. ACM Trans Intell Syst Technol (TIST) 2(3):26
Ma K, Tang Z (2014) An online social mutual help architecture for multi-tenant mobile clouds. Int J Intell Inf Database Syst 8(4):359–374
Ma K, Yang B, Abraham A (2012) A template-based model transformation approach for deriving multi-tenant SaaS applications. Acta Polytech Hung 9(2):25–41
Ma K, Dong F, Yang B (2014) Incremental object matching approach of schema-free data with mapreduce. Int J Comput Appl 36(2):72–77
Ma K, Dong F, Yang B (2015) Large-scale schema-free data deduplication approach with adaptive sliding window using mapreduce. Comput J 58(11):3187–3201
Ma K, Tang Z, Zhong J, Yang B (2016) LPSMon: a stream-based live public sentiment monitoring system. Lect Notes Comput Sci 9659:534–536
Ma K, Yu Z, Ji K, Yang B (2017) Stream-based live probabilistic topic computing and matching. In: International conference on algorithms and architectures for parallel processing. Springer, pp 397–406
Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D (2014) The Stanford CoreNLP natural language processing toolkit. In: ACL (System Demonstrations), pp 55–60
Matthes J, Kohring M (2008) The content analysis of media frames: toward improving reliability and validity. J Commun 58(2):258–279
McCandless M, Hatcher E, Gospodnetic O (2010) Lucene in action: covers apache Lucene 3.0. Manning Publications Co., Shelter Island
Media co LTD SS (2018) Shandong Shunwang official website. http://www.e23.cn
Mihalcea R, Tarau P (2004) Textrank: bringing order into texts. Association for Computational Linguistics, Berlin
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013a) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Mikolov T, Wt Y, Zweig G (2013b) Linguistic regularities in continuous space word representations. HLT-NAACL 13:746–751
Miyoshi T, Nakagami Y (2007) Sentiment classification of customer reviews on electric products. In: 2007 IEEE international conference on systems, man and cybernetics. IEEE, pp 2028–2033
O’Connor B, Balasubramanyan R, Routledge BR, Smith NA et al (2010) From tweets to polls: linking text sentiment to public opinion time series. ICWSM 11(122–129):1–2
Petrović S, Osborne M, Lavrenko V (2010) Streaming first story detection with application to Twitter. In: Human language technologies: the 2010 annual conference of the North American chapter of the association for computational linguistics. Association for Computational Linguistics, pp 181–189
Phan XH, Nguyen CT (2007) Gibbslda++: Ac/c++ implementation of latent Dirichlet allocation (LDA)
Phan XH, Nguyen LM, Horiguchi S (2008) Learning to classify short and sparse text and web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on world wide web. ACM, pp 91–100
Qian R, Zhang K, Zhao G (2013) A topic-specific web crawler based on content and structure mining. In: 2013 3rd international conference on computer science and network technology (ICCSNT). IEEE, pp 458–461
Qiu G, Liu B, Bu J, Chen C (2009) Expanding domain sentiment lexicon through double propagation. IJCAI 9:1199–1204
Ramos M, Shao J, Reis SD, Anteneodo C, Andrade JS, Havlin S, Makse HA (2015) How does public opinion become extreme? Sci Rep 5(10):032
Sakaji H, Ishibuchi J, Sakai H (2016) Extraction of polarity comments from Nico Nico Douga. Int J Space-Based Situated Comput 6(3):165–172. https://doi.org/10.1504/IJSSC.2016.080283
Shahi D (2015) Apache Solr: an introduction. In: Apache Solr. Springer, pp 1–9
Socher R, Perelygin A, Wu JY, Chuang J, Manning CD, Ng AY, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), vol 1631, Citeseer, p 1642
Su C, Gao Y, Yang J, Luo B (2005) An efficient adaptive focused crawler based on ontology learning. In: 5th International conference on hybrid intelligent systems (HIS’05). IEEE, p 6
Su LYF, Cacciatore MA, Liang X, Brossard D, Scheufele DA, Xenos MA (2016) Analyzing public sentiments online: combining human-and computer-based content analysis. Inf Commun Soc 20:1–22
Tang Z, Ma K (2014) Rsscube: a content syndication and recommendation architecture. Int J Database Theory Appl 7(4):237–248
Tsirakis N, Poulopoulos V, Tsantilas P, Varlamis I (2016) Large scale opinion mining for social, news and blog data. J Syst Softw 127:1–12
Vuurens JB, de Vries AP (2016) First story detection using multiple nearest neighbors. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 845–848
Wang Y, Bai H, Stanton M, Chen WY, Chang EY (2009) Plda: parallel latent Dirichlet allocation for large-scale applications. In: International conference on algorithmic applications in management. Springer, pp 301–314
Wang Y, Zhao X, Sun Z, Yan H, Wang L, Jin Z, Wang L, Gao Y, Law C, Zeng J (2015) Peacock: learning long-tail topic features for industrial applications. ACM Trans Intell Syst Technol (TIST) 6(4):47
Wu HC, Luk RWP, Wong KF, Kwok KL (2008) Interpreting TF-IDF term weights as making relevance decisions. ACM Trans Inf Syst (TOIS) 26(3):13
Yadollahi A, Shahraki AG, Zaiane OR (2017) Current state of text sentiment analysis from opinion to emotion mining. ACM Comput Surv (CSUR) 50(2):25
Yu X, Wang H, Zheng X (2018) Mining top-k approximate closed patterns in an imprecise database. Int J Grid Utility Comput 9(2):97–107. https://doi.org/10.1504/IJGUC.2018.1001279
Yuan J, Gao F, Ho Q, Dai W, Wei J, Zheng X, Xing EP, Liu TY, Ma WY (2015) Lightlda: big topic models on modest computer clusters. In: Proceedings of the 24th international conference on world wide web. ACM, pp 1351–1361
Zhai Z, Xu H, Kang B, Jia P (2011) Exploiting effective features for Chinese sentiment classification. Expert Syst Appl 38(8):9139–9146
Zhang M, Chakrabarti K (2013) Infogather+: semantic matching and annotation of numeric and time-varying attributes in web tables. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data. ACM, pp 145–156
Zhang D, Xu H, Su Z, Xu Y (2015) Chinese comments sentiment classification based on word2vec and SVMperf. Expert Syst Appl 42(4):1857–1863
Zheng HT, Kang BY, Kim HG (2008) An ontology-based approach to learnable focused crawling. Inf Sci 178(23):4512–4522
Acknowledgements
This work was supported by the National Natural Science Foundation of China (61772231 & 61702217 & 61702216), the Shandong Provincial Natural Science Foundation (ZR2017MF025 & ZR2014FQ029), the Shandong Provincial Key R&D Program of China (2015GGX106007 & 2016ZDJS01A12 & 2017CXGC0701 & 2018CXGC0706), the Science and Technology Program of University of Jinan (XKY1734 & XKY1828).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Communicated by V. Loia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ma, K., Yu, Z., Ji, K. et al. Stream-based live public opinion monitoring approach with adaptive probabilistic topic model. Soft Comput 23, 7451–7470 (2019). https://doi.org/10.1007/s00500-018-3391-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-018-3391-7