Skip to main content
Log in

Two-Stage Method for Grouping News with Similar Topics

  • Published:
Programming and Computer Software Aims and scope Submit manuscript

Abstract

This paper is devoted to event detection as part of news stream analysis. An event is regarded as a group of news texts dedicated to one real-world fact or situation. We analyze news from Russian news agencies. A two-stage clustering method is proposed, which combines a rough clustering algorithm used at the first stage and a refinement classifier used at the second stage. In addition, we present an open labeled dataset for news event detection, which is based on the Yandex News service. Empirical evaluation of the proposed method on this dataset proves its effectiveness for event detection in news texts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1.
Fig. 2.

Similar content being viewed by others

Notes

  1. https://developer.twitter.com/en/developer-terms/policy

  2. https://catalog.ldc.upenn.edu/LDC2005T16

  3. https://yandex.ru/news/

  4. http://talisman.ispras.ru/wp-content/uploads/2020/09/news_events.json_.gz

REFERENCES

  1. Allan, J., Carbonell, J.G., Doddington, G., Yamron, J., and Yang, Y., Topic detection and tracking pilot study final report, Proc. DARPA Broadcast News Transcription and Understanding Workshop, 1998, pp. 194–218.

  2. Brants, T., Chen, F., Farahat, A., A system for new event detection, Proc. 26th Annu. Int. ACM SIGIR Conf. Research and Development in Information Retrieval, 2003, pp. 330–337.

  3. Kumaran, G. and Allan, J., Text classification and named entities for new event detection, Proc. 27th Annu. Int. ACM SIGIR Conf. Research and Development in Information Retrieval, 2004, pp. 297–304.

  4. Popescu, A.-M. and Pennacchiotti, M., Detecting controversial events from Twitter, Proc. 19th ACM Int. Conf. Information and Knowledge Management, 2010, pp. 1873–1876.

  5. Petrović, S., Osborne, M., and Lavrenko, V., Streaming first story detection with application to Twitter, Proc. Annu. Conf. North American Chapter of the Association forComputational Linguistics, 2010, pp. 181–189.

  6. Becker, H., Naaman, M., and Gravano, L., Beyond trending topics: Real-world event identification on Twitter, Proc. 5th Int. AAAI Conf. Weblogs and Social Media, 2011, pp. 438–441.

  7. Sankaranarayanan, J., Samet, H., Teitler, B.E., Lieberman, M.D., and Sperling, J., Twitterstand: News in tweets, Proc. 17th ACM Sigspatial Int. Conf. Advances inGeographic Information Systems, 2009, pp. 42–51.

  8. Long, R., Wang, H., Chen, Y., Jin, O., and Yu, Y., Towards effective event detection, tracking and summarization on microblog data, Lect. Notes Comput. Sci., 2011, vol. 6897, pp. 652–663.

    Article  Google Scholar 

  9. Sakaki, T., Okazaki, M., and Matsuo, Y., Earthquake shakes Twitter users: Teal-time event detection by social sensors, Proc. 19th Int. Conf. World Wide Web, 2010, pp. 851–860.

  10. Conrad, J.G. and Bender, M., Semi-supervised events clustering in news retrieval, Proc. 1st Int. Workshop Recent Trends in News Information Retrieval co-located with 38th Eur. Conf. Information Retrieval, 2016, pp. 21–26.

  11. Mohd, M., Named entity patterns across news domains, Proc. 1st BCS IRSG Conf. Future Directions in Information Access, 2007.

  12. Hua, T., Chen, F., Zhao, L., Lu, C.-T., and Ramakrishnan, N., STED: Semi-supervised targeted-interest event detectioning in Twitter, Proc. 19th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2013, pp. 1466–1469.

  13. Vavliakis, K.N., Tzima, F.A., and Mitkas, P.A., Event detection via LDA for the MediaEval 2012 SED task, Proc. Multimedia Benchmark Workshop, 2012.

  14. Zhou, X. and Chen, L., Event detection over Twitter social media streams, VLDB J.,2014, vol. 23, no. 3, pp. 381–400.

    Article  MathSciNet  Google Scholar 

  15. Broder, A.Z., Glassman, S.C., Manasse, M.S., and Zweig, G., Syntactic clustering of the web, Comput. Networks and ISDNS, 1997, vol. 29, nos. 8–13, pp. 1157–1166.

    Article  Google Scholar 

  16. Pearce, D.J., An improved algorithm for finding the strongly connected components of a directed graph, Tech. Rep., Victoria University, Wellington, NZ, 2005.

    Google Scholar 

  17. Jaccard, P., Étude comparative de la distribution florale dans une portion des alpes et des jura, Bulletin del la Société Vaudoise des Sciences Naturelles, 1901, vol. 37, pp. 547–579.

    Google Scholar 

  18. Dice, L.R., Measures of the amount of ecologic association between species, Ecology, 1945, vol. 26, no. 3, pp. 297–302.

    Article  Google Scholar 

  19. Levenshtein, V.I., Binary codes capable of correcting deletions, insertions, and reversals, Phys.-Dokl., 1966, vol. 10, no. 8, pp. 707–710.

    Google Scholar 

  20. Cilibrasi, R. and Vitányi, P.M., Clustering by compression, IEEE Trans. Inf. Theory, 2005, vol. 51, no. 4, pp. 1523–1545.

    Article  MathSciNet  Google Scholar 

  21. Yatskov, A.K., Varlamov, M.I., and Turdakov, D.Yu., Extraction of data from mass media web sites, Program. Comput. Software, 2018, vol. 44, no. 5, pp. 344–352.

    Article  Google Scholar 

  22. Pronoza, E., Yagunova, E., and Pronoza, A., Construction of a Russian paraphrase corpus: Unsupervised paraphrase extraction, Proc. Russian Summer School in Information Retrieval, 2015, pp. 146–157.

    Google Scholar 

  23. Cohen, J., A coefficient of agreement for nominal scales, Educ. Psychol. Meas., 1960, vol. 20, no. 1, pp. 37–46.

    Article  Google Scholar 

  24. Parhomenko, P.A., Grigorev, A.A., and Astrakhantsev, N.A., A survey and an experimental comparison of methods for text clustering: Application to scientific articles, Tr. Inst. Sist. Program. Ross. Akad. Nauk (Proc. Inst. Syst. Program. Russ. Acad. Sci.), 2017, vol. 29, no. 2, pp. 161–200.

Download references

Funding

This work was supported by the Russian Foundation for Basic Research, project no. 18-07-01059.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to K. A. Skorniakov, A. S. Laskina or D. Yu. Turdakov.

Additional information

Translated by Yu. Kornienko

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Skorniakov, K.A., Laskina, A.S. & Turdakov, D.Y. Two-Stage Method for Grouping News with Similar Topics. Program Comput Soft 47, 534–540 (2021). https://doi.org/10.1134/S0361768821070070

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S0361768821070070

Navigation