skip to main content
10.1145/3352411.3352442acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdsitConference Proceedingsconference-collections
research-article

An Improved Clustering Algorithm based on Single-pass

Authors Info & Claims
Published:19 July 2019Publication History

ABSTRACT

Topic Detection and Tracking is a popular topic clustering method in the big data age, which aims at automatic recognition of new topics and continuous tracking of known topics in news information flow. Traditional Topic Detection and Tracking mainly studies short text. With the rapid development of digital devices and communication techniques, the news is going to be longer and richer. So nowadays traditional Topic Detection and Tracking is faced with three problems, first, long news text usually contains multiple topics, so traditional clustering algorithm cannot accurately identify them. Second, traditional clustering mostly uses multi-dimensional computation based on word bag, but the time-consuming of this multi-dimensional computation increases exponentially with the increase of the length and number of articles. Third, long-text news contains more information. How to show the continuity and relevance of long-text news in a better way is very important and meaningful. Therefore, an improved clustering algorithm based on single-pass is presented in this paper, which can solve the above problems primly. Experiments show that, compared with K-means clustering algorithm, agglomerative hierarchical clustering algorithm, Density-Based Spatial Clustering of Applications with Noise and hierarchical clustering on the constructed concept graph, the accuracy of this algorithm is improved by about 20% to 30%, the recall rate is increased by 10% to 20%, and the algorithm time is reduced by more than 40%. With the increase of the number of articles, the time-consuming curve of the improved single-pass clustering algorithm approximates a linear function. For each additional article, the time required for the algorithm is only 0.1-0.5 times that of other algorithms. Besides, by adding timelines and extracting topics in the theme during presentation, the algorithm can effectively mine the continuity and relevance information of news topics and track the changes of news topics.

References

  1. R. Swan and J. Allan, Automatic generation of overview timelines, In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Allan, 2002 Introduction to topic detection and tracking, Topic detection and tracking. Springer, Boston, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Cieri, S. Strassel, D. Graff, N. Martey, K. Rennert and M. Liberman (2002). Corpora for Topic Detection and Tracking. Information Retrieva, 12, 33--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chen, Y., Liu, L., Development and research of Topic Detection and Tracking, In Proceedings of the 7th IEEE International Conference on Software Engineering & Service Science.Google ScholarGoogle Scholar
  5. Amayri, O., Bouguila, N., Online news topic detection and tracking via localized feature selection, In Proceedings of the IEEE 2013 International Joint Conference on Neural Networks.Google ScholarGoogle Scholar
  6. M. Mohd, F. Crestani, I. Ruthven, Design of an Interface for Interactive Topic Detection and Tracking, In Proceedings of the 8th International Conference on Flexible Query Answering Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Heyer, G., Holz, F., and Teresniak, S., Change of topics over time-tracking topics by their change of meaning, In Proceedings of the 9th Knowledge Discovery and Information Retrieval.Google ScholarGoogle Scholar
  8. Li G, Zhang W, Pang J, Huang Q, Jiang S (2013). Online web-video topic detection and tracking with semisupervised learning. Multimedia Systems, 22(1), 115--125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Yeh J F, Tan Y S, Lee C H (2016). Topic detection and tracking for conversational content by using conceptual dynamic latent Dirichlet allocation. Neurocomputing, 216, 310--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Lu Z, Lin YR, Huang X, Xiong N, Fang Z (2017). Visual topic discovering, tracking and summarization from social media streams. Multimedia Tools and Applications, 76(8), 10855--10879. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Yan, D., Hua, E., and Hu, B., An improved single-pass algorithm for chinese microblog topic detection and tracking, In Proceedings of the IEEE International Congress on Big Data.Google ScholarGoogle Scholar
  12. W. Zheng, Y. Zhang, Y. Hong, J. Fan, and T. Liu, Topic tracking based on keywords dependency profile, In Proceedings of the 4th Asia Infomation Retrieval Symposium. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. She Y, Tang S, Zhang Q, Indirect Gaussian Graph Learning beyond Gaussianity, In Proceedings of the IEEE Transactions on Network Science and Engineering.Google ScholarGoogle Scholar
  14. Huang J, Peng M, Wang H, et al (2016). A probabilistic method for emerging topic tracking in Microblog stream. World Wide Web, 20(2), 325--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Li C., Ye Y., Zhang X., et al, Clustering Based Topic Events Detection on Text Stream, In Proceedings of the 5th Asian Conference on Intelligent Information and Database Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jianping Zeng, Shiyong Zhang (2009). Incorporating Topic Transition in Topic Detection and Tracking Algorithms. Expert Systems with Applications, 36(1), 227--232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. X. Wu, I. Ide and S. Satoh, News Topic Tracking and Re-ranking with Query Expansion Based on Near-Duplicate Detection, In Proceedings of the 10th Pacific Rim Conference on Multimedia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Montalvo, V. Fresno, and R. Martínez (2012). NESM: a Named Entity based Proximity Measure for Multilingual News Clustering. Procesamiento del lenguaje natural, 48, 81--88.Google ScholarGoogle Scholar
  19. W. Li, J. Joo, H. Qi, S. Zhu, Joint image-text news topic detection and tracking by multimodal topic and-or graph, In Proceedings of the IEEE Transactions on Multimedia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Gaul W, Vincent D (2017). Evaluation of the evolution of relationships between topics over time. Advances in Data Analysis & Classification, 11(1), 1--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Heyer, G., Holz, F., and Teresniak, S., Change of topics over time-tracking topics by their change of meaning, In Proceedings of In Proceedings of the 9th Knowledge Discovery and Information Retrieval.Google ScholarGoogle Scholar
  22. Biao Wang, Yiwei Zhang, Ding Wang, Research on a New Metadata Model of Political Event Data Set, In Proceedings of the 4th International Conference on Big Data Security on Cloud.Google ScholarGoogle Scholar
  23. Biao Wang, Ding Wang, Yingchu Xie, Research on the Construction and Application of Burma-vietnam's Political Event Data Set, In Proceedings of the 4th International Conference on Big Data Security on Cloud.Google ScholarGoogle Scholar
  24. Huang J, Peng M, Wang H, et al (2013). A topic detection approach through hierarchical clustering on concept graph. Applied Mathematics & Information Sciences, 7(6), 2285--2295.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. An Improved Clustering Algorithm based on Single-pass

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      DSIT 2019: Proceedings of the 2019 2nd International Conference on Data Science and Information Technology
      July 2019
      280 pages
      ISBN:9781450371414
      DOI:10.1145/3352411

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 July 2019

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      DSIT 2019 Paper Acceptance Rate43of95submissions,45%Overall Acceptance Rate114of277submissions,41%
    • Article Metrics

      • Downloads (Last 12 months)3
      • Downloads (Last 6 weeks)0

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader