skip to main content
10.1145/1871437.1871539acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Two-tier similarity model for story link detection

Published:26 October 2010Publication History

ABSTRACT

The paper presents a novel approach to story link detection, where the goal is to determine whether a pair of news stories are linked, i.e., talk about the same event. The present work marks a departure from the prior work in that we measure similarity at two distinct levels of textual organization, the document and its collection, and combine scores at both levels to determine how well stories are linked. Experiments on the TDT-5 corpus show that the present approach, which we call a 'two-tier similarity model,' comfortably beats conventional approaches such as Clarity enhanced KL divergence, while performing robustly across diverse languages.

References

  1. J. Allan, Y. Yang, J. Carbonell, J. Yamron, G. Doddington, and C. Wayne. TDT Pilot Study Corpus, 1998. Linguistic Data Consortium.Google ScholarGoogle Scholar
  2. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Brown. Dynamic stopwording for story link detection. In Proceedings of HLT 2002, pages 190--193, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. Cancedda, E. Gaussier, C. Goutte, and J.-M. Renders. Word-Sequence Kernels. Journal of Machine Learning Research, 3:1059--1082, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. F. Chen, A. Farahat, and T. Brants. Story link detection and new event detection are asymmetric. In Proceedings of HLT-NACCL 2003, pages 13--15, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F. Chen, A. Farahat, and T. Brants. Multiple similarity measures and source-pair information in story link detection. In Proceedings of HLT-NAACL 2004, pages 313--320, 2004.Google ScholarGoogle Scholar
  7. Y.-J. Chen and H.-H. Chen. NLP and IR approaches to monolingual and multilingual link dectection. In The 19th International Conference on Computational Linguistics (COLING-2002), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Farahat, F. Chen, and T. Brants. Optimizing story link detection is not equivalent to optimizing new event detection. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 232--239, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Feldherr and A. Mitchel. Swine flu coverage around the world. http://www.journalism.org/analysis_report/Swine_Flu_Coverage_around_the_World, May 2009.Google ScholarGoogle Scholar
  10. J. Galtung and M. H. Ruge. The structure of foreign news. Journal of Peace Research, 2(1):64--91, 1965.Google ScholarGoogle ScholarCross RefCross Ref
  11. M. Glenn, S. Strassel, J. Kong, and K. Maeda. TDT5 Topics and Annotations, 2006. Linguistic Data Consortium.Google ScholarGoogle Scholar
  12. D. Graff, J. Kong, K. Maeda, and S. Strassel. TDT5 Multilingual Text, 2006. Linguistic Data Consortium.Google ScholarGoogle Scholar
  13. D. Harman and M. Liberman. Text Research Collection Vol. 1. CD-ROM, 1994. TIPSTER.Google ScholarGoogle Scholar
  14. D. Harman and M. Liberman. Text Research Collection Vol. 3. CD-ROM, 1994. TIPSTER.Google ScholarGoogle Scholar
  15. L. S. Larkey, F. Feng, M. Connell, and V. Lavrenko. Language-specific models in multilingual topic tracking. In SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 402--409, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. V. Lavrenko, J. A. E. DeGuzman, D. LaFallme, V. Pollard, and S. Thomas. Relevance models for topic detection and tracking. In Proceedings of the Conference on Human Language Technology, pages 102--110, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. K.-S. Lee and K. Kageura. Korean-Japanese story link detection based on distributional and contrastive properties of event terms. Information Processing and Management, 42:538--550, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Lee. On the effectiveness of the skew divergence for statistical language analysis. In Artificial Intelligence and Statistics, pages 65--72, 2001.Google ScholarGoogle Scholar
  19. C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Mochihashi. lda, a latent dirichlet allocation package. http://chasen.org/~daiti-m/dist/lda/, 2004.Google ScholarGoogle Scholar
  21. R. Nallapati. Semantic language models for topic detection and tracking. In Proceedings of the HLT-NAACL 2003 Student Research Workshop. pages 1--6, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Pei, J. Han, B. Mortazavi-asl, H. Pinto, Q. Chen, U. Dayal, and M. chun Hs. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. In Proceedings of the 17th International Conference on Data Engineering (ICDE 2001), pages 215--224, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. G. Salton and M. E. Lesk. The SMART automatic document retrieval systems - an illustration. Commun. ACM, 8(6):391--398, 1965. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Suzuki, H. Isozaki, and E. Maeda. Convolution kernels with feature selection for natural language processing tasks. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL'04), pages 119--126, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. Swan and J. Allen. Automatic generation of overview timelines. In Proceedings of SIGIR 2000, pages 49--56, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C. C. Vogt and G. W. Cottrell. Predicting the performance of linearly combined ir systems. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 190--196, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. X. Zhang, T. Wang, and H. Chen. Story link detection based on dynamic information extending. In Proceedings of the Third International Join Conference on Natural Language Processing, pages 40--47, 2008.Google ScholarGoogle Scholar

Index Terms

  1. Two-tier similarity model for story link detection

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
          October 2010
          2036 pages
          ISBN:9781450300995
          DOI:10.1145/1871437

          Copyright © 2010 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 26 October 2010

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,861of8,427submissions,22%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader