ABSTRACT
The paper presents a novel approach to story link detection, where the goal is to determine whether a pair of news stories are linked, i.e., talk about the same event. The present work marks a departure from the prior work in that we measure similarity at two distinct levels of textual organization, the document and its collection, and combine scores at both levels to determine how well stories are linked. Experiments on the TDT-5 corpus show that the present approach, which we call a 'two-tier similarity model,' comfortably beats conventional approaches such as Clarity enhanced KL divergence, while performing robustly across diverse languages.
- J. Allan, Y. Yang, J. Carbonell, J. Yamron, G. Doddington, and C. Wayne. TDT Pilot Study Corpus, 1998. Linguistic Data Consortium.Google Scholar
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
- R. Brown. Dynamic stopwording for story link detection. In Proceedings of HLT 2002, pages 190--193, 2002. Google ScholarDigital Library
- N. Cancedda, E. Gaussier, C. Goutte, and J.-M. Renders. Word-Sequence Kernels. Journal of Machine Learning Research, 3:1059--1082, 2003. Google ScholarDigital Library
- F. Chen, A. Farahat, and T. Brants. Story link detection and new event detection are asymmetric. In Proceedings of HLT-NACCL 2003, pages 13--15, 2003. Google ScholarDigital Library
- F. Chen, A. Farahat, and T. Brants. Multiple similarity measures and source-pair information in story link detection. In Proceedings of HLT-NAACL 2004, pages 313--320, 2004.Google Scholar
- Y.-J. Chen and H.-H. Chen. NLP and IR approaches to monolingual and multilingual link dectection. In The 19th International Conference on Computational Linguistics (COLING-2002), 2002. Google ScholarDigital Library
- A. Farahat, F. Chen, and T. Brants. Optimizing story link detection is not equivalent to optimizing new event detection. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 232--239, 2003. Google ScholarDigital Library
- E. Feldherr and A. Mitchel. Swine flu coverage around the world. http://www.journalism.org/analysis_report/Swine_Flu_Coverage_around_the_World, May 2009.Google Scholar
- J. Galtung and M. H. Ruge. The structure of foreign news. Journal of Peace Research, 2(1):64--91, 1965.Google ScholarCross Ref
- M. Glenn, S. Strassel, J. Kong, and K. Maeda. TDT5 Topics and Annotations, 2006. Linguistic Data Consortium.Google Scholar
- D. Graff, J. Kong, K. Maeda, and S. Strassel. TDT5 Multilingual Text, 2006. Linguistic Data Consortium.Google Scholar
- D. Harman and M. Liberman. Text Research Collection Vol. 1. CD-ROM, 1994. TIPSTER.Google Scholar
- D. Harman and M. Liberman. Text Research Collection Vol. 3. CD-ROM, 1994. TIPSTER.Google Scholar
- L. S. Larkey, F. Feng, M. Connell, and V. Lavrenko. Language-specific models in multilingual topic tracking. In SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 402--409, 2004. Google ScholarDigital Library
- V. Lavrenko, J. A. E. DeGuzman, D. LaFallme, V. Pollard, and S. Thomas. Relevance models for topic detection and tracking. In Proceedings of the Conference on Human Language Technology, pages 102--110, 2001. Google ScholarDigital Library
- K.-S. Lee and K. Kageura. Korean-Japanese story link detection based on distributional and contrastive properties of event terms. Information Processing and Management, 42:538--550, 2006. Google ScholarDigital Library
- L. Lee. On the effectiveness of the skew divergence for statistical language analysis. In Artificial Intelligence and Statistics, pages 65--72, 2001.Google Scholar
- C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, 1999. Google ScholarDigital Library
- D. Mochihashi. lda, a latent dirichlet allocation package. http://chasen.org/~daiti-m/dist/lda/, 2004.Google Scholar
- R. Nallapati. Semantic language models for topic detection and tracking. In Proceedings of the HLT-NAACL 2003 Student Research Workshop. pages 1--6, 2003. Google ScholarDigital Library
- J. Pei, J. Han, B. Mortazavi-asl, H. Pinto, Q. Chen, U. Dayal, and M. chun Hs. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. In Proceedings of the 17th International Conference on Data Engineering (ICDE 2001), pages 215--224, 2001. Google ScholarDigital Library
- G. Salton and M. E. Lesk. The SMART automatic document retrieval systems - an illustration. Commun. ACM, 8(6):391--398, 1965. Google ScholarDigital Library
- J. Suzuki, H. Isozaki, and E. Maeda. Convolution kernels with feature selection for natural language processing tasks. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL'04), pages 119--126, 2004. Google ScholarDigital Library
- R. Swan and J. Allen. Automatic generation of overview timelines. In Proceedings of SIGIR 2000, pages 49--56, 2000. Google ScholarDigital Library
- C. C. Vogt and G. W. Cottrell. Predicting the performance of linearly combined ir systems. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 190--196, 1998. Google ScholarDigital Library
- X. Zhang, T. Wang, and H. Chen. Story link detection based on dynamic information extending. In Proceedings of the Third International Join Conference on Natural Language Processing, pages 40--47, 2008.Google Scholar
Index Terms
- Two-tier similarity model for story link detection
Recommendations
Story link detection based on event model with uneven SVM
AIRS'08: Proceedings of the 4th Asia information retrieval conference on Information retrieval technologyTopic Detection and Tracking refers to automatic techniques for locating topically related materials in streams of data. As a core of it, story link detection is to determine whether two stories are about the same topic. Up to now, many representation ...
Measuring Similarity Based on Link Information: A Comparative Study
Measuring similarity between objects is a fundamental task in domains such as data mining, information retrieval, and so on. Link-based similarity measures have attracted the attention of many researchers and have been widely applied in recent years. ...
Effective measures for inter-document similarity
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge ManagementWhile supervised learning-to-rank algorithms have largely supplanted unsupervised query-document similarity measures for search, the exploration of query-document measures by many researchers over many years produced insights that might be exploited in ...
Comments