research-article

Two-tier similarity model for story link detection

Author:
Tadashi Nomoto

National Institute of Japanese Literature, Tachikawa, Japan

National Institute of Japanese Literature, Tachikawa, Japan
View Profile

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge managementOctober 2010Pages 789–798https://doi.org/10.1145/1871437.1871539

Published:26 October 2010Publication History

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

Pages 789–798

ABSTRACT

The paper presents a novel approach to story link detection, where the goal is to determine whether a pair of news stories are linked, i.e., talk about the same event. The present work marks a departure from the prior work in that we measure similarity at two distinct levels of textual organization, the document and its collection, and combine scores at both levels to determine how well stories are linked. Experiments on the TDT-5 corpus show that the present approach, which we call a 'two-tier similarity model,' comfortably beats conventional approaches such as Clarity enhanced KL divergence, while performing robustly across diverse languages.

References

J. Allan, Y. Yang, J. Carbonell, J. Yamron, G. Doddington, and C. Wayne. TDT Pilot Study Corpus, 1998. Linguistic Data Consortium.Google Scholar
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
R. Brown. Dynamic stopwording for story link detection. In Proceedings of HLT 2002, pages 190--193, 2002. Google ScholarDigital Library
N. Cancedda, E. Gaussier, C. Goutte, and J.-M. Renders. Word-Sequence Kernels. Journal of Machine Learning Research, 3:1059--1082, 2003. Google ScholarDigital Library
F. Chen, A. Farahat, and T. Brants. Story link detection and new event detection are asymmetric. In Proceedings of HLT-NACCL 2003, pages 13--15, 2003. Google ScholarDigital Library
F. Chen, A. Farahat, and T. Brants. Multiple similarity measures and source-pair information in story link detection. In Proceedings of HLT-NAACL 2004, pages 313--320, 2004.Google Scholar
Y.-J. Chen and H.-H. Chen. NLP and IR approaches to monolingual and multilingual link dectection. In The 19th International Conference on Computational Linguistics (COLING-2002), 2002. Google ScholarDigital Library
A. Farahat, F. Chen, and T. Brants. Optimizing story link detection is not equivalent to optimizing new event detection. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 232--239, 2003. Google ScholarDigital Library
E. Feldherr and A. Mitchel. Swine flu coverage around the world. http://www.journalism.org/analysis_report/Swine_Flu_Coverage_around_the_World, May 2009.Google Scholar
J. Galtung and M. H. Ruge. The structure of foreign news. Journal of Peace Research, 2(1):64--91, 1965.Google ScholarCross Ref
M. Glenn, S. Strassel, J. Kong, and K. Maeda. TDT5 Topics and Annotations, 2006. Linguistic Data Consortium.Google Scholar
D. Graff, J. Kong, K. Maeda, and S. Strassel. TDT5 Multilingual Text, 2006. Linguistic Data Consortium.Google Scholar
D. Harman and M. Liberman. Text Research Collection Vol. 1. CD-ROM, 1994. TIPSTER.Google Scholar
D. Harman and M. Liberman. Text Research Collection Vol. 3. CD-ROM, 1994. TIPSTER.Google Scholar
L. S. Larkey, F. Feng, M. Connell, and V. Lavrenko. Language-specific models in multilingual topic tracking. In SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 402--409, 2004. Google ScholarDigital Library
V. Lavrenko, J. A. E. DeGuzman, D. LaFallme, V. Pollard, and S. Thomas. Relevance models for topic detection and tracking. In Proceedings of the Conference on Human Language Technology, pages 102--110, 2001. Google ScholarDigital Library
K.-S. Lee and K. Kageura. Korean-Japanese story link detection based on distributional and contrastive properties of event terms. Information Processing and Management, 42:538--550, 2006. Google ScholarDigital Library
L. Lee. On the effectiveness of the skew divergence for statistical language analysis. In Artificial Intelligence and Statistics, pages 65--72, 2001.Google Scholar
C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, 1999. Google ScholarDigital Library
D. Mochihashi. lda, a latent dirichlet allocation package. http://chasen.org/~daiti-m/dist/lda/, 2004.Google Scholar
R. Nallapati. Semantic language models for topic detection and tracking. In Proceedings of the HLT-NAACL 2003 Student Research Workshop. pages 1--6, 2003. Google ScholarDigital Library
J. Pei, J. Han, B. Mortazavi-asl, H. Pinto, Q. Chen, U. Dayal, and M. chun Hs. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. In Proceedings of the 17th International Conference on Data Engineering (ICDE 2001), pages 215--224, 2001. Google ScholarDigital Library
G. Salton and M. E. Lesk. The SMART automatic document retrieval systems - an illustration. Commun. ACM, 8(6):391--398, 1965. Google ScholarDigital Library
J. Suzuki, H. Isozaki, and E. Maeda. Convolution kernels with feature selection for natural language processing tasks. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL'04), pages 119--126, 2004. Google ScholarDigital Library
R. Swan and J. Allen. Automatic generation of overview timelines. In Proceedings of SIGIR 2000, pages 49--56, 2000. Google ScholarDigital Library
C. C. Vogt and G. W. Cottrell. Predicting the performance of linearly combined ir systems. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 190--196, 1998. Google ScholarDigital Library
X. Zhang, T. Wang, and H. Chen. Story link detection based on dynamic information extending. In Proceedings of the Third International Join Conference on Natural Language Processing, pages 40--47, 2008.Google Scholar

Index Terms

Two-tier similarity model for story link detection
1. Information systems
  1. Information retrieval

Recommendations

Story link detection based on event model with uneven SVM
AIRS'08: Proceedings of the 4th Asia information retrieval conference on Information retrieval technology

Topic Detection and Tracking refers to automatic techniques for locating topically related materials in streams of data. As a core of it, story link detection is to determine whether two stories are about the same topic. Up to now, many representation ...
Read More
Measuring Similarity Based on Link Information: A Comparative Study

Measuring similarity between objects is a fundamental task in domains such as data mining, information retrieval, and so on. Link-based similarity measures have attracted the attention of many researchers and have been widely applied in recent years. ...
Read More
Effective measures for inter-document similarity
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

While supervised learning-to-rank algorithms have largely supplanted unsupervised query-document similarity measures for search, the exploration of query-document measures by many researchers over many years produced insights that might be exploited in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
October 2010
2036 pages
ISBN:9781450300995
DOI:10.1145/1871437
General Chair:
Jimmy Huang
York University, Canada
,
Program Chairs:
Nick Koudas
University of Toronto, Canada
,
Gareth Jones
Dublin City University, Ireland
,
Xindong Wu
University of Vermont, USA
,
Kevyn Collins-Thompson
Microsoft Research, USA
,
Aijun An
York University, Canada
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
pseudo relevance feedback
similarity measures
story link detection
tdt-5
topic tracking
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 352
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Two-tier similarity model for story link detection

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Story link detection based on event model with uneven SVM

Measuring Similarity Based on Link Information: A Comparative Study

Effective measures for inter-document similarity