skip to main content
10.1145/1277741.1277778acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Topic segmentation with shared topic detection and alignment of multiple documents

Published: 23 July 2007 Publication History

Abstract

Topic detection and tracking and topic segmentation play an important role in capturing the local and sequential information of documents. Previous work in this area usually focuses on single documents, although similar multiple documents are available in many domains. In this paper, we introduce a novel unsupervised method for shared topic detection and topic segmentation of multiple similar documents based on mutual information (MI) and weighted mutual information (WMI) that is a combination of MI and term weights. The basic idea is that the optimal segmentation maximizes MI (or WMI). Our approach can detect shared topics among documents. It can find the optimal boundaries in a document, and align segments among documents at the same time. It also can handle single-document segmentation as a special case of the multi-document segmentation and alignment. Our methods can identify and strengthen cue terms that can be used for segmentation and partially remove stop words by using term weights based on entropy learned from multiple documents. Our experimental results show that our algorithm works well for the tasks of single-document segmentation, shared topic detection, and multi-document segmentation. Utilizing information from multiple documents can tremendously improve the performance of topic segmentation, and using WMI is even better than using MI for the multi-document segmentation.

References

[1]
A. Banerjee, I. Ghillon, J. Ghosh, S. Merugu, and D. Modha. A generalized maximum entropy approach to bregman co-clustering and matrix approximation. In Proceedings of SIGKDD, 2004.
[2]
R. Bekkerman, R. El-Yaniv, and A. McCallum. Multi-way distributional clustering via pairwise interactions. In Proceedings of ICML, 2005.
[3]
D. M. Blei and P. J. Moreno. Topic segmentation with an aspect hidden markov model. In Proceedings of SIGIR, 2001.
[4]
D. M. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003.
[5]
T. Brants, F. Chen, and I. Tsochantaridis. Topic-based document segmentation with probabilistic latent semantic analysis. In Proceedings of CIKM, 2002.
[6]
F. Choi. Advances in domain indepedent linear text segmentation. In Proceedings of the NAACL, 2000.
[7]
H. Christensen, B. Kolluru, Y. Gotoh, and S. Renals. Maximum entropy segmentation of broadcast news. In Proceedings of ICASSP, 2005.
[8]
T. Cover and J. Thomas. Elements of Information Theory. John Wiley and Sons, New York, USA, 1991.
[9]
S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Systems, 1990.
[10]
I. Dhillon, S. Mallela, and D. Modha. Information-theoretic co-clustering. In Proceedings of SIGKDD, 2003.
[11]
M. Hajime, H. Takeo, and O. Manabu. Text segmentation with multiple surface linguistic cues. In Proceedings of COLING-ACL, 1998.
[12]
T. K. Ho. Stop word location and identification for adaptive text recognition. International Journal of Document Analysis and Recognition, 3(1), August 2000.
[13]
T. Hofmann. Probabilistic latent semantic analysis. In Proceedings of the UAI'99, 1999.
[14]
X. Ji and H. Zha. Correlating summarization of a pair of multilingual documents. In Proceedings of RIDE, 2003.
[15]
X. Ji and H. Zha. Domain-independent text segmentation using anisotropic diffusion and dynamic programming. In Proceedings of SIGIR, 2003.
[16]
X. Ji and H. Zha. Extracting shared topics of multiple documents. In Proceedings of the 7th PAKDD, 2003.
[17]
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML, 2001.
[18]
T. Li, S. Ma, and M. Ogihara. Entropy-based criterion in categorical clustering. In Proceedings of ICML, 2004.
[19]
A. McCallum, D. Freitag, and F. Pereira. Maximum entropy markov models for information extraction and segmentation. In Proceedings of ICML, 2000.
[20]
L. Pevzner and M. Hearst. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistic, 28(1):19--36, 2002.
[21]
J. C. Reynar. Statistical models for topic segmentation. In Proceedings of ACL, 1999.
[22]
G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw Hill, 1983.
[23]
B. Sun, D. Zhou, H. Zha, and J. Yen. Multi-task text segmentation and alignment based on weighted mutual information. In Proceedings of CIKM, 2006.
[24]
M. Utiyama and H. Isahara. A statistical model for domain-independent text segmentation. In Proceedings of the 39th ACL, 1999.
[25]
C. Wayne. Multilingual topic detection and tracking: Successful research enabled by corpora and evaluation. In Proceedings of LREC, 2000.
[26]
J. Yamron, I. Carp, L. Gillick, S. Lowe, and P. van Mulbregt. A hidden markov model approach to text segmentation and event tracking. In Proceedings of ICASSP, 1998.
[27]
H. Zha and X. Ji. Correlating multilingual documents via bipartite graph modeling. In Proceedings of SIGIR, 2002.

Cited By

View all
  • (2015)Modelling the `hurried' bug report reading process to summarize bug reportsEmpirical Software Engineering10.1007/s10664-014-9311-220:2(516-548)Online publication date: 1-Apr-2015
  • (2014)A hybrid linear text segmentation algorithm using hierarchical agglomerative clustering and discrete particle swarm optimizationIntegrated Computer-Aided Engineering10.3233/ICA-13044621:1(35-46)Online publication date: 1-Jan-2014
  • (2014)Automatic quality measurement for health information on the internetInternational Journal of Intelligent Information and Database Systems10.1504/IJIIDS.2014.0683408:4(340-358)Online publication date: 1-Mar-2014
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
July 2007
946 pages
ISBN:9781595935977
DOI:10.1145/1277741
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. multiple documents
  2. mutual information
  3. shared topic detection
  4. term weight
  5. topic alignment
  6. topic segmentation

Qualifiers

  • Article

Conference

SIGIR07
Sponsor:
SIGIR07: The 30th Annual International SIGIR Conference
July 23 - 27, 2007
Amsterdam, The Netherlands

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2015)Modelling the `hurried' bug report reading process to summarize bug reportsEmpirical Software Engineering10.1007/s10664-014-9311-220:2(516-548)Online publication date: 1-Apr-2015
  • (2014)A hybrid linear text segmentation algorithm using hierarchical agglomerative clustering and discrete particle swarm optimizationIntegrated Computer-Aided Engineering10.3233/ICA-13044621:1(35-46)Online publication date: 1-Jan-2014
  • (2014)Automatic quality measurement for health information on the internetInternational Journal of Intelligent Information and Database Systems10.1504/IJIIDS.2014.0683408:4(340-358)Online publication date: 1-Mar-2014
  • (2014)Effective automatic image annotation via integrated discriminative and generative modelsInformation Sciences: an International Journal10.1016/j.ins.2013.11.005262(159-171)Online publication date: 1-Mar-2014
  • (2013)Optimizing temporal topic segmentation for intelligent text visualizationProceedings of the 2013 international conference on Intelligent user interfaces10.1145/2449396.2449441(339-350)Online publication date: 19-Mar-2013
  • (2013)A passage extractor for classification of disease aspect informationJournal of the American Society for Information Science and Technology10.1002/asi.2292664:11(2265-2277)Online publication date: 27-Aug-2013
  • (2012)Topic Extraction for Documents Based on Compressibility VectorIEICE Transactions on Information and Systems10.1587/transinf.E95.D.2438E95.D:10(2438-2446)Online publication date: 2012
  • (2012)Modelling the ‘Hurried’ bug report reading process to summarize bug reportsProceedings of the 2012 IEEE International Conference on Software Maintenance (ICSM)10.1109/ICSM.2012.6405303(430-439)Online publication date: 23-Sep-2012
  • (2012)Detection of cross-channel anomaliesKnowledge and Information Systems10.1007/s10115-012-0509-635:1(33-59)Online publication date: 12-Jun-2012
  • (2011)Structural topic model for latent topical structure analysisProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 110.5555/2002472.2002657(1526-1535)Online publication date: 19-Jun-2011
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media