skip to main content
10.1145/1031171.1031193acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

A practical web-based approach to generating topic hierarchy for text segments

Published: 13 November 2004 Publication History

Abstract

It is crucial in many information systems to organize short text segments, such as keywords in documents and queries from users, into a well-formed topic hierarchy. In this paper, we address the problem of generating topic hierarchies for diverse text segments with a general and practical approach that uses the Web as an additional knowledge source. Unlike long documents, short text segments typically do not contain enough information to extract reliable features. This work investigates the possibilities of using highly ranked search-result snippets to enrich the representation of text segments. A hierarchical clustering algorithm is then applied to create the hierarchical topic structure of text segments. Different from traditional clustering algorithms, which tend to produce cluster hierarchies with a very unnatural shape, the approach tries to produce a more natural and comprehensive hierarchy. Extensive experiments were conducted on different domains of text segments. The obtained results have shown the potential of the proposed approach, which is believed able to benefit many information systems.

References

[1]
L. D. Baker and A. K. McCallum. Distributional clustering of words for text classification. In Proceedings of SIGIR'98, pages 96--103, 1998.
[2]
D. Beeferman and A. Berger. Agglomerative clustering of a search engine query log. In Proceedings of SIGKDD'00, pages 407--416, August 2000.
[3]
P. Brown, S. D. Pietra, V. D. Pietra, and R. Mercer. Word sense disambiguation using statistical methods. In Proceedings of ACL'91, pages 264--270, 1991.
[4]
C. Buckley, G. Salton, and J. Allan. Automatic retrieval with locality information using smart. In Proceedings of the First Text REtrieval Conference (TREC-1), pages 59--72, 1992.
[5]
S.-L. Chuang and L.-F. Chien. Towards automatic generation of query taxonomy: A hierarchical query clustering approach. In Proceedings of ICDM'02, pages 75--82, 2002.
[6]
H. Cui, M.-Y. Kan, and T.-S. Chua. Unsupervised learning of soft patterns for generating definitions from online news. In Proceedings of WWW'04, pages 90--99, 2004.
[7]
I. S. Dhillon, S. Mallela, and R. Kumar. Enhanced word clustering for hierarchical text classification. In Proceedings of SIGKDD'02, pages 191--200, 2002.
[8]
C. Ding, X. He, H. Zha, M. Gu, and H. Simon. A min-max cut algorithm for graph partitioning and data clustering. In Proceedings of ICDM'01, pages 107--114, 2001.
[9]
E. Glover, D. M. Pennock, S. Lawrence, and R. Krovetz. Inferring hierarchical descriptions. In Proceedings of CIKM'02, pages 4--9, 2002.
[10]
M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of SIGIR'96, pages 76--84, 1996.
[11]
S. Johansson, E. Atwell, R. Garside, and G. Leech. THE TAGGED LOB CORPUS: Users' Manual, 1986.
[12]
B. Larsen and C. Aone. Fast and effective text mining using linear-time document clustering. In Proceedings of SIGKDD'99, pages 16--22, 1999.
[13]
D. Lawrie, W. B. Croft, and A. L. Rosenberg. Finding topic words for hierarchical summarization. In Proceedings of SIGIR'01, pages 349--357, 2001.
[14]
T. Li, S. Zhu, and M. Ogihara. Topic hierarchy generation via linear discriminant projection. In Proceedings of SIGIR'03, pages 421--422, 2003.
[15]
B. Liu, C. W. Chin, and H. T. Ng. Mining topic-specific concepts and definitions on the web. In Proceedings of WWW'03, pages 251--260, 2003.
[16]
G. W. Milligan and M. C. Cooper. An examination of procedures for detecting the number of clusters in a data set. Psychometrika, 50:159--179, 1985.
[17]
B. Mirkin. Mathematical Classification and Clustering. Kluwer, 1996.
[18]
A. Muller, J. Dorre, P. Gerstl, and R. Seiffert. The TaxGen framework: Automating the generation of a taxonomy for a large document collection. In Proceedings of the 32nd Hawaii International Conference on System Sciences, Maui, Hawaii, 1999.
[19]
F. C. N. Pereira, N. Tishby, and L. Lee. Distributional clustering of english words. In Proceedings of ACL'93, pages 183--190, 1993.
[20]
G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24:513--523, 1988.
[21]
M. Sanderson and B. Croft. Deriving concept hierarchies from text. In Proceedings of SIGIR'99, pages 206--213, 1999.
[22]
N. Slonim and N. Thishby. Document clustering using word clusters via the information bottleneck method. In Proceedings of SIGIR'00, pages 208--215, 2000.
[23]
M. Suan N. M. Semi-automatic taxonomy for efficient information searching. In Proceedings of the 2nd International Conference on Information Technology for Application, 2004.
[24]
D. Sullivan. Document warehousing & content management: Poor search quality in your enterprise information portal? DM Review, January 2002.
[25]
S. Vaithyanathan and B. Dom. Model-based hierarchical clustering. In Proceedings of UAI'00, pages 599--608, 2000.
[26]
J.-R. Wen, J.-Y. Nie, and H.-J. Zhang. Query clustering using user logs. ACM Transactions on Information Systems, 20(1):59--81, January 2002.
[27]
P. Willet. Recent trends in hierarchical document clustering: A critical review. Information Processing and Management, 24:577--597, 1988.
[28]
O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings of SIGIR'98, pages 46--54, 1998.
[29]
O. Zamir and O. Etzioni. Grouper: A dynamic clustering interface to web search results. In Proceedings of WWW'99, 1999.
[30]
H.-J. Zeng, Q.-C. He, Z. Chen, W.-Y. Ma, and J. Ma. Learning to cluster web search results. In Proceedings of SIGIR'04, pages 210--217, 2004.

Cited By

View all
  • (2024)Automated Category Tree Construction: Hardness Bounds and AlgorithmsACM Transactions on Database Systems10.1145/3664283Online publication date: 9-May-2024
  • (2023)A Novel Hierarchical Storage System for Different Types of DataProceedings of the 4th International Conference on Big Data Analytics for Cyber-Physical System in Smart City - Volume 210.1007/978-981-99-1157-8_84(700-709)Online publication date: 1-Apr-2023
  • (2022)Automated Category Tree Construction in E-CommerceProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526124(1770-1783)Online publication date: 10-Jun-2022
  • Show More Cited By

Index Terms

  1. A practical web-based approach to generating topic hierarchy for text segments

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management
      November 2004
      678 pages
      ISBN:1581138741
      DOI:10.1145/1031171
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 November 2004

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. clustering
      2. partitioning
      3. search-result snippet
      4. text segment
      5. topic hierarchy generation
      6. web data mining

      Qualifiers

      • Article

      Conference

      CIKM04
      Sponsor:
      CIKM04: Conference on Information and Knowledge Management
      November 8 - 13, 2004
      D.C., Washington, USA

      Acceptance Rates

      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      CIKM '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)9
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 20 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Automated Category Tree Construction: Hardness Bounds and AlgorithmsACM Transactions on Database Systems10.1145/3664283Online publication date: 9-May-2024
      • (2023)A Novel Hierarchical Storage System for Different Types of DataProceedings of the 4th International Conference on Big Data Analytics for Cyber-Physical System in Smart City - Volume 210.1007/978-981-99-1157-8_84(700-709)Online publication date: 1-Apr-2023
      • (2022)Automated Category Tree Construction in E-CommerceProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526124(1770-1783)Online publication date: 10-Jun-2022
      • (2021)Identifying Queries in Instant Search LogsProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463025(1692-1696)Online publication date: 11-Jul-2021
      • (2021)ConCaT: Construction of Category Trees from Search Queries in E-Commerce2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00308(2701-2704)Online publication date: Apr-2021
      • (2021)A Study on Different Aspects of Web Mining and Research IssuesIOP Conference Series: Materials Science and Engineering10.1088/1757-899X/1022/1/0120181022(012018)Online publication date: 19-Jan-2021
      • (2018)Automatic Cluster Labeling Based on Phylogram Analysis2018 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN.2018.8489325(1-8)Online publication date: Jul-2018
      • (2017)Automatic maintenance of category hierarchyFuture Generation Computer Systems10.1016/j.future.2016.06.03867(1-12)Online publication date: Feb-2017
      • (2016)A mental model approach for category hierarchy maintenance on sellers' self-input items in e-commerce websites2016 11th Iberian Conference on Information Systems and Technologies (CISTI)10.1109/CISTI.2016.7521525(1-7)Online publication date: Jun-2016
      • (2015)Incremental learning from news eventsKnowledge-Based Systems10.1016/j.knosys.2015.09.00789:C(618-626)Online publication date: 1-Nov-2015
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media