skip to main content
10.1145/1655925.1655956acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicisConference Proceedingsconference-collections
research-article

PhraseRank for document clustering: reweighting the weight of phrase

Published: 24 November 2009 Publication History

Abstract

Given a document collection, a hierarchical clustering algorithm groups several clusters. Recent works have identified the set of overlap phrases as useful features in hierarchical document clustering. However, they did not consider the relationship between co-occurred overlap phrases in a document and degrees of opposite relationships between overlap phrases. In this paper, we propose new algorithms for effective similarity measure before working hierarchical clustering algorithm. There are two important features in the proposed methods: the ranking list of top-k phrases for each particular overlap phrase and the opposite significances between two overlap phrases with each other. Experiment result shows that proposed method improves the results of clustering.

References

[1]
M. Akaishi, K. Satoh, and Y. Tanaka. An Associative Information Retrieval Based on the Dependency of Term Co-occurrence. In Proceedings of 7th International Conference on Discovery Science, 3245:195--206, 2004.
[2]
Apache Lucene Project. http://lucene.apache.org.
[3]
H. Chim and X. Deng. A New Suffix Tree Similarity Measure for Document Clustering. In Proceedings of the 16th International Conference on World Wide Web, pages 121--130, 2007.
[4]
F. Gelgi, H. Davulcu, and S. Vadrevu. Term Ranking for Clustering Web Search Results. In Proceedings of the 10th International Workshop on Web and Database, 2007.
[5]
Hammouda, K. M. and Kamel, M. S. Efficient Phrase-Based Document Indexing for Web Document Clustering. IEEE Transactions on Knowledge and Data Engineering, 16(10):1279--1296, 2004.
[6]
W. Hersh, C. Buckley, T. J. Leone, and D. Hickam. OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research. In Proceedings of the 17th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 192--201, 1994.
[7]
D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361--397, 2004.
[8]
Y. Matsuo, M. Ishizuka. Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information. International Journal on Artificial Intelligence Tools, 13(1):157--169, 2004.
[9]
L. Page, S. Brin, R. Motwani, and T. Winograd. The Pagerank Citation Ranking: Bringing Order to the Web. Technical Report, Stanford Digital Library Technologies Project, 1998.
[10]
M. Rosell, V. Kann, and J. E. Litton. Comparing comparisons: Document clustering evaluation using two manual classifications. In Proceedings of the 3rd International Conference on Natural Language Processing, 2004.
[11]
D. S. Sven Meyer zu Eissen and M. Potthast. The Suffix Tree Document Model Revisited. In Proceedings of the 5th International Conference on Knowledge Management, pages 596--603, 2005.
[12]
E. M. Voorhees. Implementing Agglomerative Hierarchic Clustering Algorithms for Use in Document Retrieval. Information Processing and Management, 22(5):465--476, 1986.
[13]
O. Zamir and O. Etzioni. Web Document Clustering: A Feasibility Demonstration. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 46--54, 1998.
[14]
X. Zhang, X. Zhou, and X. Hu. Semantic Smoothing for Model-Based Document Clustering. In Proceedings of the 6th IEEE International Conference on Data Mining, pages 1193--1198, 2006.

Index Terms

  1. PhraseRank for document clustering: reweighting the weight of phrase

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICIS '09: Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human
    November 2009
    1479 pages
    ISBN:9781605587103
    DOI:10.1145/1655925
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    • AICIT
    • ETRI
    • KISTI

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 November 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. document model
    2. overlap phrases
    3. reweighting

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ICIS '09
    Sponsor:

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 16
      Total Downloads
    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 03 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media