research-article

Clustering and Labeling a Web Scale Document Collection using Wikipedia clusters

Authors:
Richi Nayak

Queensland University of Technology, Brisbane, Australia

Queensland University of Technology, Brisbane, Australia
View Profile

,
Rachel Mills

Queensland University of Technology, Brisbane, Australia

Queensland University of Technology, Brisbane, Australia
View Profile

,
Christopher De-Vries

[email protected], Berlin, Germany

[email protected], Berlin, Germany
View Profile

,
Shlomo Geva

Queensland University of Technology, Brisbane, Australia

Queensland University of Technology, Brisbane, Australia
View Profile

Web-KR '14: Proceedings of the 5th International Workshop on Web-scale Knowledge Representation Retrieval & ReasoningNovember 2014Pages 23–30https://doi.org/10.1145/2663792.2663803

Published:03 November 2014Publication History

Web-KR '14: Proceedings of the 5th International Workshop on Web-scale Knowledge Representation Retrieval & Reasoning

Pages 23–30

ABSTRACT

Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the Wikipedia is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. We found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.

References

eva S., De Vries, C. M, "TopSig: topology preserving document signatures." CIKM'11, pages 333--338, New York, NW, USA, 2011. ACM Google ScholarDigital Library
e Vries, C. M., Nayak, R., Kutty, S., Geva, S, "Overview of the INEX 2010 XML mining track: Cluster- ing and classification of XML Documents." INEX 2010, pages 363--376, 2011% Google ScholarDigital Library
e Vries, C. M, and others, "EM-tree: a clustering algorithm for web-scale applications." SIGIR 2014, Gold Coast, AustraliaGoogle Scholar
e Vries, C., De Vine, L., Geva, S., Random indexing k-tree. In: ADCS09: Australian Document Computing Symposium 2009, Sydney, Australia. (2009)Google Scholar
e Vries . and S. Geva, "'K-tree: large scale document clustering" ACM SIGIR. pages 718--719, 2009 Google ScholarDigital Library
ulkarni, A. and Callan, J., "'Document allocation policies for selective searching of distributed"' CIKM 2010,pages 449--458, 2010, USA Google ScholarDigital Library
larke, C.L.A. and Craswell, N. and Voorhees, E.M.,"'Overview of the TREC 2012 Web track"' DTIC Document,2012Google Scholar
. C Aggrawal and C. K. Reddy (Ed), "'Data Clustering Algorithms and Applications,"' CRC Press, 2014.Google Scholar
utanto T and R. Nayak, "'The Ranking Based Constrained Document Clustering Method and Its Application to Social Event Detection."' DASFAA: Database Systems for Advanced Applications, 2014Google Scholar
nil K. Jain "'Data Clustering: User's Dilemma."' MLDM 2007 Google ScholarDigital Library
ohnson, W.B. and Lindenstrauss, J., "'Extensions of Lipschitz mappings into a Hilbert space"', Contemporary mathematics, pages 189--206, 1984.Google Scholar
ahlgren, M., "'An introduction to random indexing"', IEEE TKDE 2005Google Scholar
ewis, D.D. and Yang, Y. and Rose, T.G. and Li, F., "'RCV1: A new benchmark collection for text categorization research"', The Journal of Machine Learning Research, No 5, pages 361--397, 2004 Google ScholarDigital Library
ormack, G.V. and Smucker, M.D. and Clarke, C.L.A., "'Efficient and effective spam filtering and re-ranking for large webdatasets"', Information retrieval, No 5 (14), pages 441--465, 2011 Google ScholarDigital Library
. Karypis. CLUTO-A Clustering Toolkit. 2002.Google Scholar

Index Terms

Clustering and Labeling a Web Scale Document Collection using Wikipedia clusters
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis

Recommendations

Efficient Phrase-Based Document Indexing for Web Document Clustering

Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly ...
Read More
A scaleable document clustering approach for large document corpora

In this paper, the scalability and quality of the contextual document clustering (CDC) approach is demonstrated for large data-sets using the whole Reuters Corpus Volume 1 (RCV1) collection. CDC is a form of distributional clustering, which ...
Read More
Text document clustering based on neighbors

Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of k-means, are reported performing well on document clustering. They treat the clustering problem as ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
Web-KR '14: Proceedings of the 5th International Workshop on Web-scale Knowledge Representation Retrieval & Reasoning
November 2014
72 pages
ISBN:9781450316064
DOI:10.1145/2663792
Program Chairs:
Yi Zeng
Chinese Academy of Sciences, China
,
Spyros Kotoulas
IBM Research, Ireland
,
Zhisheng Huang
VU University Amsterdam, The Netherlands
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 November 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cueweb
document clustering
document signature
large scale clustering
wilipedia
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate4of4submissions,100%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 205
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Clustering and Labeling a Web Scale Document Collection using Wikipedia clusters

Web-KR '14: Proceedings of the 5th International Workshop on Web-scale Knowledge Representation Retrieval & Reasoning

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient Phrase-Based Document Indexing for Web Document Clustering

A scaleable document clustering approach for large document corpora

Text document clustering based on neighbors