ABSTRACT
Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the Wikipedia is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. We found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.
- eva S., De Vries, C. M, "TopSig: topology preserving document signatures." CIKM'11, pages 333--338, New York, NW, USA, 2011. ACM Google ScholarDigital Library
- e Vries, C. M., Nayak, R., Kutty, S., Geva, S, "Overview of the INEX 2010 XML mining track: Cluster- ing and classification of XML Documents." INEX 2010, pages 363--376, 2011% Google ScholarDigital Library
- e Vries, C. M, and others, "EM-tree: a clustering algorithm for web-scale applications." SIGIR 2014, Gold Coast, AustraliaGoogle Scholar
- e Vries, C., De Vine, L., Geva, S., Random indexing k-tree. In: ADCS09: Australian Document Computing Symposium 2009, Sydney, Australia. (2009)Google Scholar
- e Vries . and S. Geva, "'K-tree: large scale document clustering" ACM SIGIR. pages 718--719, 2009 Google ScholarDigital Library
- ulkarni, A. and Callan, J., "'Document allocation policies for selective searching of distributed"' CIKM 2010,pages 449--458, 2010, USA Google ScholarDigital Library
- larke, C.L.A. and Craswell, N. and Voorhees, E.M.,"'Overview of the TREC 2012 Web track"' DTIC Document,2012Google Scholar
- . C Aggrawal and C. K. Reddy (Ed), "'Data Clustering Algorithms and Applications,"' CRC Press, 2014.Google Scholar
- utanto T and R. Nayak, "'The Ranking Based Constrained Document Clustering Method and Its Application to Social Event Detection."' DASFAA: Database Systems for Advanced Applications, 2014Google Scholar
- nil K. Jain "'Data Clustering: User's Dilemma."' MLDM 2007 Google ScholarDigital Library
- ohnson, W.B. and Lindenstrauss, J., "'Extensions of Lipschitz mappings into a Hilbert space"', Contemporary mathematics, pages 189--206, 1984.Google Scholar
- ahlgren, M., "'An introduction to random indexing"', IEEE TKDE 2005Google Scholar
- ewis, D.D. and Yang, Y. and Rose, T.G. and Li, F., "'RCV1: A new benchmark collection for text categorization research"', The Journal of Machine Learning Research, No 5, pages 361--397, 2004 Google ScholarDigital Library
- ormack, G.V. and Smucker, M.D. and Clarke, C.L.A., "'Efficient and effective spam filtering and re-ranking for large webdatasets"', Information retrieval, No 5 (14), pages 441--465, 2011 Google ScholarDigital Library
- . Karypis. CLUTO-A Clustering Toolkit. 2002.Google Scholar
Index Terms
- Clustering and Labeling a Web Scale Document Collection using Wikipedia clusters
Recommendations
Efficient Phrase-Based Document Indexing for Web Document Clustering
Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly ...
A scaleable document clustering approach for large document corpora
In this paper, the scalability and quality of the contextual document clustering (CDC) approach is demonstrated for large data-sets using the whole Reuters Corpus Volume 1 (RCV1) collection. CDC is a form of distributional clustering, which ...
Text document clustering based on neighbors
Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of k-means, are reported performing well on document clustering. They treat the clustering problem as ...
Comments