skip to main content
10.1145/2663792.2663803acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Clustering and Labeling a Web Scale Document Collection using Wikipedia clusters

Published:03 November 2014Publication History

ABSTRACT

Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the Wikipedia is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. We found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.

References

  1. eva S., De Vries, C. M, "TopSig: topology preserving document signatures." CIKM'11, pages 333--338, New York, NW, USA, 2011. ACM Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. e Vries, C. M., Nayak, R., Kutty, S., Geva, S, "Overview of the INEX 2010 XML mining track: Cluster- ing and classification of XML Documents." INEX 2010, pages 363--376, 2011% Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. e Vries, C. M, and others, "EM-tree: a clustering algorithm for web-scale applications." SIGIR 2014, Gold Coast, AustraliaGoogle ScholarGoogle Scholar
  4. e Vries, C., De Vine, L., Geva, S., Random indexing k-tree. In: ADCS09: Australian Document Computing Symposium 2009, Sydney, Australia. (2009)Google ScholarGoogle Scholar
  5. e Vries . and S. Geva, "'K-tree: large scale document clustering" ACM SIGIR. pages 718--719, 2009 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. ulkarni, A. and Callan, J., "'Document allocation policies for selective searching of distributed"' CIKM 2010,pages 449--458, 2010, USA Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. larke, C.L.A. and Craswell, N. and Voorhees, E.M.,"'Overview of the TREC 2012 Web track"' DTIC Document,2012Google ScholarGoogle Scholar
  8. . C Aggrawal and C. K. Reddy (Ed), "'Data Clustering Algorithms and Applications,"' CRC Press, 2014.Google ScholarGoogle Scholar
  9. utanto T and R. Nayak, "'The Ranking Based Constrained Document Clustering Method and Its Application to Social Event Detection."' DASFAA: Database Systems for Advanced Applications, 2014Google ScholarGoogle Scholar
  10. nil K. Jain "'Data Clustering: User's Dilemma."' MLDM 2007 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. ohnson, W.B. and Lindenstrauss, J., "'Extensions of Lipschitz mappings into a Hilbert space"', Contemporary mathematics, pages 189--206, 1984.Google ScholarGoogle Scholar
  12. ahlgren, M., "'An introduction to random indexing"', IEEE TKDE 2005Google ScholarGoogle Scholar
  13. ewis, D.D. and Yang, Y. and Rose, T.G. and Li, F., "'RCV1: A new benchmark collection for text categorization research"', The Journal of Machine Learning Research, No 5, pages 361--397, 2004 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. ormack, G.V. and Smucker, M.D. and Clarke, C.L.A., "'Efficient and effective spam filtering and re-ranking for large webdatasets"', Information retrieval, No 5 (14), pages 441--465, 2011 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. . Karypis. CLUTO-A Clustering Toolkit. 2002.Google ScholarGoogle Scholar

Index Terms

  1. Clustering and Labeling a Web Scale Document Collection using Wikipedia clusters

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      Web-KR '14: Proceedings of the 5th International Workshop on Web-scale Knowledge Representation Retrieval & Reasoning
      November 2014
      72 pages
      ISBN:9781450316064
      DOI:10.1145/2663792

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 3 November 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate4of4submissions,100%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader