Skip to main content

Multi-label Wikipedia Classification with Textual and Link Features

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6203))

Abstract

We address the problem of categorizing a large set of linked documents with important content and structure aspects, in particular, from the Wikipedia collection proposed at the INEX 2009 XML Mining challenge. We analyze the network of collection pages and turn it into valuable features for the classification. We combine the content-based and link-based features of pages to train an accurate categorizer for unlabelled pages. In the multi-label setting, we revise a number of existing techniques and test some which show a good scalability. We report evaluation results obtained with a variety of learning methods and techniques on the training set of the Wikipedia corpus.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brandes, U.: A faster algorithm for betweenness centrality. Journal of Mathematical Sociology 25, 163–177 (2001)

    Article  MATH  Google Scholar 

  2. Bron, C., Kerbosch, J.: Algorithm 457: finding all cliques of an undirected graph. Communications of the ACM 16(9), 575–577 (1973)

    Article  MATH  Google Scholar 

  3. Chidlovskii, B.: Semi-supervised categorization of wikipedia collection by label expansion. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2008. LNCS, vol. 5631, pp. 412–419. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  4. Getoor, L., Diehl, C.P.: Link mining: a survey. SIGKDD Explorations 7(2), 3–12 (2005)

    Article  Google Scholar 

  5. Ghamrawi, N., McCallum, A.: Collective multi-label classification. In: CIKM 2005: Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 195–200. ACM, New York (2005)

    Google Scholar 

  6. Gleich, D.: MatlabBGL: a Matlab Graph Library (2008), http://www.stanford.edu/~dgleich/programs/matlab_bgl

  7. Joachims, T.: A statistical learning model of text classification for Support Vector Machines. In: Proc. 24th International ACM SIGIR Conf., pp. 128–136. ACM Press, New York (2001)

    Google Scholar 

  8. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  9. Newman, M.E.J.: The structure and function of complex networks. SIAM Review 45, 167–256 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  10. Riehle, D.: How and why Wikipedia works: an interview with Angela Beesley, Elisabeth Bauer, and Kizu Naoko. In: WikiSym 2006: Proceedings of the 2006 international symposium on Wikis, pp. 3–8. ACM, New York (2006)

    Google Scholar 

  11. Rowe, R., Creamer, G., Hershkop, S., Stolfo, S.J.: Automated social hierarchy detection through email network analysis. In: Proc. 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis, pp. 109–117. ACM, New York (2007)

    Google Scholar 

  12. Tang, L., Rajan, S., Narayanan, V.K.: Large scale multi-label classification via metalabeler. In: WWW 2009: Proceedings of the 18th international conference on World Wide Web, pp. 211–220. ACM, New York (2009)

    Google Scholar 

  13. Tsoumakas, G., Katakis, I.: Multi-label classification: An overview. International Journal of Data Warehousing and Mining 3(3), 1–13 (2007)

    Article  Google Scholar 

  14. Yu, K., Yu, S., Tresp, V.: Multi-label informed latent semantic indexing. In: SIGIR 2005: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 258–265. ACM, New York (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chidlovskii, B. (2010). Multi-label Wikipedia Classification with Textual and Link Features. In: Geva, S., Kamps, J., Trotman, A. (eds) Focused Retrieval and Evaluation. INEX 2009. Lecture Notes in Computer Science, vol 6203. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14556-8_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-14556-8_38

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-14555-1

  • Online ISBN: 978-3-642-14556-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics