skip to main content
10.1145/2396761.2398588acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper

Clustering Wikipedia infoboxes to discover their types

Published: 29 October 2012 Publication History

Abstract

Wikipedia has emerged as an important source of structured information on the Web. But while the success of Wikipedia can be attributed in part to the simplicity of adding and modifying content, this has also created challenges when it comes to using, querying, and integrating the information. Even though authors are encouraged to select appropriate categories and provide infoboxes that follow pre-defined templates, many do not follow the guidelines or follow them loosely. This leads to undesirable effects, such as template duplication, heterogeneity, and schema drift. As a step towards addressing this problem, we propose a new unsupervised approach for clustering Wikipedia infoboxes. Instead of relying on manually assigned categories and template labels, we use the structured information available in infoboxes to group them and infer their entity types. Experiments using over 48,000 infoboxes indicate that our clustering approach is effective and produces high quality clusters.

References

[1]
D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In SODA, pages 1027--1035, 2007.
[2]
R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999.
[3]
C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann. Dbpedia - a crystallization point for the web of data. JWS, 7(3):154--165, 2009.
[4]
U. Brandes. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology, 25:163--177, 2001.
[5]
B. He, T. Tao, and K. C.-C. Chang. Organizing structured web sources by query schemas: a clustering approach. In CIKM, 2004.
[6]
G. Kasneci, M. Ramanath, F. Suchanek, and G. Weikum. The yago-naga approach to knowledge discovery. SIGMOD Rec., 37(4):41--47, 2008.
[7]
H. Nguyen, T. Nguyen, H. Nguyen, and J. Freire. Querying wikipedia documents and relationships. In WebDB, 2010.
[8]
J. Shi and J. Malik. Normalized cuts and image segmentation. TPAMI, 22(8):888--905, 2000.
[9]
W. Su, J. Wang, and F. Lochovsky. Holistic query interface matching using parallel schema matching. In EDBT, pages 77--94, 2006.
[10]
F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A Core of Semantic Knowledge. In WWW, 2007.
[11]
S. M. van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, The Netherlands, 2000.
[12]
F. Wu and D. S. Weld. Automatically refining the wikipedia infobox ontology. In WWW, pages 635--644, 2008.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management
October 2012
2840 pages
ISBN:9781450311564
DOI:10.1145/2396761
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 October 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clustering
  2. wikipedia infobox

Qualifiers

  • Short-paper

Conference

CIKM'12
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 241
    Total Downloads
  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media