skip to main content
10.1145/1498759.1498763acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
invited-talk

Harvesting, searching, and ranking knowledge on the web: invited talk

Published: 09 February 2009 Publication History

Abstract

There are major trends to advance the functionality of search engines to a more expressive semantic level (e.g., [2, 4, 6, 7, 8, 9, 13, 14, 18]). This is enabled by employing large-scale information extraction [1, 11, 20] of entities and relationships from semistructured as well as natural-language Web sources. In addition, harnessing Semantic-Web-style ontologies [22] and reaching into Deep-Web sources [16] can contribute towards a grand vision of turning the Web into a comprehensive knowledge base that can be efficiently searched with high precision.
This talk presents ongoing research towards this objective, with emphasis on our work on the YAGO knowledge base [23, 24] and the NAGA search engine [14] but also covering related projects. YAGO is a large collection of entities and relational facts that are harvested from Wikipedia and WordNet with high accuracy and reconciled into a consistent RDF-style "semantic" graph. For further growing YAGO from Web sources while retaining its high quality, pattern-based extraction is combined with logic-based consistency checking in a unified framework [25]. NAGA provides graph-template-based search over this data, with powerful ranking capabilities based on a statistical language model for graphs. Advanced queries and the need for ranking approximate matches pose efficiency and scalability challenges that are addressed by algorithmic and indexing techniques [15, 17].
YAGO is publicly available and has been imported into various other knowledge-management projects including DB-pedia. YAGO shares many of its goals and methodologies with parallel projects along related lines. These include Avatar [19], Cimple/DBlife [10, 21], DBpedia [3], Know-ItAll/TextRunner [12, 5], Kylin/KOG [26, 27], and the Libra technology [18, 28] (and more). Together they form an exciting trend towards providing comprehensive knowledge bases with semantic search capabilities.

References

[1]
Eugene Agichtein: Scaling Information Extraction to Large Document Collections. IEEE Data Eng. Bull. 28(4), 2005
[2]
Kemafor Anyanwu, Angela Maduko, Amit P. Sheth: SemRank: Ranking Complex Relationship Search Results on the Semantic Web. WWW 2005
[3]
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, Zachary G. Ives: DBpedia: A Nucleus for a Web of Open Data. ISWC/ASWC 2007
[4]
Ricardo A. Baeza-Yates, Massimiliano Ciaramita, Peter Mika, Hugo Zaragoza: Towards Semantic Search. NLDB 2008
[5]
Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, Oren Etzioni: Open Information Extraction from the Web. IJCAI 2007
[6]
Holger Bast, Alexandru Chitea, Fabian M. Suchanek, Ingmar Weber: ESTER: Efficient search on Text, Entities, and Relations. SIGIR 2007
[7]
Michael J. Cafarella: Extracting and Querying a Comprehensive Web Database. CIDR 2009
[8]
Soumen Chakrabarti: Breaking Through the Syntax Barrier: Searching with Entities and Relations. ECML 2004
[9]
Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang: EntityRank: Searching Entities Directly and Holistically. VLDB 2007
[10]
Pedro DeRose, Warren Shen, Fei Chen, AnHai Doan, Raghu Ramakrishnan: Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach. VLDB 2007
[11]
AnHai Doan, Luis Gravano, Raghu Ramakrishnan, Shivakumar Vaithyanathan (Editors): Special Issue on Information Extraction, SIGMOD Record 37(4), December 2008
[12]
Oren Etzioni, Michael J. Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, Alexander Yates: Unsupervised Named-Entity Extraction from the Web: An Experimental Study. Artif. Intell. 165(1), 2005
[13]
Jens Graupmann, Ralf Schenkel, Gerhard Weikum: The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents. VLDB 2005
[14]
Gjergji Kasneci, Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath, Gerhard Weikum: NAGA: Searching and Ranking Knowledge. ICDE 2008
[15]
Gjergji Kasneci, Maya Ramanath, Mauro Sozio, Fabian M. Suchanek, Gerhard Weikum: STAR: Steiner Tree Approximation in Relationship-Graphs. ICDE 2009
[16]
Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy: Harnessing the Deep Web: Present and Future. CIDR 2009
[17]
Thomas Neumann, Gerhard Weikum. RDF-3X: a RISC-style Engine for RDF. PVLDB 1(1), 2008
[18]
Zaiqing Nie, Yunxiao Ma, Shuming Shi, Ji-Rong Wen, Wei-Ying Ma: Web Object Retrieval. WWW 2007
[19]
Frederick Reiss, Sriram Raghavan, Rajasekar Krishnamurthy, Huaiyu Zhu, Shivakumar Vaithyanathan: An Algebraic Approach to Rule-Based Information Extraction. ICDE 2008
[20]
Sunita Sarawagi: Information Extraction. Foundations and Trends in Databases 2(1), 2008
[21]
Warren Shen, AnHai Doan, Jeffrey F. Naughton, Raghu Ramakrishnan: Declarative Information Extraction Using Datalog with Embedded Extraction Predicates. VLDB 2007
[22]
Steffen Staab, Rudi Studer: Handbook on Ontologies, 2nd Edition. Springer 2008
[23]
Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum: YAGO: a Core of Semantic Knowledge. WWW 2007
[24]
Fabian Suchanek, Gjergji Kasneci, Gerhard Weikum: YAGO: A Large Ontology from Wikipedia and WordNet. Journal of Web Semantics 6(39, 2008
[25]
Fabian Suchanek, Mauro Sozio, Gerhard Weikum: SOFIE: a Self-Organizing Framework for Information Extraction. Technical Report MPI-I-2008-5-004, 2008
[26]
Fei Wu, Daniel S. Weld: Autonomously Semantifying Wikipedia. CIKM 2007
[27]
Fei Wu, Daniel S. Weld: Automatically Refining the wikipedia Infobox Ontology. WWW 2008
[28]
Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, Wei-Ying Ma: Simultaneous Record Detection and Attribute Labeling in Web Data Extraction. KDD 2006

Cited By

View all
  • (2015)Finding semantic associations in hierarchically structured groups of Web dataFormal Aspects of Computing10.1007/s00165-015-0337-z27:5-6(867-884)Online publication date: 9-Jul-2015
  • (2010)From information to knowledgeProceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems10.1145/1807085.1807097(65-76)Online publication date: 6-Jun-2010
  • (2009)Knowledge Management with SnapshotsLeveraging Knowledge for Innovation in Collaborative Networks10.1007/978-3-642-04568-4_31(293-300)Online publication date: 2009

Index Terms

  1. Harvesting, searching, and ranking knowledge on the web: invited talk

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data Mining
      February 2009
      314 pages
      ISBN:9781605583907
      DOI:10.1145/1498759
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 09 February 2009

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Information extraction
      2. information retrieval
      3. knowledge management
      4. scalability

      Qualifiers

      • Invited-talk

      Conference

      WSDM'09
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 498 of 2,863 submissions, 17%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)4
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 08 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2015)Finding semantic associations in hierarchically structured groups of Web dataFormal Aspects of Computing10.1007/s00165-015-0337-z27:5-6(867-884)Online publication date: 9-Jul-2015
      • (2010)From information to knowledgeProceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems10.1145/1807085.1807097(65-76)Online publication date: 6-Jun-2010
      • (2009)Knowledge Management with SnapshotsLeveraging Knowledge for Innovation in Collaborative Networks10.1007/978-3-642-04568-4_31(293-300)Online publication date: 2009

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media