skip to main content
research-article

Search and analytics using semantic annotations

Published: 23 March 2021 Publication History

Abstract

Current information retrieval systems are limited to text in documents for helping users with their information needs. With the progress in the field of natural language processing, there now exists the possibility of enriching large document collections with accurate semantic annotations. Annotations in the form of part-of-speech tags, temporal expressions, numerical values, geographic locations, and other named entities can help us look at terms in text with additional semantics. This doctoral dissertation presents methods for search and analysis of large semantically annotated document collections. Concretely, we make contributions along three broad directions: indexing, querying, and mining of large semantically annotated document collections.
Indexing Annotated Document Collections. Knowledge-centric tasks such as information extraction, question answering, and relationship extraction require a user to retrieve text regions within documents that detail relationships between entities. Current search systems are ill-equipped to handle such tasks, as they can only provide phrase querying with Boolean operators. To enable knowledge acquisition at scale, we propose gyani, an indexing infrastructure for knowledge-centric tasks. gyani enables search for structured query patterns by allowing regular expression operators to be expressed between word sequences and semantic annotations. To implement grep-like search capabilities over large annotated document collections, we present a data model and index design choices involving word sequences, annotations, and their combinations. We show that by using our proposed indexing infrastructure we bring about drastic speedups in crucial knowledge-centric tasks: 95× in information extraction, 53× in question answering, and 12× in relationship extraction.
Hyper-phrase queries are multi-phrase set queries that naturally arise when attempting to spot knowledge graph facts or subgraphs in large document collections. An example hyper-phrase query for the fact 〈mahatma gandhi, nominated for, nobel peace prize〉 is: 〈{mahatma gandhi, m k gandhi, gandhi}, {nominated, nominee, nomination received}, {nobel peace prize, nobel prize for peace, nobel prize in peace}〉. Efficient execution of hyper-phrase queries is of essence when attempting to verify and validate claims concerning named entities or emerging named entities. To do so, it is required that the fact concerning the entity can be contextualized in text. To acquire text regions given a hyper-phrase query, we propose a retrieval framework using combinations of n-gram and skip-gram indexes. Concretely, we model the combinatorial space of the phrases in the hyper-phrase query to be retrieved using vertical and horizontal operators and propose a dynamic programming approach for optimized query processing. We show that using our proposed optimizations we can retrieve sentences in support of knowledge graph facts and subgraphs from large document collections within seconds.
Querying Annotated Document Collections. Users often struggle to convey their information needs in short keyword queries. This often results in a series of query reformulations, in an attempt to find relevant documents. To assist users navigate large document collections and lead them to their information needs with ease, we propose methods that leverage semantic annotations. As a first step, we focus on temporal information needs. Specifically, we leverage temporal expressions in large document collections to serve time-sensitive queries better. Time-sensitive queries, e.g., summer olympics implicitly carry a temporal dimension for document retrieval. To help users explore longitudinal document collections, we propose a method that generates time intervals of interest as query reformulations. For instance, for the query world war, time intervals of interest are: [1914; 1918] and [1939;1945]. The generated time intervals are immediately useful in search-related tasks such as temporal query classification and temporal diversification of documents.
As a second and final step, we focus on helping the user in navigating large document collections by generating semantic aspects. The aspects are generated using semantic annotations in the form of temporal expressions, geographic locations, and other named entities. Concretely, we propose the xFactor algorithm that generates semantic aspects in two steps. In the first step, xFactor computes the salience of annotations in models informed of their semantics. Thus, the temporal expressions 1930s and 1939 are considered similar as well as entities such as usain bolt and justin gatlin are considered related when computing their salience. Second, the xFactor algorithm computes the co-occurrence salience of annotations belonging to different types by using an efficient partitioning procedure. For instance, the aspect 〈{usain bolt}, {beijing, London}, [2008;2012]〉 signifies that the entity, locations, and the time interval are observed frequently in isolation as well as together in the documents retrieved for the query olympic medalists.
Mining Annotated Document Collections. Large annotated document collections are a treasure trove of historical information concerning events and entities. In this regard, we first present EventMiner, a clustering algorithm, that mines events for keyword queries by using annotations in the form of temporal expressions, geographic locations, and other disambiguated named entities present in a pseudo-relevant set of documents. EventMiner aggregates the annotation evidences by mathematically modeling their semantics. Temporal expressions are modeled in an uncertainty and proximity-aware time model. Geographic locations are modeled as minimum bounding rectangles over their geographic co-ordinates. Other disambiguated named entities are modeled as a set of links corresponding to their Wikipedia articles. For a set of history-oriented queries concerning entities and events, we show that our approach can truly identify event clusters when compared to approaches that disregard annotation semantics.
Second and finally, we present jigsaw, an end-to-end query-driven system that generates structured tables for user-defined schema from unstructured text. To define the table schema, we describe query operators that help perform structured search on annotated text and fill in table cell values. To resolve table cell values whose values can not be retrieved, we describe methods for inferring null values using local context. jigsaw further relies on semantic models for text and numbers to link together near-duplicate rows. This way, jigsaw is able to piece together paraphrased, partial, and redundant text regions retrieved in response to structured queries to generate high-quality tables within seconds.
This doctoral dissertation was supervised by Klaus Berberich at the Max Planck Institute for Informatics and htw saar in Saarbrücken, Germany. This thesis is available online at: https://people.mpi-inf.mpg.de/~dhgupta/pub/dhruv-thesis.pdf.
  1. Search and analytics using semantic annotations

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGIR Forum
    ACM SIGIR Forum  Volume 53, Issue 2
    December 2019
    125 pages
    ISSN:0163-5840
    DOI:10.1145/3458553
    Issue’s Table of Contents
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 March 2021
    Published in SIGIR Volume 53, Issue 2

    Check for updates

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 33
      Total Downloads
    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 17 Jan 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media