Elsevier

Information Systems

Volume 71, November 2017, Pages 164-181
Information Systems

Efficiently answer top-k queries on typed intervals

https://doi.org/10.1016/j.is.2017.08.005Get rights and content

Highlights

  • Top-k queries on typed intervals are defined.

  • We build the interval tree in a compact way and propose a partition method.

  • Algorithms for non-continuous queries are proposed and the complexity is analyzed.

  • Algorithms for continuous queries are developed.

  • Extensive experiments using real and synthetic datasets are performed.

Abstract

Consider a database consisting of a set of tuples, each of which contains an interval, a type and a weight. These tuples are called typed intervals and used to support applications involving diverse intervals. In this paper, we study top-k queries on typed intervals. The query reports k intervals intersecting the query time, containing a particular type and having the largest weight. The query time can be a point or an interval. Further, we define top-k continuous queries that return qualified intervals at each time point during the query interval. To efficiently answer such queries, a key challenge is to build an index structure to manage typed intervals. Employing the standard interval tree, we build the structure in a compact way to reduce the I/O cost, and provide analytically derived partitioning methods to manage the data. Query algorithms are proposed to support point, interval and continuous queries. An auxiliary main-memory structure is developed to report continuous results. Using large real and synthetic datasets, extensive experiments are performed in a prototype database system to demonstrate the effectiveness, efficiency and scalability. The results show that our method significantly outperforms alternative methods in most settings.

Introduction

Intervals representing axis-parallel line segments have been widely used in a plethora of application domains. In temporal and multi-version databases, intervals are typically defined as transaction time and valid time ranges for recording changes (update, insertion or deletion), versions and the lifetime of objects [9], [11], [30], [32], [38]. In spatial and geographic information systems, intervals occur as line segments on a space-filling curve, e.g., modeling a printed circuit board [8], [19]. Intervals also play a pivotal role in constraint databases [26].

In the literature, a number of operators have been studied on querying intervals such as intersecting [18], stabbing [5], splitters [30], and joins [13], [17]. This work differs from them by investigating top-k queries on intervals which are associated with types and weights. Recent advances in sensing technologies have made collecting data with extensive information in ease. In real applications, intervals with diverse types may be collected due to different data sources. The system should be able to represent and manage the data for further queries. To the best of our knowledge, typed intervals have not been considered before.

In this paper, we investigate a database storing a set of tuples, each of which contains an interval consisting of start and end points, a type and a weight. Typed intervals enrich the data representation and support applications requiring different kinds of intervals, e.g., various genome intervals in genomics datasets, different versions of data items, and line segments categorized into several groups. Various choices are provided on the website “www.booking.com ” for tourists such as five-star hotels, apartments and motels, and room rates change over time (e.g., hot and cold seasons, weekdays and weekends). The system needs to manage a large amount of typed intervals representing different hotel room rates. We study top-k queries on typed intervals in the paper. Formally, given a query time and a type, the system reports k intervals fulfilling the condition: (i) intersecting the query time; (ii) containing the type; and (iii) having the largest weight. To help understand the problem, we give some application examples.

Example 1. Fig. 1 shows a running example. The database stores a set of computer science projects. Each tuple keeps record of the lifetime, the project category and the budget (weight). There are totally four types of projects: {AI, DB, DM, OS}. A top-k query is “return the top-1 DB project running at the time37”, denoted by Q(37, DB, 1). Three intervals intersecting the time: {o1, o7, o8}. However, o1 is not reported because it is not a DB project. The system returns o7 as the result because its weight is larger than o8.

Example 2. In traffic monitoring systems, to analyze vehicles appearing in certain areas, we need to distinguish the different vehicles such as {Taxi, Bus, Truck, Car}. The database stores the number of vehicles appearing in a district. Such a value changes over time such that one district will have a sequence of typed intervals. Each tuple corresponds to a district and records the number of vehicles having the same type during a time interval. A top-k query is “return the district with the largest number of trucks at 7:00am”.

Such applications impose new challenges regarding indexing and querying intervals, in particular, supporting combined selections on different attributes. We define the query time by a point or an interval. The type predicate is included. Continuous queries are also studied to report top-k intervals at each time point during the query interval. This complicates the evaluation because the result changes at certain time points. Consider Qc([20, 50], DB, 1) by referring to Fig. 1. The query aims to return the maximum budget DB project at each time during [20, 50]. One can see that o7 is the DB interval with the maximum budget but is only valid during [35, 45]. The system will return o8 during [20, 35].

To efficiently answer top-k queries, the key issue is to develop an index structure that can (i) efficiently index intervals for intersecting queries; (ii) well manage different types such that we can quickly find intervals with a particular type; and (iii) order intervals on weights to minimize the number of accessed intervals as only k intervals are reported. It is not difficult to achieve each of them individually, but a complex task to well support all of them. One can treat each condition as a predicate and we need an access structure that allows a combined evaluation of three predicates. This motivates us to develop an efficient index structure for typed intervals.

We choose Edelsbrunner’s interval tree [14] as the basic structure. This structure and its variants have been commonly used in existing works [4], [9], [18], [28], and the interval tree provides the primitive functionality needed in solving our problem. In principle, an interval tree is a binary tree that serves as the primary structure. Each node maintains a value called the split point and two lists of sorted intervals that intersect the split point, called the secondary structure. Intervals smaller and larger than the split point are stored in the left and right subtrees, respectively. The standard interval tree, however, does not well manage typed intervals because of the following reasons.

  • By observation, we find that there is a large number of nodes at the bottom level only containing a few intervals (sometimes only one). That means, we require many nodes but only store a small number of intervals.

  • The standard structure uses two sorted lists to maintain intervals. To determine intervals intersecting a query point (interval), we need to scan the list until the position after which intervals cannot be the result. The complexity is proportional to the number of intervals intersecting the query. Given a large dataset, too many intervals may be accessed, but the query only needs k intervals.

  • The standard structure is not capable of managing types. Therefore, the intervals are iteratively evaluated on the type condition, decreasing the query efficiency.

To overcome these shortcomings, we build the structure in a compact way by defining a bound to determine the minimum number of intervals maintained in a node. If the number of intervals is less than the bound, we stop partitioning intervals which will result in creating nodes for intervals at lower level, and just use one node to store the intervals. A list is defined in the node to maintain the intervals. If the number of intervals is larger than the bound, we propose a new structure to substitute the sorted list for the interval management. The idea is, the interval data space in the node is partitioned into a set of equal-length slots. Two tables are defined in which each row corresponds to a slot and stores a list of intervals. One table maintains intervals containing the slots, called full table, and the other maintains intervals intersecting (partially overlapping) the slots, called partial table. Given a query, we calculate its slot and then access tables to retrieve intervals. Since the intervals in the full table contain the slot, we can skip testing the intersection condition, reducing CPU time and I/O accesses. In contrast, intervals in the partial table have to be iteratively evaluated.

Intuitively, the more intervals in the full table, the better the performance is. This is affected by the slot length. A short slot is more likely to be contained by intervals than a long slot. Therefore, it is possible to create more slots with small lengths to increase the number of intervals in the full table. However, this raises two issues. First, the storage overhead increases because an interval will be distributed in all slots that the interval contains. Second, if the query is an interval, a set of slots will be determined. We have to access tables for each slot to retrieve intervals. To sum up, short slots have a high probability of being contained by intervals, but increase the number of table accesses, incurring more I/O cost. This complicates the partitioning issue.

We provide a thorough analysis on the partition strategy and analytically determine the slot length to perform an optimal partition. This enables the structure to be adaptive and self-adjusting because each node automatically determines the slot length according to the intervals in the node. The slot length differs from node to node, rather than being a dominating value for all nodes. We build type indexes to efficiently find intervals according to types and order intervals on weights to avoid accessing intervals with small weights that cannot contribute to the result. We make the following contributions in the paper:

  • We formalize three kinds of top-k queries on typed intervals.

  • We build the interval tree in a compact way and propose a new secondary structure to efficiently manage intervals by performing an optimal partition and building type indexes. The index storage cost is analyzed. We also discuss how to update the structure for new arrival intervals.

  • We develop efficient algorithms for point and interval queries. To optimize the query procedure, interval queries are converted to point queries. The query time complexity is analyzed.

  • For continuous queries, we develop an auxiliary structure to maintain top-k intervals at each time point and propose a heuristic to prune intervals in batch. The structure is general that can also be used to process continuous queries on standard intervals.

  • We implement the proposals in a prototype database system and conduct extensive experiments on large real and synthetic datasets to demonstrate the performance advantage of our method over alternative methods.

  • The discussion on the generality of the method and the implementation/integration in a conventional database system is provided.

The rest of the paper is organized as follows. Section 2 defines the problem and reviews the interval tree. Section 3 details the hybrid index for typed intervals. Section 4.2 presents the algorithms for top-k point and interval queries, and analyzes the time complexity. Section 5 addresses continuous queries. Section 6 reports the results of the experimental evaluation. Section 7 provides the discussion. Section 8 reviews the related work, followed by conclusion in Section 9.

Section snippets

Problem definition

Let the database O be a set of objects, each of which represents a typed interval. Interval start and end points are defined in a real domain. The type domain is a set of positive integers, denoted by T, and the weight domain is a set of positive real numbers, denoted by R+.

Definition 2.1

Typed intervals I = {(s, e)|s, e  ∈  R, s  <  e} O = {(i, t, w)|i  ∈  I, t  ∈  T, w  ∈  R+}

We associate a type with each interval to enrich the data representation and support applications involving diverse intervals or

Motivation

According to the standard algorithm of creating an interval tree, the procedure stops partitioning the interval set until no interval is left. By observation, we find that the number of intervals in a node decreases when the level of the node increases, assuming the level of the root node is 1. Usually, the interval count becomes a small value (sometimes only one) for the majority of nodes at the bottom level. Given a set of intervals, we store intervals intersecting the split point in a node

Point queries

To answer such queries, the root node, the query and a min-heap are taken as input. We start from the root node and perform the traversal in a top-down approach. A min-heap with the size k is used to maintain candidates and the heap is kept updating during the query procedure. If the accessed node maintains intervals by a list (the number of intervals is less than the bound), we iteratively test each interval. Otherwise, we compare the query point Q.x with the minimum and maximum endpoints in

Framework

We make use of the well-known filter-and-refinement strategy to answer top-k continuous queries. Specifically,

Filter. This step traverses the tree to find a set of candidates each of which intersects the query and contains the type. During the procedure, a set of slots will be determined for each accessed node and we search both full and partial tables for each slot.

Refinement. This step iteratively checks each candidate from the filter step. If the candidate belongs to top-k intervals, we put

Experimental evaluation

The evaluation is conducted in a standard PC (Intel(R) Core(TM) i7-4770CPU, 3.4  GHz, 8GB memory, 2TB disk) running Ubuntu 14.04 (64 bits, kernel version 4.8.2-19). We develop all index structures and query algorithms (including alternative methods) in C/C++ and integrate the implementation into an extensible database system Secondo [21].

Discussion

We discuss how to leverage our structure to support a wide range of queries on typed intervals and make a comparison with spatial indexes that are already built in most database systems.

Queries on intervals

Queries on interval data are defined based on the primitive interval relationships. Interval join [20] is studied as a specific join operation in temporal databases. A stabbing-max query [3] returns the interval that contains the query point and has the maximum weight. The sequenced semantics [11], [12] provides a relational algebra solution to support outer joins, antijoins and aggregations with predicates and functions over interval timestamped data. The binary interval search [29] counts the

Conclusion and future research

In this paper, we study top-k queries on typed intervals. Based on the standard interval tree, a new structure is developed to capture intervals, types, and weights in a hybrid presentation, and partition the interval domain in a query-efficient manner. Employing the proposed structure, query algorithms are developed for static (point and interval) and continuous queries that ask for data objects with qualified intervals and types as well as top weights. Extensive experiments using both real

Acknowledgment

This work is supported by the Fundamental Research Funds for the Central Universities, NO. NZ2013306.

References (40)

  • C. Böhm et al.

    Xz-ordering: a space-filling curve for objects with spatial extension

    Advances in Spatial Databases, 6th International Symposium

    (1999)
  • T. Bozkaya et al.

    Indexing valid time intervals

    Database and Expert Systems Applications, 9th International Conference, DEXA

    (1998)
  • J.V. den Bercken et al.

    An evaluation of generic bulk loading techniques

    VLDB

    (2001)
  • A. Dignös et al.

    Temporal alignment

    ACM SIGMOD

    (2012)
  • A. Dignös et al.

    Query time scaling of attribute values in interval timestamped databases

    IEEE ICDE

    (2013)
  • A. Dignös et al.

    Overlap interval partition join

    ACM SIGMOD

    (2014)
  • H. Edelsbrunner

    Dynamic Data Structures for Orthogonal Intersection Queries

    Technical Report

    (1980)
  • H. Edelsbrunner

    A new approach to rectangles intersections, parts I and II.

    Int. J. Comput. Math.

    (1983)
  • R. Elmasri et al.

    The time index: an access structure for temporal data

    VLDB

    (1990)
  • J. Enderle et al.

    Joining interval data in relational databases

    ACM SIGMOD

    (2004)
  • Cited by (0)

    View full text