Elsevier

Information Sciences

Volume 574, October 2021, Pages 238-258
Information Sciences

Efficiently answering top-k frequent term queries in temporal-categorical range

https://doi.org/10.1016/j.ins.2021.05.081Get rights and content

Abstract

In the procedure of extracting hot topics and detecting emerging topic, counting term frequency is one of the most inevitable and time-consuming steps. For the purpose of text exploration, users may change the query range frequently, and the adjustment of ranges would cause recalculation of term frequency when finding hot terms, bringing unacceptable time cost. In addition, real-time update of dimensions is also a challenge. To address these problems, we first propose a novel data structure based on prefix cube to store terms and their frequencies, so that the time for counting term frequency gets a significant reduction. Based on the data structure, we propose an efficient range query algorithm that significantly decreases the number of input word lists involved in top-k queries. Considering the underlying dimension update, we also design an efficient maintenance mechanism to cope with different dimension updates. Finally, we conduct comprehensive experiments to validate the effectiveness of the proposed structure and the efficiency of the optimized query algorithm. We also prove that using the proposed data structure, the time cost of our algorithms in hot topic extraction and emerging topic detection can be reduced by about ten times compared with the previous algorithms.

Introduction

Analysis of text data plays an increasing important role in various applications, such as hot topic detection [6] and emerging news extraction [5]. Recent years, lots of efforts have been conducted for text data mining. In practice, text data mining is beyond a single data mining step. It is a combined process of data selection, transformation, summarization, and so on. To acquire satisfied results, users often have to select a range to check statistical data (e.g., TF*IDF and TF*PDF [3]), and adjust the queries. Such adjustment occurs frequently. The fundamental challenge lies in how to provide an efficient solution for accelerating the range query over text database.

Text cubes generally have multiple dimensions such as time, category [20], topic [36] and space [19], which usually have a hierarchical structure and form a data cube lattice [22], [27]. Most text data usually use category and time as the basic dimension of index, to support multiple text tasks such as news retrieval and browsing, topic tracking and detection. This paper studies efficiently counting top-k frequent terms within a two-dimensional range of time and category. In the application of hot topic extraction and emerging news detection, users usually adjust the time and category ranges continuously to find more meaningful hot terms and topics according to users’ satisfaction. There are also many important applications, e.g. communication and sensor networks and pair-trading where users search top-k pairs via continuous top-k queries over sliding windows [35], [25], [30], and social media where users receive up-to-date subscriptions via top-k publish/subscribe queries [33], [7]. Therefore, it is of practical importance to investigate the efficient two-dimensional range query method in this work.

In the hierarchical structure of categories, each category node may be broken down even further into the subcategories. Therefore, we use the category range on leaf nodes to model such hierarchy. Fig. 1 shows a hierarchical structure of Sports. A category is a tree node as well as the root of its subtree, which forms a range of subcategories on leaf nodes. Note that Football forms a range of La Liga (Spanish football league), Premier League (English football league) and 2018 World Cup. When an event happens, such as the opening of World Cup, users usually ask the hot terms in related categories to gain interesting news. For example, queries to search hot terms are described as follows: given a category, Football = {La Liga, Premier League and 2018 World Cup} and a time range, 2018-01-01 to 2018-02-01, find top 10 hot terms; given a category, Tennis = {Menus,Womenus,Menau,Womenau}, a time range, 2018-05-01 to 2018-05-05, find top 5 hot terms. The time range and subtree category range form a 2-D range and need to be updated dynamically according to users’ preference, which causes repetitive computations of term frequency and the incapability of traditional approaches in [6], [5], [12], LDA [2], clustering [23], and TF*PDF [3] to provide on-line queries.

Answering top-k query efficiently is important and basic in varies applications, such as subscription system [9], [8], data streaming [16], etc. Several approaches have been proposed to improve the performance of retrieving top-k frequent terms. The Fagin’s Algorithm [14] and a family of threshold algorithms such as [15], [21], [26] were proposed to answer top-k queries in a set of lists. However, when the query ranges are changed frequently, these approaches are not efficient enough to satisfy users’ demands of dynamical on-line query processing. The reason is that the efficiency of algorithms in TA-family depends on the number of lists involved in the range query. To address weaknesses above, we propose a Prefix Cube based data structure and an optimized Threshold Algorithm (PCTA). They can significantly reduce the time cost of extracting hot terms and make the dynamic on-line query possible to users. The advantages of PCTA include: (a) Efficiency: based on the idea of prefix cube, PCTA optimizes threshold algorithms to significantly improve the speed of searching top-k frequent terms, so that the on-line queries can be quickly answered. (b) Storage Scalability: the data structure can be maintained incrementally by adding data for new time or category.

The main contributions of this paper are as follows:

  • We develop an effective data structure based on the idea of prefix cube at the dimensions of time and category.

  • We propose an efficient range query algorithm to speed up the operation of calculating top-k frequent terms in two-dimensional ranges based on the proposed data structure and TA-style algorithms.

  • We present an efficient algorithm to answer complex queries, which can meanwhile efficiently handle the incremental update of the proposed data structure.

  • We conduct comprehensive experiments to verify the performance of the structure and algorithm.

The rest of this paper is organized as follows: Section 2 summarizes the related work. Section 3 introduces basic concepts and the problem definition of Top-k Frequent Terms Range Query. Sections 4 and 5 provide detailed procedures and analysis to the proposed data structure and PCTA algorithm for range queries. Section 6 studies the updating methods of the proposed data structure. In Section 7, we evaluate the time and storage scalability performance of the proposed data structure and PCTA query algorithm. Finally, Section 8 concludes this paper.

Section snippets

Related work

The literature related to this work can be grouped into (1) Spatio-temporal queries, (2) Range queries and (3) Top-k queries on multiple lists.

Spatio-Temporal Queries. Some studies on spatio-temporal data sets is similar to our work. Van et al. [32] have provided an distributed index structure and parallel processing methods for a top-k spatio-temporal terms query. Yang et al. [34] have indexed geographic text stream data for processing spatial keyword queries. Chen et al. [10] have used a

Problem definition

This section provides the problem definition. The important symbols used in this paper are listed in Table 1.

Definition 1

Time Range, T

A time range is defined as Ty̲,y=ty̲,ty̲+1,,ty, and ti means the i-th time slot, which is the minimal unit of time in the corpus. Particularly T1,Nt refers to the whole time range in the data corpus, where Nt is the number of time slots in a corpus and 1y̲yNt.

Example 1

Assuming the whole time range in the corpus is from 2016-01-01 to 2016-01-31, a time range T1,5 is from 2016-01-01 to

PCTA algorithm

In this section, we first propose a prefix cube (PC) based data structure called PC and PCTA range query algorithm based on PC to significantly reduce the number of lists in a given range when executing TA-style algorithms. We next provide methods for the maintenance of PC, i.e. update operation to implement the data storage scalability.

PCTA in complex ranges

In the previous section, we introduce the definition of PC and the algorithm to execute top-k frequent terms queries in a simple range r. However, in other cases, users may want to find top frequent terms in more complex ranges. In this section, we design the algorithm to answer complex range queries.

Maintenance of PC

Considering the nature of news corpus, the main tasks for the maintenance of PC include:

  • 1.

    Addition of time slots. The addition of time slots is the most common maintenance operation, since news articles are published everyday.

  • 2.

    Addition of categories. The addition of categories may occur when a new category is added as a subcategory of an existing category, or as an independent category.

  • 3.

    Partition of categories. A category may be divided into some subcategories sometimes, when it is too broad to

Experiments

We evaluate the effectiveness and efficiency of PCTA on different datasets, and compare PCTA with approaches that solve relevant problems for top-k queries in terms of efficiency and scalability. Furthermore, we also reported the effectiveness of exploiting PCTA algorithm to extract hot terms according to the theory proposed in [6] and analyzed the trends of hot terms we extracted. In addition, we apply our data structure to the algorithm in [4] to extract emerging terms and news.

Conclusions

This paper proposes an efficient data structure called PC based on the principle of Prefix Cube. We implement an optimized TA-based two-dimensional range query algorithm named PCTA in both simple and complex ranges. It can improve the backend of calculating term frequencies and support the on-line query to top-k frequent terms. In addition, comprehensive experiments including searching top-k frequent terms and extracting hot terms and emerging topic are conducted to evaluate the efficiency,

CRediT authorship contribution statement

Zhenying He: Conceptualization, Methodology, Writing - original draft. Lu Wang: Methodology, Software. Chang Lu: Conceptualization, Methodology, Writing - original draft. Yinan Jing: Supervision. Kai Zhang: Supervision. Weili Han: Supervision. Jianxin Li: Conceptualization, Writing - review & editing. Chengfei Liu: Writing - review & editing. X. Sean Wang: Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was mainly supported by National Natural Science Foundation of China under Grant Nos. 61732004, 62072113, 61872207 and U1836207. This research was also partially supported by the ARC Linkage Project under Grand No. LP180100750, and ARC Discovery Projects under Grand No. DP200103700.

References (37)

  • J.C. Culberson et al.

    Covering polygons is hard

    Journal of Algorithms

    (1994)
  • R. Fagin

    Combining Fuzzy Information from Multiple Systems

    Journal of Computer & System Sciences

    (1999)
  • R. Fagin et al.

    Optimal aggregation algorithms for middleware

    Journal of Computer & System Sciences

    (2003)
  • K. Gutiérrez-Batista et al.

    Building a contextual dimension for olap using textual data from social networks

    Expert Systems with Applications

    (2018)
  • P. Ahmed, M. Hasan, A. Kashyap, V. Hristidis, V.J. Tsotras, Efficient computation of top-k frequent terms over...
  • D.M. Blei et al.

    Latent dirichlet allocation

    Journal of Machine Learning Research

    (2003)
  • K.K. Bun, M. Ishizuka, Topic extraction from news archive using tf*pdf algorithm, in: Proceedings of the Third...
  • M. Cataldi et al.

    Personalized emerging topic detection based on a term aging model

    ACM Transactions on Intelligent Systems and Technology (TIST)

    (2013)
  • M. Cataldi et al.

    Emerging topic detection on twitter based on temporal and social terms evaluation

  • K. Chen et al.

    Hot topic extraction based on timeline analysis and multidimensional sentence modeling

    IEEE Transactions on Knowledge and Data Engineering

    (2007)
  • L. Chen et al.

    Temporal spatial-keyword top-k publish/subscribe

  • L. Chen et al.

    Approximate spatio-temporal top-k publish/subscribe

    World Wide Web

    (2019)
  • L. Chen et al.

    Top-k term publish/subscribe for geo-textual data streams

    VLDB Journal

    (2020)
  • L. Chen et al.

    Spatio-temporal top-k term search over sliding window

    World Wide Web

    (2019)
  • X.Y. Dai et al.

    Online topic detection and tracking of financial news based on hierarchical clustering

  • P.M. Deshpande et al.

    Efficient online top-k retrieval with arbitrary similarity measures

  • S. Farazi et al.

    Top-k frequent term queries on streaming data

  • X. Fu et al.

    Continuous range-based skyline queries in road networks

    World Wide Web

    (2017)
  • Cited by (1)

    View full text