Efficiently answering top-k frequent term queries in temporal-categorical range
Introduction
Analysis of text data plays an increasing important role in various applications, such as hot topic detection [6] and emerging news extraction [5]. Recent years, lots of efforts have been conducted for text data mining. In practice, text data mining is beyond a single data mining step. It is a combined process of data selection, transformation, summarization, and so on. To acquire satisfied results, users often have to select a range to check statistical data (e.g., TF*IDF and TF*PDF [3]), and adjust the queries. Such adjustment occurs frequently. The fundamental challenge lies in how to provide an efficient solution for accelerating the range query over text database.
Text cubes generally have multiple dimensions such as time, category [20], topic [36] and space [19], which usually have a hierarchical structure and form a data cube lattice [22], [27]. Most text data usually use category and time as the basic dimension of index, to support multiple text tasks such as news retrieval and browsing, topic tracking and detection. This paper studies efficiently counting top-k frequent terms within a two-dimensional range of time and category. In the application of hot topic extraction and emerging news detection, users usually adjust the time and category ranges continuously to find more meaningful hot terms and topics according to users’ satisfaction. There are also many important applications, e.g. communication and sensor networks and pair-trading where users search top-k pairs via continuous top-k queries over sliding windows [35], [25], [30], and social media where users receive up-to-date subscriptions via top-k publish/subscribe queries [33], [7]. Therefore, it is of practical importance to investigate the efficient two-dimensional range query method in this work.
In the hierarchical structure of categories, each category node may be broken down even further into the subcategories. Therefore, we use the category range on leaf nodes to model such hierarchy. Fig. 1 shows a hierarchical structure of Sports. A category is a tree node as well as the root of its subtree, which forms a range of subcategories on leaf nodes. Note that Football forms a range of La Liga (Spanish football league), Premier League (English football league) and 2018 World Cup. When an event happens, such as the opening of World Cup, users usually ask the hot terms in related categories to gain interesting news. For example, queries to search hot terms are described as follows: given a category, Football = {La Liga, Premier League and 2018 World Cup} and a time range, 2018-01-01 to 2018-02-01, find top 10 hot terms; given a category, Tennis = , a time range, 2018-05-01 to 2018-05-05, find top 5 hot terms. The time range and subtree category range form a 2-D range and need to be updated dynamically according to users’ preference, which causes repetitive computations of term frequency and the incapability of traditional approaches in [6], [5], [12], LDA [2], clustering [23], and TF*PDF [3] to provide on-line queries.
Answering top-k query efficiently is important and basic in varies applications, such as subscription system [9], [8], data streaming [16], etc. Several approaches have been proposed to improve the performance of retrieving top-k frequent terms. The Fagin’s Algorithm [14] and a family of threshold algorithms such as [15], [21], [26] were proposed to answer top-k queries in a set of lists. However, when the query ranges are changed frequently, these approaches are not efficient enough to satisfy users’ demands of dynamical on-line query processing. The reason is that the efficiency of algorithms in TA-family depends on the number of lists involved in the range query. To address weaknesses above, we propose a Prefix Cube based data structure and an optimized Threshold Algorithm (PCTA). They can significantly reduce the time cost of extracting hot terms and make the dynamic on-line query possible to users. The advantages of PCTA include: (a) Efficiency: based on the idea of prefix cube, PCTA optimizes threshold algorithms to significantly improve the speed of searching top-k frequent terms, so that the on-line queries can be quickly answered. (b) Storage Scalability: the data structure can be maintained incrementally by adding data for new time or category.
The main contributions of this paper are as follows:
- •
We develop an effective data structure based on the idea of prefix cube at the dimensions of time and category.
- •
We propose an efficient range query algorithm to speed up the operation of calculating top-k frequent terms in two-dimensional ranges based on the proposed data structure and TA-style algorithms.
- •
We present an efficient algorithm to answer complex queries, which can meanwhile efficiently handle the incremental update of the proposed data structure.
- •
We conduct comprehensive experiments to verify the performance of the structure and algorithm.
The rest of this paper is organized as follows: Section 2 summarizes the related work. Section 3 introduces basic concepts and the problem definition of Top-k Frequent Terms Range Query. Sections 4 and 5 provide detailed procedures and analysis to the proposed data structure and PCTA algorithm for range queries. Section 6 studies the updating methods of the proposed data structure. In Section 7, we evaluate the time and storage scalability performance of the proposed data structure and PCTA query algorithm. Finally, Section 8 concludes this paper.
Section snippets
Related work
The literature related to this work can be grouped into (1) Spatio-temporal queries, (2) Range queries and (3) Top-k queries on multiple lists.
Spatio-Temporal Queries. Some studies on spatio-temporal data sets is similar to our work. Van et al. [32] have provided an distributed index structure and parallel processing methods for a top-k spatio-temporal terms query. Yang et al. [34] have indexed geographic text stream data for processing spatial keyword queries. Chen et al. [10] have used a
Problem definition
This section provides the problem definition. The important symbols used in this paper are listed in Table 1. Definition 1 A time range is defined as , and means the i-th time slot, which is the minimal unit of time in the corpus. Particularly refers to the whole time range in the data corpus, where is the number of time slots in a corpus and . Example 1 Assuming the whole time range in the corpus is from 2016-01-01 to 2016-01-31, a time range is from 2016-01-01 toTime Range, T
PCTA algorithm
In this section, we first propose a prefix cube (PC) based data structure called and PCTA range query algorithm based on to significantly reduce the number of lists in a given range when executing TA-style algorithms. We next provide methods for the maintenance of , i.e. update operation to implement the data storage scalability.
PCTA in complex ranges
In the previous section, we introduce the definition of PC and the algorithm to execute top-k frequent terms queries in a simple range r. However, in other cases, users may want to find top frequent terms in more complex ranges. In this section, we design the algorithm to answer complex range queries.
Maintenance of
Considering the nature of news corpus, the main tasks for the maintenance of include:
- 1.
Addition of time slots. The addition of time slots is the most common maintenance operation, since news articles are published everyday.
- 2.
Addition of categories. The addition of categories may occur when a new category is added as a subcategory of an existing category, or as an independent category.
- 3.
Partition of categories. A category may be divided into some subcategories sometimes, when it is too broad to
Experiments
We evaluate the effectiveness and efficiency of PCTA on different datasets, and compare PCTA with approaches that solve relevant problems for top-k queries in terms of efficiency and scalability. Furthermore, we also reported the effectiveness of exploiting PCTA algorithm to extract hot terms according to the theory proposed in [6] and analyzed the trends of hot terms we extracted. In addition, we apply our data structure to the algorithm in [4] to extract emerging terms and news.
Conclusions
This paper proposes an efficient data structure called based on the principle of Prefix Cube. We implement an optimized TA-based two-dimensional range query algorithm named PCTA in both simple and complex ranges. It can improve the backend of calculating term frequencies and support the on-line query to top-k frequent terms. In addition, comprehensive experiments including searching top-k frequent terms and extracting hot terms and emerging topic are conducted to evaluate the efficiency,
CRediT authorship contribution statement
Zhenying He: Conceptualization, Methodology, Writing - original draft. Lu Wang: Methodology, Software. Chang Lu: Conceptualization, Methodology, Writing - original draft. Yinan Jing: Supervision. Kai Zhang: Supervision. Weili Han: Supervision. Jianxin Li: Conceptualization, Writing - review & editing. Chengfei Liu: Writing - review & editing. X. Sean Wang: Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was mainly supported by National Natural Science Foundation of China under Grant Nos. 61732004, 62072113, 61872207 and U1836207. This research was also partially supported by the ARC Linkage Project under Grand No. LP180100750, and ARC Discovery Projects under Grand No. DP200103700.
References (37)
- et al.
Covering polygons is hard
Journal of Algorithms
(1994) Combining Fuzzy Information from Multiple Systems
Journal of Computer & System Sciences
(1999)- et al.
Optimal aggregation algorithms for middleware
Journal of Computer & System Sciences
(2003) - et al.
Building a contextual dimension for olap using textual data from social networks
Expert Systems with Applications
(2018) - P. Ahmed, M. Hasan, A. Kashyap, V. Hristidis, V.J. Tsotras, Efficient computation of top-k frequent terms over...
- et al.
Latent dirichlet allocation
Journal of Machine Learning Research
(2003) - K.K. Bun, M. Ishizuka, Topic extraction from news archive using tf*pdf algorithm, in: Proceedings of the Third...
- et al.
Personalized emerging topic detection based on a term aging model
ACM Transactions on Intelligent Systems and Technology (TIST)
(2013) - et al.
Emerging topic detection on twitter based on temporal and social terms evaluation
- et al.
Hot topic extraction based on timeline analysis and multidimensional sentence modeling
IEEE Transactions on Knowledge and Data Engineering
(2007)
Temporal spatial-keyword top-k publish/subscribe
Approximate spatio-temporal top-k publish/subscribe
World Wide Web
Top-k term publish/subscribe for geo-textual data streams
VLDB Journal
Spatio-temporal top-k term search over sliding window
World Wide Web
Online topic detection and tracking of financial news based on hierarchical clustering
Efficient online top-k retrieval with arbitrary similarity measures
Top-k frequent term queries on streaming data
Continuous range-based skyline queries in road networks
World Wide Web
Cited by (1)
An Efficient Distributed Spatiotemporal Index for Parallel Top-k Frequent Terms Query
2022, Proceedings - 2022 IEEE International Conference on Big Data and Smart Computing, BigComp 2022