Elsevier

Future Generation Computer Systems

Volume 110, September 2020, Pages 849-863
Future Generation Computer Systems

Mining the frequency of time-constrained serial episodes over massive data sequences and streams

https://doi.org/10.1016/j.future.2019.11.008Get rights and content

Abstract

With the popularity and development of the Internet, telecommunication, industrial systems etc., massive amounts of event sequences and streams have been and are being produced. These sequences and streams are generated at a fast pace posing grand challenges in computation and analysis. On one hand, due to the huge number of events, analyzing the sequences is time-consuming. On the other hand, as events in a stream may not necessarily arrive in uniform speed, an effective computational model over the stream should be able to accommodate the intensive arrival of events. In this work, we focus on frequency evaluation which is one representative task in sequence and stream analysis. To address the challenges listed above, we present a one-pass algorithm, namely ONCE, which outputs a popularly used frequency from a given sequence. Moreover, we also present a pair of advanced models, SparkONCE and StreamingONCE, respectively. Both of these approaches are built on ONCE. With a series of non-trivial strategies carefully designed towards Spark, SparkONCE and StreamingONCE exhibit superior performances with respect to ONCE. In particular, compared to ONCE, SparkONCE significantly improves the efficiency in massive sequences; StreamingONCE can effectively adapt to the uneven speed for the events in a stream. The experimental study on real-world and synthetic datasets demonstrate that the proposed approach can work well on massive sequences and streams.

Introduction

With the development of cloud computing and big data analytics, it is possible to mine plenty of useful information from massive data sources in fields like telecommunication [1], neuroscience [2] and finance (stock exchange). Among all the analytical problems over sequential data, counting the frequency for a finite set of given serial episodes can be easily found in many real applications in different fields [3], [4], [5]. For instance, in securities market, the detection of securities fraud is a challenging task considering the massive amount of trading data produced everyday. Insider trading, one category of deceptive practices, can be generalized as a serial pattern using a group of actions including offers and sales of securities [6]. With a set of patterns/trends that is known to be fraud, automatic detection of fraudulent activities can be achieved as long as we focus on the deceptive patterns in the streaming trading sequence. Besides, in the field of bioinformatics, in order to analyze a gene set of interest, analyzing its frequency and distribution among the whole genome datasets can help find out in which tissues or cells are they co-expressed [7]. Many traditional methods have already been applied to handle this type of problem. However, there exist several grand challenges to addressing the problem with massive sequences or streams.

Traditional research in serial pattern mining targeted static sequences with a limited number of event signals [8]. Because of the large-scale coverage and massive usage of the Internet, a tremendous amount of data is generated every day. If we only use traditional algorithms to process this data, it will take lots of computational costs and with limited memory, this limits (reduce) the effectiveness of these algorithms. Designing an efficient algorithm that can process massive amounts of data is urgent. Moreover, in many situations we have to process streaming data that constantly arrives when produced. In stock trading, the time to process a signal is extremely demanding and has a significant impact on the trading volume in the market. Therefore, it is necessary to process this type of signal in a timely manner. Compared with static datasets, streaming data has many exclusive characteristics, such as continuity, expiration and infinity [8], [9], [10]. To process the data in a timely manner, the counting speed must be faster than the arrival speed [11].

In order to address the frequency evaluation for serial episodes in massive sequence and stream, we further present SparkONCE and StreamingONCE algorithms. SparkONCE can count the frequency of time-constrained serial episodes [12] over massive data based on Spark in a parallel way [13]. StreamingONCE can accomplish the same task in streaming data. In summary, our contributions in this work are as follows:

  • We present a one-pass algorithm, namely ONCE, which outputs non-overlapped frequency, which is a popularly used frequency, from a given sequence.

  • We present a novel algorithm, namely SparkONCE to evaluate the frequencies for targeted time-constraint serialepisode in massive sequence in parallel. We designed a new scheme to process multiple segments in parallel and greatly enhance the processing speed for massive data.

  • We also present another algorithm, namely StreamingONCE to output the frequencies for targeted time-constraint serial episode in streaming data based on mini-batches.

  • We provide theoretical proof showing that the proposed methods do not underestimate or overestimate the frequencies for the targeted episodes.

  • Empirical study for both SparkONCE and StreamingONCE justify the effectiveness of our approach comparing with a baseline.

The rest of this paper is organized as follows. In Section 2, we introduce related work. In Section 3, we introduce the preliminary definitions and problem statement. In Section 4, we present a basic version of our algorithm group, namely ONCE. Afterwards, in Section 5 we present a pair of variations towards long sequence and signal-intensive streaming scenarios, respectively. In Section 6, we conduct empirical study over real-world and synthetic datasets. Finally, we conclude this work in Section 7.

Section snippets

Related work

Several types of sequential patterns have been extensively studied so far, including frequent (closed) sequential pattern mining [14], [15], [16], [17], serial episodes discovery mining [18], [19], [20]. However, in the era of big data, there will always be massive or streaming data to process and these efforts cannot be deployed to some real-world applications. They either cannot process massive or streaming data or their performances are inefficient.

In the field of serial episode mining over

Preliminaries

In this section, we present a series of preliminaries and corresponding concepts required for understanding our approach. In Table 1, we summarize the key notations that will be used in this paper.

Definition 1 Sequence Fragment

A sequence fragment is the part of a long (potentially infinite) sequence of an event.2 Let Σ be a finite alphabet set and S a sequential list of events denoted by S=(s1,t1),,(sn,tn),,

A one-pass frequency evaluation algorithm

Firstly, we present a group of one-pass algorithms, ONCE [28], which can respectively compute two non-overlapped frequency of given episodes satisfying predefined time-constraint.

In this section, we present in detail the algorithm for serial episodes counting in streaming sequence under an arbitrary time constraint. Given a streaming event sequence and a target serial episode with an arbitrary time constraint, ONCE algorithm generally works as follows. As each event in the stream passing by, we

SparkONCE and StreamingONCE

In this section, we present the algorithms for serial episodes frequency evaluation in massive and streaming data in detail.

Experimental results

In this part, we conduct experimental study over both synthetic and real world data. The real world data is the streaming sequence of telecommunication alarms within 4 cities in Guizhou Province of China in the year 2014. We also create a synthetic dataset by randomly sampling an event at each timestamp. The statistics of all datasets are shown in Table 2. Notably, to evaluate the scalability for SparkONCE and StreamingONCE, we test on three synthetic datasets (i.e., Synthetic-Spark-1,

Conclusion

In this work, we focus on frequency evaluation which is one representative task in sequence and stream analysis. In particular we present a group of one-pass algorithms, which output a popularly used frequency from a given sequence. Depending on the practical scenarios, different versions of the algorithm can be applied. In general scenarios with limited length of sequence or throughput of stream, ONCE can be applied. In extremely long sequence scenario, where tens of millions of event signals

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Hui Li received the B.Eng. from Harbin Institute of Technology in 2005 and Ph.D. degree from Nanyang Technological University, Singapore in 2012, respectively. He is currently a Professor in School of Cyber Engineering, Xidian University, China. His research interests include data mining, knowledge management and discovery, privacy-preserving query and analysis in big data. He has been nominated as the best paper award in SIGMOD 2015.

References (33)

  • GolmohammadiK. et al.

    Data mining applications for fraud detection in securities market

  • SahaI. et al.

    A web based nucleotide sequencing tool using blast algorithm

    Int. J. Biotech Trends Technol.

    (2016)
  • X. Fu, L. Shi, J. Li, Balanced parallel frequent pattern mining over massive data stream, in: Third IEEE International...
  • KuehnE.

    Online analysis of dynamic streaming data

    (2018)
  • AoX. et al.

    Mining precise-positioning episode rules from event sequences

  • WenH. et al.

    Pargen: A parallel method for partitioning data stream applications in mobile edge computing

    IEEE Access

    (2018)
  • Cited by (0)

    Hui Li received the B.Eng. from Harbin Institute of Technology in 2005 and Ph.D. degree from Nanyang Technological University, Singapore in 2012, respectively. He is currently a Professor in School of Cyber Engineering, Xidian University, China. His research interests include data mining, knowledge management and discovery, privacy-preserving query and analysis in big data. He has been nominated as the best paper award in SIGMOD 2015.

    Zhe Li received the B.S. degree from the School of Information Security in Sichuan University, China in 201 7. He is currently a M.S. student in the School of Cyber Engineering at Xidian University, China. His research interests include data mining, sequence mining.

    Sizhe Peng received the B.S. degree from the School of Software Engineering in Jilin University, China in 2016. He is currently a M.S. student in the School of Cyber Engineering at Xidian University, China. His research interests include data mining, sequence mining.

    Jingjing Li received the B.S. degree from the School of Computer Science and Technology in Xidian University, China in 2017. Currently, she is a Ph.D. student in the Department of Computer Science and Engineering in Chinese University of Hong Kong. Her research interests include knowledge discovering, spatial–temporal data mining.

    Chia E. Tungom received the B.Eng. degree from Ningbo University in 2016. He is currently a M.Eng. student in Xidian University, China. His research interests include automatic feature engineering, sports-driven data mining and decision support systems.

    The work is supported by National Natural Science Foundation of China (No. 61672408 and 61972309), Fundamental Research Funds for the Central Universities, China (No. JB181505), Natural Science Basic Research Plan in Shaanxi Province of China (No. 2018JM6073) and China 111 Project (No. B16037).

    1

    Both authors contribute the same to this work.

    View full text