Mining top-k sequential patterns in transaction database graphs

Lei, Mingtao; Chu, Lingyang; Wang, Zhefeng; Pei, Jian; He, Caifeng; Zhang, Xi; Fang, Binxing

doi:10.1007/s11280-019-00686-w

Mining top-k sequential patterns in transaction database graphs

A new challenging problem and a sampling-based approach

Published: 03 May 2019

Volume 23, pages 103–130, (2020)
Cite this article

World Wide Web Aims and scope Submit manuscript

Mingtao Lei¹,
Lingyang Chu²,
Zhefeng Wang³,
Jian Pei²,
Caifeng He⁴,
Xi Zhang¹ &
…
Binxing Fang¹

595 Accesses
1 Citation
Explore all metrics

Abstract

In many real world networks, a vertex is usually associated with a transaction database that comprehensively describes the behaviour of the vertex. A typical example is a social network, where the behaviours of every user are depicted by a transaction database that stores her daily posted contents. Specifically, a transaction database consists of a collection of transactions, where each transaction corresponds to a piece of tweet. For each transaction, it consists of a set of items, where each item may correspond to a keyword or a piece of video clip contained in this tweet. To model such type of scenario, we propose the novel notion of the transaction database graph, where each vertex is associated with a transaction database. Every path of the graph is a sequence of vertices that induces multiple sequences of transactions. The sequences of transactions induced by all of the paths in the graph form an extremely large sequence database. Finding frequent sequential patterns from such sequence database discovers interesting subsequences that frequently appear in many paths of the network. Our goal is to find the top-k frequent sequential patterns in the sequence database induced from a transaction database graph. However, it is challenging since the sequence database induced by a transaction database graph is too large to be explicitly induced and stored, and finding the top-k frequent sequential patterns is #P-hard. To tackle this problem, we propose an efficient two-step sampling algorithm that approximates the top-k frequent sequential patterns with the provable quality guarantee. Extensive experimental results on synthetic and real-world data sets demonstrate the effectiveness and efficiency of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Longest Common Substring with Approximately k Mismatches

Article Open access 16 February 2019

Graph based anomaly detection and description: a survey

Article 05 July 2014

Time-Dependent Graphs: Definitions, Applications, and Algorithms

Article Open access 25 September 2019

Notes

References

Agrawal, R., Srikant, R.: Mining sequential patterns. In: Proceedings of the 11th International Conference on Data Engineering, ICDE’95, pp. 3–14 (1995)
Bartlett, P.L., Boucheron, S., Lugosi, G.: Model selection and error estimation. Mach. Learn. 48(1-3), 85–113 (2002)
Article Google Scholar
Bonferroni, C.E.: Teoria statistica delle classi e calcolo delle probabilita. Libreria internazionale Seeber (1936)
Calders, T., Garboni, C., Goethals, B.: Efficient pattern mining of uncertain data with sampling. In: Proceedings of the 14th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD’10, pp. 480–487 (2010)
Chapter Google Scholar
Cherkassky, V.: The nature of statistical learning theory. IEEE Trans. Neural Netw. 8(6), 1564 (1997)
Article Google Scholar
Chernoff, H.: A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 23(4), 493–507 (1952)
Article MathSciNet Google Scholar
Cochran, W.G.: Sampling techniques, 3rd. Wiley, New York (1977)
MATH Google Scholar
Dong, G., Pei, J.: Sequence data mining. Springer, Berlin (2007)
MATH Google Scholar
Dutta, S., Nayek, P., Bhattacharya, A.: Neighbor-aware search for approximate labeled graph matching using the chi-square statistics. In: Proceedings of the 26th International Conference on World Wide Web, WWW’17, pp. 1281–1290 (2017)
Fournier-Viger, P., Gomariz, A., Gueniche, T., Mwamikazi, E.T.: Tks: efficient mining of top-k sequential patterns. In: Proceedings of the 9th International Conference on Advanced Data Mining and Applications, ADMA’13, pp. 109–120 (2013)
Chapter Google Scholar
Ge, J., Xia, Y.: Distributed sequential pattern mining in large scale uncertain databases. In: Proceedings of the 20th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD’16, pp. 17–29 (2016)
Chapter Google Scholar
Han, J., Pei, J., Mortazavi-Asl, B., Chen, Q., Dayal, U., Hsu, M.: Freespan: Frequent pattern-projected sequential pattern mining. In: Proceedings of the 6th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD’00, pp. 355–359 (2000)
Huang, D., Xu, K., Pei, J.: Malicious url detection by dynamically mining patterns without pre-defined elements. World Wide Web Journal 17(6), 1375–1394 (2014)
Article Google Scholar
Kimura, M., Saito, K.: Tractable models for information diffusion in social networks. In: Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD’06, pp. 259–271 (2006)
Google Scholar
Leskovec, J., Faloutsos, C.: Neighbor-aware search for approximate labeled graph matching using the chi-square statistics. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD’06, pp. 631–636 (2006)
Liu, C., Zhang, K., Xiong, H., Jiang, G., Yang, Q.: Temporal skeletonization on sequential data: Patterns, categorization, and visualization. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’14, pp. 1336–1345 (2014)
Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.: Prefixspan: mining sequential patterns by prefix-projected growth. In: Proceedings of the 17th International Conference on Data Engineering, ICDE’01, pp. 215–224 (2001)
Pfeiffer, J.J., Moreno, S., Fond, T.L., Neville, J., Gallagher, B.: Attributed graph models: modeling network structure with correlated attributes. In: Proceedings of the 23rd International Conference on World Wide Web, WWW’14, pp. 831–842 (2014)
Pietracaprina, A., Riondato, M., Upfal, E., Vandin, F.: Mining top-k frequent itemsets through progressive sampling. Data Min. Knowl. Disc. 21(2), 310–326 (2010)
Article MathSciNet Google Scholar
Raïssi, C., Poncelet, P.: Sampling for sequential pattern mining: From static databases to data streams. In: Proceedings of the 7th IEEE International Conference on Data Mining, ICDM’07, pp. 631–636 (2007)
Ribeiro, B.F., Wang, P., Murai, F., Towsley, D.: Sampling directed graphs with random walks. In: Proceedings of the IEEE International Conference on Computer Communications, INFOCOM’12, pp. 1692–1700 (2012)
Riondato, M., Upfal, E.: Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD’12, pp. 25–41 (2012)
Chapter Google Scholar
Riondato, M., Upfal, E.: Mining frequent itemsets through progressive sampling with rademacher averages. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’15, pp. 1005–1014 (2015)
Shang, J., Peng, J., Han, J.: Macfp: maximal approximate consecutive frequent pattern mining under edit distance. In: Proceedings of the 2016 SIAM International Conference on Data Mining, SDM’16, pp. 558–566 (2016)
Singhal, A.: Modern information retrieval: a brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24(4), 35–43 (2001)
Google Scholar
Srikant, R., Agrawal, R.: Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the 5th International Conference on Extending Database Technology, EDBT’96, pp. 3–17 (1996)
Google Scholar
Tang, J., Zhang, J., Yao, L., Zhang, L., Su, Z.: Arnetminer: extraction and mining of academic social networks. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’08, pp. 990–998 (2008)
Thompson, S.K.: Sampling, 3rd. Wiley, New York (2012)
Book Google Scholar
Toivonen, H.: Sampling large databases for association rules. Proceedings of the Vldb Endowment 96, 134–145 (1996)
Google Scholar
Tong, H., Faloutsos, C., Gallagher, B., Eliassi-Rad, T.: Fast best-effort pattern matching in large attributed graphs. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD’07, pp. 737–746 (2007)
Tzvetkov, P., Yan, X., Han, J.: Tsp: mining top-k closed sequential patterns. Knowl. Inf. Syst. 7(4), 438–457 (2005)
Article Google Scholar
Wang, X., Lin, J., Senin, P., Oates, T., Gandhi, S., Boedihardjo, A.P., Chen, C., Frankenstein, S.: Rpm: representative pattern mining for efficient time series classification. In: Proceedings of the 19th International Conference on Extending Database Technology, EDBT’16, pp. 185–196 (2016)
Ye, W., Zhou, L., Mautz, D., Plant, C., Böhm, C.: Learning from labeled and unlabeled vertices in networks. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’17, pp. 1265–1274 (2017)
Zaki, M.J.: Spade: an efficient algorithm for mining frequent sequences. Mach. Learn. 42(1/2), 31–60 (2001)
Article Google Scholar
Zhang, J., Tang, J., Ma, C., Tong, H., Jing, Y., Li, J.: Panther: fast top-k similarity search on large networks. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’15, pp. 1445–1454 (2015)
Zheng, Z., Wei, W., Liu, C., Cao, W., Cao, L., Bhatia, M.: An effective contrast sequential pattern mining approach to taxpayer behavior analysis. In: Proceedings of the 25th International Conference on World Wide Web, WWW’16, pp. 633–651 (2016)
Article Google Scholar

Download references

Acknowledgments

This work was supported in part by the National Key Research and Development Program of China (No. 2017YFB0803301), the Natural Science Foundation of China (No. U1836215), DongGuan Innovative Research Team Program (No.201636000100038), and the 111 Project (No. B18008).

Author information

Authors and Affiliations

Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Ministry of Education, Beijing University of Posts and Telecommunications, Beijing, China
Mingtao Lei, Xi Zhang & Binxing Fang
Simon Fraser University, Burnaby, Canada
Lingyang Chu & Jian Pei
University of Science and Technology of China, Hefei, China
Zhefeng Wang
Noah Ark’s Laboratory, Huawei Technologies, Shenzhen, China
Caifeng He

Authors

Mingtao Lei
View author publications
You can also search for this author in PubMed Google Scholar
Lingyang Chu
View author publications
You can also search for this author in PubMed Google Scholar
Zhefeng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jian Pei
View author publications
You can also search for this author in PubMed Google Scholar
Caifeng He
View author publications
You can also search for this author in PubMed Google Scholar
Xi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Binxing Fang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xi Zhang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lei, M., Chu, L., Wang, Z. et al. Mining top-k sequential patterns in transaction database graphs. World Wide Web 23, 103–130 (2020). https://doi.org/10.1007/s11280-019-00686-w

Download citation

Received: 25 September 2018
Revised: 15 February 2019
Accepted: 25 April 2019
Published: 03 May 2019
Issue Date: January 2020
DOI: https://doi.org/10.1007/s11280-019-00686-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mining top-k sequential patterns in transaction database graphs

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

Graph based anomaly detection and description: a survey

Time-Dependent Graphs: Definitions, Applications, and Algorithms

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Mining top-k sequential patterns in transaction database graphs

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

Graph based anomaly detection and description: a survey

Time-Dependent Graphs: Definitions, Applications, and Algorithms

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation