Skip to main content
Log in

Mining top-k sequential patterns in transaction database graphs

A new challenging problem and a sampling-based approach

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

In many real world networks, a vertex is usually associated with a transaction database that comprehensively describes the behaviour of the vertex. A typical example is a social network, where the behaviours of every user are depicted by a transaction database that stores her daily posted contents. Specifically, a transaction database consists of a collection of transactions, where each transaction corresponds to a piece of tweet. For each transaction, it consists of a set of items, where each item may correspond to a keyword or a piece of video clip contained in this tweet. To model such type of scenario, we propose the novel notion of the transaction database graph, where each vertex is associated with a transaction database. Every path of the graph is a sequence of vertices that induces multiple sequences of transactions. The sequences of transactions induced by all of the paths in the graph form an extremely large sequence database. Finding frequent sequential patterns from such sequence database discovers interesting subsequences that frequently appear in many paths of the network. Our goal is to find the top-k frequent sequential patterns in the sequence database induced from a transaction database graph. However, it is challenging since the sequence database induced by a transaction database graph is too large to be explicitly induced and stored, and finding the top-k frequent sequential patterns is #P-hard. To tackle this problem, we propose an efficient two-step sampling algorithm that approximates the top-k frequent sequential patterns with the provable quality guarantee. Extensive experimental results on synthetic and real-world data sets demonstrate the effectiveness and efficiency of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12

Similar content being viewed by others

Notes

  1. http://www.almaden.ibm.com/cs/quest/syndata.html

  2. https://openflights.org/data.html

References

  1. Agrawal, R., Srikant, R.: Mining sequential patterns. In: Proceedings of the 11th International Conference on Data Engineering, ICDE’95, pp. 3–14 (1995)

  2. Bartlett, P.L., Boucheron, S., Lugosi, G.: Model selection and error estimation. Mach. Learn. 48(1-3), 85–113 (2002)

    Article  Google Scholar 

  3. Bonferroni, C.E.: Teoria statistica delle classi e calcolo delle probabilita. Libreria internazionale Seeber (1936)

  4. Calders, T., Garboni, C., Goethals, B.: Efficient pattern mining of uncertain data with sampling. In: Proceedings of the 14th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD’10, pp. 480–487 (2010)

    Chapter  Google Scholar 

  5. Cherkassky, V.: The nature of statistical learning theory. IEEE Trans. Neural Netw. 8(6), 1564 (1997)

    Article  Google Scholar 

  6. Chernoff, H.: A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 23(4), 493–507 (1952)

    Article  MathSciNet  Google Scholar 

  7. Cochran, W.G.: Sampling techniques, 3rd. Wiley, New York (1977)

    MATH  Google Scholar 

  8. Dong, G., Pei, J.: Sequence data mining. Springer, Berlin (2007)

    MATH  Google Scholar 

  9. Dutta, S., Nayek, P., Bhattacharya, A.: Neighbor-aware search for approximate labeled graph matching using the chi-square statistics. In: Proceedings of the 26th International Conference on World Wide Web, WWW’17, pp. 1281–1290 (2017)

  10. Fournier-Viger, P., Gomariz, A., Gueniche, T., Mwamikazi, E.T.: Tks: efficient mining of top-k sequential patterns. In: Proceedings of the 9th International Conference on Advanced Data Mining and Applications, ADMA’13, pp. 109–120 (2013)

    Chapter  Google Scholar 

  11. Ge, J., Xia, Y.: Distributed sequential pattern mining in large scale uncertain databases. In: Proceedings of the 20th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD’16, pp. 17–29 (2016)

    Chapter  Google Scholar 

  12. Han, J., Pei, J., Mortazavi-Asl, B., Chen, Q., Dayal, U., Hsu, M.: Freespan: Frequent pattern-projected sequential pattern mining. In: Proceedings of the 6th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD’00, pp. 355–359 (2000)

  13. Huang, D., Xu, K., Pei, J.: Malicious url detection by dynamically mining patterns without pre-defined elements. World Wide Web Journal 17(6), 1375–1394 (2014)

    Article  Google Scholar 

  14. Kimura, M., Saito, K.: Tractable models for information diffusion in social networks. In: Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD’06, pp. 259–271 (2006)

    Google Scholar 

  15. Leskovec, J., Faloutsos, C.: Neighbor-aware search for approximate labeled graph matching using the chi-square statistics. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD’06, pp. 631–636 (2006)

  16. Liu, C., Zhang, K., Xiong, H., Jiang, G., Yang, Q.: Temporal skeletonization on sequential data: Patterns, categorization, and visualization. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’14, pp. 1336–1345 (2014)

  17. Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.: Prefixspan: mining sequential patterns by prefix-projected growth. In: Proceedings of the 17th International Conference on Data Engineering, ICDE’01, pp. 215–224 (2001)

  18. Pfeiffer, J.J., Moreno, S., Fond, T.L., Neville, J., Gallagher, B.: Attributed graph models: modeling network structure with correlated attributes. In: Proceedings of the 23rd International Conference on World Wide Web, WWW’14, pp. 831–842 (2014)

  19. Pietracaprina, A., Riondato, M., Upfal, E., Vandin, F.: Mining top-k frequent itemsets through progressive sampling. Data Min. Knowl. Disc. 21(2), 310–326 (2010)

    Article  MathSciNet  Google Scholar 

  20. Raïssi, C., Poncelet, P.: Sampling for sequential pattern mining: From static databases to data streams. In: Proceedings of the 7th IEEE International Conference on Data Mining, ICDM’07, pp. 631–636 (2007)

  21. Ribeiro, B.F., Wang, P., Murai, F., Towsley, D.: Sampling directed graphs with random walks. In: Proceedings of the IEEE International Conference on Computer Communications, INFOCOM’12, pp. 1692–1700 (2012)

  22. Riondato, M., Upfal, E.: Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD’12, pp. 25–41 (2012)

    Chapter  Google Scholar 

  23. Riondato, M., Upfal, E.: Mining frequent itemsets through progressive sampling with rademacher averages. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’15, pp. 1005–1014 (2015)

  24. Shang, J., Peng, J., Han, J.: Macfp: maximal approximate consecutive frequent pattern mining under edit distance. In: Proceedings of the 2016 SIAM International Conference on Data Mining, SDM’16, pp. 558–566 (2016)

  25. Singhal, A.: Modern information retrieval: a brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24(4), 35–43 (2001)

    Google Scholar 

  26. Srikant, R., Agrawal, R.: Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the 5th International Conference on Extending Database Technology, EDBT’96, pp. 3–17 (1996)

    Google Scholar 

  27. Tang, J., Zhang, J., Yao, L., Zhang, L., Su, Z.: Arnetminer: extraction and mining of academic social networks. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’08, pp. 990–998 (2008)

  28. Thompson, S.K.: Sampling, 3rd. Wiley, New York (2012)

    Book  Google Scholar 

  29. Toivonen, H.: Sampling large databases for association rules. Proceedings of the Vldb Endowment 96, 134–145 (1996)

    Google Scholar 

  30. Tong, H., Faloutsos, C., Gallagher, B., Eliassi-Rad, T.: Fast best-effort pattern matching in large attributed graphs. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD’07, pp. 737–746 (2007)

  31. Tzvetkov, P., Yan, X., Han, J.: Tsp: mining top-k closed sequential patterns. Knowl. Inf. Syst. 7(4), 438–457 (2005)

    Article  Google Scholar 

  32. Wang, X., Lin, J., Senin, P., Oates, T., Gandhi, S., Boedihardjo, A.P., Chen, C., Frankenstein, S.: Rpm: representative pattern mining for efficient time series classification. In: Proceedings of the 19th International Conference on Extending Database Technology, EDBT’16, pp. 185–196 (2016)

  33. Ye, W., Zhou, L., Mautz, D., Plant, C., Böhm, C.: Learning from labeled and unlabeled vertices in networks. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’17, pp. 1265–1274 (2017)

  34. Zaki, M.J.: Spade: an efficient algorithm for mining frequent sequences. Mach. Learn. 42(1/2), 31–60 (2001)

    Article  Google Scholar 

  35. Zhang, J., Tang, J., Ma, C., Tong, H., Jing, Y., Li, J.: Panther: fast top-k similarity search on large networks. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’15, pp. 1445–1454 (2015)

  36. Zheng, Z., Wei, W., Liu, C., Cao, W., Cao, L., Bhatia, M.: An effective contrast sequential pattern mining approach to taxpayer behavior analysis. In: Proceedings of the 25th International Conference on World Wide Web, WWW’16, pp. 633–651 (2016)

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported in part by the National Key Research and Development Program of China (No. 2017YFB0803301), the Natural Science Foundation of China (No. U1836215), DongGuan Innovative Research Team Program (No.201636000100038), and the 111 Project (No. B18008).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xi Zhang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lei, M., Chu, L., Wang, Z. et al. Mining top-k sequential patterns in transaction database graphs. World Wide Web 23, 103–130 (2020). https://doi.org/10.1007/s11280-019-00686-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-019-00686-w

Keywords

Navigation