Skip to main content
Log in

PrefixFPM: a parallel framework for general-purpose mining of frequent and closed patterns

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

A frequent pattern is a substructure that appears in a database with frequency (aka. support) no less than a user-specified threshold, while a closed pattern is one that has no super-pattern that has the same support. Here, a substructure can refer to different structural forms, such as itemsets, subsequences, subtrees, and subgraphs, and mining such substructures is important in many real applications such as product recommendation and feature extraction. Currently, there lacks a general programming framework that can be easily customized to mine different types of patterns, and existing parallel and distributed solutions are IO-bound rendering CPU cores underutilized. Since mining frequent and/or closed patterns are NP-hard, it is important to fully utilize the available CPU cores. This paper presents such a general-purpose framework called PrefixFPM. The framework is based on the idea of prefix projection which allows a divide-and-conquer mining paradigm. PrefixFPM exposes a unified programming interface to users who can readily customize it to mine their desired patterns. We have adapted the state-of-the-art serial algorithms for mining patterns including subsequences, subtrees, and subgraphs on top of PrefixFPM, and extensive experiments demonstrate an excellent speedup ratio of PrefixFPM with the number of CPU cores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30

Similar content being viewed by others

Notes

  1. https://github.com/wenwenQu/PrefixFPM

References

  1. Aggarwal, C.C., Han, J. (eds.): Frequent Pattern Mining. Springer, Berlin (2014)

  2. Bhuiyan, M., Hasan, M.A.: An iterative mapreduce based frequent subgraph mining algorithm. IEEE Trans. Knowl. Data Eng. 27(3), 608–620 (2015)

    Article  Google Scholar 

  3. Bringmann, B., Nijssen, S.: What is frequent in a single graph? In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A., (eds) PAKDD, vol. 5012 of Lecture Notes in Computer Science, pp. 858–863. Springer, Berlin (2008)

  4. Chan, H.K., Long, C., Yan, D., Wong, R.C.: Fraction-score: a new support measure for co-location pattern mining. Presented at the (2019)

  5. Cheng, J., Ke, Y., Ng, W., Lu, A.: Fg-index: towards verification-free query processing on graph databases. In: SIGMOD, pp. 857–872 (2007)

  6. Chon, K., Hwang, S., Kim, M.: Gminer: a fast gpu-based frequent itemset mining method for large-scale data. Inf. Sci. 439–440, 19–38 (2018)

    Article  MathSciNet  Google Scholar 

  7. CloSpan Package. https://sites.cs.ucsb.edu/~xyan/software/Clospan.htm

  8. COST in the Land of Databases. https://github.com/frankmcsherry/blog/blob/master/posts/2017-09-23.md

  9. DBLP Collaboration Network. http://networkrepository.com/ca-dblp-2012.php

  10. Elseidy, M., Abdelhamid, E., Skiadopoulos, S., Kalnis, P.: GRAMI: frequent subgraph and pattern mining in a single large graph. Proc. VLDB Endow. 7(7), 517–528 (2014)

    Article  Google Scholar 

  11. Enamine Dataset. https://enamine.net/

  12. Fang, W., Lu, M., Xiao, X., He, B., Luo, Q.: Frequent itemset mining on graphics processors. In: DaMoN, pp. 34–42. ACM (2009)

  13. FSM-H Code. http://dmgroup.cs.iupui.edu/Mansurul_FSMH.php

  14. Gan, W., Lin, J.C., Fournier-Viger, P., Chao, H., Tseng, V.S., Yu, P.S.: A survey of utility-oriented pattern mining. IEEE Trans. Knowl. Data Eng. 33(4), 1306–1327 (2021)

    Article  Google Scholar 

  15. Gaston Implementation. https://liacs.leidenuniv.nl/~nijssensgr/gaston/

  16. gSpan Implementation. https://github.com/rkwitt/gboost/tree/master/src-gspan

  17. gSpan Technical Report. https://sites.cs.ucsb.edu/~xyan/papers/gSpan.pdf

  18. Guo, G., Yan, D., Özsu, M.T., Jiang, Z., Khalil, J.: Scalable mining of maximal quasi-cliques: an algorithm-system codesign approach. Proc. VLDB Endow. 14(4), 573–585 (2020)

    Article  Google Scholar 

  19. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. Presented at the (2000)

  20. IBM Synthetic Data Generator. https://github.com/zakimjz/IBMGenerator

  21. Kudo, T., Maeda, E., Matsumoto, Y.: An application of boosting to graph classification. In: NIPS, pp. 729–736 (2004)

  22. Li, E., Liu, L.: Optimization of frequent itemset mining on multiple-core processor. In: VLDB, pp. 1275–1285. ACM (2007)

  23. Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: Pfp: parallel fp-growth for query recommendation. In: Proceedings of the 2008 ACM Conference on Recommender Systems, RecSys 2008, Lausanne, Switzerland, October 23–25, 2008, pp. 107–114 (2008)

  24. Lin, W., Xiao, X., Ghinita, G.: Large-scale frequent subgraph mining in mapreduce. In: ICDE, pp. 844–855 (2014)

  25. McSherry, F., Isard, M., Murray, D.G.: Scalability! but at what cost? In: HotOS, USENIX Association (2015)

  26. NCI Dataset. https://cactus.nci.nih.gov/download/nci/

  27. Nijssen, S., Kok, J.N.: A quickstart in frequent structure mining can make a difference. In: KDD, pp. 647–652. ACM (2004)

  28. Nijssen, S., Kok, J.N.: The gaston tool for frequent subgraph mining. Electron. Notes Theor. Comput. Sci. 127(1), 77–87 (2005)

    Article  Google Scholar 

  29. OpenMP. https://www.openmp.org/

  30. Orlando, S., Lucchese, C., Palmerini, P., Perego, R., Silvestri, F.: kdci: a multi-strategy algorithm for mining frequent sets. In: FIMI, vol. 90 of CEUR Workshop Proceedings. CEUR-WS.org (2003)

  31. Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.: In: Prefixspan: Mining Sequential Patterns by Prefix-Projected Growth, pp. 215–224, Heidelberg, Germany (2001)

  32. Peng, Z., Wang, T., Lu, W., Huang, H., Du, X., Zhao, F., Tung, A.K.H.: Mining frequent subgraphs from tremendous amount of small graphs using mapreduce. Knowl. Inf. Syst. 56(3), 663–690 (2018)

    Article  Google Scholar 

  33. Plan Dataset. https://www.cs.rpi.edu/~zaki/software/plandata.gz

  34. PrefixSpan Implementation. http://chasen.org/~taku/software/prefixspan/prefixspan-0.4.tar.gz

  35. Schlegel, B., Karnagel, T., Kiefer, T., Lehner, W.: Scalable frequent itemset mining on many-core processors. In: DaMoN, p. 3. ACM (2013)

  36. Silvestri, C., Orlando, S., gpudci.: Exploiting gpus in frequent itemset mining. In: PDP, pp. 416–425. IEEE (2012)

  37. Sleuth Implementation. https://github.com/zakimjz/SLEUTH

  38. Teixeira, C.H.C., Fonseca, A.J., Serafini, M., Siganos, G., Zaki, M.J., Aboulnaga, A.: Arabesque: a system for distributed graph mining. In: SOSP, pp. 425–440 (2015)

  39. The RStream system. https://github.com/rstream-system

  40. Tree Generator. http://www.cs.rpi.edu/~zaki/software/TreeGen.tar.gz

  41. TreeBank. http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/www/repository.html#treebank

  42. TreeMiner Implementation. https://github.com/zakimjz/TreeMiner

  43. Vu, L., Alaghband, G.: Novel parallel method for mining frequent patterns on multi-core shared memory systems. In: DISCS@SC, pp. 49–54. ACM (2013)

  44. Wang, J., Han, J.: BIDE: efficient mining of frequent closed sequences. In: Özsoyoglu, Z.M., Zdonik, S.B., (eds) ICDE, pp. 79–90 (2004)

  45. Wang, K., Zuo, Z., Thorpe, J., Nguyen, T.Q., Xu, G.H.: Rstream: marrying relational algebra with streaming for efficient graph mining on a single machine. In: OSDI, pp. 763–782 (2018)

  46. Yan, D., Guo, G., Chowdhury, M.M.R., Özsu, M.T., Ku, W.-S., Lui, J.C.: G-thinker: a distributed framework for mining subgraphs in a big graph. ICDE (2020)

  47. Yan, D., Guo, G., Chowdhury, M.M.R., Özsu, M.T., Lui, J.C.S., Tan, W.: T-thinker: a task-centric distributed framework for compute-intensive divide-and-conquer algorithms. Presented at the (2019)

  48. Yan, X., Han, J., Afshar, R.: Clospan: mining closed sequential patterns in large datasets. In: SDM, pp. 166–177. SIAM (2003)

  49. Yan, X., Han, J.: gspan: Graph-based substructure pattern mining. In: ICDM, pp. 721–724 (2002)

  50. Yan, D., Qu, W., Guo, G., Wang, X.: Prefixfpm: a parallel framework for general-purpose frequent pattern mining. In: ICDE (2020)

  51. Yang, G.: The complexity of mining maximal frequent itemsets and maximal frequent patterns. Presented at the (2004)

  52. Yeast Dataset. https://sites.cs.ucsb.edu/~xyan/dataset.htm

  53. Zaki, M.J.: In: Efficiently mining frequent trees in a forest, pp. 71–80. , Edmonton, Alberta, Canada (2002)

  54. Zaki, M.J.: Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng. 12(3), 372–390 (2000)

    Article  Google Scholar 

  55. Zaki, M.J.: SPADE: an efficient algorithm for mining frequent sequences. Mach. Learn. 42(1/2), 31–60 (2001)

    Article  Google Scholar 

  56. Zaki, M.J.: Efficiently mining frequent embedded unordered trees. Fundam. Inform. 66(1–2), 33–52 (2005)

    MathSciNet  MATH  Google Scholar 

  57. Zhang, F., Zhang, Y., Bakos, J.D.: Gpapriori: Gpu-accelerated frequent itemset mining. In: CLUSTER, pp. 590–594. IEEE Computer Society (2011)

  58. Zhang, F., Zhang, Y., Bakos, J.D.: Accelerating frequent itemset mining on graphics processing units. J. Supercomput. 66(1), 94–117 (2013)

    Article  Google Scholar 

  59. Zou, L., Lu, Y., Zhang, H., Hu, R.: Prefixtreeespan: a pattern growth algorithm for mining embedded subtrees. In: Aberer, K., Peng, Z., Rundensteiner, E.A., Zhang, Y., Li, X., (eds) WISE, vol. 4255 of Lecture Notes in Computer Science, pp. 499–505. Springer (2006)

Download references

Acknowledgements

Da Yan and Guimu Guo were supported by NSF OAC-1755464 and NSF DGE-1723250. Guimu Guo acknowledges financial support from the Alabama Graduate Research Scholars Program (GRSP) funded through the Alabama Commission for Higher Education and administered by the Alabama EPSCoR. Wenwen Qu and Xiaoling Wang were supported by NSFC grants (No. 61972155), the Science and Technology Commission of Shanghai Municipality (20DZ1100300), and the Open Project Fund from Shenzhen Institute of Artificial Intelligence and Robotics for Society.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Da Yan or Wenwen Qu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yan, D., Qu, W., Guo, G. et al. PrefixFPM: a parallel framework for general-purpose mining of frequent and closed patterns. The VLDB Journal 31, 253–286 (2022). https://doi.org/10.1007/s00778-021-00687-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-021-00687-0

Keywords

Navigation