Abstract
A frequent pattern is a substructure that appears in a database with frequency (aka. support) no less than a user-specified threshold, while a closed pattern is one that has no super-pattern that has the same support. Here, a substructure can refer to different structural forms, such as itemsets, subsequences, subtrees, and subgraphs, and mining such substructures is important in many real applications such as product recommendation and feature extraction. Currently, there lacks a general programming framework that can be easily customized to mine different types of patterns, and existing parallel and distributed solutions are IO-bound rendering CPU cores underutilized. Since mining frequent and/or closed patterns are NP-hard, it is important to fully utilize the available CPU cores. This paper presents such a general-purpose framework called PrefixFPM. The framework is based on the idea of prefix projection which allows a divide-and-conquer mining paradigm. PrefixFPM exposes a unified programming interface to users who can readily customize it to mine their desired patterns. We have adapted the state-of-the-art serial algorithms for mining patterns including subsequences, subtrees, and subgraphs on top of PrefixFPM, and extensive experiments demonstrate an excellent speedup ratio of PrefixFPM with the number of CPU cores.
Similar content being viewed by others
References
Aggarwal, C.C., Han, J. (eds.): Frequent Pattern Mining. Springer, Berlin (2014)
Bhuiyan, M., Hasan, M.A.: An iterative mapreduce based frequent subgraph mining algorithm. IEEE Trans. Knowl. Data Eng. 27(3), 608–620 (2015)
Bringmann, B., Nijssen, S.: What is frequent in a single graph? In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A., (eds) PAKDD, vol. 5012 of Lecture Notes in Computer Science, pp. 858–863. Springer, Berlin (2008)
Chan, H.K., Long, C., Yan, D., Wong, R.C.: Fraction-score: a new support measure for co-location pattern mining. Presented at the (2019)
Cheng, J., Ke, Y., Ng, W., Lu, A.: Fg-index: towards verification-free query processing on graph databases. In: SIGMOD, pp. 857–872 (2007)
Chon, K., Hwang, S., Kim, M.: Gminer: a fast gpu-based frequent itemset mining method for large-scale data. Inf. Sci. 439–440, 19–38 (2018)
CloSpan Package. https://sites.cs.ucsb.edu/~xyan/software/Clospan.htm
COST in the Land of Databases. https://github.com/frankmcsherry/blog/blob/master/posts/2017-09-23.md
DBLP Collaboration Network. http://networkrepository.com/ca-dblp-2012.php
Elseidy, M., Abdelhamid, E., Skiadopoulos, S., Kalnis, P.: GRAMI: frequent subgraph and pattern mining in a single large graph. Proc. VLDB Endow. 7(7), 517–528 (2014)
Enamine Dataset. https://enamine.net/
Fang, W., Lu, M., Xiao, X., He, B., Luo, Q.: Frequent itemset mining on graphics processors. In: DaMoN, pp. 34–42. ACM (2009)
FSM-H Code. http://dmgroup.cs.iupui.edu/Mansurul_FSMH.php
Gan, W., Lin, J.C., Fournier-Viger, P., Chao, H., Tseng, V.S., Yu, P.S.: A survey of utility-oriented pattern mining. IEEE Trans. Knowl. Data Eng. 33(4), 1306–1327 (2021)
Gaston Implementation. https://liacs.leidenuniv.nl/~nijssensgr/gaston/
gSpan Implementation. https://github.com/rkwitt/gboost/tree/master/src-gspan
gSpan Technical Report. https://sites.cs.ucsb.edu/~xyan/papers/gSpan.pdf
Guo, G., Yan, D., Özsu, M.T., Jiang, Z., Khalil, J.: Scalable mining of maximal quasi-cliques: an algorithm-system codesign approach. Proc. VLDB Endow. 14(4), 573–585 (2020)
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. Presented at the (2000)
IBM Synthetic Data Generator. https://github.com/zakimjz/IBMGenerator
Kudo, T., Maeda, E., Matsumoto, Y.: An application of boosting to graph classification. In: NIPS, pp. 729–736 (2004)
Li, E., Liu, L.: Optimization of frequent itemset mining on multiple-core processor. In: VLDB, pp. 1275–1285. ACM (2007)
Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: Pfp: parallel fp-growth for query recommendation. In: Proceedings of the 2008 ACM Conference on Recommender Systems, RecSys 2008, Lausanne, Switzerland, October 23–25, 2008, pp. 107–114 (2008)
Lin, W., Xiao, X., Ghinita, G.: Large-scale frequent subgraph mining in mapreduce. In: ICDE, pp. 844–855 (2014)
McSherry, F., Isard, M., Murray, D.G.: Scalability! but at what cost? In: HotOS, USENIX Association (2015)
NCI Dataset. https://cactus.nci.nih.gov/download/nci/
Nijssen, S., Kok, J.N.: A quickstart in frequent structure mining can make a difference. In: KDD, pp. 647–652. ACM (2004)
Nijssen, S., Kok, J.N.: The gaston tool for frequent subgraph mining. Electron. Notes Theor. Comput. Sci. 127(1), 77–87 (2005)
OpenMP. https://www.openmp.org/
Orlando, S., Lucchese, C., Palmerini, P., Perego, R., Silvestri, F.: kdci: a multi-strategy algorithm for mining frequent sets. In: FIMI, vol. 90 of CEUR Workshop Proceedings. CEUR-WS.org (2003)
Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.: In: Prefixspan: Mining Sequential Patterns by Prefix-Projected Growth, pp. 215–224, Heidelberg, Germany (2001)
Peng, Z., Wang, T., Lu, W., Huang, H., Du, X., Zhao, F., Tung, A.K.H.: Mining frequent subgraphs from tremendous amount of small graphs using mapreduce. Knowl. Inf. Syst. 56(3), 663–690 (2018)
Plan Dataset. https://www.cs.rpi.edu/~zaki/software/plandata.gz
PrefixSpan Implementation. http://chasen.org/~taku/software/prefixspan/prefixspan-0.4.tar.gz
Schlegel, B., Karnagel, T., Kiefer, T., Lehner, W.: Scalable frequent itemset mining on many-core processors. In: DaMoN, p. 3. ACM (2013)
Silvestri, C., Orlando, S., gpudci.: Exploiting gpus in frequent itemset mining. In: PDP, pp. 416–425. IEEE (2012)
Sleuth Implementation. https://github.com/zakimjz/SLEUTH
Teixeira, C.H.C., Fonseca, A.J., Serafini, M., Siganos, G., Zaki, M.J., Aboulnaga, A.: Arabesque: a system for distributed graph mining. In: SOSP, pp. 425–440 (2015)
The RStream system. https://github.com/rstream-system
Tree Generator. http://www.cs.rpi.edu/~zaki/software/TreeGen.tar.gz
TreeBank. http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/www/repository.html#treebank
TreeMiner Implementation. https://github.com/zakimjz/TreeMiner
Vu, L., Alaghband, G.: Novel parallel method for mining frequent patterns on multi-core shared memory systems. In: DISCS@SC, pp. 49–54. ACM (2013)
Wang, J., Han, J.: BIDE: efficient mining of frequent closed sequences. In: Özsoyoglu, Z.M., Zdonik, S.B., (eds) ICDE, pp. 79–90 (2004)
Wang, K., Zuo, Z., Thorpe, J., Nguyen, T.Q., Xu, G.H.: Rstream: marrying relational algebra with streaming for efficient graph mining on a single machine. In: OSDI, pp. 763–782 (2018)
Yan, D., Guo, G., Chowdhury, M.M.R., Özsu, M.T., Ku, W.-S., Lui, J.C.: G-thinker: a distributed framework for mining subgraphs in a big graph. ICDE (2020)
Yan, D., Guo, G., Chowdhury, M.M.R., Özsu, M.T., Lui, J.C.S., Tan, W.: T-thinker: a task-centric distributed framework for compute-intensive divide-and-conquer algorithms. Presented at the (2019)
Yan, X., Han, J., Afshar, R.: Clospan: mining closed sequential patterns in large datasets. In: SDM, pp. 166–177. SIAM (2003)
Yan, X., Han, J.: gspan: Graph-based substructure pattern mining. In: ICDM, pp. 721–724 (2002)
Yan, D., Qu, W., Guo, G., Wang, X.: Prefixfpm: a parallel framework for general-purpose frequent pattern mining. In: ICDE (2020)
Yang, G.: The complexity of mining maximal frequent itemsets and maximal frequent patterns. Presented at the (2004)
Yeast Dataset. https://sites.cs.ucsb.edu/~xyan/dataset.htm
Zaki, M.J.: In: Efficiently mining frequent trees in a forest, pp. 71–80. , Edmonton, Alberta, Canada (2002)
Zaki, M.J.: Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng. 12(3), 372–390 (2000)
Zaki, M.J.: SPADE: an efficient algorithm for mining frequent sequences. Mach. Learn. 42(1/2), 31–60 (2001)
Zaki, M.J.: Efficiently mining frequent embedded unordered trees. Fundam. Inform. 66(1–2), 33–52 (2005)
Zhang, F., Zhang, Y., Bakos, J.D.: Gpapriori: Gpu-accelerated frequent itemset mining. In: CLUSTER, pp. 590–594. IEEE Computer Society (2011)
Zhang, F., Zhang, Y., Bakos, J.D.: Accelerating frequent itemset mining on graphics processing units. J. Supercomput. 66(1), 94–117 (2013)
Zou, L., Lu, Y., Zhang, H., Hu, R.: Prefixtreeespan: a pattern growth algorithm for mining embedded subtrees. In: Aberer, K., Peng, Z., Rundensteiner, E.A., Zhang, Y., Li, X., (eds) WISE, vol. 4255 of Lecture Notes in Computer Science, pp. 499–505. Springer (2006)
Acknowledgements
Da Yan and Guimu Guo were supported by NSF OAC-1755464 and NSF DGE-1723250. Guimu Guo acknowledges financial support from the Alabama Graduate Research Scholars Program (GRSP) funded through the Alabama Commission for Higher Education and administered by the Alabama EPSCoR. Wenwen Qu and Xiaoling Wang were supported by NSFC grants (No. 61972155), the Science and Technology Commission of Shanghai Municipality (20DZ1100300), and the Open Project Fund from Shenzhen Institute of Artificial Intelligence and Robotics for Society.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yan, D., Qu, W., Guo, G. et al. PrefixFPM: a parallel framework for general-purpose mining of frequent and closed patterns. The VLDB Journal 31, 253–286 (2022). https://doi.org/10.1007/s00778-021-00687-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-021-00687-0