PrefixFPM: a parallel framework for general-purpose mining of frequent and closed patterns

Yan, Da; Qu, Wenwen; Guo, Guimu; Wang, Xiaoling; Zhou, Yang

doi:10.1007/s00778-021-00687-0

PrefixFPM: a parallel framework for general-purpose mining of frequent and closed patterns

Special Issue Paper
Published: 09 August 2021

Volume 31, pages 253–286, (2022)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Da Yan ORCID: orcid.org/0000-0002-4653-0408¹^na1,
Wenwen Qu²^na1,
Guimu Guo¹,
Xiaoling Wang² &
…
Yang Zhou³

679 Accesses
12 Citations
Explore all metrics

Abstract

A frequent pattern is a substructure that appears in a database with frequency (aka. support) no less than a user-specified threshold, while a closed pattern is one that has no super-pattern that has the same support. Here, a substructure can refer to different structural forms, such as itemsets, subsequences, subtrees, and subgraphs, and mining such substructures is important in many real applications such as product recommendation and feature extraction. Currently, there lacks a general programming framework that can be easily customized to mine different types of patterns, and existing parallel and distributed solutions are IO-bound rendering CPU cores underutilized. Since mining frequent and/or closed patterns are NP-hard, it is important to fully utilize the available CPU cores. This paper presents such a general-purpose framework called PrefixFPM. The framework is based on the idea of prefix projection which allows a divide-and-conquer mining paradigm. PrefixFPM exposes a unified programming interface to users who can readily customize it to mine their desired patterns. We have adapted the state-of-the-art serial algorithms for mining patterns including subsequences, subtrees, and subgraphs on top of PrefixFPM, and extensive experiments demonstrate an excellent speedup ratio of PrefixFPM with the number of CPU cores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 8

Parallel Mining of Frequent Subtree Patterns

TopPI: An Efficient Algorithm for Item-Centric Mining

P-FCloHUS: A Parallel Approach for Mining Frequent Closed High-Utility Sequences on Multi-core Processors

Notes

https://github.com/wenwenQu/PrefixFPM

References

Aggarwal, C.C., Han, J. (eds.): Frequent Pattern Mining. Springer, Berlin (2014)
Bhuiyan, M., Hasan, M.A.: An iterative mapreduce based frequent subgraph mining algorithm. IEEE Trans. Knowl. Data Eng. 27(3), 608–620 (2015)
Article Google Scholar
Bringmann, B., Nijssen, S.: What is frequent in a single graph? In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A., (eds) PAKDD, vol. 5012 of Lecture Notes in Computer Science, pp. 858–863. Springer, Berlin (2008)
Chan, H.K., Long, C., Yan, D., Wong, R.C.: Fraction-score: a new support measure for co-location pattern mining. Presented at the (2019)
Cheng, J., Ke, Y., Ng, W., Lu, A.: Fg-index: towards verification-free query processing on graph databases. In: SIGMOD, pp. 857–872 (2007)
Chon, K., Hwang, S., Kim, M.: Gminer: a fast gpu-based frequent itemset mining method for large-scale data. Inf. Sci. 439–440, 19–38 (2018)
Article MathSciNet Google Scholar
CloSpan Package. https://sites.cs.ucsb.edu/~xyan/software/Clospan.htm
COST in the Land of Databases. https://github.com/frankmcsherry/blog/blob/master/posts/2017-09-23.md
DBLP Collaboration Network. http://networkrepository.com/ca-dblp-2012.php
Elseidy, M., Abdelhamid, E., Skiadopoulos, S., Kalnis, P.: GRAMI: frequent subgraph and pattern mining in a single large graph. Proc. VLDB Endow. 7(7), 517–528 (2014)
Article Google Scholar
Enamine Dataset. https://enamine.net/
Fang, W., Lu, M., Xiao, X., He, B., Luo, Q.: Frequent itemset mining on graphics processors. In: DaMoN, pp. 34–42. ACM (2009)
FSM-H Code. http://dmgroup.cs.iupui.edu/Mansurul_FSMH.php
Gan, W., Lin, J.C., Fournier-Viger, P., Chao, H., Tseng, V.S., Yu, P.S.: A survey of utility-oriented pattern mining. IEEE Trans. Knowl. Data Eng. 33(4), 1306–1327 (2021)
Article Google Scholar
Gaston Implementation. https://liacs.leidenuniv.nl/~nijssensgr/gaston/
gSpan Implementation. https://github.com/rkwitt/gboost/tree/master/src-gspan
gSpan Technical Report. https://sites.cs.ucsb.edu/~xyan/papers/gSpan.pdf
Guo, G., Yan, D., Özsu, M.T., Jiang, Z., Khalil, J.: Scalable mining of maximal quasi-cliques: an algorithm-system codesign approach. Proc. VLDB Endow. 14(4), 573–585 (2020)
Article Google Scholar
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. Presented at the (2000)
IBM Synthetic Data Generator. https://github.com/zakimjz/IBMGenerator
Kudo, T., Maeda, E., Matsumoto, Y.: An application of boosting to graph classification. In: NIPS, pp. 729–736 (2004)
Li, E., Liu, L.: Optimization of frequent itemset mining on multiple-core processor. In: VLDB, pp. 1275–1285. ACM (2007)
Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: Pfp: parallel fp-growth for query recommendation. In: Proceedings of the 2008 ACM Conference on Recommender Systems, RecSys 2008, Lausanne, Switzerland, October 23–25, 2008, pp. 107–114 (2008)
Lin, W., Xiao, X., Ghinita, G.: Large-scale frequent subgraph mining in mapreduce. In: ICDE, pp. 844–855 (2014)
McSherry, F., Isard, M., Murray, D.G.: Scalability! but at what cost? In: HotOS, USENIX Association (2015)
NCI Dataset. https://cactus.nci.nih.gov/download/nci/
Nijssen, S., Kok, J.N.: A quickstart in frequent structure mining can make a difference. In: KDD, pp. 647–652. ACM (2004)
Nijssen, S., Kok, J.N.: The gaston tool for frequent subgraph mining. Electron. Notes Theor. Comput. Sci. 127(1), 77–87 (2005)
Article Google Scholar
OpenMP. https://www.openmp.org/
Orlando, S., Lucchese, C., Palmerini, P., Perego, R., Silvestri, F.: kdci: a multi-strategy algorithm for mining frequent sets. In: FIMI, vol. 90 of CEUR Workshop Proceedings. CEUR-WS.org (2003)
Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.: In: Prefixspan: Mining Sequential Patterns by Prefix-Projected Growth, pp. 215–224, Heidelberg, Germany (2001)
Peng, Z., Wang, T., Lu, W., Huang, H., Du, X., Zhao, F., Tung, A.K.H.: Mining frequent subgraphs from tremendous amount of small graphs using mapreduce. Knowl. Inf. Syst. 56(3), 663–690 (2018)
Article Google Scholar
Plan Dataset. https://www.cs.rpi.edu/~zaki/software/plandata.gz
PrefixSpan Implementation. http://chasen.org/~taku/software/prefixspan/prefixspan-0.4.tar.gz
Schlegel, B., Karnagel, T., Kiefer, T., Lehner, W.: Scalable frequent itemset mining on many-core processors. In: DaMoN, p. 3. ACM (2013)
Silvestri, C., Orlando, S., gpudci.: Exploiting gpus in frequent itemset mining. In: PDP, pp. 416–425. IEEE (2012)
Sleuth Implementation. https://github.com/zakimjz/SLEUTH
Teixeira, C.H.C., Fonseca, A.J., Serafini, M., Siganos, G., Zaki, M.J., Aboulnaga, A.: Arabesque: a system for distributed graph mining. In: SOSP, pp. 425–440 (2015)
The RStream system. https://github.com/rstream-system
Tree Generator. http://www.cs.rpi.edu/~zaki/software/TreeGen.tar.gz
TreeBank. http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/www/repository.html#treebank
TreeMiner Implementation. https://github.com/zakimjz/TreeMiner
Vu, L., Alaghband, G.: Novel parallel method for mining frequent patterns on multi-core shared memory systems. In: DISCS@SC, pp. 49–54. ACM (2013)
Wang, J., Han, J.: BIDE: efficient mining of frequent closed sequences. In: Özsoyoglu, Z.M., Zdonik, S.B., (eds) ICDE, pp. 79–90 (2004)
Wang, K., Zuo, Z., Thorpe, J., Nguyen, T.Q., Xu, G.H.: Rstream: marrying relational algebra with streaming for efficient graph mining on a single machine. In: OSDI, pp. 763–782 (2018)
Yan, D., Guo, G., Chowdhury, M.M.R., Özsu, M.T., Ku, W.-S., Lui, J.C.: G-thinker: a distributed framework for mining subgraphs in a big graph. ICDE (2020)
Yan, D., Guo, G., Chowdhury, M.M.R., Özsu, M.T., Lui, J.C.S., Tan, W.: T-thinker: a task-centric distributed framework for compute-intensive divide-and-conquer algorithms. Presented at the (2019)
Yan, X., Han, J., Afshar, R.: Clospan: mining closed sequential patterns in large datasets. In: SDM, pp. 166–177. SIAM (2003)
Yan, X., Han, J.: gspan: Graph-based substructure pattern mining. In: ICDM, pp. 721–724 (2002)
Yan, D., Qu, W., Guo, G., Wang, X.: Prefixfpm: a parallel framework for general-purpose frequent pattern mining. In: ICDE (2020)
Yang, G.: The complexity of mining maximal frequent itemsets and maximal frequent patterns. Presented at the (2004)
Yeast Dataset. https://sites.cs.ucsb.edu/~xyan/dataset.htm
Zaki, M.J.: In: Efficiently mining frequent trees in a forest, pp. 71–80. , Edmonton, Alberta, Canada (2002)
Zaki, M.J.: Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng. 12(3), 372–390 (2000)
Article Google Scholar
Zaki, M.J.: SPADE: an efficient algorithm for mining frequent sequences. Mach. Learn. 42(1/2), 31–60 (2001)
Article Google Scholar
Zaki, M.J.: Efficiently mining frequent embedded unordered trees. Fundam. Inform. 66(1–2), 33–52 (2005)
MathSciNet MATH Google Scholar
Zhang, F., Zhang, Y., Bakos, J.D.: Gpapriori: Gpu-accelerated frequent itemset mining. In: CLUSTER, pp. 590–594. IEEE Computer Society (2011)
Zhang, F., Zhang, Y., Bakos, J.D.: Accelerating frequent itemset mining on graphics processing units. J. Supercomput. 66(1), 94–117 (2013)
Article Google Scholar
Zou, L., Lu, Y., Zhang, H., Hu, R.: Prefixtreeespan: a pattern growth algorithm for mining embedded subtrees. In: Aberer, K., Peng, Z., Rundensteiner, E.A., Zhang, Y., Li, X., (eds) WISE, vol. 4255 of Lecture Notes in Computer Science, pp. 499–505. Springer (2006)

Download references

Acknowledgements

Da Yan and Guimu Guo were supported by NSF OAC-1755464 and NSF DGE-1723250. Guimu Guo acknowledges financial support from the Alabama Graduate Research Scholars Program (GRSP) funded through the Alabama Commission for Higher Education and administered by the Alabama EPSCoR. Wenwen Qu and Xiaoling Wang were supported by NSFC grants (No. 61972155), the Science and Technology Commission of Shanghai Municipality (20DZ1100300), and the Open Project Fund from Shenzhen Institute of Artificial Intelligence and Robotics for Society.

Author information

Da Yan and Wenwen Qu are parallel first authors.

Authors and Affiliations

Department of Computer Science, The University of Alabama at Birmingham, Birmingham, AL, USA
Da Yan & Guimu Guo
Shanghai Key Laboratory of Trustworthy Computing, East China Normal University (ECNU), Shanghai, China
Wenwen Qu & Xiaoling Wang
Department of Computer Science and Software Engineering, Auburn University, Auburn, AL, USA
Yang Zhou

Authors

Da Yan
View author publications
You can also search for this author in PubMed Google Scholar
Wenwen Qu
View author publications
You can also search for this author in PubMed Google Scholar
Guimu Guo
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoling Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yang Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Da Yan or Wenwen Qu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yan, D., Qu, W., Guo, G. et al. PrefixFPM: a parallel framework for general-purpose mining of frequent and closed patterns. The VLDB Journal 31, 253–286 (2022). https://doi.org/10.1007/s00778-021-00687-0

Download citation

Received: 30 August 2020
Revised: 27 April 2021
Accepted: 10 July 2021
Published: 09 August 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s00778-021-00687-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PrefixFPM: a parallel framework for general-purpose mining of frequent and closed patterns

Abstract

Access this article

Similar content being viewed by others

Parallel Mining of Frequent Subtree Patterns

TopPI: An Efficient Algorithm for Item-Centric Mining

P-FCloHUS: A Parallel Approach for Mining Frequent Closed High-Utility Sequences on Multi-core Processors

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

PrefixFPM: a parallel framework for general-purpose mining of frequent and closed patterns

Abstract

Access this article

Similar content being viewed by others

Parallel Mining of Frequent Subtree Patterns

TopPI: An Efficient Algorithm for Item-Centric Mining

P-FCloHUS: A Parallel Approach for Mining Frequent Closed High-Utility Sequences on Multi-core Processors

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation