skip to main content
10.1145/3543873.3587586acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Weighted Statistically Significant Pattern Mining

Published: 30 April 2023 Publication History

Abstract

Pattern discovery (aka pattern mining) is a fundamental task in the field of data science. Statistically significant pattern mining (SSPM) is the task of finding useful patterns that statistically occur more often from databases for one class than for another. The existing SSPM task does not consider the weight of each item. While in the real world, the significant level of different items/objects is various. Therefore, in this paper, we introduce the Weighted Statistically Significant Patterns Mining (WSSPM) problem and propose a novel WSSpm algorithm to successfully solve it. We present a new framework that effectively mines weighted statistically significant patterns by combining the weighted upper-bound model and the multiple hypotheses test. We also propose a new weighted support threshold that can satisfy the demand of WSSPM and prove its correctness and completeness. Besides, our weighted support threshold and modified weighted upper-bound can effectively shrink the mining range. Finally, experimental results on several real datasets show that the WSSpm algorithm performs well in terms of execution time and memory storage.

References

[1]
Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. 1993. Mining association rules between sets of items in large databases. In The ACM SIGMOD International Conference on Management of Data. ACM, 207–216.
[2]
Yoav Benjamini and Daniel Yekutieli. 2001. The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29, 4 (2001), 1165–1188.
[3]
Carlo Bonferroni. 1936. Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze 8 (1936), 3–62.
[4]
Chien-Ming Chen, Lili Chen, Wensheng Gan, Lina Qiu, and Weiping Ding. 2021. Discovering high utility-occupancy patterns from uncertain data. Information Sciences 546 (2021), 1208–1229.
[5]
Pilsun Choi and Buhyun Hwang. 2017. Dynamic weighted sequential pattern mining for USN system. In The 11th International Conference on Ubiquitous Information Management and Communication. ACM, 1–6.
[6]
EunYi Chung and Joseph P Romano. 2013. Exact and asymptotically robust permutation tests. Annals of Statistics 41, 2 (2013), 484–507.
[7]
Guozhu Dong and James Bailey. 2012. Contrast data mining: concepts, algorithms, and applications. CRC Press.
[8]
Wouter Duivesteijn and Arno Knobbe. 2011. Exploiting false discoveries–statistical validation of patterns and quality measures in subgroup discovery. In The IEEE 11th International Conference on Data Mining. IEEE, 151–160.
[9]
Ronald A Fisher. 1922. On the interpretation of χ 2 from contingency tables, and the calculation of P. Journal of the Royal Statistical Society 85, 1 (1922), 87–94.
[10]
Philippe Fournier-Viger, Wensheng Gan, Youxi Wu, Mourad Nouioua, Wei Song, Tin Truong, and Hai Duong. 2022. Pattern mining: Current challenges and opportunities. In International Conference Database Systems for Advanced Applications International Workshops. Springer, 34–49.
[11]
Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao, Tzung-Pei Hong, and Hamido Fujita. 2018. A survey of incremental high-utility itemset mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8, 2 (2018), e1242.
[12]
Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao, Vincent S Tseng, and Philip S Yu. 2021. A survey of utility-oriented pattern mining. IEEE Transactions on Knowledge and Data Engineering 33, 4 (2021), 1306–1327.
[13]
Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao, and Philip S Yu. 2019. A survey of parallel sequential pattern mining. ACM Transactions on Knowledge Discovery from Data 13, 3 (2019), 1–34.
[14]
Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao, and Justin Zhan. 2017. Mining of frequent patterns with multiple minimum supports. Engineering Applications of Artificial Intelligence 60 (2017), 83–96.
[15]
Wensheng Gan, Jerry Chun Wei Lin, Philippe Fournier-Viger, Han Chieh Chao, Justin Zhan, and Ji Zhang. 2018. Exploiting highly qualified pattern with frequency and weight occupancy. Knowledge and Information Systems 56, 1 (2018), 165–196.
[16]
Wensheng Gan, Jerry Chun-Wei Lin, Jiexiong Zhang, Philippe Fournier-Viger, Han-Chieh Chao, and Philip S Yu. 2021. Fast utility mining on sequence data. IEEE Transactions on Cybernetics 51, 2 (2021), 487–500.
[17]
Aristides Gionis, Heikki Mannila, Taneli Mielikäinen, and Panayiotis Tsaparas. 2007. Assessing data mining results via swap randomization. ACM Transactions on Knowledge Discovery from Data 1, 3 (2007), 14–24.
[18]
Yijie Gui, Wensheng Gan, Yao Chen, and Yongdong Wu. 2022. Mining with Rarity for Web Intelligence. In Companion Proceedings of the Web Conference. ACM, 973–981.
[19]
Wilhelmiina Hämäläinen. 2012. Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures. Knowledge and Information Systems 32 (2012), 383–414.
[20]
Wilhelmiina Hämäläinen and Geoffrey I Webb. 2019. A tutorial on statistically sound pattern discovery. Data Mining and Knowledge Discovery 33, 2 (2019), 325–377.
[21]
Jiawei Han, Jian Pei, and Yiwen Yin. 2000. Mining frequent patterns without candidate generation. ACM SIGMOD Record 29, 2 (2000), 1–12.
[22]
Sabrina Zaman Ishita, Faria Noor, and Chowdhury Farhan Ahmed. 2018. An efficient approach for mining weighted sequential patterns in dynamic databases. In The Industrial Conference on Data Mining. Springer, 215–229.
[23]
Md Ashraful Islam, Mahfuzur Rahman Rafi, Al-amin Azad, and Jesan Ahammed Ovi. 2021. Weighted frequent sequential pattern mining. Applied Intelligence 52, 1 (2021), 1–28.
[24]
Junpei Komiyama, Masakazu Ishihata, Hiroki Arimura, Takashi Nishibayashi, and Shin-ichi Minato. 2017. Statistical emerging pattern mining with multiple testing correction. In The 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 897–906.
[25]
Guo-Cheng Lan, Tzung-Pei Hong, and Hong-Yu Lee. 2014. An efficient approach for finding weighted sequential patterns from sequence databases. Applied Intelligence 41, 2 (2014), 439–452.
[26]
Duy Nguyen Le Vo, Takuto Sakuma, Taiju Ishiyama, Hiroki Toda, Kazuya Arai, Masayuki Karasuyama, Yuta Okubo, Masayuki Sunaga, Hiroyuki Hanada, and Yasuo Tabei. 2020. Stat-DSM: Statistically discriminative sub-trajectory mining with multiple testing correction. IEEE Transactions on Knowledge and Data Engineering 34, 3 (2020), 1477–1488.
[27]
Jiuyong Li, Jixue Liu, Hannu Toivonen, Kenji Satou, Youqiang Sun, and Bingyu Sun. 2014. Discovering statistically non-redundant subgroups. Knowledge-Based Systems 67 (2014), 315–327.
[28]
Jerry Chun-Wei Lin, Wensheng Gan, Philippe Fournier-Viger, Tzung-Pei Hong, and Vincent S Tseng. 2016. Efficient algorithms for mining high-utility itemsets in uncertain databases. Knowledge-Based Systems 96 (2016), 171–187.
[29]
Felipe Llinares-López, Mahito Sugiyama, Laetitia Papaxanthos, and Karsten Borgwardt. 2015. Fast and memory-efficient significant pattern mining via permutation testing. In The 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 725–734.
[30]
Shin-ichi Minato, Takeaki Uno, Koji Tsuda, Aika Terada, and Jun Sese. 2014. A Fast Method of Statistical Assessment for Combinatorial Hypotheses Based on Frequent Itemset Enumeration. In The European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 422–436.
[31]
Leonardo Pellegrina, Matteo Riondato, and Fabio Vandin. 2019. SPuManTE: Significant pattern mining with unconditional testing. In The 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1528–1538.
[32]
Leonardo Pellegrina and Fabio Vandin. 2018. Efficient mining of the most significant patterns with permutation testing. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2070–2079.
[33]
Md Mahmudur Rahman, Chowdhury Farhan Ahmed, and Carson Kai-Sang Leung. 2019. Mining weighted frequent sequences in uncertain databases. Information Sciences 479 (2019), 76–100.
[34]
GD Ramkumar, Sanjay Ranka, and Shalom Tsur. 1998. Weighted association rules: Model and algorithm. In The 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 661–666.
[35]
Kashob Kumar Roy, Md Hasibul Haque Moon, Md Mahmudur Rahman, Chowdhury Farhan Ahmed, and Carson Kai-Sang Leung. 2022. Mining weighted sequential patterns in incremental uncertain databases. Information Sciences 582 (2022), 865–896.
[36]
Huijun Tang, Jiangbo Qian, Yangguang Liu, and Xiao-Zhi Gao. 2022. Mining statistically significant patterns with high utility. International Journal of Computational Intelligence Systems 15, 1 (2022), 1–19.
[37]
Feng Tao, Fionn Murtagh, and Mohsen Farid. 2003. Weighted association rule mining using weighted support and significance framework. In The 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 661–666.
[38]
Robert E Tarone. 1990. A modified Bonferroni method for discrete data. Biometrics 46, 2 (1990), 515–522.
[39]
Aika Terada, Mariko Okada-Hatakeyama, Koji Tsuda, and Jun Sese. 2013. Statistical significance of combinatorial regulations. The National Academy of Sciences 110, 32 (2013), 12996–13001.
[40]
Aika Terada, Koji Tsuda, and Jun Sese. 2013. Fast Westfall-Young permutation procedure for combinatorial regulation discovery. In The IEEE International Conference on Bioinformatics and Biomedicine. IEEE, 153–158.
[41]
Thien Q Tran, Kazuto Fukuchi, Youhei Akimoto, and Jun Sakuma. 2020. Statistically significant pattern mining with ordinal utility. In The 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1645–1655.
[42]
Shicheng Wan, Jiahui Chen, Peifeng Zhang, Wensheng Gan, and Tianlong Gu. 2022. Discovering top-k profitable patterns for smart manufacturing. In Companion Proceedings of the Web Conference. ACM, 956–964.
[43]
Geoffrey I Webb. 2008. Layered critical values: a powerful direct-adjustment approach to discovering significant patterns. Machine Learning 71, 2 (2008), 307–323.
[44]
Peter H Westfall and S Stanley Young. 1993. Resampling-based multiple testing: Examples and methods for p-value adjustment. Vol. 279. John Wiley & Sons.
[45]
Unil Yun and John J Leggett. 2005. WFIM: Weighted frequent itemset mining with a weight range and a minimum weight. In The 15th SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 636–640.
[46]
Unil Yun and John J Leggett. 2006. WSpan: Weighted sequential pattern mining in large sequence databases. In The 3Rd International IEEE Conference Intelligent Systems. IEEE, 512–517.
[47]
Chunkai Zhang, Zilin Du, Yuting Yang, Wensheng Gan, and Philip S Yu. 2021. On-shelf utility mining of sequence data. ACM Transactions on Knowledge Discovery from Data 16, 2 (2021), 1–31.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023
April 2023
1567 pages
ISBN:9781450394192
DOI:10.1145/3543873
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 April 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. multiple hypothesis testing
  2. pattern mining
  3. significant pattern
  4. weighted pattern.

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

WWW '23
Sponsor:
WWW '23: The ACM Web Conference 2023
April 30 - May 4, 2023
TX, Austin, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 85
    Total Downloads
  • Downloads (Last 12 months)43
  • Downloads (Last 6 weeks)11
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media