research-article

Weighted Statistically Significant Pattern Mining

Authors:

Guoting ChenAuthors Info & Claims

WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023

Pages 1276 - 1285

https://doi.org/10.1145/3543873.3587586

Published: 30 April 2023 Publication History

Abstract

Pattern discovery (aka pattern mining) is a fundamental task in the field of data science. Statistically significant pattern mining (SSPM) is the task of finding useful patterns that statistically occur more often from databases for one class than for another. The existing SSPM task does not consider the weight of each item. While in the real world, the significant level of different items/objects is various. Therefore, in this paper, we introduce the Weighted Statistically Significant Patterns Mining (WSSPM) problem and propose a novel WSSpm algorithm to successfully solve it. We present a new framework that effectively mines weighted statistically significant patterns by combining the weighted upper-bound model and the multiple hypotheses test. We also propose a new weighted support threshold that can satisfy the demand of WSSPM and prove its correctness and completeness. Besides, our weighted support threshold and modified weighted upper-bound can effectively shrink the mining range. Finally, experimental results on several real datasets show that the WSSpm algorithm performs well in terms of execution time and memory storage.

References

[1]

Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. 1993. Mining association rules between sets of items in large databases. In The ACM SIGMOD International Conference on Management of Data. ACM, 207–216.

Digital Library

[2]

Yoav Benjamini and Daniel Yekutieli. 2001. The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29, 4 (2001), 1165–1188.

[3]

Carlo Bonferroni. 1936. Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze 8 (1936), 3–62.

[4]

Chien-Ming Chen, Lili Chen, Wensheng Gan, Lina Qiu, and Weiping Ding. 2021. Discovering high utility-occupancy patterns from uncertain data. Information Sciences 546 (2021), 1208–1229.

[5]

Pilsun Choi and Buhyun Hwang. 2017. Dynamic weighted sequential pattern mining for USN system. In The 11th International Conference on Ubiquitous Information Management and Communication. ACM, 1–6.

Digital Library

[6]

EunYi Chung and Joseph P Romano. 2013. Exact and asymptotically robust permutation tests. Annals of Statistics 41, 2 (2013), 484–507.

[7]

Guozhu Dong and James Bailey. 2012. Contrast data mining: concepts, algorithms, and applications. CRC Press.

Digital Library

[8]

Wouter Duivesteijn and Arno Knobbe. 2011. Exploiting false discoveries–statistical validation of patterns and quality measures in subgroup discovery. In The IEEE 11th International Conference on Data Mining. IEEE, 151–160.

Digital Library

[9]

Ronald A Fisher. 1922. On the interpretation of χ 2 from contingency tables, and the calculation of P. Journal of the Royal Statistical Society 85, 1 (1922), 87–94.

[10]

Philippe Fournier-Viger, Wensheng Gan, Youxi Wu, Mourad Nouioua, Wei Song, Tin Truong, and Hai Duong. 2022. Pattern mining: Current challenges and opportunities. In International Conference Database Systems for Advanced Applications International Workshops. Springer, 34–49.

Digital Library

[11]

Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao, Tzung-Pei Hong, and Hamido Fujita. 2018. A survey of incremental high-utility itemset mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8, 2 (2018), e1242.

Digital Library

[12]

Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao, Vincent S Tseng, and Philip S Yu. 2021. A survey of utility-oriented pattern mining. IEEE Transactions on Knowledge and Data Engineering 33, 4 (2021), 1306–1327.

[13]

Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao, and Philip S Yu. 2019. A survey of parallel sequential pattern mining. ACM Transactions on Knowledge Discovery from Data 13, 3 (2019), 1–34.

[14]

Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao, and Justin Zhan. 2017. Mining of frequent patterns with multiple minimum supports. Engineering Applications of Artificial Intelligence 60 (2017), 83–96.

Digital Library

[15]

Wensheng Gan, Jerry Chun Wei Lin, Philippe Fournier-Viger, Han Chieh Chao, Justin Zhan, and Ji Zhang. 2018. Exploiting highly qualified pattern with frequency and weight occupancy. Knowledge and Information Systems 56, 1 (2018), 165–196.

Digital Library

[16]

Wensheng Gan, Jerry Chun-Wei Lin, Jiexiong Zhang, Philippe Fournier-Viger, Han-Chieh Chao, and Philip S Yu. 2021. Fast utility mining on sequence data. IEEE Transactions on Cybernetics 51, 2 (2021), 487–500.

[17]

Aristides Gionis, Heikki Mannila, Taneli Mielikäinen, and Panayiotis Tsaparas. 2007. Assessing data mining results via swap randomization. ACM Transactions on Knowledge Discovery from Data 1, 3 (2007), 14–24.

[18]

Yijie Gui, Wensheng Gan, Yao Chen, and Yongdong Wu. 2022. Mining with Rarity for Web Intelligence. In Companion Proceedings of the Web Conference. ACM, 973–981.

[19]

Wilhelmiina Hämäläinen. 2012. Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures. Knowledge and Information Systems 32 (2012), 383–414.

[20]

Wilhelmiina Hämäläinen and Geoffrey I Webb. 2019. A tutorial on statistically sound pattern discovery. Data Mining and Knowledge Discovery 33, 2 (2019), 325–377.

Digital Library

[21]

Jiawei Han, Jian Pei, and Yiwen Yin. 2000. Mining frequent patterns without candidate generation. ACM SIGMOD Record 29, 2 (2000), 1–12.

Digital Library

[22]

Sabrina Zaman Ishita, Faria Noor, and Chowdhury Farhan Ahmed. 2018. An efficient approach for mining weighted sequential patterns in dynamic databases. In The Industrial Conference on Data Mining. Springer, 215–229.

Digital Library

[23]

Md Ashraful Islam, Mahfuzur Rahman Rafi, Al-amin Azad, and Jesan Ahammed Ovi. 2021. Weighted frequent sequential pattern mining. Applied Intelligence 52, 1 (2021), 1–28.

[24]

Junpei Komiyama, Masakazu Ishihata, Hiroki Arimura, Takashi Nishibayashi, and Shin-ichi Minato. 2017. Statistical emerging pattern mining with multiple testing correction. In The 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 897–906.

Digital Library

[25]

Guo-Cheng Lan, Tzung-Pei Hong, and Hong-Yu Lee. 2014. An efficient approach for finding weighted sequential patterns from sequence databases. Applied Intelligence 41, 2 (2014), 439–452.

Digital Library

[26]

Duy Nguyen Le Vo, Takuto Sakuma, Taiju Ishiyama, Hiroki Toda, Kazuya Arai, Masayuki Karasuyama, Yuta Okubo, Masayuki Sunaga, Hiroyuki Hanada, and Yasuo Tabei. 2020. Stat-DSM: Statistically discriminative sub-trajectory mining with multiple testing correction. IEEE Transactions on Knowledge and Data Engineering 34, 3 (2020), 1477–1488.

[27]

Jiuyong Li, Jixue Liu, Hannu Toivonen, Kenji Satou, Youqiang Sun, and Bingyu Sun. 2014. Discovering statistically non-redundant subgroups. Knowledge-Based Systems 67 (2014), 315–327.

Digital Library

[28]

Jerry Chun-Wei Lin, Wensheng Gan, Philippe Fournier-Viger, Tzung-Pei Hong, and Vincent S Tseng. 2016. Efficient algorithms for mining high-utility itemsets in uncertain databases. Knowledge-Based Systems 96 (2016), 171–187.

Digital Library

[29]

Felipe Llinares-López, Mahito Sugiyama, Laetitia Papaxanthos, and Karsten Borgwardt. 2015. Fast and memory-efficient significant pattern mining via permutation testing. In The 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 725–734.

Digital Library

[30]

Shin-ichi Minato, Takeaki Uno, Koji Tsuda, Aika Terada, and Jun Sese. 2014. A Fast Method of Statistical Assessment for Combinatorial Hypotheses Based on Frequent Itemset Enumeration. In The European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 422–436.

[31]

Leonardo Pellegrina, Matteo Riondato, and Fabio Vandin. 2019. SPuManTE: Significant pattern mining with unconditional testing. In The 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1528–1538.

Digital Library

[32]

Leonardo Pellegrina and Fabio Vandin. 2018. Efficient mining of the most significant patterns with permutation testing. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2070–2079.

Digital Library

[33]

Md Mahmudur Rahman, Chowdhury Farhan Ahmed, and Carson Kai-Sang Leung. 2019. Mining weighted frequent sequences in uncertain databases. Information Sciences 479 (2019), 76–100.

[34]

GD Ramkumar, Sanjay Ranka, and Shalom Tsur. 1998. Weighted association rules: Model and algorithm. In The 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 661–666.

[35]

Kashob Kumar Roy, Md Hasibul Haque Moon, Md Mahmudur Rahman, Chowdhury Farhan Ahmed, and Carson Kai-Sang Leung. 2022. Mining weighted sequential patterns in incremental uncertain databases. Information Sciences 582 (2022), 865–896.

Digital Library

[36]

Huijun Tang, Jiangbo Qian, Yangguang Liu, and Xiao-Zhi Gao. 2022. Mining statistically significant patterns with high utility. International Journal of Computational Intelligence Systems 15, 1 (2022), 1–19.

[37]

Feng Tao, Fionn Murtagh, and Mohsen Farid. 2003. Weighted association rule mining using weighted support and significance framework. In The 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 661–666.

Digital Library

[38]

Robert E Tarone. 1990. A modified Bonferroni method for discrete data. Biometrics 46, 2 (1990), 515–522.

[39]

Aika Terada, Mariko Okada-Hatakeyama, Koji Tsuda, and Jun Sese. 2013. Statistical significance of combinatorial regulations. The National Academy of Sciences 110, 32 (2013), 12996–13001.

[40]

Aika Terada, Koji Tsuda, and Jun Sese. 2013. Fast Westfall-Young permutation procedure for combinatorial regulation discovery. In The IEEE International Conference on Bioinformatics and Biomedicine. IEEE, 153–158.

[41]

Thien Q Tran, Kazuto Fukuchi, Youhei Akimoto, and Jun Sakuma. 2020. Statistically significant pattern mining with ordinal utility. In The 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1645–1655.

Digital Library

[42]

Shicheng Wan, Jiahui Chen, Peifeng Zhang, Wensheng Gan, and Tianlong Gu. 2022. Discovering top-k profitable patterns for smart manufacturing. In Companion Proceedings of the Web Conference. ACM, 956–964.

Digital Library

[43]

Geoffrey I Webb. 2008. Layered critical values: a powerful direct-adjustment approach to discovering significant patterns. Machine Learning 71, 2 (2008), 307–323.

Digital Library

[44]

Peter H Westfall and S Stanley Young. 1993. Resampling-based multiple testing: Examples and methods for p-value adjustment. Vol. 279. John Wiley & Sons.

[45]

Unil Yun and John J Leggett. 2005. WFIM: Weighted frequent itemset mining with a weight range and a minimum weight. In The 15th SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 636–640.

[46]

Unil Yun and John J Leggett. 2006. WSpan: Weighted sequential pattern mining in large sequence databases. In The 3Rd International IEEE Conference Intelligent Systems. IEEE, 512–517.

[47]

Chunkai Zhang, Zilin Du, Yuting Yang, Wensheng Gan, and Philip S Yu. 2021. On-shelf utility mining of sequence data. ACM Transactions on Knowledge Discovery from Data 16, 2 (2021), 1–31.

Index Terms

Weighted Statistically Significant Pattern Mining
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
    2. Robotics
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

Weighted frequent sequential pattern mining
Abstract
Trillions of bytes of data are generated every day in different forms, and extracting useful information from that massive amount of data is the study of data mining. Sequential pattern mining is a major branch of data mining that deals with ...
Approximate weighted frequent pattern mining with/without noisy environments

In data mining area, weighted frequent pattern mining has been suggested to find important frequent patterns by considering the weights of patterns. More extensions with weight constraints have been proposed such as mining weighted association rules, ...
New approaches to weighted frequent pattern mining

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023

April 2023

1567 pages

ISBN:9781450394192

DOI:10.1145/3543873

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 April 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Natural Science Foundation of Guangdong Province
Guangzhou Basic and Applied Basic Research Foundation
National Natural Science Foundation of China

Conference

WWW '23

Sponsor:

SIGWEB

WWW '23: The ACM Web Conference 2023

April 30 - May 4, 2023

TX, Austin, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
85
Total Downloads

Downloads (Last 12 months)43
Downloads (Last 6 weeks)11

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents