ABSTRACT
Pattern discovery (aka pattern mining) is a fundamental task in the field of data science. Statistically significant pattern mining (SSPM) is the task of finding useful patterns that statistically occur more often from databases for one class than for another. The existing SSPM task does not consider the weight of each item. While in the real world, the significant level of different items/objects is various. Therefore, in this paper, we introduce the Weighted Statistically Significant Patterns Mining (WSSPM) problem and propose a novel WSSpm algorithm to successfully solve it. We present a new framework that effectively mines weighted statistically significant patterns by combining the weighted upper-bound model and the multiple hypotheses test. We also propose a new weighted support threshold that can satisfy the demand of WSSPM and prove its correctness and completeness. Besides, our weighted support threshold and modified weighted upper-bound can effectively shrink the mining range. Finally, experimental results on several real datasets show that the WSSpm algorithm performs well in terms of execution time and memory storage.
- Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. 1993. Mining association rules between sets of items in large databases. In The ACM SIGMOD International Conference on Management of Data. ACM, 207–216.Google ScholarDigital Library
- Yoav Benjamini and Daniel Yekutieli. 2001. The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29, 4 (2001), 1165–1188.Google ScholarCross Ref
- Carlo Bonferroni. 1936. Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze 8 (1936), 3–62.Google Scholar
- Chien-Ming Chen, Lili Chen, Wensheng Gan, Lina Qiu, and Weiping Ding. 2021. Discovering high utility-occupancy patterns from uncertain data. Information Sciences 546 (2021), 1208–1229.Google ScholarCross Ref
- Pilsun Choi and Buhyun Hwang. 2017. Dynamic weighted sequential pattern mining for USN system. In The 11th International Conference on Ubiquitous Information Management and Communication. ACM, 1–6.Google ScholarDigital Library
- EunYi Chung and Joseph P Romano. 2013. Exact and asymptotically robust permutation tests. Annals of Statistics 41, 2 (2013), 484–507.Google ScholarCross Ref
- Guozhu Dong and James Bailey. 2012. Contrast data mining: concepts, algorithms, and applications. CRC Press.Google ScholarDigital Library
- Wouter Duivesteijn and Arno Knobbe. 2011. Exploiting false discoveries–statistical validation of patterns and quality measures in subgroup discovery. In The IEEE 11th International Conference on Data Mining. IEEE, 151–160.Google ScholarDigital Library
- Ronald A Fisher. 1922. On the interpretation of χ 2 from contingency tables, and the calculation of P. Journal of the Royal Statistical Society 85, 1 (1922), 87–94.Google ScholarCross Ref
- Philippe Fournier-Viger, Wensheng Gan, Youxi Wu, Mourad Nouioua, Wei Song, Tin Truong, and Hai Duong. 2022. Pattern mining: Current challenges and opportunities. In International Conference Database Systems for Advanced Applications International Workshops. Springer, 34–49.Google ScholarDigital Library
- Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao, Tzung-Pei Hong, and Hamido Fujita. 2018. A survey of incremental high-utility itemset mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8, 2 (2018), e1242.Google ScholarDigital Library
- Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao, Vincent S Tseng, and Philip S Yu. 2021. A survey of utility-oriented pattern mining. IEEE Transactions on Knowledge and Data Engineering 33, 4 (2021), 1306–1327.Google ScholarCross Ref
- Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao, and Philip S Yu. 2019. A survey of parallel sequential pattern mining. ACM Transactions on Knowledge Discovery from Data 13, 3 (2019), 1–34.Google Scholar
- Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao, and Justin Zhan. 2017. Mining of frequent patterns with multiple minimum supports. Engineering Applications of Artificial Intelligence 60 (2017), 83–96.Google ScholarDigital Library
- Wensheng Gan, Jerry Chun Wei Lin, Philippe Fournier-Viger, Han Chieh Chao, Justin Zhan, and Ji Zhang. 2018. Exploiting highly qualified pattern with frequency and weight occupancy. Knowledge and Information Systems 56, 1 (2018), 165–196.Google ScholarDigital Library
- Wensheng Gan, Jerry Chun-Wei Lin, Jiexiong Zhang, Philippe Fournier-Viger, Han-Chieh Chao, and Philip S Yu. 2021. Fast utility mining on sequence data. IEEE Transactions on Cybernetics 51, 2 (2021), 487–500.Google ScholarCross Ref
- Aristides Gionis, Heikki Mannila, Taneli Mielikäinen, and Panayiotis Tsaparas. 2007. Assessing data mining results via swap randomization. ACM Transactions on Knowledge Discovery from Data 1, 3 (2007), 14–24.Google Scholar
- Yijie Gui, Wensheng Gan, Yao Chen, and Yongdong Wu. 2022. Mining with Rarity for Web Intelligence. In Companion Proceedings of the Web Conference. ACM, 973–981.Google Scholar
- Wilhelmiina Hämäläinen. 2012. Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures. Knowledge and Information Systems 32 (2012), 383–414.Google ScholarCross Ref
- Wilhelmiina Hämäläinen and Geoffrey I Webb. 2019. A tutorial on statistically sound pattern discovery. Data Mining and Knowledge Discovery 33, 2 (2019), 325–377.Google ScholarDigital Library
- Jiawei Han, Jian Pei, and Yiwen Yin. 2000. Mining frequent patterns without candidate generation. ACM SIGMOD Record 29, 2 (2000), 1–12.Google ScholarDigital Library
- Sabrina Zaman Ishita, Faria Noor, and Chowdhury Farhan Ahmed. 2018. An efficient approach for mining weighted sequential patterns in dynamic databases. In The Industrial Conference on Data Mining. Springer, 215–229.Google ScholarDigital Library
- Md Ashraful Islam, Mahfuzur Rahman Rafi, Al-amin Azad, and Jesan Ahammed Ovi. 2021. Weighted frequent sequential pattern mining. Applied Intelligence 52, 1 (2021), 1–28.Google Scholar
- Junpei Komiyama, Masakazu Ishihata, Hiroki Arimura, Takashi Nishibayashi, and Shin-ichi Minato. 2017. Statistical emerging pattern mining with multiple testing correction. In The 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 897–906.Google ScholarDigital Library
- Guo-Cheng Lan, Tzung-Pei Hong, and Hong-Yu Lee. 2014. An efficient approach for finding weighted sequential patterns from sequence databases. Applied Intelligence 41, 2 (2014), 439–452.Google ScholarDigital Library
- Duy Nguyen Le Vo, Takuto Sakuma, Taiju Ishiyama, Hiroki Toda, Kazuya Arai, Masayuki Karasuyama, Yuta Okubo, Masayuki Sunaga, Hiroyuki Hanada, and Yasuo Tabei. 2020. Stat-DSM: Statistically discriminative sub-trajectory mining with multiple testing correction. IEEE Transactions on Knowledge and Data Engineering 34, 3 (2020), 1477–1488.Google Scholar
- Jiuyong Li, Jixue Liu, Hannu Toivonen, Kenji Satou, Youqiang Sun, and Bingyu Sun. 2014. Discovering statistically non-redundant subgroups. Knowledge-Based Systems 67 (2014), 315–327.Google ScholarDigital Library
- Jerry Chun-Wei Lin, Wensheng Gan, Philippe Fournier-Viger, Tzung-Pei Hong, and Vincent S Tseng. 2016. Efficient algorithms for mining high-utility itemsets in uncertain databases. Knowledge-Based Systems 96 (2016), 171–187.Google ScholarDigital Library
- Felipe Llinares-López, Mahito Sugiyama, Laetitia Papaxanthos, and Karsten Borgwardt. 2015. Fast and memory-efficient significant pattern mining via permutation testing. In The 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 725–734.Google ScholarDigital Library
- Shin-ichi Minato, Takeaki Uno, Koji Tsuda, Aika Terada, and Jun Sese. 2014. A Fast Method of Statistical Assessment for Combinatorial Hypotheses Based on Frequent Itemset Enumeration. In The European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 422–436.Google Scholar
- Leonardo Pellegrina, Matteo Riondato, and Fabio Vandin. 2019. SPuManTE: Significant pattern mining with unconditional testing. In The 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1528–1538.Google ScholarDigital Library
- Leonardo Pellegrina and Fabio Vandin. 2018. Efficient mining of the most significant patterns with permutation testing. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2070–2079.Google ScholarDigital Library
- Md Mahmudur Rahman, Chowdhury Farhan Ahmed, and Carson Kai-Sang Leung. 2019. Mining weighted frequent sequences in uncertain databases. Information Sciences 479 (2019), 76–100.Google ScholarCross Ref
- GD Ramkumar, Sanjay Ranka, and Shalom Tsur. 1998. Weighted association rules: Model and algorithm. In The 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 661–666.Google Scholar
- Kashob Kumar Roy, Md Hasibul Haque Moon, Md Mahmudur Rahman, Chowdhury Farhan Ahmed, and Carson Kai-Sang Leung. 2022. Mining weighted sequential patterns in incremental uncertain databases. Information Sciences 582 (2022), 865–896.Google ScholarDigital Library
- Huijun Tang, Jiangbo Qian, Yangguang Liu, and Xiao-Zhi Gao. 2022. Mining statistically significant patterns with high utility. International Journal of Computational Intelligence Systems 15, 1 (2022), 1–19.Google ScholarCross Ref
- Feng Tao, Fionn Murtagh, and Mohsen Farid. 2003. Weighted association rule mining using weighted support and significance framework. In The 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 661–666.Google ScholarDigital Library
- Robert E Tarone. 1990. A modified Bonferroni method for discrete data. Biometrics 46, 2 (1990), 515–522.Google ScholarCross Ref
- Aika Terada, Mariko Okada-Hatakeyama, Koji Tsuda, and Jun Sese. 2013. Statistical significance of combinatorial regulations. The National Academy of Sciences 110, 32 (2013), 12996–13001.Google ScholarCross Ref
- Aika Terada, Koji Tsuda, and Jun Sese. 2013. Fast Westfall-Young permutation procedure for combinatorial regulation discovery. In The IEEE International Conference on Bioinformatics and Biomedicine. IEEE, 153–158.Google ScholarCross Ref
- Thien Q Tran, Kazuto Fukuchi, Youhei Akimoto, and Jun Sakuma. 2020. Statistically significant pattern mining with ordinal utility. In The 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1645–1655.Google ScholarDigital Library
- Shicheng Wan, Jiahui Chen, Peifeng Zhang, Wensheng Gan, and Tianlong Gu. 2022. Discovering top-k profitable patterns for smart manufacturing. In Companion Proceedings of the Web Conference. ACM, 956–964.Google ScholarDigital Library
- Geoffrey I Webb. 2008. Layered critical values: a powerful direct-adjustment approach to discovering significant patterns. Machine Learning 71, 2 (2008), 307–323.Google ScholarDigital Library
- Peter H Westfall and S Stanley Young. 1993. Resampling-based multiple testing: Examples and methods for p-value adjustment. Vol. 279. John Wiley & Sons.Google Scholar
- Unil Yun and John J Leggett. 2005. WFIM: Weighted frequent itemset mining with a weight range and a minimum weight. In The 15th SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 636–640.Google ScholarCross Ref
- Unil Yun and John J Leggett. 2006. WSpan: Weighted sequential pattern mining in large sequence databases. In The 3Rd International IEEE Conference Intelligent Systems. IEEE, 512–517.Google ScholarCross Ref
- Chunkai Zhang, Zilin Du, Yuting Yang, Wensheng Gan, and Philip S Yu. 2021. On-shelf utility mining of sequence data. ACM Transactions on Knowledge Discovery from Data 16, 2 (2021), 1–31.Google Scholar
Index Terms
- Weighted Statistically Significant Pattern Mining
Recommendations
Weighted frequent sequential pattern mining
AbstractTrillions of bytes of data are generated every day in different forms, and extracting useful information from that massive amount of data is the study of data mining. Sequential pattern mining is a major branch of data mining that deals with ...
Approximate weighted frequent pattern mining with/without noisy environments
In data mining area, weighted frequent pattern mining has been suggested to find important frequent patterns by considering the weights of patterns. More extensions with weight constraints have been proposed such as mining weighted association rules, ...
Comments