Abstract
Subgroup discovery is a well-known technique for the extraction of patterns, with respect to a variable of interest in the data. However, the explosion in data gathering has hampered the performance of traditional algorithms to discover interesting relationships between different objects in a set with respect to a specific property which is of interest to the user. In this regard, our goal is to propose a set of efficient techniques to mine subgroups on Big Data by means of Apache Spark. On this matter, AprioriK-SD-OE and PFP-SD-OE are proposed as fast exhaustive search algorithms to discover subgroups on Big Data using Apache Spark. The experimental study includes more than 70 datasets considering search spaces bigger than \(10^{15}\) subgroups. The scalability of our proposals are analyzed by considering datasets with 200 million of instances demonstrating the usefulness of using Spark to tackle Big Data.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13748-017-0112-x/MediaObjects/13748_2017_112_Fig1_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13748-017-0112-x/MediaObjects/13748_2017_112_Fig2_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13748-017-0112-x/MediaObjects/13748_2017_112_Fig3_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13748-017-0112-x/MediaObjects/13748_2017_112_Fig4_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13748-017-0112-x/MediaObjects/13748_2017_112_Fig5_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13748-017-0112-x/MediaObjects/13748_2017_112_Fig6_HTML.gif)
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2011)
Herrera, F., Carmona, C.J., González, P., Jesus, M.J.: An overview on subgroup discovery: foundations and applications. Knowl. Inf. Syst. 29(3), 495–525 (2010)
Ventura, S., Luna, J.M.: Pattern Mining with Evolutionary Algorithms. Springer, Berlin (2016)
Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. SIGMOD Rec. 22(2), 207–216 (1993)
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. SIGMOD Rec. 29(2), 1–12 (2000)
Luna, J.M., Romero, J.R., Romero, C., Ventura, S.: On the use of genetic programming for mining comprehensible rules in subgroup discovery. IEEE Trans. Cybernet. 44(12), 2329–2341 (2014)
Scheffer, T., Wrobel, S.: Finding the most interesting patterns in a database quickly by using sequential sampling. J. Mach. Learn. Res. 3, 833–862 (2003)
Grosskreutz, H., Rüping, S., Wrobel, S.: Proceedings, Part I European Conference, ECML PKDD 2008, Antwerp, Belgium, September 15-19, 2008. Tight Optimistic Estimates for Fast Subgroup Discovery (Berlin, Heidelberg, 2008) pp. 440–456 (2008)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, Ser. HotCloud’10, Berkeley (2010)
Klösgen, W.: Advances in knowledge discovery and data mining. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Explora: A Multipattern and Multistrategy Discovery Assistant, pp. 249–271. American Association for Artificial Intelligence, Menlo Park (1996)
Kavšek, B., Lavrač, N., Jovanoski, V.: 5th International Symposium on Intelligent Data Analysis, IDA: ch, pp. 230–241. APRIORI-SD, Adapting Association Rule Learning to Subgroup Discovery (2003)
Atzmueller, M., Puppe, F.: Sd-map-a fast algorithm for exhaustive subgroup discovery. In: 17th European Conference on Machine Learning and 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2006). Lecture Notes on Computer Science, vol. 4213, pp. 6–17. Springer (2006)
Klösgen, W.: Advances in knowledge discovery and data mining. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Explora: A Multipattern and Multistrategy Discovery Assistant. American Association for Artificial Intelligence, Menlo Park (1996)
García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining. Springer, Switzerland (2015)
Lemmerich, F., Atzmueller, M., Puppe, F.: Fast exhaustive subgroup discovery with numerical target concepts. Data Min. Knowl. Discov. 30(3), 711–762 (2015)
Atzmueller, M., Lemmerich, F.: Fast subgroup discovery for continuous target concepts. In: Foundations of Intelligent Systems, pp. 35–44. Springer, Berlin (2009)
Grosskreutz, H., Rüping, S., Wrobel, S.: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2008, Antwerp, Belgium, September 15-19, 2008, Proceedings, Part I. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, ch. Tight Optimistic Estimates for Fast Subgroup Discovery, pp. 440–456
Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, New York (2011)
Padillo, F., Luna, J.M., Cano, A., Ventura, S.: A data structure to speed-up machine learning algorithms on massive datasets. In: Proceedings of the 11th International Conference on Hybrid Artificial Intelligence Systems, ser. HAIS 2016, Seville, Spain, pp. 365–376 (2016)
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM - 50th anniversary issue: 1958 - 2008, 51(1), 107–113 (2008)
Lam, C.: Hadoop in Action, 1st edn. Manning Publications Co., Greenwich (2010)
Luna, J.M.: Pattern mining: current status and emerging topics. Prog. Artif. Intel. 5(3), 1–6 (2016)
Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: Pfp: Parallel fp-growth for query recommendation. In: Proceedings of the 2008 ACM Conference on Recommender Systems, ser. RecSys ’08. New York, NY, USA: ACM, pp. 107–114 (2008)
Acknowledgements
This work was supported by the Spanish Ministry of Economy and Competitiveness and FEDER funds, Project TIN-2014-55252-P.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Padillo, F., Luna, J.M. & Ventura, S. Exhaustive search algorithms to mine subgroups on Big Data using Apache Spark. Prog Artif Intell 6, 145–158 (2017). https://doi.org/10.1007/s13748-017-0112-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13748-017-0112-x