Abstract
With the rapid change of volume, variety, and velocity of data across real-life domains, learning from big data has become a growing challenge. Rough set theory has been successfully applied to knowledge discovery from databases (KDD) for handling data with imperfections. Most traditional rough set algorithms were implemented in a sequential manner and ran on a single machine, becoming computationally expensive and inefficient for handling massive data. Recent computing frameworks, such as MapReduce and Apache Spark, made it possible to realize parallel rough set algorithms on distributed clusters of commodity computers and speed up big data analyses. Although a variety of scalable rough set implementations have been developed, (1) most proposed research compared their work with outdated sequential implementations; (2) certain distributed computing frameworks were used more frequently, overlooking recently developed frameworks; and (3) existing issues and guidance in adapting new computing frameworks are lacking. The main objective of this paper is to provide current state-of-the-art scalable implementations of rough set algorithms. This paper will help researchers catch up with the recent developments in this field and further provide some insights to develop rough set algorithms in up-to-date high performance computing environments for big data analytics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Pawlak, Z.: Rough Sets, Theoretical Aspects of Reasoning About Data. Kluwer Academic Publishers, Dordrecht (1991)
Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. Wiley, New York (1973)
Zadeh, L.A.: Fuzzy sets. Inf. Control 8(3), 338–353 (1965)
Zadeh, L.: Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst. 1(1), 3–28 (1978)
Shafer, G.: A Mathematical Theory of Evidence. Princeton University Press, Princeton (1976)
Hasan, A., Srinivasan, R., Vasudevan, G., Verbiest, N., Cornelis, C., Tolentino, M.E., Teredesai, A., Cock, M.D.: Computing fuzzy rough approximations in large scale information systems. In: BigData Conference, pp. 9–16 (2014)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)
Apache Flink: Scalable stream and batch data processing. https://flink.apache.org/
Apache Storm. http://storm.apache.org/
Samza. http://samza.apache.org/
Pawlak, Z.: Rough set approach to knowledge-based decision support. Eur. J. Oper. Res. 99(1), 48–57 (1997)
Jadhav, S., Suryawanshi, S.: A survey on parallel rough set based knowledge acquisition using MapReduce from big data (2014)
Nandgaonkar, Suruchi, V., Raut, A.B.: A survey on parallel method for rough set using MapReduce technique for data mining. Int. J. Eng. Comput. Sci. (2015)
Li, T., Luo, C., Chen, H., Zhang, J.: PICKT: a solution for big data analysis. In: Ciucci, D., Wang, G., Mitra, S., Wu, W.-Z. (eds.) RSKT 2015. LNCS (LNAI), vol. 9436, pp. 15–25. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25754-9_2
Zhang, J., Li, T., Pan, Y.: PLAR: parallel large-scale attribute reduction on cloud systems. In: PDCAT, pp. 184–191 (2013)
Li, S.Y., Li, T.R., Zhang, Z.X., Chen, H.M., Zhang, J.B.: Parallel computing of approximations in dominance-based rough sets approach. Knowl. Based Syst. 87, 102–111 (2015)
Zhang, J.B., Wong, J.S., Pan, Y., Li, T.R.: A parallel matrix-based method for computing approximations in incomplete information systems. IEEE Trans. Knowl. Data Eng. 27(2), 326–229 (2015)
Zhang, J.B., Li, T.R., Ruan, D., Gao, Z.Z., Zhao, C.B.: A parallel method for computing rough set approximations. Inf. Sci. 194, 209–223 (2012)
Huang, K.M., Chen, H.Y., Hsiung, K.L.: On realizing rough set algorithms with apache spark. In: Third International Conference on Data Mining, Internet Computing and Big Data, pp. 111–112 (2016)
Gromniak, W.: Scalability of attribute selection methods: application of rough sets and MapReduce. Dissertation Institute of Mathematics, University of Warsaw (2015)
Sarah, V., Asfoor, H., Saeys, Y., Cornelis, C., Tolentino, M.E., Teredesai, A., Cock, M.D.: Distributed fuzzy rough prototype selection for big data regression. In: NAFIPS/WConSC, pp. 1–6 (2015)
Kawhale, R., Patil, S.: Obtaining approximation with data cube using MapReduce. Int. J. Recent Innov. Trends Comput. Commun. 3(7), 4880–4884 (2015). ISSN: 2321–8169
Cui, W.P., Huang, L.: A MapReduce solution for knowledge reduction in big data. IJCSA 13(1), 17–30 (2016)
Dhande, V., Sarkar, B.K.: Obtaining rough set approximation using MapReduce technique in data mining (2016)
Chaudhuri, A.: Parallel fuzzy rough support vector machine for data classification in cloud environment. Informatica 39(4), 397–420 (2015)
Nandgaonkar, S.V., Raut, A.B.: Parallel rough set approximation using MapReduce technique in Hadoop (2015)
El-Alfy, E., Alshammari, M.: Towards scalable rough set based attribute subset selection for intrusion detection using parallel genetic algorithm in MapReduce. Simul. Model. Pract. Theory 64, 18–29 (2016)
Kwiatkowski, P., Nguyen, S.H., Nguyen, H.S.: On scalability of rough set methods. In: Hüllermeier, E., Kruse, R., Hoffmann, F. (eds.) IPMU 2010. CCIS, vol. 80, pp. 288–297. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14055-6_30
Chen, M., Yuan, J., Li, L., Liu, D., Li, T.: A fast heuristic attribute reduction algorithm using Spark. In: 2017 IEEE 37th International Conference Distributed Computing Systems (ICDCS) (2017)
Yang, Y., Chen, Z., Liang, Z., Wang, G.: Attribute reduction for massive data based on rough set theory and MapReduce. In: Yu, J., Greco, S., Lingras, P., Wang, G., Skowron, A. (eds.) RSKT 2010. LNCS (LNAI), vol. 6401, pp. 672–678. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16248-0_91
Xi, D., Wang, G., Zhang, X., Zhang, F.: Parallel attribute reduction based on MapReduce. In: Miao, D., Pedrycz, W., Ślȩzak, D., Peters, G., Hu, Q., Wang, R. (eds.) RSKT 2014. LNCS (LNAI), vol. 8818, pp. 631–641. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11740-9_58
Lv, P., Qian, J., Yue, X.: Incremental attribute reduction algorithm for big data using MapReduce. J. Comput. Methods Sci. Eng. 16(3), 641–652 (2016)
Feng, L., Li, T., Ruan, D., Gou, S.: A vague-rough set approach for uncertain knowledge acquisition. Knowl. Based Syst. 24(6), 837–843 (2011)
Zhang, J.B., Wong, J., Li, T., Pan, Y.: A comparison of parallel large-scale knowledge acquisition using rough set theory on different MapReduce runtime systems. Int. J. Approximate Reasoning 55(3), 896–907 (2014)
Xin, R.S., Rosen, J., Zaharia, M., Franklin, M., Shenker, S., Stoic, I.: Shark: SQL and rich analytics at scale. In: 2013 ACM SIGMOD International Conference on Management of Data, pp. 13–24 (2013)
Karun, A.K., Chitharanjan, K.: A review on Hadoop–HDFS infrastructure extensions. In: 2013 IEEE Conference on Information & Communication Technologies (ICT), pp. 132–137 (2013)
What is Apache Spark? https://databricks.com/spark/about
Pradeepa, A., Thanamani, A.: Hadoop file system and fundamental concept of MapReduce Interior and closure rough set approximations. Int. J. Adv. Res. Comput. Commun. Eng. 2(10), 5865–5868 (2013)
Patil, P.: Data mining with rough set using MapReduce. Int. J. Innov. Res. Comput. Commun. Eng. 2(11), 6980–6986 (2014)
Zhang, J.B., Li, T.R., Pan, Y.: Parallel rough set based knowledge acquisition using MapReduce from big data. In: 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, pp. 20–27. ACM (2012)
Xu, F., Wei, L., Bi, Z., Zhu, L.: Research on fuzzy rough parallel reduction based on mutual information. J. Comput. Inf. Syst. 10(12), 5391–5401 (2014)
Yang, Y., Chen, Z.: Parallelized computing of attribute core based on rough set theory and MapReduce. In: Li, T., Nguyen, H.S., Wang, G., Grzymala-Busse, J., Janicki, R., Hassanien, A.E., Yu, H. (eds.) RSKT 2012. LNCS (LNAI), vol. 7414, pp. 155–160. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31900-6_20
Qian, J., Miao, D., Zhang, Z., Yue, X.: Parallel attribute reduction algorithms using MapReduce. Inf. Sci. 279, 671–690 (2014)
Wu, M., Sakai, H.: On parallelization of the NIS-apriori algorithm for data mining. Procedia Comput. Sci. 60, 623–631 (2015)
Dai, Y., Sun, H.: The naive Bayes text classification algorithm based on rough set in the cloud platform. J. Chem. Pharm. Res. 6, 1636–1643 (2014)
Weka 3 - Data mining with open source machine learning software in Java. https://www.cs.waikato.ac.nz/ml/weka/
R: The R project for statistical computing. https://www.r-project.org/
Komorowski, J., Ohrn, A., Skowron, A.: The ROSETTA rough set software system. In: Handbook of Data Mining and Knowledge Discovery, pp. 2–3 (2002)
Owen, S.: Mahout in Action. Manning, Shelter Island (2012)
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., Xin, D.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)
Lin, J., Dyer, C.: Data-Intensive text processing with MapReduce. Synthesis Lectures on Human Language Technologies, vol. 3, pp. 1–177 (2010)
https://spark.apache.org/docs/latest/img/cluster-overview.png
Garca-Gil, D., Ramrez-Gallego, S., Garca, S., Herrera, F.: A comparison on scalability for batch big data processing on Apache Spark and Apache Flink. Big Data Analytics 2(1) (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Zhou, B., Cho, H., Zhang, X. (2018). Scalable Implementations of Rough Set Algorithms: A Survey. In: Mouhoub, M., Sadaoui, S., Ait Mohamed, O., Ali, M. (eds) Recent Trends and Future Technology in Applied Intelligence. IEA/AIE 2018. Lecture Notes in Computer Science(), vol 10868. Springer, Cham. https://doi.org/10.1007/978-3-319-92058-0_62
Download citation
DOI: https://doi.org/10.1007/978-3-319-92058-0_62
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92057-3
Online ISBN: 978-3-319-92058-0
eBook Packages: Computer ScienceComputer Science (R0)