Abstract
Multi-relational Data Mining algorithms (MRDM) are the appropriate approach for inferring knowledge from databases containing multiple relationships between non-homogenous entities, which are precisely the case of datasets obtained from social networks. However, to acquire such expressivity, the search space of candidate hypotheses in MRDM algorithms is more complex than those obtained from traditional data mining algorithms. To allow a feasible search space of hypotheses, MRDM algorithms adopt several language biases during the mining process. Because of that, when running a MRDM-based system, the user needs to execute the same set of data mining tasks a number of times, each assuming a different combination of parameters in order to get a final good hypothesis. This makes manual control of such complex process tedious, laborious and error-prone. In addition, running the same MRDM process several times consumes much time. Thus, the automatic execution of each setting of parameters throughout parallelization techniques becomes essential. In this paper, we propose an approach named LPFlow4SN that models a MRDM process as a scientific workflow and then executes it in parallel in the cloud, thus benefiting from the existing Scientific Workflow Management Systems. Experimental results reinforce the potential of running parallel scientific workflows in the cloud to automatically control the MRDM process while improving its overall execution performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Download SciCumulus at: https://scicumulusc2.wordpress.com/.
- 3.
References
Bakshy, E., Rosenn, I., Marlow, C., Adamic, L.: The role of social networks in information diffusion. In: Proceedings of the 21st International Conference on World Wide Web, pp. 519–528, New York, NY, USA (2012)
Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62(1–2), 107–136 (2006)
Han, J., Kamber, M., Pei, J.: Data Mining Concepts and Techniques, 3rd edn. Elsevier, Amsterdam (2012)
Bloedorn, E., Christiansen, A.D., Hill, W., Skorupka, C., Talbot, L.M., Tivel, J.: Data Mining for Network Intrusion Detection: How to Get Started (2001)
Dalal, M.A., Harale, N.D.: A survey on clustering in data mining. In: Proceedings of the International Conference & Workshop on Emerging Trends in Technology, pp. 559–562, New York, NY, USA (2011)
Hu, X.: Data mining in bioinformatics: challenges and opportunities. In: Proceeding of the Third International Workshop on Data and Text Mining in Bioinformatics, pp. 1–1, New York, NY, USA (2009)
Džeroski, S., Lavrač, N.: Relational Data Mining. Springer, Berlin, New York (2001)
Raedt, L.: Logical and relational learning. In: Proceedings of the 19th Brazilian Symposium on Artificial Intelligence: Advances in Artificial Intelligence, pp. 1–1. Springer, Berlin, Heidelberg (2008)
Michalski, R.S.: A theory and methodology of inductive learning. Artif. Intell. 20, 111–161 (1983)
Muggleton, S.: Inductive logic programming. In: 6th International Workshop, ILP-96, Stockholm, Sweden, August 1996, Selected Papers. Springer, New York (1997)
Nilsson, U., Małuszyński, J.: Logic, Programming, and Prolog. Wiley, Chichester, New York (1995)
Mattoso, M., Werner, C., Travassos, G.H., Braganholo, V., Ogasawara, E., Oliveira, D.D., Cruz, S.M.S.D., Martinho, W., Murta, L.: Towards supporting the life cycle of large scale scientific experiments. Int. J. Bus. Process Integr. Manage. 5(1), 79 (2010)
Deelman, E., Gannon, D., Shields, M., Taylor, I.: Workflows and e-Science: an overview of workflow system features and capabilities. Future Gener. Comput. Syst. 25(5), 528–540 (2009)
Taylor, I.J., Deelman, E., Gannon, D.B., Shields, M.: Workflows for e-Science: Scientific Workflows for Grids, 1st edn. Springer, Berlin (2007)
Oliveira, D., Baião, F., Mattoso, M.: MiningFlow: adding semantics to text mining workflows. In: First Poster Session of the Brazilian Symposium on Databases, pp. 15–18, João Pessoa, PB, Brazil (2007)
Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for computational tasks: a survey. Comput. Sci. Eng. 10, 11–21 (2008)
Buneman, P., Khanna, S., Tan, W.-C.: Why and where: a characterization of data provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 316–330. Springer, Heidelberg (2000)
Oliveira, D., Ogasawara, E., Baião, F., Mattoso, M.: “SciCumulus: a lightweight cloud middleware to explore many task computing paradigm in scientific workflows. In: 3rd International Conference on Cloud Computing, pp. 378–385, Washington, DC, USA (2010)
de Oliveira, D., Ocaña, K.A.C.S., Baião, F., Mattoso, M.: A provenance-based adaptive scheduling heuristic for parallel scientific workflows in clouds. J. Grid Comput. 10(3), 521–552 (2012)
Oliveira, D., Ogasawara, E., Ocaña, K., Baião, F., Mattoso, M.: An adaptive parallel execution strategy for cloud-based scientific workflows. Concurrency Comput. Pract. Experience 24(13), 1531–1550 (2012)
Alpaydin, E.: Introduction to Machine Learning. MIT Press, Cambridge (2004)
Raicu, I., Foster, I.T., Zhao, Y.: Many-task computing for grids and supercomputers. MTAGS 2008, 1–11 (2008)
Wozniak, J.M., Armstrong, T.G., Wilde, M., Katz, D.S., Lusk, E., Foster, I.T.: Swift/T: large-scale application composition via distributed-memory dataflow processing. In: Proceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 95–102 (2013)
Deelman, E., Mehta, G., Singh, G., Su, M.-H., Vahi, K.: Pegasus: mapping large-scale workflows to distributed resources. In: Taylor, I.J., Deelman, E., Gannon, D.B., Shields, M. (eds.) Workflows for e-Science, pp. 376–394. Springer, London (2007)
Powers, D.: Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation (2007)
Ogasawara, E., Dias, J., Oliveira, D., Porto, F., Valduriez, P., Mattoso, M.: An algebraic approach for data-centric scientific workflows. In: Proceedings of the 37th International Conference on Very Large Data Bases (PVLDB), vol. 4, no. 12, pp. 1328–1339 (2011)
Costa, F., Silva, V., de Oliveira, D., Ocaña, K., Ogasawara, E., Dias, J., Mattoso, M.: Capturing and querying workflow runtime provenance with PROV: a practical approach. In: Proceedings of the Joint EDBT/ICDT 2013 Workshops, pp. 282–289, New York, NY, USA (2013)
Ailamaki, A.: Managing scientific data: lessons, challenges, and opportunities. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp. 1045–1046. New York, NY, USA (2011)
Coutinho, R., Drummond, L., Frota, Y., Oliveira, D., Ocaña, K.: Evaluating grasp-based cloud dimensioning for comparative genomics: a practical approach. In: Proceedings of the Second International Workshop on Parallelism in Bioinformatics, Madrid, Spain (2014)
Jackson, K.R., Ramakrishnan, L., Runge, K.J., Thomas, R.C.: Seeking supernovae in the clouds: a performance study. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 421–429, New York, NY, USA (2010)
Popiolek, P.F., Mendizabal, O.M.: Monitoring and analysis of performance impact in virtualized environments. J. Appl. Comput. Res. 2(2), 75–82 (2013)
Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks. J. Am. Soc. Inform. Sci. Technol. 58(7), 1019–1031 (2007)
Acknowledgments
The authors would like to thank FAPERJ (grant E-26/111.370/2013) and CNPq (grant 478878/2013-3) for partially sponsoring this research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Paes, A., de Oliveira, D. (2015). Running Multi-relational Data Mining Processes in the Cloud: A Practical Approach for Social Networks. In: Osthoff, C., Navaux, P., Barrios Hernandez, C., Silva Dias, P. (eds) High Performance Computing. CARLA 2015. Communications in Computer and Information Science, vol 565. Springer, Cham. https://doi.org/10.1007/978-3-319-26928-3_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-26928-3_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26927-6
Online ISBN: 978-3-319-26928-3
eBook Packages: Computer ScienceComputer Science (R0)