Skip to main content
Log in

WordificationMI: multi-relational data mining through multiple-instance propositionalization

  • Regular Paper
  • Published:
Progress in Artificial Intelligence Aims and scope Submit manuscript

Abstract

Multi-relational data mining (MRDM) looks for patterns from a relational database. One of the established approaches to MRDM is propositionalization, characterized by transforming a relational database into a simpler representation, commonly a single table. Another approach that has proven to be effective to address learning problems involving one-to-many relationships between the data is multiple-instance learning. In this paper, we propose a new technique to transform relational data, called WordificationMI, which takes advantage of the multiple-instance learning’s potentialities. This new proposal is based on the bag-of-words representation, proposed in the Wordification methodology, but with the difference that it transforms a relational database into a multiple-instance representation. Additionally, we propose a feature selection method, named MICHI (\(\chi _\mathrm{MI}^{2}\)), for reducing the dimensionality of the datasets obtained with WordificationMI. We also present an empirical evaluation with ten relational databases and four learning techniques that show the effectiveness of the proposed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

Notes

  1. All databases used here were obtained from https://relational.fit.cvut.cz except IMDb that was provided by the authors of Wordification.

References

  1. Aggarwal, C.C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer, New York (2012). https://doi.org/10.1007/978-1-4614-3223-4_6

    Chapter  Google Scholar 

  2. Ahmed, C.F., Lachiche, N., Charnay, C., El Jelali, S., Braud, A.: Flexible propositionalization of continuous attributes in relational data mining. Expert Syst. Appl. 42(21), 7698–7709 (2015). https://doi.org/10.1016/j.eswa.2015.05.053

    Article  Google Scholar 

  3. Alphonse, É., Rouveirol, C.: Lazy propositionalisation for relational learning. In: Proceedings of the 14th European Conference on Artificial Intelligence, ECAI’00, pp. 256–260. IOS Press, Amsterdam, The Netherlands (2000)

  4. Amores, J.: Multiple instance classification: review, taxonomy and comparative study. Artif. Intell. 201, 81–105 (2013). https://doi.org/10.1016/j.artint.2013.06.003

    Article  MathSciNet  MATH  Google Scholar 

  5. Blockeel, H., De Raedt, L.: Top-down induction of first-order logical decision trees. Artif. Intell. 101(1–2), 285–297 (1998). https://doi.org/10.1016/S0004-3702(98)00034-4

    Article  MathSciNet  MATH  Google Scholar 

  6. Blockeel, H., Page, D., Srinivasan, A.: Multi-instance tree learning. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 57–64. ACM (2005). http://dl.acm.org/citation.cfm?id=1102359

  7. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27:1–27:27 (2011). https://doi.org/10.1145/1961189.1961199

    Article  Google Scholar 

  8. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1007/BF00994018

    Article  MATH  Google Scholar 

  9. De Raedt, L.: Attribute-value learning versus inductive logic programming: the missing links. In: Page, D. (ed.) Inductive Logic Programming. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), vol. 1446, pp. 1–8. Springer, Berlin, Heidelberg (1998). https://doi.org/10.1007/BFb0027304

  10. De Raedt, L.: Logical and Relational Learning. Cognitive Technologies. Springer, Berlin (2008)

    Book  MATH  Google Scholar 

  11. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)

    MathSciNet  MATH  Google Scholar 

  12. Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89(1), 31–71 (1997). https://doi.org/10.1016/S0004-3702(96)00034-3

    Article  MATH  Google Scholar 

  13. Džeroski, S.: Relational data mining. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 887–911. Springer, New York (2009). https://doi.org/10.1007/978-0-387-09823-4_46

    Chapter  Google Scholar 

  14. Ferreira, C.A., Gama, J., Costa, V.S.: Exploring multi-relational temporal databases with a propositional sequence miner. Prog. Artif. Intell. 4(1–2), 11–20 (2015). https://doi.org/10.1007/s13748-015-0065-x

    Article  Google Scholar 

  15. França, M.V.M., Zaverucha, G., d’Avila Garcez, A.S.: Fast relational learning using bottom clause propositionalization with artificial neural networks. Mach. Learn. 94(1), 81–104 (2014). https://doi.org/10.1007/s10994-013-5392-1

    Article  MathSciNet  Google Scholar 

  16. Gao, S., Sun, Q.: Exploiting generalized discriminative multiple instance learning for multimedia semantic concept detection. Pattern Recognit. 41(10), 3214–3223 (2008). https://doi.org/10.1016/j.patcog.2008.03.029

    Article  MATH  Google Scholar 

  17. García, S., Herrera, F.: An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9(Dec), 2677–2694 (2008)

    MATH  Google Scholar 

  18. Gärtner, T., Flach, P.A., Kowalczyk, A., Smola, A.J.: Multi-instance kernels. In: Proceedings of the 19th International Conference on Machine Learning, vol. 2, pp. 179–186. Sydney, Australia (2002). http://sci2s.ugr.es/keel/pdf/algorithm/congreso/2002-Gartner-ICML.pdf

  19. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newslett. 11(1), 10–18 (2009)

    Article  Google Scholar 

  20. Helma, C., King, R.D., Kramer, S., Srinivasan, A.: The predictive toxicology challenge 2000–2001. Bioinformatics 17(1), 107–108 (2001). https://doi.org/10.1093/bioinformatics/17.1.107

    Article  Google Scholar 

  21. Herrera, F., Ventura, S., Bello-Pérez, R., Cornelis, C., Zafra Gómez, A., Sánchez-Tarragó, D., Vluymans, S.: Multiple Instance Learning. Foundations and Algorithms. Springer, Berlin (2016)

    Book  MATH  Google Scholar 

  22. Knobbe, A.J.: Multi-relational Data Mining. No. 145 in Frontiers in Artificial Intelligence and Applications. IOS Press, Amsterdam (2006)

    Google Scholar 

  23. Knobbe, A.J., de Haas, M., Siebes, A.: Propositionalisation and aggregates. In: Proceeding of the 5th PKDD, pp. 277–288. Springer (2001). https://doi.org/10.1007/3-540-44797-0_3

  24. Krogel, M.A.: On propositionalization for knowledge discovery in relational databases. PhD thesis, Otto-von-Guericke-Universität Magdeburg, Universitätsbibliothek (2005). http://diglib.uni-magdeburg.de/Dissertationen/2005/markrogel.htm

  25. Krogel, M.A., Wrobel, S.: Transformation-based learning using multirelational aggregation. In: Proceedings of the Eleventh International Conference on Inductive Logic Programming (ILP 2001), LNAI, vol. 2157, pp. 142–155. Springer (2001). https://doi.org/10.1007/3-540-44797-0_12

  26. Kuželka, O., Železný, F.: Block-wise construction of tree-like relational features with monotone reducibility and redundancy. Mach. Learn. 83(2), 163–192 (2011). https://doi.org/10.1007/s10994-010-5208-5

    Article  MathSciNet  MATH  Google Scholar 

  27. LavraÄŤ, N., DĹľeroski, S.: Inductive Logic Programming: Techniques and Applications. Ellis Hortwood, New York (1994)

    MATH  Google Scholar 

  28. Lavrač, N., Džeroski, S., Grobelnik, M.: Learning nonrecursive definitions of relations with LINUS. In: Y. Kodratoff (ed.) Machine Learning—EWSL-91. Lecture Notes in Computer Science, pp. 265–281. Springer, Berlin, Heidelberg (1991). https://doi.org/10.1007/BFb0017020

  29. Lavrač, N., Flach, P.A.: An extended transformation approach to inductive logic programming. ACM Trans. Comput. Log. (TOCL) 2(4), 458–494 (2001)

    Article  MATH  Google Scholar 

  30. Le Cessie, S., Van Houwelingen, J.C.: Ridge estimators in logistic regression. J. R. Stat. Soc. Ser. C (Appl. Stat.) 41(1), 191–201 (1992). https://doi.org/10.2307/2347628

    Article  MATH  Google Scholar 

  31. Lodhi, H., Muggleton, S.: Is mutagenesis still challenging? In: ILP-Late-Breaking Papers, vol. 35 (2005). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.115.2954&rep=rep1&type=pdf

  32. McGovern, A., Jensen, D.: Chi-squared: a simpler evaluation function for multiple-instance learning. Technical report TR-03-14, Massachusetts University Amherst, Department of Computer Science (2003). http://www.dtic.mil/docs/citations/ADA465740

  33. Melki, G., Cano, A., Ventura, S.: MIRSVM: multi-instance support vector machine with bag representatives. Pattern Recognit. 79, 228–241 (2018). https://doi.org/10.1016/j.patcog.2018.02.007

    Article  Google Scholar 

  34. Michalski, R.S.: Pattern recognition as rule-guided inductive inference. IEEE Trans. Pattern Anal. Mach. Intell. PAMI–2(4), 349–361 (1980). https://doi.org/10.1109/TPAMI.1980.4767034

    Article  MATH  Google Scholar 

  35. Muggleton, S.: Inverse entailment and Progol. New Gener. Comput. 13(3–4), 245–286 (1995). https://doi.org/10.1007/BF03037227

    Article  Google Scholar 

  36. Muggleton, S., De Raedt, L., Poole, D., Bratko, I., Flach, P., Inoue, K., Srinivasan, A.: ILP turns 20. Mach. Learn. 86(1), 3–23 (2012). https://doi.org/10.1007/s10994-011-5259-2

    Article  MathSciNet  MATH  Google Scholar 

  37. Perovšek, M., Vavpetič, A., Kranjc, J., Cestnik, B., Lavrač, N.: Wordification: propositionalization by unfolding relational data into bags of words. Expert Syst. Appl. 42(17), 6442–6456 (2015). https://doi.org/10.1016/j.eswa.2015.04.017

    Article  Google Scholar 

  38. Quinlan, J.R.: C4.5: Programs for Machine Learning. The Morgan Kaufmann Series in Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993)

    Google Scholar 

  39. Reutemann, P., Pfahringer, B., Frank, E.: A toolbox for learning from relational data with propositional and multi-instance learners. In: AI 2004: Advances in Artificial Intelligence. Lecture Notes in Computer Science, pp. 1017–1023. Springer, Berlin (2004). https://doi.org/10.1007/978-3-540-30549-1_95

  40. Sánchez Tarragó, D., Cornelis, C., Bello, R., Herrera, F.: A multi-instance learning wrapper based on the Rocchio classifier for web index recommendation. Knowl. Based Syst. 59, 173–181 (2014). https://doi.org/10.1016/j.knosys.2014.01.008

    Article  Google Scholar 

  41. Srinivasan, A.: The Aleph Manual (2007). https://www.cs.ox.ac.uk/activities/machlearn/Aleph/aleph.html

  42. Srinivasan, A., King, R.D., Muggleton, S.H., Sternberg, M.J.: Carcinogenesis predictions using ILP. In: Inductive Logic Programming, pp. 273–287. Springer (1997)

  43. Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques, 3ed edn. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, Burlington (2011)

    Google Scholar 

  44. Yang, J., Jiang, Y.G., Hauptmann, A.G., Ngo, C.W.: Evaluating bag-of-visual-words representations in scene classification. In: Proceedings of the International Workshop on Multimedia Information Retrieval, MIR’07, pp. 197–206. ACM, New York, NY, USA (2007). https://doi.org/10.1145/1290082.1290111

  45. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of 14th International Conference on Machine Learning, pp. 412–420 (1997). http://www.surdeanu.info/mihai/teaching/ista555-spring15/readings/yang97comparative.pdf

  46. Zafra, A., Ventura, S.: G3P-MI: a genetic programming algorithm for multiple instance learning. Inf. Sci. 180(23), 4496–4513 (2010). https://doi.org/10.1016/j.ins.2010.07.031

    Article  Google Scholar 

  47. Zafra, A., Ventura, S.: Multi-instance genetic programming for predicting student performance in web based educational environments. Appl. Soft Comput. 12(8), 2693–2706 (2012). https://doi.org/10.1016/j.asoc.2012.03.054

    Article  Google Scholar 

  48. Železný, F., Lavrač, N.: Propositionalization-based relational subgroup discovery with RSD. Mach. Learn. 62(1–2), 33–63 (2006). https://doi.org/10.1007/s10994-006-5834-0

    Article  Google Scholar 

  49. Zhou, Z.H., Zhang, M.L.: Solving multi-instance problems with classifier ensemble based on constructive clustering. Knowledge and Information Systems 11(2), 155–170 (2007). https://doi.org/10.1007/s10115-006-0029-3

    Article  Google Scholar 

Download references

Acknowledgements

We gratefully acknowledge Matic Perovšek for his clarifications on the Wordification method and for providing the IMDb database.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sebastián Ventura.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research was supported by the Spanish Ministry of Economy and the European Regional Development Fund, Project TIN2017-83445-P. The authors also thank the AUIP and the Council of Economy and Knowledge of the Andalusia Board, as sponsors of the Academic Mobility Scholarship Program of the AUIP.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Quintero-Domínguez, L.A., Morell, C. & Ventura, S. WordificationMI: multi-relational data mining through multiple-instance propositionalization. Prog Artif Intell 8, 375–387 (2019). https://doi.org/10.1007/s13748-019-00186-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13748-019-00186-y

Keywords

Navigation