Skip to main content

R Ultimate Multilabel Dataset Repository

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9648))

Abstract

Multilabeled data is everywhere on the Internet. From news on digital media and entries published in blogs, to videos hosted in Youtube, every object is usually tagged with a set of labels. This way they can be categorized into several non-exclusive groups. However, publicly available multilabel datasets (MLDs) are not so common. There is a handful of websites providing a few of them, using disparate file formats. Finding proper MLDs, converting them into the correct format and locating the appropriate bibliographic data to cite them are some of the difficulties usually confronted by researchers and practitioners.

In this paper RUMDR (R Ultimate Multilabel Dataset Repository), a new multilabel dataset repository aimed to fuse all public MLDs, is introduced, along with mldr.datasets, an R package which eases the process of retrieving MLDs and their bibliographic information, exporting them to the desired file formats and partitioning them.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://GitHub.com.

  2. 2.

    https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html.

  3. 3.

    http://sourceforge.net/projects/meka/files/Datasets/.

  4. 4.

    http://mulan.sourceforge.net/datasets-mlc.html.

  5. 5.

    http://sci2s.ugr.es/keel/multilabel.php.

  6. 6.

    This is a compressed file format of the representation of R objects in memory.

  7. 7.

    https://github.com/fcharte/mldr.datasets/blob/master/additional-data/README.md.

References

  1. Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In: Advances in Neural Information Processing Systems 14, vol. 14, pp. 681–687. MIT Press (2001)

    Google Scholar 

  2. Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: NUS-WIDE: a real-world web image database from National University of Singapore. In: Proceedings of the ACM International Conference on Image and Video Retrieval, p. 48. ACM (2009)

    Google Scholar 

  3. Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: QUINTA: a questiontagging assistant to improve the answering ratio in electronic forums. In: EUROCON 2015 - International Conference on Computer as a Tool (EUROCON), pp. 1–6. IEEE (2015). doi:10.1109/EUROCON.2015.7313677

  4. Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)

    Google Scholar 

  5. Gibaja, E., Ventura, S.: Multi-label learning: a review of the state of the art and ongoing research. Wiley Interdisc. Rev. Data Min. Knowl. Discov. 4(6), 411–444 (2014). doi:10.1002/widm.1139

    Article  Google Scholar 

  6. Zhang, M., Zhou, Z.: A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26(8), 1819–1837 (2014). doi:10.1109/TKDE.2013.39

    Article  Google Scholar 

  7. Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: Addressing imbalance in multilabel classification: measures and random resampling algorithms. Neurocomputing 163, 3–16 (2015). doi:10.1016/j.neucom.2014.08.091

    Article  Google Scholar 

  8. Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: MLSMOTE: approaching imbalanced multilabel learning through synthetic instance generation. Know. Based Syst. 89, 385–397 (2015). doi:10.1016/j.knosys.2015.07.019

    Article  Google Scholar 

  9. Chang, C.-C., Lin, C.-J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27:1–27:27 (2011). doi:10.1145/1961189.1961199

    Article  Google Scholar 

  10. Read, J., Reutemann, P.: MEKA multi-label dataset repository. http://meka.sourceforge.net/#datasets

  11. Tsoumakas, G., Xioufis, E.S., Vilcek, J., Vlahavas, I.: MULAN: a Java library for multi-label learning. J. Mach. Learn. Res. 12, 2411–2414 (2011)

    MathSciNet  MATH  Google Scholar 

  12. Alcala-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.: Keel data-mining software tool: data set repository and integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17(2–3), 255–287 (2011)

    Google Scholar 

  13. R Core Team, R: A Language and Environmentfor Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria (2014). http://www.R-project.org/

  14. Charte, F., Charte, D.: Working with multilabel datasets in R: the mldr package. R J. 7(2), 149–162 (2015)

    Google Scholar 

  15. Lang, K.: Newsweeder: Learning to filter netnews. In: Proceedings of the 12th International Conference on Machine Learning, pp. 331–339 (1995)

    Google Scholar 

  16. Katakis, I., Tsoumakas, G., Vlahavas, I.: Multilabel text classifcation for automatedtag suggestion. In: Proceedings of the ECML PKDD 2008 Discovery Challenge, Antwerp, Belgium, pp. 75–83 (2008)

    Google Scholar 

  17. Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and effcient multilabel classiffcationin domains with large number of labels. In: Proceedings of the ECML/PKDD Workshop on Mining Multidimensional Data, MMD 2008, Antwerp, Belgium, pp. 30–44 (2008)

    Google Scholar 

  18. Klimt, B., Yang, Y.: The enron corpus: a new dataset for email classification research. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004). doi:10.1007/978-3-540-30115-8_22

    Chapter  Google Scholar 

  19. Loza Mencía, E., Fürnkranz, J.: Efficient pairwise multilabel classification for large-scale problems in the legal domain. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part II. LNCS (LNAI), vol. 5212, pp. 50–65. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  20. Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. Mach. Learn. 85, 333–359 (2011). doi:10.1007/s10994-011-5256-5

    Article  MathSciNet  Google Scholar 

  21. Read, J.: Scalable multi-label classification. Ph.D. thesis, University of Waikato (2010)

    Google Scholar 

  22. Crammer, K., Dredze, M., Ganchev, K., Talukdar, P.P., Carroll, S.: Automatic code assignment to medical text. In: Proceeding of the Workshop on Biological, Translational, and Clinical Language Processing, BioNLP 2007, Prague, Czech Republic, pp. 129–136 (2007)

    Google Scholar 

  23. Joachims, T.: Text categorization with suport vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  24. Srivastava, A.N., Zane-Ulman, B.: Discovering recurring anomalies in text reports regarding complex space systems. In: IEEE Aerospace Conference, pp. 3853–3862 (2005). doi:10.1109/AERO.2005.1559692

  25. Ueda, N., Saito, K.: Parametric mixture models for multi-labeled text. In: Advances in Neural Information Processing Systems, pp. 721–728 (2002)

    Google Scholar 

  26. Briggs, F., Lakshminarayanan, B., Neal, L., Fern, X.Z., Raich, R., Hadley, S.J.K., Hadley, A.S., Betts, M.G.: Acoustic classification of multiple simultaneous bird species: a multi-instance multi-label approach. J. Acoust. Soc. Am. 131(6), 4640–4650 (2012)

    Article  Google Scholar 

  27. Turnbull, D., Barrington, L., Torres, D., Lanckriet, G.: Semantic annotation and retrieval of music and sound effects. IEEE Audio Speech Lang. Process. 16(2), 467–476 (2008). doi:10.1109/TASL.2007.913750

    Article  Google Scholar 

  28. Wieczorkowska, A., Synak, P., Raś, Z.: Multi-label classification of emotions in music. In: Klopotek, M.A., Wierzchori, S.T., Trojanowski, K. (eds.) Intelligent Information Processing and Web Mining. ASC, pp. 307–315. Springer, Heidelberg (2006). doi:10.1007/3-540-33521-8_30

    Chapter  Google Scholar 

  29. Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D.M., Jordan, M.I.: Matching words and pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003)

    MATH  Google Scholar 

  30. Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part IV. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002). doi:10.1007/3-540-47979-1_7

    Chapter  Google Scholar 

  31. Gonçalves, E.C., Plastino, A., Freitas, A.A.: A genetic algorithm for optimizing the label ordering in multi-label classifier chains. In: Proceedings of the 25th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2013), pp. 469–476 (2013)

    Google Scholar 

  32. Boutell, M., Luo, J., Shen, X., Brown, C.: Learning multi-label scene classification. Pattern Recogn. 37(9), 1757–1771 (2004). doi:10.1016/j.patcog.2004.03.009

    Article  Google Scholar 

  33. Snoek, C.G.M., Worring, M., van Gemert, J.C., Geusebroek, J.M., Smeulders, A.W.M.: The challenge problem for automated detection of 101 semantic concepts in multimedia. In: Proceedings of the 14th Annual ACM International Conference on Multimedia, MULTIMEDIA 2006, Santa Barbara, CA, USA, pp. 421–430 (2006). doi:10.1145/1180639.1180727

  34. Diplaris, S., Tsoumakas, G., Mitkas, P.A., Vlahavas, I.P.: Protein classification with multiple algorithms. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 448–456. Springer, Heidelberg (2005). doi:10.1007/11573036_42

    Chapter  Google Scholar 

  35. Charte, F., Rivera, A., del Jesus, M.J., Herrera, F.: On the impact of dataset complexity and sampling strategy in multilabel classifiers performance. In: Martínez-Álvarez, F., Troncoso, A., Quintián, H., Corchado, E. (eds.) HAIS 2016. LNCS (LNAI), vol. 9648, pp. 500–511. Springer, Switzerland (2016)

    Google Scholar 

Download references

Acknowledgments

This work was partially supported by the Spanish Ministry of Science and Technology under projects TIN2014-57251-P and TIN2012-33856, and the Andalusian regional projects P10-TIC-06858 and P11-TIC-7765.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francisco Charte .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Charte, F., Charte, D., Rivera, A., del Jesus, M.J., Herrera, F. (2016). R Ultimate Multilabel Dataset Repository. In: Martínez-Álvarez, F., Troncoso, A., Quintián, H., Corchado, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2016. Lecture Notes in Computer Science(), vol 9648. Springer, Cham. https://doi.org/10.1007/978-3-319-32034-2_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-32034-2_41

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-32033-5

  • Online ISBN: 978-3-319-32034-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics