Skip to main content

IDPP: Imbalanced Datasets Pipelines in Pyrus

  • Conference paper
  • First Online:
Engineering of Computer-Based Systems (ECBS 2023)

Abstract

We showcase and demonstrate IDPP, a Pyrus-based tool that offers a collection of pipelines for the analysis of imbalanced datasets. Like Pyrus, IDPP is a web-based, low-code/no-code graphical modelling environment for ML and data analytics applications. On a case study from the medical domain, we solve the challenge of re-using AI/ML models that do not address data with imbalanced class by implementing ML algorithms in Python that do the re-balancing. We then use these algorithms and the original ML models in the IDPP pipelines. With IDPP, our low-code development approach to balance datasets for AI/ML applications can be used by non-coders. It simplifies the data-preprocessing stage of any AI/ML project pipeline, which can potentially improve the performance of the models. The tool demo will showcase the low-code implementation and no-code reuse and repurposing of AI-based systems through end-to end Pyrus pipelines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.tines.com/product.

  2. 2.

    https://h2o.ai/platform/ai-cloud/make/hydrogen-torch/.

  3. 3.

    All code, information on datasets used and the results are published on GitHub at: https://github.com/singhad/class_imbalance_pyrus.

  4. 4.

    https://jupyter.org.

  5. 5.

    https://www.cdc.gov/brfss/annual_data/annual_2014.html.

References

  1. Al-Areqi, S., Lamprecht, A.-L., Margaria, T.: Constraints-driven automatic geospatial service composition: workflows for the analysis of sea-level rise impacts. In: Gervasi, O., et al. (eds.) ICCSA 2016. LNCS, vol. 9788, pp. 134–150. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-42111-7_12

    Chapter  Google Scholar 

  2. Devarriya, D., Gulati, C., Mansharamani, V., Sakalle, A., Bhardwaj, A.: Unbalanced breast cancer data classification using novel fitness functions in genetic programming. Expert Syst. Appl. 140, 112866 (2020), https://www.sciencedirect.com/science/article/pii/S0957417419305767

  3. Kuo, N., Finfer, S., Jorm, L., Barbieri, S.: Synthetic acute hypotension and sepsis datasets based on mimic-iii and published as part of the health gym project, https://physionet.org/content/synthetic-mimic-iii-health-gym/1.0.0/

  4. Lamprecht, A.-L., Margaria, T., Steffen, B.: Seven variations of an alignment workflow - an illustration of agile process design and management in Bio-jETI. In: Măndoiu, I., Sunderraman, R., Zelikovsky, A. (eds.) ISBRA 2008. LNCS, vol. 4983, pp. 445–456. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79450-9_42

    Chapter  Google Scholar 

  5. Lamprecht, A.L., Margaria, T., Steffen, B., Sczyrba, A., Hartmeier, S., Giegerich, R.: Genefisher-p: variations of genefisher as processes in Bio-jETI. BMC Bioinformatics 9(4), 1–15 (2008)

    Google Scholar 

  6. Lemaître, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(17), 1–5 (2017), http://jmlr.org/papers/v18/16-365.html

  7. Liu, T., Fan, W., Wu, C.: Data for: A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical-datasets 1 (2019), https://data.mendeley.com/datasets/x8ygrw87jw/1

  8. Margaria, T.: From Computational Thinking to Constructive Design with Simple Models. In: Margaria, T., Steffen, B. (eds.) ISoLA 2018. LNCS, vol. 11244, pp. 261–278. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03418-4_16

    Chapter  Google Scholar 

  9. Margaria, T., Schieweck, A.: The Digital Thread in Industry 4.0. In: Ahrendt, W., Tapia Tarifa, S.L. (eds.) IFM 2019. LNCS, vol. 11918, pp. 3–24. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34968-4_1

    Chapter  Google Scholar 

  10. Margaria, T., Steffen, B.: Business process modeling in the jABC: the one-thing approach. In: Handbook of research on business process modeling, pp. 1–26. IGI Global (2009)

    Google Scholar 

  11. Margaria, T., Steffen, B.: Continuous model-driven engineering. Computer 42(10), 106–109 (2009)

    Article  Google Scholar 

  12. Minguett Pirela, O.M.: Evaluation of machine learning classification techniques for handling class imbalance in medical datasets. M.Sc. in Artificial Intelligence, University of Limerick (2022)

    Google Scholar 

  13. Naujokat, S., Lybecait, M., Kopetzki, D., Steffen, B.: Cinco: a simplicity-driven approach to full generation of domain-specific graphical modeling tools. Int. J. Softw. Tools Technol. Transfer 20, 327–354 (2018)

    Article  Google Scholar 

  14. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, E.A.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    Google Scholar 

  15. Xie, Z.: Building risk prediction models for type 2 diabetes using machine learning techniques. Prev. Chronic Dis. 16, e130 (2019)

    Google Scholar 

  16. Xu, Z., Shen, D., Nie, T., Kou, Y.: A hybrid sampling algorithm combining m-smote and ENN based on random forest for medical imbalanced data. J. Biomed. Inf. 107, 103465 (2020)

    Google Scholar 

  17. Zweihoff, P., Steffen, B.: Pyrus: an online modeling environment for no-code data-analytics service composition. In: Margaria, T., Steffen, B. (eds.) ISoLA 2021. LNCS, vol. 13036, pp. 18–40. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-89159-6_2

    Chapter  Google Scholar 

Download references

Acknowledgments

This research was partially funded by Science Foundation Ireland (SFI) under Grant Number 18/CRT/6223 - SFI Centre of Research Training in AI.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amandeep Singh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Singh, A., Minguett, O. (2024). IDPP: Imbalanced Datasets Pipelines in Pyrus. In: Kofroň, J., Margaria, T., Seceleanu, C. (eds) Engineering of Computer-Based Systems. ECBS 2023. Lecture Notes in Computer Science, vol 14390. Springer, Cham. https://doi.org/10.1007/978-3-031-49252-5_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-49252-5_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-49251-8

  • Online ISBN: 978-3-031-49252-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics