Addressing Local Class Imbalance in Balanced Datasets with Dynamic Impurity Decision Trees

Mulyar, Andriy; Krawczyk, Bartosz

doi:10.1007/978-3-030-01771-2_1

Andriy Mulyar¹⁷ &
Bartosz Krawczyk¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11198))

Included in the following conference series:

International Conference on Discovery Science

947 Accesses
1 Citations

Abstract

Decision trees are among the most popular machine learning algorithms, due to their simplicity, versatility, and interpretability. Their underlying principle revolves around the recursive partitioning of the feature space into disjoint subsets, each of which should ideally contain only a single class. This is achieved by selecting features and conditions that allow for the most effective split of the tree structure. Traditionally, impurity metrics are used to measure the effectiveness of a split, as ideally in a given subset only instances from a single class should be present. In this paper, we discuss the underlying shortcoming of such an assumption and introduce the notion of local class imbalance. We show that traditional splitting criteria induce the emergence of increasing class imbalances as the tree structure grows. Therefore, even when dealing with initially balanced datasets, class imbalance will become a problem during decision tree induction. At the same time, we show that existing skew-insensitive split criteria return inferior performance when data is roughly balanced. To address this, we propose a simple, yet effective hybrid decision tree architecture that is capable of dynamically switching between standard and skew-insensitive splitting criterion during decision tree induction. Our experimental study depicts that local class imbalance is embedded in most standard classification problems and that the proposed hybrid approach is capable of alleviating its influence.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Enhancing techniques for learning decision trees from imbalanced data

Article 02 March 2019

Improving lazy decision tree for imbalanced classification by using skew-insensitive criteria

Article 25 October 2018

AdaDT: An adaptive decision tree for addressing local class imbalance based on multiple split criteria

Article 05 January 2021

References

Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Mult.-Valued Log. Soft Comput. 17(2–3), 255–287 (2011)
Google Scholar
Boonchuay, K., Sinapiromsaran, K., Lursinsap, C.: Decision tree induction based on minority entropy for the class imbalance problem. Pattern Anal. Appl. 20(3), 769–782 (2017)
Article MathSciNet Google Scholar
Breiman, L.: Technical note: some properties of splitting criteria. Mach. Learn. 24(1), 41–47 (1996)
MathSciNet MATH Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth (1984)
Google Scholar
Cano, A.: A survey on graphic processing unit computing for large-scale data mining. Wiley Interdisc. Rew. Data Min. Knowl. Discov. 8(1) (2018)
Google Scholar
Cieslak, D.A., Chawla, N.V.: Learning decision trees for unbalanced data. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008. LNCS (LNAI), vol. 5211, pp. 241–256. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87479-9_34
Chapter Google Scholar
Cieslak, D.A., Hoens, T.R., Chawla, N.V., Kegelmeyer, W.P.: Hellinger distance decision trees are robust and skew-insensitive. Data Min. Knowl. Discov. 24(1), 136–158 (2012)
Article MathSciNet Google Scholar
Flach, P.A.: The geometry of roc space: understanding machine learning metrics through roc isometrics. In: Proceedings of the Twentieth International Conference on International Conference on Machine Learning, pp. 194–201. ICML’03, AAAI Press (2003). http://dl.acm.org/citation.cfm?id=3041838.3041863
García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf. Sci. 180(10), 2044–2064 (2010)
Article Google Scholar
Hapfelmeier, A., Pfahringer, B., Kramer, S.: Pruning incremental linear model trees with approximate lookahead. IEEE Trans. Knowl. Data Eng. 26(8), 2072–2076 (2014)
Article Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009). https://doi.org/10.1109/TKDE.2008.239
Article Google Scholar
Jaworski, M., Duda, P., Rutkowski, L.: New splitting criteria for decision trees in stationary data streams. IEEE Trans. Neural Netw. Learn. Syst. 29(6), 2516–2529 (2018)
Article MathSciNet Google Scholar
Kearns, M.J., Mansour, Y.: On the boosting ability of top-down decision tree learning algorithms. In: STOC, pp. 459–468. ACM (1996)
Google Scholar
Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Prog. AI 5(4), 221–232 (2016)
Google Scholar
Lango, M., Brzezinski, D., Firlik, S., Stefanowski, J.: Discovering minority sub-clusters and local difficulty factors from imbalanced data. In: Yamamoto, A., Kida, T., Uno, T., Kuboyama, T. (eds.) DS 2017. LNCS (LNAI), vol. 10558, pp. 324–339. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67786-6_23
Chapter Google Scholar
Li, F., Zhang, X., Zhang, X., Du, C., Xu, Y., Tian, Y.: Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets. Inf. Sci. 422, 242–256 (2018)
Article Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Smith, M.R., Martinez, T.R., Giraud-Carrier, C.G.: An instance level analysis of data complexity. Mach. Learn. 95(2), 225–256 (2014)
Article MathSciNet Google Scholar
Weinberg, A.I., Last, M.: Interpretable decision-tree induction in a big data parallel framework. Appl. Math. Comput. Sci. 27(4), 737–748 (2017)
MathSciNet MATH Google Scholar
Woźniak, M.: A hybrid decision tree training method using data streams. Knowl. Inf. Syst. 29(2), 335–347 (2011)
Article Google Scholar
Woźniak, M., Graña, M., Corchado, E.: A survey of multiple classifier systems as hybrid systems. Inf. Fusion 16, 3–17 (2014)
Article Google Scholar

Download references

Acknowledgements

This work is supported by the VCU College of Engineering Deans Undergraduate Research Initiative (DURI) program.

Author information

Authors and Affiliations

Department of Computer Science, Virginia Commonwealth University, 401 West Main Street, Richmond, VA, 23284, USA
Andriy Mulyar & Bartosz Krawczyk

Authors

Andriy Mulyar
View author publications
You can also search for this author in PubMed Google Scholar
Bartosz Krawczyk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bartosz Krawczyk .

Editor information

Editors and Affiliations

Goldsmiths University of London, London, UK
Larisa Soldatova
Eindhoven University of Technology, Eindhoven, The Netherlands
Joaquin Vanschoren
University of Cyprus, Nicosia, Cyprus
George Papadopoulos
Università degli Studi di Bari Aldo Moro, Bari, Italy
Michelangelo Ceci

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mulyar, A., Krawczyk, B. (2018). Addressing Local Class Imbalance in Balanced Datasets with Dynamic Impurity Decision Trees. In: Soldatova, L., Vanschoren, J., Papadopoulos, G., Ceci, M. (eds) Discovery Science. DS 2018. Lecture Notes in Computer Science(), vol 11198. Springer, Cham. https://doi.org/10.1007/978-3-030-01771-2_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-01771-2_1
Published: 07 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01770-5
Online ISBN: 978-3-030-01771-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics