Full Border Identification for Reduction of Training Sets

Li, Guichong; Japkowicz, Nathalie; Stocki, Trevor J.; Ungar, R. Kurt

doi:10.1007/978-3-540-68825-9_20

Guichong Li¹,
Nathalie Japkowicz¹,
Trevor J. Stocki² &
…
R. Kurt Ungar²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5032))

Included in the following conference series:

Conference of the Canadian Society for Computational Studies of Intelligence

Abstract

Border identification (BI) was previously proposed to help learning systems focus on the most relevant portion of the training set so as to improve learning accuracy. This paper argues that the traditional BI implementation suffers from a serious limitation: it is only able to identify partial borders. This paper proposes a new BI method called Progressive Border Sampling (PBS), which addresses this limitation by borrowing ideas from recent research on Progressive Sampling. PBS progressively learns optimal borders from the entire training sets by, first, identifying a full border, thus, avoiding the limitation of the traditional BI method, and, second, by incrementing the size of that border until it converges to an optimal sample, which is smaller than the original training set. Since PBS identifies the full border, it is expected to discover more optimal samples than traditional BI. Our experimental results on the selected 30 benchmark datasets from the UCI repository show that, indeed, in the context of classification, PBS is more successful than traditional BI at reducing the size of the training sets and optimizing the accuracy results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bay, S.D.: The UCI KDD archive (1999), http://kdd.ics.-uci.edu
Cohn, D., Ghahramani, Z., Jordan, M.: Active learning with statistical models. Journal of Artificial Intelligence Research 4, 129–145 (1996)
MATH Google Scholar
Duch, W.: Similarity based methods: a general framework for classification, approximation and association. Control and Cybernetics 29(4), 937–968 (2000)
MATH MathSciNet Google Scholar
Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. Wiley Intersience, Chichester (2000)
Google Scholar
Foody, G.M.: Issues in Training Set Selection and Refinement for Classification by a Feedforward Neural Network. In: Proceedings of IEEE International Geoscience and Remote Sensing Symposium. IGARSS 1998, Seattle, WA, USA, vol. 1, pp. 409–411 (1998)
Google Scholar
John, G., Langley, P.: Static versus dynamic sampling for data mining. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 367–370. AAAI Press, Menlo Park (1996)
Google Scholar
Kubat, M., Matwin, S.: Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In: Proc. 14th International Conference on Machine Learning (1997)
Google Scholar
Mitchell, T.: Machine Learning. McGraw-Hill Companies, Inc., New York (1997)
MATH Google Scholar
Provost, F., Jensen, D., Oates, T.: Efficient Progressive Sampling. In: KDD 1999 (1999)
Google Scholar
Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Scholkopf, B., Burges, C., Smola, A. (eds.) Advances in kernel methods - support vector learning, MIT Press, Cambridge (1998)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)
Google Scholar
Press, W.H., Farrar, G.R.: Recursive Stratified Sampling for Multidimensional Monte Carlo Integration. Computers in Physics 4, 190–195 (1990)
Google Scholar
Strehl, A., Ghosh, J.: Value-based customer grouping from large retail data-sets. In: Proc. SPIE Conference on Data Mining and Knowledge Discovery, Orlando, April 2000, vol. 4057, pp. 33–42 (2000)
Google Scholar
Sulzmann, J., Fürnkranz, J., Hüllermeier, E.: On Pairwise Naive Bayes Classifiers. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 658–665. Springer, Heidelberg (2007)
Chapter Google Scholar
Weiss, G.M., Provost, F.: Learning when training data are costly: the effect of class distribution on tree induction. Journal of Artificial Intelligence Research 19, 315–354 (2003)
MATH Google Scholar
WEKA Software, v3.5.2. University of Waikato, http://www.cs.waikato.ac.nz/ml/-weka/index/datasets.html

Download references

Author information

Authors and Affiliations

Computer Science of University of Ottawa,
Guichong Li & Nathalie Japkowicz
Radiation Protection Bureau, Health Canada, Ottawa, ON, Canada
Trevor J. Stocki & R. Kurt Ungar

Authors

Guichong Li
View author publications
You can also search for this author in PubMed Google Scholar
Nathalie Japkowicz
View author publications
You can also search for this author in PubMed Google Scholar
Trevor J. Stocki
View author publications
You can also search for this author in PubMed Google Scholar
R. Kurt Ungar
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Sabine Bergler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, G., Japkowicz, N., Stocki, T.J., Ungar, R.K. (2008). Full Border Identification for Reduction of Training Sets. In: Bergler, S. (eds) Advances in Artificial Intelligence. Canadian AI 2008. Lecture Notes in Computer Science(), vol 5032. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68825-9_20

Download citation

DOI: https://doi.org/10.1007/978-3-540-68825-9_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68821-1
Online ISBN: 978-3-540-68825-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics