Generating Fixed-Size Training Sets for Large and Streaming Datasets

Ougiaroglou, Stefanos; Arampatzis, Georgios; Dervos, Dimitris A.; Evangelidis, Georgios

doi:10.1007/978-3-319-66917-5_7

Stefanos Ougiaroglou^16,17,
Georgios Arampatzis¹⁶,
Dimitris A. Dervos¹⁶ &
…
Georgios Evangelidis¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10509))

Included in the following conference series:

European Conference on Advances in Databases and Information Systems

1102 Accesses
1 Citations

Abstract

The k Nearest Neighbor is a popular and versatile classifier but requires a relatively small training set in order to perform adequately, a prerequisite not satisfiable with the large volumes of training data that are nowadays available from streaming environments. Conventional Data Reduction Techniques that select or generate training prototypes are also inappropriate in such environments. Dynamic RHC (dRHC) is a prototype generation algorithm that can update its condensing set when new training data arrives. However, after repetitive updates, the size of the condensing set may become unpredictably large. This paper proposes dRHC2, a new variation of dRHC, which remedies the aforementioned drawback. dRHC2 keeps the size of the condensing set in a convenient, manageable by the classifier, level by ranking the prototypes and removing the least important ones. dRHC2 is tested on several datasets and the experimental results reveal that it is more efficient and noise tolerant than dRHC and is comparable to dRHC in terms of accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Aggarwal, C.: Data Streams: Models and Algorithms. Advances in Database Systems Series. Springer, Heidelberg (2007)
Book MATH Google Scholar
Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991). http://dx.doi.org/10.1023/A:1022689900470
Google Scholar
Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Multi. Valued Logic Soft Comput. 17(2–3), 255–287 (2011)
Google Scholar
Beringer, J., Hüllermeier, E.: Efficient instance-based learning on data streams. Intell. Data Anal. 11(6), 627–650 (2007). http://dl.acm.org/citation.cfm?id=1368018.1368022
Google Scholar
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theor. 13(1), 21–27 (2006). http://dx.doi.org/10.1109/TIT.1967.1053964
Article MATH Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006). http://dl.acm.org/citation.cfm?id=1248547.1248548
MathSciNet MATH Google Scholar
Gama, J.A., Sebastião, R., Rodrigues, P.P.: Issues in evaluation of stream learning algorithms. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 329–338, KDD 2009. ACM, New York (2009). http://doi.acm.org/10.1145/1557019.1557060
Garcia, S., Derrac, J., Cano, J., Herrera, F.: Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 417–435 (2012). http://dx.doi.org/10.1109/TPAMI.2011.142
Article Google Scholar
Hart, P.E.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14(3), 515–516 (1968)
Article Google Scholar
Olvera-Lopez, J.A., Carrasco-Ochoa, J.A., Trinidad, J.F.M.: A new fast prototype selection method based on clustering. Pattern Anal. Appl. 13(2), 131–141 (2010)
Article MathSciNet Google Scholar
Ougiaroglou, S., Evangelidis, G.: Efficient dataset size reduction by finding homogeneous clusters. In: Proceedings of the Fifth Balkan Conference in Informatics, pp. 168–173, BCI 2012. ACM, New York (2012). http://doi.acm.org/10.1145/2371316.2371349
Ougiaroglou, S., Evangelidis, G.: RHC: a non-parametric cluster-based data reduction for efficient k-NN classification. Pattern Anal. Appl. 19(1), 93–109 (2014). http://dx.doi.org/10.1007/s10044-014-0393-7
Article MathSciNet Google Scholar
Ougiaroglou, S., Evangelidis, G.: WebDR: a web workbench for data reduction. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) Machine Learning and Knowledge Discovery in Databases. LNCS, vol. 8726, pp. 464–467. Springer, Heidelberg (2014). http://dx.doi.org/10.1007/978-3-662-44845-8_36
Google Scholar
Sánchez, J.S.: High training set size reduction by space partitioning and prototype abstraction. Pattern Recogn. 37(7), 1561–1564 (2004)
Article Google Scholar
Triguero, I., Derrac, J., Garcia, S., Herrera, F.: A taxonomy and experimental study on prototype generation for nearest neighbor classification. Trans. Sys. Man Cyber Part C 42(1), 86–100 (2012). http://dx.doi.org/10.1109/TSMCC.2010.2103939
Article Google Scholar
Tsymbal, A.: The problem of concept drift: definitions and related work. Technical report TCD-CS-2004-15, The University of Dublin, Trinity College, Department of Computer Science, Dublin, Ireland (2004)
Google Scholar
Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learning algorithms. Mach. Learn. 38(3), 257–286 (2000). http://dx.doi.org/10.1023/A:1007626913721
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Technology, Alexander TEI of Thessaloniki, 57400, Sindos, Greece
Stefanos Ougiaroglou, Georgios Arampatzis & Dimitris A. Dervos
Department of Applied Informatics, School of Information Sciences, University of Macedonia, 54636, Thessaloniki, Greece
Stefanos Ougiaroglou & Georgios Evangelidis

Authors

Stefanos Ougiaroglou
View author publications
You can also search for this author in PubMed Google Scholar
Georgios Arampatzis
View author publications
You can also search for this author in PubMed Google Scholar
Dimitris A. Dervos
View author publications
You can also search for this author in PubMed Google Scholar
Georgios Evangelidis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefanos Ougiaroglou .

Editor information

Editors and Affiliations

Riga Technical University , Riga, Latvia
Mārīte Kirikova
Norwegian University of Science and Technology, Trondheim, Norway
Kjetil Nørvåg
University of Cyprus , Nicosia, Cyprus
George A. Papadopoulos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ougiaroglou, S., Arampatzis, G., Dervos, D.A., Evangelidis, G. (2017). Generating Fixed-Size Training Sets for Large and Streaming Datasets. In: Kirikova, M., Nørvåg, K., Papadopoulos, G. (eds) Advances in Databases and Information Systems. ADBIS 2017. Lecture Notes in Computer Science(), vol 10509. Springer, Cham. https://doi.org/10.1007/978-3-319-66917-5_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-66917-5_7
Published: 25 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66916-8
Online ISBN: 978-3-319-66917-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics