Incremental k-Nearest Neighbors Using Reservoir Sampling for Data Streams

Bahri, Maroua; Bifet, Albert

doi:10.1007/978-3-030-88942-5_10

Maroua Bahri¹⁰ &
Albert Bifet^10,11

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12986))

Included in the following conference series:

International Conference on Discovery Science

1524 Accesses
2 Citations

Abstract

The online and potentially infinite nature of data streams leads to the inability to store the flow in its entirety and thus restricts the storage to a part of – and/or synopsis information from – the stream. To process these evolving data, we need efficient and accurate methodologies and systems, such as window models (e.g., sliding windows) and summarization techniques (e.g., sampling, sketching, dimensionality reduction). In this paper, we propose, RW-kNN, a k-Nearest Neighbors (kNN) algorithm that employs a practical way to store information about past instances using the biased reservoir sampling to sample the input instances along with a sliding window to maintain the most recent instances from the stream. We evaluate our proposal on a diverse set of synthetic and real datasets and compare against state-of-the-art algorithms in a traditional test-then-train evaluation. Results show how our proposed RW-kNN approach produces high-predictive performance for both real and synthetic datasets while using a feasible amount of resources.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In the sequel, we use the terms instance or observation interchangeably.
2.
http://waikato.github.io/meka/datasets/.
3.
https://archive.ics.uci.edu/ml/datasets/Poker+Hand.

References

Aggarwal, C.C.: On biased reservoir sampling in the presence of stream evolution. In: Very Large Data Bases (VLDB), pp. 607–618 (2006)
Google Scholar
Aggarwal, C.C., Philip, S.Y.: A survey of synopsis construction in data streams. In: Aggarwal, C.C. (eds.) Data Streams. ADBS, vol. 31, pp. 169–207. Springer, Boston (2007). https://doi.org/10.1007/978-0-387-47534-9_9
Agrawal, R., Imielinski, T., Swami, A.: Database mining: a performance perspective. Trans. Knowl. Data Eng. (TKDE) 5(6), 914–925 (1993)
Article Google Scholar
Anguita, D., Ghio, A., Oneto, L., Parra, X., Reyes-Ortiz, J.L.: Human activity recognition on smartphones using a multiclass hardware-friendly support vector machine. In: Bravo, J., Hervás, R., Rodríguez, M. (eds.) IWAAL 2012. LNCS, vol. 7657, pp. 216–223. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35395-6_30
Chapter Google Scholar
Bahri, M., Bifet, A., Maniu, S., Gomes, H.M.: Survey on feature transformation techniques for data streams. In: International Joint Conference on Artificial Intelligence (2020)
Google Scholar
Bahri, M., Bifet, A., Maniu, S., de Mello, R., Tziortziotis, N.: Compressed k-nearest neighbors ensembles for evolving data streams. In: European Conference on Artificial Intelligence (ECAI). IEEE (2020)
Google Scholar
Bahri, M., Maniu, S., Bifet, A.: Sketch-based Naive Bayes algorithms for evolving data streams. In: International Conference on Big Data, pp. 604–613. IEEE (2018)
Google Scholar
Bifet, A., Gavalda, R.: Learning from time-changing data with adaptive windowing. In: International Conference on Data Mining (ICDM), pp. 443–448. SIAM (2007)
Google Scholar
Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: massive online analysis. J. Mach. Learn. Res. (JMLR) 11(May), 1601–1604 (2010)
Google Scholar
Bifet, A., Pfahringer, B., Read, J., Holmes, G.: Efficient data stream classification via probabilistic adaptive windows. In: Symposium On Applied Computing (SIGAPP), pp. 801–806. ACM (2013)
Google Scholar
Caiming, Z., Yong, C.: A review of research relevant to the emerging industry trends: industry 4.0, IoT, blockchain, and business analytics. J. Ind. Integr. Manag. 5, 165–180 (2020)
Article Google Scholar
Candillier, L., Lemaire, V.: Design and analysis of the Nomao challenge active learning in the real-world. In: ALRA, Workshop ECML-PKDD. sn (2012)
Google Scholar
Ciarelli, P.M., Oliveira, E.: Agglomeration and elimination of terms for dimensionality reduction. In: International Conference on Intelligent Systems Design and Applications, pp. 547–552. IEEE (2009)
Google Scholar
Da Xu, L., He, W., Li, S.: Internet of things in industries: a survey. IEEE Trans. Industr. Inf. 10(4), 2233–2243 (2014)
Article Google Scholar
Dawid, A.P.: Present position and potential developments: some personal views statistical theory the prequential approach. J. R. Stat. Soc. Ser. A (General) 147(2), 278–290 (1984)
Article Google Scholar
Domingos, P., Hulten, G.: Mining high-speed data streams. In: SIGKDD International Conference on Knowledge Discovery & Data Mining (2000)
Google Scholar
Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2–3), 131–163 (1997)
Article Google Scholar
Gama, J.: Knowledge Discovery from Data Streams. CRC Press, Boca Raton (2010)
Book Google Scholar
Gama, J., Fernandes, R., Rocha, R.: Decision trees for mining data streams. Intell. Data Anal. (IDA) 10(1), 23–45 (2006)
Article Google Scholar
Gama, J., Rocha, R., Medas, P.: Accurate decision trees for mining high-speed data streams. In: SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 523–528. ACM (2003)
Google Scholar
Gomes, H.M., et al.: Adaptive random forests for evolving data stream classification. Mach. Learn. 106, 1469–1495 (2017). https://doi.org/10.1007/s10994-017-5642-8
Article MathSciNet Google Scholar
Gomes, H.M., Barddal, J.P., Enembreck, F., Bifet, A.: A survey on ensemble learning for data stream classification. Comput. Surv. (CSUR) 50(2), 23 (2017)
Google Scholar
Gomes, H.M., Read, J., Bifet, A.: Streaming random patches for evolving data stream classification. In: International Conference on Data Mining (ICDM). IEEE (2019)
Google Scholar
Haas, P.J.: Data-stream sampling: basic techniques and results. In: Data Stream Management. DSA, pp. 13–44. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-540-28608-0_2
Chapter Google Scholar
Hand, D.J., Mannila, H., Smyth, P.: Principles of Data Mining. MIT Press, Cambridge (2001)
Google Scholar
Losing, V., Hammer, B., Wersing, H.: KNN classifier with self adjusting memory for heterogeneous concept drift. In: International Conference on Data Mining (ICDM), pp. 291–300. IEEE (2016)
Google Scholar
Ng, W., Dash, M.: Discovery of frequent patterns in transactional data streams. In: Hameurlain, A., Küng, J., Wagner, R., Bach Pedersen, T., Tjoa, A.M. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems II. LNCS, vol. 6380, pp. 1–30. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16175-9_1
Chapter Google Scholar
Oza, N.C., Russell, S.: Experimental comparisons of online and batch versions of bagging and boosting. In: SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 359–364 (2001)
Google Scholar
Read, J., Bifet, A., Pfahringer, B., Holmes, G.: Batch-incremental versus instance-incremental learning in dynamic and evolving data. In: Hollmén, J., Klawonn, F., Tucker, A. (eds.) IDA 2012. LNCS, vol. 7619, pp. 313–323. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34156-4_29
Chapter Google Scholar
Street, W.N., Kim, Y.: A streaming ensemble algorithm (SEA) for large-scale classification. In: SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 377–382. ACM (2001)
Google Scholar
Tabassum, S., Gama, J.: Sampling massive streaming call graphs. In: ACM Symposium on Applied Computing, pp. 923–928 (2016)
Google Scholar
Vitter, J.S.: Random sampling with a reservoir. Trans. Math. Softw. (TOMS) 11(1), 37–57 (1985)
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work has been carried out in the frame of a cooperation between Huawei Technologies France SASU and Télécom Paris (Grant no. YBN2018125164).

Author information

Authors and Affiliations

LTCI, Télécom Paris, IP-Paris, Paris, France
Maroua Bahri & Albert Bifet
University of Waikato, Hamilton, New Zealand
Albert Bifet

Authors

Maroua Bahri
View author publications
You can also search for this author in PubMed Google Scholar
Albert Bifet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Maroua Bahri or Albert Bifet .

Editor information

Editors and Affiliations

Universidade do Porto and Fraunhofer Portugal AICOS, Porto, Portugal
Carlos Soares
Dalhousie University, Halifax, NS, Canada
Luis Torgo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bahri, M., Bifet, A. (2021). Incremental k-Nearest Neighbors Using Reservoir Sampling for Data Streams. In: Soares, C., Torgo, L. (eds) Discovery Science. DS 2021. Lecture Notes in Computer Science(), vol 12986. Springer, Cham. https://doi.org/10.1007/978-3-030-88942-5_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-88942-5_10
Published: 09 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88941-8
Online ISBN: 978-3-030-88942-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics