Skip to main content

Incremental k-Nearest Neighbors Using Reservoir Sampling for Data Streams

  • Conference paper
  • First Online:
Discovery Science (DS 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12986))

Included in the following conference series:

Abstract

The online and potentially infinite nature of data streams leads to the inability to store the flow in its entirety and thus restricts the storage to a part of – and/or synopsis information from – the stream. To process these evolving data, we need efficient and accurate methodologies and systems, such as window models (e.g., sliding windows) and summarization techniques (e.g., sampling, sketching, dimensionality reduction). In this paper, we propose, RW-kNN, a k-Nearest Neighbors (kNN) algorithm that employs a practical way to store information about past instances using the biased reservoir sampling to sample the input instances along with a sliding window to maintain the most recent instances from the stream. We evaluate our proposal on a diverse set of synthetic and real datasets and compare against state-of-the-art algorithms in a traditional test-then-train evaluation. Results show how our proposed RW-kNN approach produces high-predictive performance for both real and synthetic datasets while using a feasible amount of resources.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In the sequel, we use the terms instance or observation interchangeably.

  2. 2.

    http://waikato.github.io/meka/datasets/.

  3. 3.

    https://archive.ics.uci.edu/ml/datasets/Poker+Hand.

References

  1. Aggarwal, C.C.: On biased reservoir sampling in the presence of stream evolution. In: Very Large Data Bases (VLDB), pp. 607–618 (2006)

    Google Scholar 

  2. Aggarwal, C.C., Philip, S.Y.: A survey of synopsis construction in data streams. In: Aggarwal, C.C. (eds.) Data Streams. ADBS, vol. 31, pp. 169–207. Springer, Boston (2007). https://doi.org/10.1007/978-0-387-47534-9_9

  3. Agrawal, R., Imielinski, T., Swami, A.: Database mining: a performance perspective. Trans. Knowl. Data Eng. (TKDE) 5(6), 914–925 (1993)

    Article  Google Scholar 

  4. Anguita, D., Ghio, A., Oneto, L., Parra, X., Reyes-Ortiz, J.L.: Human activity recognition on smartphones using a multiclass hardware-friendly support vector machine. In: Bravo, J., Hervás, R., Rodríguez, M. (eds.) IWAAL 2012. LNCS, vol. 7657, pp. 216–223. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35395-6_30

    Chapter  Google Scholar 

  5. Bahri, M., Bifet, A., Maniu, S., Gomes, H.M.: Survey on feature transformation techniques for data streams. In: International Joint Conference on Artificial Intelligence (2020)

    Google Scholar 

  6. Bahri, M., Bifet, A., Maniu, S., de Mello, R., Tziortziotis, N.: Compressed k-nearest neighbors ensembles for evolving data streams. In: European Conference on Artificial Intelligence (ECAI). IEEE (2020)

    Google Scholar 

  7. Bahri, M., Maniu, S., Bifet, A.: Sketch-based Naive Bayes algorithms for evolving data streams. In: International Conference on Big Data, pp. 604–613. IEEE (2018)

    Google Scholar 

  8. Bifet, A., Gavalda, R.: Learning from time-changing data with adaptive windowing. In: International Conference on Data Mining (ICDM), pp. 443–448. SIAM (2007)

    Google Scholar 

  9. Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: massive online analysis. J. Mach. Learn. Res. (JMLR) 11(May), 1601–1604 (2010)

    Google Scholar 

  10. Bifet, A., Pfahringer, B., Read, J., Holmes, G.: Efficient data stream classification via probabilistic adaptive windows. In: Symposium On Applied Computing (SIGAPP), pp. 801–806. ACM (2013)

    Google Scholar 

  11. Caiming, Z., Yong, C.: A review of research relevant to the emerging industry trends: industry 4.0, IoT, blockchain, and business analytics. J. Ind. Integr. Manag. 5, 165–180 (2020)

    Article  Google Scholar 

  12. Candillier, L., Lemaire, V.: Design and analysis of the Nomao challenge active learning in the real-world. In: ALRA, Workshop ECML-PKDD. sn (2012)

    Google Scholar 

  13. Ciarelli, P.M., Oliveira, E.: Agglomeration and elimination of terms for dimensionality reduction. In: International Conference on Intelligent Systems Design and Applications, pp. 547–552. IEEE (2009)

    Google Scholar 

  14. Da Xu, L., He, W., Li, S.: Internet of things in industries: a survey. IEEE Trans. Industr. Inf. 10(4), 2233–2243 (2014)

    Article  Google Scholar 

  15. Dawid, A.P.: Present position and potential developments: some personal views statistical theory the prequential approach. J. R. Stat. Soc. Ser. A (General) 147(2), 278–290 (1984)

    Article  Google Scholar 

  16. Domingos, P., Hulten, G.: Mining high-speed data streams. In: SIGKDD International Conference on Knowledge Discovery & Data Mining (2000)

    Google Scholar 

  17. Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2–3), 131–163 (1997)

    Article  Google Scholar 

  18. Gama, J.: Knowledge Discovery from Data Streams. CRC Press, Boca Raton (2010)

    Book  Google Scholar 

  19. Gama, J., Fernandes, R., Rocha, R.: Decision trees for mining data streams. Intell. Data Anal. (IDA) 10(1), 23–45 (2006)

    Article  Google Scholar 

  20. Gama, J., Rocha, R., Medas, P.: Accurate decision trees for mining high-speed data streams. In: SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 523–528. ACM (2003)

    Google Scholar 

  21. Gomes, H.M., et al.: Adaptive random forests for evolving data stream classification. Mach. Learn. 106, 1469–1495 (2017). https://doi.org/10.1007/s10994-017-5642-8

    Article  MathSciNet  Google Scholar 

  22. Gomes, H.M., Barddal, J.P., Enembreck, F., Bifet, A.: A survey on ensemble learning for data stream classification. Comput. Surv. (CSUR) 50(2), 23 (2017)

    Google Scholar 

  23. Gomes, H.M., Read, J., Bifet, A.: Streaming random patches for evolving data stream classification. In: International Conference on Data Mining (ICDM). IEEE (2019)

    Google Scholar 

  24. Haas, P.J.: Data-stream sampling: basic techniques and results. In: Data Stream Management. DSA, pp. 13–44. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-540-28608-0_2

    Chapter  Google Scholar 

  25. Hand, D.J., Mannila, H., Smyth, P.: Principles of Data Mining. MIT Press, Cambridge (2001)

    Google Scholar 

  26. Losing, V., Hammer, B., Wersing, H.: KNN classifier with self adjusting memory for heterogeneous concept drift. In: International Conference on Data Mining (ICDM), pp. 291–300. IEEE (2016)

    Google Scholar 

  27. Ng, W., Dash, M.: Discovery of frequent patterns in transactional data streams. In: Hameurlain, A., Küng, J., Wagner, R., Bach Pedersen, T., Tjoa, A.M. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems II. LNCS, vol. 6380, pp. 1–30. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16175-9_1

    Chapter  Google Scholar 

  28. Oza, N.C., Russell, S.: Experimental comparisons of online and batch versions of bagging and boosting. In: SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 359–364 (2001)

    Google Scholar 

  29. Read, J., Bifet, A., Pfahringer, B., Holmes, G.: Batch-incremental versus instance-incremental learning in dynamic and evolving data. In: Hollmén, J., Klawonn, F., Tucker, A. (eds.) IDA 2012. LNCS, vol. 7619, pp. 313–323. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34156-4_29

    Chapter  Google Scholar 

  30. Street, W.N., Kim, Y.: A streaming ensemble algorithm (SEA) for large-scale classification. In: SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 377–382. ACM (2001)

    Google Scholar 

  31. Tabassum, S., Gama, J.: Sampling massive streaming call graphs. In: ACM Symposium on Applied Computing, pp. 923–928 (2016)

    Google Scholar 

  32. Vitter, J.S.: Random sampling with a reservoir. Trans. Math. Softw. (TOMS) 11(1), 37–57 (1985)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work has been carried out in the frame of a cooperation between Huawei Technologies France SASU and Télécom Paris (Grant no. YBN2018125164).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Maroua Bahri or Albert Bifet .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bahri, M., Bifet, A. (2021). Incremental k-Nearest Neighbors Using Reservoir Sampling for Data Streams. In: Soares, C., Torgo, L. (eds) Discovery Science. DS 2021. Lecture Notes in Computer Science(), vol 12986. Springer, Cham. https://doi.org/10.1007/978-3-030-88942-5_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-88942-5_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-88941-8

  • Online ISBN: 978-3-030-88942-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics