skip to main content
10.1145/3447548.3467356acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Fast Rotation Kernel Density Estimation over Data Streams

Published: 14 August 2021 Publication History

Abstract

Kernel density estimation method is a powerful tool and is widely used in many important real-world applications such as anomaly detection and statistical learning. Unfortunately, current kernel methods suffer from high computational or space costs when dealing with large-scale, high-dimensional datasets, especially when the datasets of interest are given in a stream fashion. Although there are sketch methods designed for kernel density estimation over data streams, they still suffer from high computational costs. To address this problem, in this paper, we propose a novel Rotation Kernel. The Rotation Kernel is based on a Rotation Hash method and is much faster to compute. To achieve memory-efficient kernel density estimation over data streams, we design a method, RKD-Sketch, which compresses high dimensional data streams into a small array of integer counters. We conduct extensive experiments on both synthetic and real-world datasets, and experimental results demonstrate that our RKD-Sketch saves up to 216 times computational resources and up to 104 times space resources than state-of-the-arts. Furthermore, we apply our Rotation Kernel in active learning. Results show that our method achieves up to 256 times speedup and saves up to 13 times space to achieve the same accuracy as the baseline methods.

Supplementary Material

MP4 File (RKDS.mp4)
Kernel density estimation method is a powerful tool and is widely used in many important real-world applications, such as anomaly detection and statistical learning. Unfortunately, current kernel methods suffer from high computational or space costs when dealing with large-scale, high-dimensional datasets, especially when the datasets of interest are given in a stream fashion. Although there are sketch methods designed for kernel density estimation over data streams, they still suffer from high computational costs. To address this problem, we propose a novel Rotation Kernel. The Rotation Kernel is based on a Rotation Hash method and is much faster to compute. To achieve memory-efficient kernel density estimation over data streams, we design a method, RKD-Sketch, which compresses high dimensional data streams into a small array of integer counters.

References

[1]
Erich Schubert, Arthur Zimek, and Hans-Peter Kriegel. Generalized outlier detection with flexible kernel density estimates. In SDM, 2014.
[2]
Alexander Hinneburg and Hans-Henning Gabriel. Denclue 2.0: Fast clustering based on kernel density estimation. In IDA, 2007.
[3]
Hans-Peter Kriegel, Peer Kröger, Jörg Sander, and Arthur Zimek. Density-based clustering. Wiley Interdiscip. Rev. Data Min. Knowl. Discov., 1(3), 2011.
[4]
George H. John and Pat Langley. Estimating continuous distributions in bayesian classifiers. In UAI, 1995.
[5]
Yuan Cao, Haibo He, and Hong Man. Somke: Kernel density estimation over data streams by sequences of self-organizing maps. TNNLS, 23(8), 2012.
[6]
Christoph Heinz and Bernhard Seeger. Cluster kernels: Resource-aware kernel density estimators over streaming data. TKDE, 20(7), 2008.
[7]
Aoying Zhou, Zhiyuan Cai, Li Wei, and Weining Qian. M-kernel merging: Towards density estimation over data streams. In DASFAA, 2003.
[8]
Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Schölkopf, et al. Kernel mean embedding of distributions: A review and beyond. FTML, 10(1--2), 2017.
[9]
Arturs Backurs, Piotr Indyk, and Tal Wagner. Space and time efficient kernel density estimation in high dimensions. In NeurIPS, 2019.
[10]
Benjamin Coleman and Anshumali Shrivastava. Sub-linear race sketches for approximate kernel density estimation on streaming data. In WWW, 2020.
[11]
Moses Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002.
[12]
Benjamin Coleman and Anshumali Shrivastava. A one-pass private sketch for most machine learning tasks. arXiv preprint arXiv:2006.09352, 2020.
[13]
Felix X. Yu, Ananda Theertha Suresh, Krzysztof Marcin Choromanski, Daniel N. Holtmann-Rice, and Sanjiv Kumar. Orthogonal random features. In NIPS, 2016.
[14]
Nir Ailon and Holger Rauhut. Fast and rip-optimal transforms. Discret. Comput. Geom., 52(4), 2014.
[15]
Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya P. Razenshteyn, and Ludwig Schmidt. Practical and optimal LSH for angular distance. In NIPS, 2015.
[16]
Cody Coleman, Edward Chou, Sean Culatana, Peter Bailis, Alexander C Berg, Roshan Sumbaly, Matei Zaharia, and I Zeki Yalniz. Similarity search for efficient active learning and search of rare concepts. arXiv preprint arXiv:2007.00077, 2020.
[17]
Burr Settles and Mark Craven. An analysis of active learning strategies for sequence labeling tasks. In EMNLP, October 2008.
[18]
Dhruv Mahajan, Ross B. Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. In ECCV, 2018.
[19]
Qiang Yang, Yang Liu, Yong Cheng, Yan Kang, Tianjian Chen, and Han Yu. Federated learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 13(3), 2019.
[20]
Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
[21]
A. Rocha and S. K. Goldenstein. Multiclass from binary: Expanding one-versus-all, one-versus-one and ecoc-based approaches. TNNLS, 25(2), 2014.
[22]
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
[23]
David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. Rcv1: A new benchmark collection for text categorization research. JMLR, 5(Apr), 2004.
[24]
Marco F Duarte and Yu Hen Hu. Vehicle classification in distributed sensor networks. J. Parallel Distributed Comput., 64(7), 2004.
[25]
Jonathan J. Hull. A database for handwritten text recognition research. TPAMI, 16(5), 2002.
[26]
Jianqiu Ji, Jianmin Li, Shuicheng Yan, Bo Zhang, and Qi Tian. Super-bit locality-sensitive hashing. In NIPS, 2012.
[27]
Chidanand Apté, Fred Damerau, and Sholom M Weiss. Towards language independent automated learning of text categorization models. In SIGIR, 1994.
[28]
Dino Ienco, Indre Zliobaite, and Bernhard Pfahringer. High density-focused uncertainty sampling for active learning over evolving stream data. In BigMine, 2014.
[29]
Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, 1998.
[30]
Andrei Z Broder, Moses Charikar, Alan M Frieze, and Michael Mitzenmacher. Min-wise independent permutations. J. Comput. Syst. Sci., 60(3), June 2000.
[31]
Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In SOCG, 2004.
[32]
Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. Hashing for similarity search: A survey. arXiv preprint arXiv:1408.2927, 2014.
[33]
Rundong Li, Pinghui Wang, Peng Jia, Xiangliang Zhang, Junzhou Zhao, Jing Tao, Ye Yuan, and Xiaohong Guan. Approximately counting butterflies in large bipartite graph streams. TKDE, 2021.
[34]
Tong Yang, Yang Zhou, Hao Jin, Shigang Chen, and Xiaoming Li. Pyramid sketch: a sketch framework for frequency estimation of data streams. PVLDB, 10(11), 2017.
[35]
Pinghui Wang, Peng Jia, Yiyan Qi, Yu Sun, Jing Tao, and Xiaohong Guan. REPT: A streaming algorithm of approximating global and local triangle counts in parallel. In ICDE, 2019.
[36]
Jizhou Li, Zikun Li, Yifei Xu, Shiqi Jiang, Tong Yang, Bin Cui, Yafei Dai, and Gong Zhang. Wavingsketch: An unbiased and generic sketch for finding top-k items in data streams. In KDD, 2020.
[37]
Qingjun Xiao, Zhiying Tang, and Shigang Chen. Universal online sketch for tracking heavy hitters and estimating moments of data streams. In INFOCOM, 2020.
[38]
Pinghui Wang, Yiyan Qi, Yuanming Zhang, Qiaozhu Zhai, Chenxu Wang, John C. S. Lui, and Xiaohong Guan. A memory-efficient sketch method for estimating high similarities in streaming sets. In KDD, 2019.
[39]
Peng Jia, Pinghui Wang, Jing Tao, and Xiaohong Guan. A fast sketch method for mining user similarities over fully dynamic graph streams. In ICDE, 2019.
[40]
Yiyan Qi, Pinghui Wang, Yuanming Zhang, Qiaozhu Zhai, Chenxu Wang, Guangjian Tian, John C.S. Lui, and Xiaohong Guan. Streaming algorithms for estimating high set similarities in loglog space. TKDE, 2020.
[41]
Benjamin Coleman, Richard G. Baraniuk, and Anshumali Shrivastava. Sub-linear memory sketches for near neighbor search on streaming data. In ICML, 2020.
[42]
Moses Charikar and Paris Siminelakis. Hashing-based-estimators for kernel density in high dimensions. In FOCS, 2017.
[43]
Paris Siminelakis, Kexin Rong, Peter Bailis, Moses Charikar, and Philip Levis. Rehashing kernel evaluation in high dimensions. In ICML, 2019.
[44]
Xian Wu, Moses Charikar, and Vishnu Natchu. Local density estimation in high dimensions. In ICML, 2018.
[45]
Chen Luo and Anshumali Shrivastava. Arrays of (locality-sensitive) count estimators (ACE): anomaly detection on the edge. In WWW, 2018.
[46]
Beidi Chen, Yingchen Xu, and Anshumali Shrivastava. Fast and accurate stochastic gradient estimation. In NIPS, 2019.
[47]
Benjamin Coleman, Gaurav Gupta, John Chen, and Anshumali Shrivastava. STORM: foundations of end-to-end empirical risk minimization on the edge. arXiv preprint arXiv:2006.14554, 2020.
[48]
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam D. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, 2006.
[49]
Cynthia Dwork. Differential privacy. In ICALP, 2006.
[50]
Seung Geol Choi, Dana Dachman-Soled, Mukul Kulkarni, and Arkady Yerukhimovich. Differentially-private multi-party sketching for large-scale statistics. Proc. Priv. Enhancing Technol., 2020(3), 2020.
[51]
Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 312(1), 2004.
[52]
Tian Li, Zaoxing Liu, Vyas Sekar, and Virginia Smith. Privacy for free: Communication-efficient learning with differential privacy using sketches. arXiv preprint arXiv:1911.00972, 2019.
[53]
Hagen Sparka, Florian Tschorsch, and Bjö rn Scheuermann. P2KMV: A privacy-preserving counting sketch for efficient and accurate set intersection cardinality estimations. IACR Cryptol. ePrint Arch., 2018, 2018.
[54]
Daniel Rothchild, Ashwinee Panda, Enayat Ullah, Nikita Ivkin, Ion Stoica, Vladimir Braverman, Joseph Gonzalez, and Raman Arora. Fetchsgd: Communication-efficient federated learning with sketching. In ICML, 2020.
[55]
Farzin Haddadpour, Belhal Karimi, Ping Li, and Xiaoyun Li. Fedsketch: Communication-efficient and private federated learning via sketching. arXiv preprint arXiv:2006.14554, 2020.

Cited By

View all
  • (2024)Approximate kernel density estimation under metric-based local differential privacyProceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence10.5555/3702676.3702876(4250-4270)Online publication date: 15-Jul-2024
  • (2024)A Unified Framework for Mining Batch and Periodic Batch in Data StreamsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339902436:11(5544-5561)Online publication date: Nov-2024
  • (2023)HyperCalm Sketch: One-Pass Mining Periodic Batches in Data Streams2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00009(14-26)Online publication date: Apr-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining
August 2021
4259 pages
ISBN:9781450383325
DOI:10.1145/3447548
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. kernel density estimation
  2. streaming algorithms

Qualifiers

  • Research-article

Funding Sources

  • MoE-CMCC ``Artifical Intelligence' Project
  • Shenzhen Basic Research Grant
  • Natural Science Basic Research Plan in Shaanxi Province of China
  • Natural Science Basic Research Plan in Zhejiang Province of China
  • National Natural Science Foundation of China

Conference

KDD '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)65
  • Downloads (Last 6 weeks)7
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Approximate kernel density estimation under metric-based local differential privacyProceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence10.5555/3702676.3702876(4250-4270)Online publication date: 15-Jul-2024
  • (2024)A Unified Framework for Mining Batch and Periodic Batch in Data StreamsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339902436:11(5544-5561)Online publication date: Nov-2024
  • (2023)HyperCalm Sketch: One-Pass Mining Periodic Batches in Data Streams2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00009(14-26)Online publication date: Apr-2023
  • (2022)Similarity-Based Adaptive Window for Improving Classification of Epileptic Seizures with Imbalance EEG Data StreamEntropy10.3390/e2411164124:11(1641)Online publication date: 11-Nov-2022
  • (2022)Object detection and tracking method based on spiking neuron network2022 41st Chinese Control Conference (CCC)10.23919/CCC55666.2022.9901740(6811-6815)Online publication date: 25-Jul-2022
  • (2022)Enabling secure time-series data sharing via homomorphic encryption in cloud-assisted IIoTFuture Generation Computer Systems10.1016/j.future.2022.03.032133:C(351-363)Online publication date: 1-Aug-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media