Abstract
We deal with the problem of dynamically allocating the workload to multiple workers in massively parallel continuous distance-based outlier detection, where the workload is conceptually split in contiguous overlapping regions. The main challenges stem from the fact that modern streaming processing frameworks, such as Apache Flink and Spark Streaming, do not support feedback loops, the process is stateful while the adaptations do not result in key redistribution but in modifying the region boundaries associated with each key. These challenges correspond to overlooked issues, which call for novel solutions that we provide in our work. More specifically, firstly, we propose an architecture for allowing such adaptations in Flink. Secondly, we propose specific techniques for adaptive region definition that are applicable to any distance metric. Finally, we conduct thorough experimental evaluation and our results show that our proposal is both efficient and effective even in small finite streams. In addition, our proposal is shown to be insensitive to the exact continuous outlier detection algorithm and outlier query parameters.














Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Availability of data and material (data transparency)
all datasets used are publicly available from third-part repositories.
Notes
An early short version of this work has appeared in [29], which introduced the technique that is termed as naive in this work and was tailored for the Euclidean space and evaluated using only single-dimensional numerical datasets. We significantly extend and improve upon this early work through proposing and experimenting with more eager and sophisticated techniques, while supporting arbitrary metric distances and evaluating using both numeric and text datasets with a high number of dimensions.
The implementation of the whole framework along with the techniques in this work is publicly available from https://github.com/tatoliop/PROUD-PaRallel-OUtlier-Detection-for-streams/tree/adaptive_partitioning.
We have experimented with additional parameters (slide sizes of 1% up to 50% of the window size) and the results are similar to the ones to be presented; due to space constraints such experiments are omitted.
To further investigate any possible correlation we have run some tests using the MMPC algorithm [6] after transforming the runtime values to a binary target variable (improvement or not-improvement); this has also not yielded any concrete results.
References
Abdelhamid AS, Mahmood AR, Daghistani A, Aref WG (2020) Prompt: Dynamic data-partitioning for distributed micro-batch stream processing systems. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp 2455–2469
Aly AM, Mahmood AR, Hassan MS, Aref WG, Ouzzani M, Elmeleegy H, Qadah T (2015) AQWA: adaptive query-workload-aware partitioning of big spatial data. PVLDB 8(13):2062–2073
Angiulli F, Fassetti F (2007) Detecting distance-based outliers in streams of data. In: CIKM, pp 811–820
Balkesen C, Tatbul N (2011) Scalable data partitioning techniques for parallel sliding window processing over data streams. In: International Workshop on Data Management for Sensor Networks (DMSN)
Bellas C, Gounaris A (2020) An empirical evaluation of exact set similarity join techniques using gpus. Inf Syst 89:101485. https://doi.org/10.1016/j.is.2019.101485
Brown LE, Tsamardinos I, Aliferis CF (2004) A novel algorithm for scalable and accurate bayesian network learning. In: Fieschi M, Coiera EW, Li JY (eds) MEDINFO 2004 - Proceedings of the 11th World Congress on Medical Informatics, San Francisco, California, USA, September 7-11, 2004, Studies in Health Technology and Informatics, vol 107, pp 711–715
Cao L, Wang J, Rundensteiner EA (2016) Sharing-aware outlier analytics over high-volume data streams. In: ICDM, pp 527–540. ACM
Cao L, Yan Y, Kuhlman C, Wang Q, Rundensteiner EA, Eltabakh MY (2017) Multi-tactic distance-based outlier detection. In: ICDE, pp 959–970
Cao L, Yang D, Wang Q, Yu Y, Wang J, Rundensteiner EA (2014) Scalable distance-based outlier detection over high-volume data streams. In: ICDE, pp 76–87
Carbone P, Ewen S, Fóra G, Haridi S, Richter S, Tzoumas K (2017) State management in apache flink®: Consistent stateful distributed stream processing. PVLDB 10(12):1718–1729
Cordova I, Moh T (2015) DBSCAN on resilient distributed datasets. In: 2015 International Conference on High Performance Computing & Simulation, HPCS, pp 531–540
Deshpande A, Ives ZG, Raman V (2007) Adaptive query processing. Found. Trends Databases 1(1):1–140
Ding M, Chen S (2019) Efficient partitioning and query processing of spatio-temporal graphs with trillion edges. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp 1714–1717. IEEE
Gedik B (2014) Partitioning functions for stateful data parallelism in stream processing. VLDB J 23(4):517–539
Gill G, Dathathri R, Hoang L, Pingali K (2018) A study of partitioning policies for graph analytics on large-scale distributed platforms. Proceedings of the VLDB Endowment 12(4):321–334
Gounaris A, Yfoulis CA, Paton NW (2012) Efficient load balancing in partitioned queries under random perturbations. TAAS 7(1):5:1-5:27
Katsipoulakis NR, Labrinidis A, Chrysanthis PK (2017) A holistic view of stream partitioning costs. PVLDB 10(11):1286–1297
Knorr EM, Ng RT, Tucakov V (2000) Distance-based outliers: Algorithms and applications. VLDB J 8(3–4):237–253
Kontaki M, Gounaris A, Papadopoulos AN, Tsichlas K, Manolopoulos Y (2016) Efficient and flexible algorithms for monitoring distance-based outliers over data streams. Inf Syst 55:37–53
Monte BD, Zeuch S, Rabl T, Markl V (2020) Rhino: Efficient management of very large distributed state for stream processing engines. In: Proceedings of the 2020 International Conference on Management of Data, SIGMOD, pp 2471–2486
Rupprecht L, Culhane W, Pietzuch PR (2017) Squirreljoin: Network-aware distributed join processing with lazy partitioning. PVLDB 10(11):1250–1261
Shah MA, Hellerstein JM, Chandrasekaran S, Franklin MJ (2002) Flux: An adaptive partitioning operator for continuous query systems. In: Dayal U, Ramamritham K, Vijayaraman TM (eds) ICDE, pp 25–36 (2002)
Song H, Lee J (2018) RP-DBSCAN: A superfast parallel DBSCAN algorithm based on random partitioning. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD, pp 1173–1187
Su L, Han W, Yang S, Zou P, Jia Y (2007) Continuous adaptive outlier detection on distributed data streams. In: International Conference on High Performance Computing and Communications, pp 74–85
Subramaniam S, Palpanas T, Papadopoulos D, Kalogeraki V, Gunopulos D (2006) Online outlier detection in sensor data using non-parametric models. In: VLDB, pp 187–198
Tang M, Yu Y, Malluhi QM, Ouzzani M, Aref WG (2016) Locationspark: A distributed in-memory data management system for big spatial data. PVLDB 9(13):1565–1568
To Q, Soto J, Markl V (2018) A survey of state management in big data processing systems. VLDB J 27(6):847–872
Toliopoulos T, Bellas C, Gounaris A, Papadopoulos A (2020) PROUD: parallel outlier detection for streams. In: SIGMOD (demo track, to appear)
Toliopoulos T, Gounaris A (2020) Adaptive distributed partitioning in apache flink. In: 36th IEEE International Conference on Data Engineering Workshops, ICDE Workshops 2020, Dallas, TX, USA, April 20-24, 2020, pp 127–132. IEEE
Toliopoulos T, Gounaris A, Tsichlas K, Papadopoulos A, Sampaio S (2020) Continuous outlier mining of streaming data in flink. Inf Syst 93:101569
Tran L, Fan L, Shahabi C (2016) Distance-based outlier detection in data streams. PVLDB 9(12):1089–1100
Tran L, Mun M, Shahabi C (2020) Real-time distance-based outlier detection in data streams. PVLDB 14(2):141–153
Yang D, Rundensteiner E, Ward M (2009) Neighbor-based pattern detection for windows over streaming data. In: EDBT, pp 529–540
Yang, K., Gao, Y., Ma, R., Chen, L., Wu, S., Chen, G.: Dbscan-ms: Distributed density-based clustering in metric spaces. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1346–1357. IEEE (2019)
Yianilos PN (1993) Data structures and algorithms for nearest neighbor search in general metric spaces. In: SODA, vol 93, pp 311–321
Yoon S, Lee J, Lee BS (2019) NETS: extremely fast outlier detection from a data stream via set-based processing. PVLDB 12(11):1303–1315
Zhao G, Yu Y, Song P, Zhao G, Ji Z (2018) A parameter space framework for online outlier detection over high-volume data streams. IEEE Access 6:38124–38136
Acknowledgements
This research work has been supported by the European Commission under the Horizon 2020 Programme, through funding of the LifeChamps project (Grant 875329).
Funding
European Commission under the Horizon 2020 Programme, LifeChamps project (Grant 875329).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Code availability (software application or custom code)
all code is available from https://github.com/tatoliop/PROUD-PaRallel-OUtlier-Detection-for-streams/tree/adaptive_partitioning.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Toliopoulos, T., Gounaris, A. Adaptivity in continuous massively parallel distance-based outlier detection. Computing 104, 2659–2684 (2022). https://doi.org/10.1007/s00607-022-01101-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-022-01101-5