Adaptivity in continuous massively parallel distance-based outlier detection

Toliopoulos, Theodoros; Gounaris, Anastasios

doi:10.1007/s00607-022-01101-5

Adaptivity in continuous massively parallel distance-based outlier detection

Regular Paper
Published: 12 July 2022

Volume 104, pages 2659–2684, (2022)
Cite this article

Computing Aims and scope Submit manuscript

249 Accesses
Explore all metrics

Abstract

We deal with the problem of dynamically allocating the workload to multiple workers in massively parallel continuous distance-based outlier detection, where the workload is conceptually split in contiguous overlapping regions. The main challenges stem from the fact that modern streaming processing frameworks, such as Apache Flink and Spark Streaming, do not support feedback loops, the process is stateful while the adaptations do not result in key redistribution but in modifying the region boundaries associated with each key. These challenges correspond to overlooked issues, which call for novel solutions that we provide in our work. More specifically, firstly, we propose an architecture for allowing such adaptations in Flink. Secondly, we propose specific techniques for adaptive region definition that are applicable to any distance metric. Finally, we conduct thorough experimental evaluation and our results show that our proposal is both efficient and effective even in small finite streams. In addition, our proposal is shown to be insensitive to the exact continuous outlier detection algorithm and outlier query parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast, exact, and parallel-friendly outlier detection algorithms with proximity graph in metric spaces

Article Open access 27 January 2022

Multiple Continuous Outlier Detection over Data Stream

Entropy-based outlier detection using spark

Article 16 April 2019

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Availability of data and material (data transparency)

all datasets used are publicly available from third-part repositories.

Notes

An early short version of this work has appeared in [29], which introduced the technique that is termed as naive in this work and was tailored for the Euclidean space and evaluated using only single-dimensional numerical datasets. We significantly extend and improve upon this early work through proposing and experimenting with more eager and sophisticated techniques, while supporting arbitrary metric distances and evaluating using both numeric and text datasets with a high number of dimensions.
The implementation of the whole framework along with the techniques in this work is publicly available from https://github.com/tatoliop/PROUD-PaRallel-OUtlier-Detection-for-streams/tree/adaptive_partitioning.
https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/side_output.html.
We have experimented with additional parameters (slide sizes of 1% up to 50% of the window size) and the results are similar to the ones to be presented; due to space constraints such experiments are omitted.
To further investigate any possible correlation we have run some tests using the MMPC algorithm [6] after transforming the runtime values to a binary target variable (improvement or not-improvement); this has also not yielded any concrete results.

References

Abdelhamid AS, Mahmood AR, Daghistani A, Aref WG (2020) Prompt: Dynamic data-partitioning for distributed micro-batch stream processing systems. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp 2455–2469
Aly AM, Mahmood AR, Hassan MS, Aref WG, Ouzzani M, Elmeleegy H, Qadah T (2015) AQWA: adaptive query-workload-aware partitioning of big spatial data. PVLDB 8(13):2062–2073
Google Scholar
Angiulli F, Fassetti F (2007) Detecting distance-based outliers in streams of data. In: CIKM, pp 811–820
Balkesen C, Tatbul N (2011) Scalable data partitioning techniques for parallel sliding window processing over data streams. In: International Workshop on Data Management for Sensor Networks (DMSN)
Bellas C, Gounaris A (2020) An empirical evaluation of exact set similarity join techniques using gpus. Inf Syst 89:101485. https://doi.org/10.1016/j.is.2019.101485
Article Google Scholar
Brown LE, Tsamardinos I, Aliferis CF (2004) A novel algorithm for scalable and accurate bayesian network learning. In: Fieschi M, Coiera EW, Li JY (eds) MEDINFO 2004 - Proceedings of the 11th World Congress on Medical Informatics, San Francisco, California, USA, September 7-11, 2004, Studies in Health Technology and Informatics, vol 107, pp 711–715
Cao L, Wang J, Rundensteiner EA (2016) Sharing-aware outlier analytics over high-volume data streams. In: ICDM, pp 527–540. ACM
Cao L, Yan Y, Kuhlman C, Wang Q, Rundensteiner EA, Eltabakh MY (2017) Multi-tactic distance-based outlier detection. In: ICDE, pp 959–970
Cao L, Yang D, Wang Q, Yu Y, Wang J, Rundensteiner EA (2014) Scalable distance-based outlier detection over high-volume data streams. In: ICDE, pp 76–87
Carbone P, Ewen S, Fóra G, Haridi S, Richter S, Tzoumas K (2017) State management in apache flink®: Consistent stateful distributed stream processing. PVLDB 10(12):1718–1729
Google Scholar
Cordova I, Moh T (2015) DBSCAN on resilient distributed datasets. In: 2015 International Conference on High Performance Computing & Simulation, HPCS, pp 531–540
Deshpande A, Ives ZG, Raman V (2007) Adaptive query processing. Found. Trends Databases 1(1):1–140
Article MATH Google Scholar
Ding M, Chen S (2019) Efficient partitioning and query processing of spatio-temporal graphs with trillion edges. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp 1714–1717. IEEE
Gedik B (2014) Partitioning functions for stateful data parallelism in stream processing. VLDB J 23(4):517–539
Article Google Scholar
Gill G, Dathathri R, Hoang L, Pingali K (2018) A study of partitioning policies for graph analytics on large-scale distributed platforms. Proceedings of the VLDB Endowment 12(4):321–334
Article Google Scholar
Gounaris A, Yfoulis CA, Paton NW (2012) Efficient load balancing in partitioned queries under random perturbations. TAAS 7(1):5:1-5:27
Article Google Scholar
Katsipoulakis NR, Labrinidis A, Chrysanthis PK (2017) A holistic view of stream partitioning costs. PVLDB 10(11):1286–1297
Google Scholar
Knorr EM, Ng RT, Tucakov V (2000) Distance-based outliers: Algorithms and applications. VLDB J 8(3–4):237–253
Article Google Scholar
Kontaki M, Gounaris A, Papadopoulos AN, Tsichlas K, Manolopoulos Y (2016) Efficient and flexible algorithms for monitoring distance-based outliers over data streams. Inf Syst 55:37–53
Article Google Scholar
Monte BD, Zeuch S, Rabl T, Markl V (2020) Rhino: Efficient management of very large distributed state for stream processing engines. In: Proceedings of the 2020 International Conference on Management of Data, SIGMOD, pp 2471–2486
Rupprecht L, Culhane W, Pietzuch PR (2017) Squirreljoin: Network-aware distributed join processing with lazy partitioning. PVLDB 10(11):1250–1261
Google Scholar
Shah MA, Hellerstein JM, Chandrasekaran S, Franklin MJ (2002) Flux: An adaptive partitioning operator for continuous query systems. In: Dayal U, Ramamritham K, Vijayaraman TM (eds) ICDE, pp 25–36 (2002)
Song H, Lee J (2018) RP-DBSCAN: A superfast parallel DBSCAN algorithm based on random partitioning. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD, pp 1173–1187
Su L, Han W, Yang S, Zou P, Jia Y (2007) Continuous adaptive outlier detection on distributed data streams. In: International Conference on High Performance Computing and Communications, pp 74–85
Subramaniam S, Palpanas T, Papadopoulos D, Kalogeraki V, Gunopulos D (2006) Online outlier detection in sensor data using non-parametric models. In: VLDB, pp 187–198
Tang M, Yu Y, Malluhi QM, Ouzzani M, Aref WG (2016) Locationspark: A distributed in-memory data management system for big spatial data. PVLDB 9(13):1565–1568
Google Scholar
To Q, Soto J, Markl V (2018) A survey of state management in big data processing systems. VLDB J 27(6):847–872
Article Google Scholar
Toliopoulos T, Bellas C, Gounaris A, Papadopoulos A (2020) PROUD: parallel outlier detection for streams. In: SIGMOD (demo track, to appear)
Toliopoulos T, Gounaris A (2020) Adaptive distributed partitioning in apache flink. In: 36th IEEE International Conference on Data Engineering Workshops, ICDE Workshops 2020, Dallas, TX, USA, April 20-24, 2020, pp 127–132. IEEE
Toliopoulos T, Gounaris A, Tsichlas K, Papadopoulos A, Sampaio S (2020) Continuous outlier mining of streaming data in flink. Inf Syst 93:101569
Article Google Scholar
Tran L, Fan L, Shahabi C (2016) Distance-based outlier detection in data streams. PVLDB 9(12):1089–1100
Google Scholar
Tran L, Mun M, Shahabi C (2020) Real-time distance-based outlier detection in data streams. PVLDB 14(2):141–153
Google Scholar
Yang D, Rundensteiner E, Ward M (2009) Neighbor-based pattern detection for windows over streaming data. In: EDBT, pp 529–540
Yang, K., Gao, Y., Ma, R., Chen, L., Wu, S., Chen, G.: Dbscan-ms: Distributed density-based clustering in metric spaces. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1346–1357. IEEE (2019)
Yianilos PN (1993) Data structures and algorithms for nearest neighbor search in general metric spaces. In: SODA, vol 93, pp 311–321
Yoon S, Lee J, Lee BS (2019) NETS: extremely fast outlier detection from a data stream via set-based processing. PVLDB 12(11):1303–1315
Google Scholar
Zhao G, Yu Y, Song P, Zhao G, Ji Z (2018) A parameter space framework for online outlier detection over high-volume data streams. IEEE Access 6:38124–38136
Article Google Scholar

Download references

Acknowledgements

This research work has been supported by the European Commission under the Horizon 2020 Programme, through funding of the LifeChamps project (Grant 875329).

Funding

European Commission under the Horizon 2020 Programme, LifeChamps project (Grant 875329).

Author information

Authors and Affiliations

Aristotle University of Thessaloniki, Thessaloniki, Greece
Theodoros Toliopoulos & Anastasios Gounaris

Authors

Theodoros Toliopoulos
View author publications
You can also search for this author inPubMed Google Scholar
Anastasios Gounaris
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Theodoros Toliopoulos.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Code availability (software application or custom code)

all code is available from https://github.com/tatoliop/PROUD-PaRallel-OUtlier-Detection-for-streams/tree/adaptive_partitioning.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Toliopoulos, T., Gounaris, A. Adaptivity in continuous massively parallel distance-based outlier detection. Computing 104, 2659–2684 (2022). https://doi.org/10.1007/s00607-022-01101-5

Download citation

Received: 16 December 2021
Accepted: 16 June 2022
Published: 12 July 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s00607-022-01101-5

Keywords

Mathematics Subject Classification

Profiles

Theodoros Toliopoulos View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adaptivity in continuous massively parallel distance-based outlier detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Fast, exact, and parallel-friendly outlier detection algorithms with proximity graph in metric spaces

Multiple Continuous Outlier Detection over Data Stream

Entropy-based outlier detection using spark

Explore related subjects

Availability of data and material (data transparency)

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Code availability (software application or custom code)

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Profiles

Subscribe and save

Buy Now