Abstract
Schema profiling consists in producing key insights about the schema of data in a high-variety context. In this paper, we present a streaming approach to schema profiling, where heterogeneous data is continuously ingested from multiple sources, as is typical in many IoT applications (e.g., with multiple devices or applications dynamically logging messages). The produced profile is a clustering of the schemas extracted from the data and it is computed and evolved in real-time under the overlapping sliding window paradigm. The approach is based on two-phase k-means clustering, which entails pre-aggregating the data into a coreset and incrementally updating the previous clustering results without recomputing it in every iteration. Differently from previous proposals, the approach works in a domain where dimensionality is variable and unknown apriori, it automatically selects the optimal number of clusters, and detects cluster evolution by minimizing the need to recompute the profile. The experimental evaluation demonstrated the effectiveness and efficiency of the approach against the naïve baseline and the state-of-the-art algorithms on stream clustering.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Akidau, T., et al.: Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing. O’Reilly Media, Inc., Sebastopol (2018)
de Andrade Silva, J., et al.: An evolutionary algorithm for clustering data streams with a variable number of clusters. Expert Syst. Appl. (2017)
Arthur, D., et al.: k-means++: the advantages of careful seeding. SIAM (2007)
Breve, B., et al.: Dependency visualization in data stream profiling. Big Data Res. (2021)
Du, M., et al.: Spell: streaming parsing of system event logs. IEEE Computer Society (2016)
Emmi, L.A., et al.: Digital representation of smart agricultural environments for robot navigation. In: CEUR Workshop Proceedings (2022)
Gallinucci, E., et al.: Schema profiling of document-oriented databases. Inf. Syst. (2018)
Grefenstette, G.: Explorations in automatic thesaurus discovery (1994)
Kullback, S., et al.: On information and sufficiency. Ann. Math. Stat. (1951)
Levandowsky, M., et al.: Distance between sets. Nature (1971)
Naldi, M.C., et al.: Comparison among methods for k estimation in k-means. IEEE Computer Society (2009)
Naumann, F.: Data profiling revisited. In: SIGMOD Rec. (2013)
Seyfi, M., et al.: H-DAC: discriminative associative classification in data streams. Soft. Comput. (2023)
Youn, J., et al.: Efficient data stream clustering with sliding windows based on locality-sensitive hashing. IEEE Access (2018)
Zhang, T., et al.: BIRCH: an efficient data clustering method for very large databases. ACM Press (1996)
Zubaroğlu, A., et al.: Data stream clustering: a review. Artif. Intell. Rev. (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Forresi, C., Francia, M., Gallinucci, E., Golfarelli, M. (2023). Streaming Approach to Schema Profiling. In: Abelló, A., et al. New Trends in Database and Information Systems. ADBIS 2023. Communications in Computer and Information Science, vol 1850. Springer, Cham. https://doi.org/10.1007/978-3-031-42941-5_19
Download citation
DOI: https://doi.org/10.1007/978-3-031-42941-5_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42940-8
Online ISBN: 978-3-031-42941-5
eBook Packages: Computer ScienceComputer Science (R0)