Streaming Approach to Schema Profiling

Forresi, Chiara; Francia, Matteo; Gallinucci, Enrico; Golfarelli, Matteo

doi:10.1007/978-3-031-42941-5_19

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1850))

Included in the following conference series:

European Conference on Advances in Databases and Information Systems

539 Accesses

Abstract

Schema profiling consists in producing key insights about the schema of data in a high-variety context. In this paper, we present a streaming approach to schema profiling, where heterogeneous data is continuously ingested from multiple sources, as is typical in many IoT applications (e.g., with multiple devices or applications dynamically logging messages). The produced profile is a clustering of the schemas extracted from the data and it is computed and evolved in real-time under the overlapping sliding window paradigm. The approach is based on two-phase k-means clustering, which entails pre-aggregating the data into a coreset and incrementally updating the previous clustering results without recomputing it in every iteration. Differently from previous proposals, the approach works in a domain where dimensionality is variable and unknown apriori, it automatically selects the optimal number of clusters, and detects cluster evolution by minimizing the need to recompute the profile. The experimental evaluation demonstrated the effectiveness and efficiency of the approach against the naïve baseline and the state-of-the-art algorithms on stream clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Akidau, T., et al.: Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing. O’Reilly Media, Inc., Sebastopol (2018)
Google Scholar
de Andrade Silva, J., et al.: An evolutionary algorithm for clustering data streams with a variable number of clusters. Expert Syst. Appl. (2017)
Google Scholar
Arthur, D., et al.: k-means++: the advantages of careful seeding. SIAM (2007)
Google Scholar
Breve, B., et al.: Dependency visualization in data stream profiling. Big Data Res. (2021)
Google Scholar
Du, M., et al.: Spell: streaming parsing of system event logs. IEEE Computer Society (2016)
Google Scholar
Emmi, L.A., et al.: Digital representation of smart agricultural environments for robot navigation. In: CEUR Workshop Proceedings (2022)
Google Scholar
Gallinucci, E., et al.: Schema profiling of document-oriented databases. Inf. Syst. (2018)
Google Scholar
Grefenstette, G.: Explorations in automatic thesaurus discovery (1994)
Google Scholar
Kullback, S., et al.: On information and sufficiency. Ann. Math. Stat. (1951)
Google Scholar
Levandowsky, M., et al.: Distance between sets. Nature (1971)
Google Scholar
Naldi, M.C., et al.: Comparison among methods for k estimation in k-means. IEEE Computer Society (2009)
Google Scholar
Naumann, F.: Data profiling revisited. In: SIGMOD Rec. (2013)
Google Scholar
Seyfi, M., et al.: H-DAC: discriminative associative classification in data streams. Soft. Comput. (2023)
Google Scholar
Youn, J., et al.: Efficient data stream clustering with sliding windows based on locality-sensitive hashing. IEEE Access (2018)
Google Scholar
Zhang, T., et al.: BIRCH: an efficient data clustering method for very large databases. ACM Press (1996)
Google Scholar
Zubaroğlu, A., et al.: Data stream clustering: a review. Artif. Intell. Rev. (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Bologna, Cesena, Italy
Chiara Forresi, Matteo Francia, Enrico Gallinucci & Matteo Golfarelli

Authors

Chiara Forresi
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Francia
View author publications
You can also search for this author in PubMed Google Scholar
Enrico Gallinucci
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Golfarelli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Enrico Gallinucci .

Editor information

Editors and Affiliations

Universitat Politècnica de Catalunya, Barcelona, Spain
Alberto Abelló
University of Ioannina, Ioannina, Greece
Panos Vassiliadis
Universitat Politècnica de Catalunya, Barcelona, Spain
Oscar Romero
Poznań University of Technology, Poznan, Poland
Robert Wrembel
University of Paris-Saclay, Gif-sur-Yvette, France
Francesca Bugiotti
Free University of Bozen-Bolzano, Bozen-Bolzano, Italy
Johann Gamper
CNRS, Villeurbanne Cedex, France
Genoveva Vargas Solar
University of Calabria, Rende, Italy
Ester Zumpano

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Forresi, C., Francia, M., Gallinucci, E., Golfarelli, M. (2023). Streaming Approach to Schema Profiling. In: Abelló, A., et al. New Trends in Database and Information Systems. ADBIS 2023. Communications in Computer and Information Science, vol 1850. Springer, Cham. https://doi.org/10.1007/978-3-031-42941-5_19

Download citation

DOI: https://doi.org/10.1007/978-3-031-42941-5_19
Published: 31 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42940-8
Online ISBN: 978-3-031-42941-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Streaming Approach to Schema Profiling