Abstract
Analysis of static data is one of the best studied research areas. However, data changes over time. These changes may reveal patterns or groups of similar values, properties, and entities. We study changes in large, publicly available data repositories by modelling them as time series and clustering these series by their similarity. In order to perform change exploration on real-world data we use the publicly available revision data of Wikipedia Infoboxes and weekly snapshots of IMDB.
The changes to the data are captured as events, which we call change records. In order to extract temporal behavior we count changes in time periods and propose a general transformation framework that aggregates groups of changes to numerical time series of different resolutions. We use these time series to study different application scenarios of unsupervised clustering. Our explorative results show that changes made to collaboratively edited data sources can help find characteristic behavior, distinguish entities or properties and provide insight into the respective domains.
Similar content being viewed by others
Notes
The parser is available at: https://github.com/HPI-Information-Systems/IMDBParser
References
Aghabozorgi S, Shirkhorshidi AS, Wah TY (2015) Time-series clustering–a decade review. Inf Syst 53:16–38
Alfonseca E, Garrido G, Delort J, Peñas A (2013) WHAD: Wikipedia historical attributes data – historical structured data extraction and vandalism detection from the Wikipedia edit history. Lang Resour Eval 47(4):1163–1190
Bleifuss T, Johnson T, Kalashnikov DV, Naumann F, Shkapenyuk V, Srivastava D (2017) Enabling change exploration (vision). Fourth International Workshop on Exploratory Search in Databases and the Web (ExploreDB), pp 1–3
Cetintemel U, Cherniack M, DeBrabant J, Diao Y, Dimitriadou K, Kalinin A, Papaemmanouil O, Zdonik SB (2013) Query steering for interactive data exploration. Conference on Innovative Data Systems Research (CIDR).
Dasu T, Johnson T, Marathe A (2006) Database exploration using database dynamics. IEEE Data Eng Bull 29(2):43–59
Deligiannidis L, Kochut KJ, Sheth AP (2007) Rdf data exploration and visualization. ACM first workshop on CyberInfrastructure: information management in eScience, pp 39–46
Deng H, Runger G, Tuv E, Vladimir M (2013) A time series forest for classification and feature extraction. Inf Sci (Ny) 239:142–153
Dividino RQ, Gottron T, Scherp A, Gröner G (2014) From changes to dynamics: dynamics analysis of linked open data sources. Proceedings of the Extended Semantic Web Conference (ESWC).
Fournier-Viger P, Lin JCW, Kiran RU, Koh YS, Thomas R (2017) A survey of sequential pattern mining. Data Sci Pattern Recognit 1(1):54–77
Fu T-C, Chung F-L, Luk R, Ng V (2001) Pattern discovery from stock time series using self-organizing maps. Workshop Notes of KDD 2001 Workshop on Temporal Data Mining, pp 26–29
Idreos S, Papaemmanouil O, Chaudhuri S (2015) Overview of data exploration techniques. International Conference on Management of Data (SIGMOD), pp 277–281
Iglesias F, Kastner W (2013) Analysis of similarity measures in times series clustering for the discovery of building energy patterns. Energies 6(2):579–597
Keim DA, Kriegel HP (1994) VisDB: database exploration using multidimensional visualization. IEEE Comput Graph Appl 14(5):40–49
Li X, Li Z, Han J, Lee JG (2009) Temporal outlier detection in vehicle traffic data. International Conference on Data Engineering (ICDE), pp 1319–1322
Lin J, Keogh E, Wei L, Lonardi S (2007) Experiencing sax: a novel symbolic representation of time series. Data Min Knowl Discov 15(2):107–144
Maule A, Emmerich W, Rosenblum DS (2008) Impact analysis of database schema changes. International Conference on Software Engineering (ICSE). ACM, New York, pp 451–460
Mörchen F, Ultsch A, Hoos O (2005) Extracting interpretable muscle activation patterns with time series knowledge mining. Int J Knowledgebased Intell Eng Syst 9(3):197–208
Olszewski RT (2001) Generalized feature extraction for structural pattern recognition in time-series data. Tech. rep. Carnegie-Mellon University, School of Computer Science, Pittsburgh
Özsoyoglu G, Snodgrass RT (1995) Temporal and real-time databases: a survey. IEEE Trans Knowl Data Eng 7(4):513–532
Papavassiliou V, Flouris G, Fundulaki I, Kotzinos D, Christophides V (2009) On detecting high-level changes in RDF/S KBs. International Semantic Web Conference (ISWC), pp 473–488
Petitjean F, Ketterlin A, Gançarski P (2011) A global averaging method for dynamic time warping, with applications to clustering. Pattern Recognit 44(3):678–693
Ramoni M, Sebastiani P, Cohen P (2000) Multivariate clustering by dynamics. National Conference on Artificial Intelligence (AAAI), pp 633–638
Rebbapragada U, Protopapas P, Brodley CE, Alcock C (2009) Finding anomalous periodic time series. Mach Learn 74(3):281–313
Umbrich J, Decker S, Hausenblas M, Polleres A, Hogan A (2010) Towards dataset dynamics: change frequency of linked open data sources. International Workshop on Linked Data on the Web.
Van Der Aalst W (2012) Process mining: overview and opportunities. ACM Trans Manag Inf Syst 3(2):7
Velegrakis Y, Miller J, Popa L (2004) Preserving mapping consistency under schema changes. VLDB J 13(3):274–293
Xing Z, Pei J, Yu PS, Wang K (2011) Extracting interpretable features for early classification on time series. SIAM International Conference on Data Mining, pp 247–258
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bornemann, L., Bleifuß, T., Kalashnikov, D. et al. Data Change Exploration Using Time Series Clustering. Datenbank Spektrum 18, 79–87 (2018). https://doi.org/10.1007/s13222-018-0285-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13222-018-0285-x