Abstract
The probabilistic real-time automaton (PRTA) is a representation of dynamic processes arising in the sciences and industry. Currently, the induction of automata is divided into two steps: the creation of the prefix tree acceptor (PTA) and the merge procedure based on clustering of the states. These two steps can be very time intensive when a PRTA is to be induced for massive or even unbounded datasets. The latter one can be efficiently processed, as there exist scalable online clustering algorithms. However, the creation of the PTA still can be very time consuming. To overcome this problem, we propose a genuine online PRTA induction approach that incorporates new instances by first collapsing them and then using a maximum frequent pattern based clustering. The approach is tested against a predefined synthetic automaton and real world datasets, for which the approach is scalable and stable. Moreover, we present a broad evaluation on a real world disease group dataset that shows the applicability of such a model to the analysis of medical processes.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Patnaik D, Butler P, Ramakrishnan N, Parida L, Keller B J, Hanauer D A. Experiences with mining temporal event sequences from electronic medical records: Initial successes and some challenges. In Proc. the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2011, pp.360-368.
Verwer S, De Weerdt M, Witteveen C. A likelihood-ratio test for identifying probabilistic deterministic real-time automata from positive data. In Lecture Notes in Computer Science 6339, Sempere J M, Garcia P (eds.), 2010, pp.203-216.
Verwer S, De Weerdt M, Witteveen C. The efficiency of identifying timed automata and the power of clocks. Information and Computation, 2011, 209(3): 606-625.
Peng H K, Wu P, Zhu J, Zhang J Y. Helix: Unsupervised grammar induction for structured activity recognition. In Proc. the 11th IEEE International Conference on Data Mining, December 2011, pp.1194-1199.
Han J, Kamber M. Data Mining: Concepts and Techniques. San Francisco, CA, USA: Morgan Kaufmann Publisher, March 2006.
Verwer S, de Weerdt M, Witteveen C. Efficiently identifying deterministic real-time automata from labeled data. Machine Learning, 2012, 86(3): 295–333.
Schmidt J, Ansorge S, Kramer S. Scalable induction of probabilistic real-time automata using maximum frequent pattern based clustering. In Proc. the 12th SIAM International Conference on Data Mining, April 2012, pp.272-283.
Džeroski S, Gjorgjioski V, Slavkov I, Struyf J. Analysis of time series data with predictive clustering trees. In Proc. the 5th International Conference on Knowledge Discovery in Inductive Databases, September 2006, pp.63-80.
Sese J, Kurokawa Y, Monden M, Kato K, Morishita S. Constrained clusters of gene expression profiles with pathological features. Bioinformatics, 2004, 20(17): 3137-3145.
Blachon S, Pensa R, Besson J, Robardet C, Boulicaut J F, Gandrillon O. Clustering formal concepts to discover biologically relevant knowledge from gene expression data. In Silico Biology, 2007, 7(4/5): 467-483.
Cerf L, Besson J, Robardet C, Boulicaut J F. Closed patterns meet n-ary relations. ACM Transactions on Knowledge Discovery from Data, 2009, 3(1): Article No.3.
Achar A, Laxman S, Sastry P S. A unified view of the apriori-based algorithms for frequent episode discovery. Knowledge and Information Systems, 2012, 31(2): 223-250.
Schmidt J, Kramer S. The augmented itemset tree: A data structure for online maximum frequent pattern mining. In Proc. the 14th International Conference on Discovery Science, October 2011, pp.277-291.
Wang C, Lai J, Zhu J. Conscience online learning: An efficient approach for robust kernel-based clustering. Knowledge and Information Systems, 2012, 31(1): 79–104.
Masud M M, Al-Khateeb T, Khan L, Aggarwal C, Gao J, Han J, Thuraisingham B M. Detecting recurring and novel classes in concept-drifting data streams. In Proc. the 11th IEEE International Conference on Data Mining, Dec. 2011, pp.1176-1181.
Hommerson A, Verwer S, Lucas P. Discovering probabilistic structures of healthcare. In Lecture Notes in Computer Science 8268, Riaño D, Lenz R, Miksch S et al. (eds.), Springer-Verlag, 2013, pp.53-67.
Rowicka M, Kudlicki A, Tu B P, Otwinowski Z. High-resolution timing of cell cycle-regulated gene expression. Proceedings of the National Academy of Sciences of the United States of America, 2007, 104(43): 16892-16897.
Hubert L, Arabie P. Comparing partitions. Journal of Classification, 1985, 1(2): 193-218.
Author information
Authors and Affiliations
Corresponding author
Additional information
A preliminary version of the paper was published in the Proceedings of ICDM 2012.
Electronic supplementary material
Below is the link to the electronic supplementary material.
ESM 1
(PDF 81 kb)
Rights and permissions
About this article
Cite this article
Schmidt, J., Kramer, S. Online Induction of Probabilistic Real-Time Automata. J. Comput. Sci. Technol. 29, 345–360 (2014). https://doi.org/10.1007/s11390-014-1435-8
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-014-1435-8