Abstract
We propose a special type of time series, which we call an item-set time series, to facilitate the temporal analysis of software version histories, email logs, stock market data, etc. In an item-set time series, each observed data value is a set of discrete items. We formalize the concept of an item-set time series and present efficient algorithms for segmenting a given item-set time series. Segmentation of a time series partitions the time series into a sequence of segments where each segment is constructed by combining consecutive time points of the time series. Each segment is associated with an item set that is computed from the item sets of the time points in that segment, using a function which we call a measure function. We then define a concept called the segment difference, which measures the difference between the item set of a segment and the item sets of the time points in that segment. The segment difference values are required to construct an optimal segmentation of the time series. We describe novel and efficient algorithms to compute segment difference values for each of the measure functions described in the paper. We outline a dynamic programming based scheme to construct an optimal segmentation of the given item-set time series. We use the item-set time series segmentation techniques to analyze the temporal content of three different data sets—Enron email, stock market data, and a synthetic data set. The experimental results show that an optimal segmentation of item-set time series data captures much more temporal content than a segmentation constructed based on the number of time points in each segment, without examining the item set data at the time points, and can be used to analyze different types of temporal data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
R. Bellman. On the approximation of curves by line segments using dynamic programming. Commun. ACM, 4(6):284, 1961.
P. Chundi and D. J. Rosenkrantz. Constructing time decompositions for analyzing time-stamped documents. In Proceedings of the 4th SIAM International Conference on Data Mining, pages 57–68, Orlando, FL, Apr. 2004.
P. Chundi and D. J. Rosenkrantz. On lossy time decompositions of time-stamped documents. In Proc. 13th ACM Conference on Information and Knowledge Management (CIKM), pages 437–445, Washington, DC, Nov. 2004.
P. Chundi and D. J. Rosenkrantz. Information preserving time decompositions of time stamped documents. Data Min. Knowl. Discov., 13(1):41–65, 2006.
P. Chundi and D. J. Rosenkrantz. Segmentation of time series data. In J. Wang, editor, Encyclopedia of Data Warehousing and Mining. Information Science Reference, Hershey, 2nd edition, pages 1753–1758, 2008.
P. Chundi, R. Zhang, and D. J. Rosenkrantz. Efficient algorithms for constructing time decompositions of time stamped documents. In K. V. Andersen, J. K. Debenham, and R. Wagner, editors, Proc. 16th International Conference on Database and Expert Systems Applications (DEXA). Lecture Notes in Computer Science, volume 3588, pages 514–523. Springer, Berlin, 2005.
K. K. S. Chung, L. Hossain, and J. Davis. Exploring sociocentric and egocentric approaches for social network analysis. In Proc. 2nd International Conference on Knowledge Management in Asia Pacific, 2005.
P. Cohen and N. Adams. An algorithm for segmenting categorical time series into meaningful episodes. In Proc. 4th International Symposium on Intelligent Data Analysis. Lecture Notes in Computer Science, volume 2189, pages 198–207. Springer, Berlin, 2001.
G. Das, K. Lin, H. Mannila, G. Renganathan, and P. Smyth. Rule discovery from time series. In Proc. 4th International Conference on Knowledge Discovery and Data Mining (KDD), pages 16–22. AAAI Press, Menlo Park, 1998.
J. Diesner and K. Carley. Exploration of communication networks from the Enron Email Corpus. In Proc. 2005 Workshop on Link Analysis, Counterterrorism, and Security (held in conjunction with SDM 2005), 2005.
Enron, 2005, Enron Email Corpus. http://www.cs.cmu.edu/~enron/.
J. A. Flanagan, J. Mantyjarvi, and J. Himberg. Unsupervised clustering of symbol strings and context recognition. In Proc. 2nd IEEE International Conference on Data Mining, page 171, 2002.
M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy. Mining data streams: A review. ACM SIGMOD Record, 34(2):18–26, 2005.
X. Ge, W. Pratt, and P. Smyth. Discovering Chinese words from unsegmented text. In Proc. 22nd International Conference on Research and Development on Information Retrieval (SIGIR), pages 271–272, Berkeley, CA, 1999.
A. Gionis and H. Mannila. Finding recurrent sources in sequences. In Proc. 7th International Conference on Research in Computational Molecular Biology (RECOMB), pages 123–130, 2003.
A. Gionis and H. Mannila. Segmentation algorithms for time series and sequence data. In Tutorial at 5th SIAM International Conference on Data Mining, 2005.
R. Gwadera, A. Gionis, and H. Mannila. Optimal segmentation using tree models. In Proc. 6th International Conference on Data Mining (ICDM), pages 244–253, 2006.
J. Himberg, J. Toivonen, K. Korpiaho, and H. Mannila. Time series segmentation for context recognition in mobile devices. In Proc. 1st International Conference on Data Mining (ICDM), pages 203–210, 2001.
A. Kehagias and V. Petridis. Time-series segmentation using predictive modular neural networks. Neural Computation, 9(8):1691–1709, 1997.
E. J. Keogh and S. Kasetty. On the need for time series data mining benchmarks: A survey and empirical demonstration. Data Min. Knowl. Discov., 7(4):349–371, 2003.
E. J. Keogh and M. J. Pazzani. An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In Proc. 4th International Conference on Knowledge Discovery and Data Mining (KDD), pages 239–243. AAAI Press, Menlo Park, 1998.
E. J. Keogh, S. Chu, D. Hart, and M. J. Pazzani. An online algorithm for segmenting time series. In Proc. 1st IEEE International Conference on Data Mining (ICDM), pages 289–296, 2001.
B. Klimt and Y. Yang. Introducing the Enron Corpus. In First Conference on Email and Anti-Spam (CEAS), 2004.
J. Lin, E. J. Keogh, S. Lonardi, and B. Chiu. A symbolic representation of time series, with implications for streaming algorithms. In Proc. 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD), pages 2–11, 2003.
H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Min. Knowl. Discov., 1(3):259–289, 1997.
N. Pathak, S. Mane, and J. Srivastava. Who thinks who knows who? Socio-cognitive analysis of Email networks. In Proc. 6th IEEE International Conference on Data Mining (ICDM), pages 466–477, 2006.
E. Perlman and A. Java. Predictive mining of time series data in astronomy. Proc. Astronomical Data Analysis Software and Systems XII, ASP Conference Series, 295:431–434, 2003.
J. Shetty and J. Adibi. Discovering important nodes through graph entropy – the case of Enron Email Database. In Workshop on Link Discovery: Issues, Approaches and Applications (held in conjunction with ACM SIGKDD 2005), pages 74–81, 2005.
H. Siy, P. Chundi, D. J. Rosenkrantz, and M. Subramaniam. Discovering dynamic developer relationships from software version histories by time series segmentation. In Proc. 23rd IEEE International Conference on Software Maintenance (ICSM), pages 415–424, Paris, Oct. 2007.
H. Siy, P. Chundi, D. J. Rosenkrantz, and M. Subramaniam. A segmentation-based approach for temporal analysis of software version repositories. J. Software Maintenance and Evolution: Research and Practice, 20(3):199–222, 2008.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Science + Business Media B.V.
About this chapter
Cite this chapter
Chundi, P., Rosenkrantz, D.J. (2009). Efficient Algorithms for Segmentation of Item-Set Time Series. In: Ravi, S.S., Shukla, S.K. (eds) Fundamental Problems in Computing. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-9688-4_10
Download citation
DOI: https://doi.org/10.1007/978-1-4020-9688-4_10
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-9687-7
Online ISBN: 978-1-4020-9688-4
eBook Packages: Computer ScienceComputer Science (R0)