Skip to main content
Log in

Information Preserving Time Decompositions of Time Stamped Documents*

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Extraction of sequences of events from news and other documents based on the publication times of these documents has been shown to be extremely effective in tracking past events. This paper addresses the issue of constructing an optimal information preserving decomposition of the time period associated with a given document set, i.e., a decomposition with the smallest number of subintervals, subject to no loss of information. We introduce the notion of the compressed interval decomposition, where each subinterval consists of consecutive time points having identical information content. We define optimality, and show that any optimal information preserving decomposition of the time period is a refinement of the compressed interval decomposition. We define several special classes of measure functions (functions that measure the prevalence of keywords in the document set and assign them numeric values), based on their effect on the information computed as document sets are combined. We give algorithms, appropriate for different classes of measure functions, for computing an optimal information preserving decomposition of a given document set. We studied the effectiveness of these algorithms by computing several compressed interval and information preserving decompositions for a subset of the Reuters–21578 document set. The experiments support the obvious conclusion that the temporal information gleaned from a document set is strongly dependent on the measure function used and on other user-defined parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
Figure 6.
Figure 7.
Figure 8.
Figure 9.
Figure 10.
Figure 11.
Figure 12.

Similar content being viewed by others

Notes

  1. Available at http://kdd.ics.uci.edu.

  2. Each \( \Pi_C^{w_p} \) is a partition of T D into subintervals. The product of a set of partitions is the coarsest partition that is a refinement of each of the individual partitions, and is unique (Rosen, 2003).

  3. Available at http://ldc.upenn.edu.

References

  • Agrawal, R., Psaila, G., Wimmers, E., and Zait, M. 1995. Querying shapes of histories. In Proc. of 21st International Conference on Very Large Databases.

  • Allan, J., Carbonell, J.G., Doddington, G., Yamron, J., and Yang, Y. 1998. Topic detection and tracking pilot study: Final report. In Proc. DARPA Broadcast News Transcription and Understanding Workshop.

  • Allan, J., Lavrenko, V., Malin, D., and Swan, R. 2000. Detections, bounds, and timelines: UMass and TDT-3. In Proc. 3rd Topic Detection and Tracking Workshop.

  • Berndt. D.J. and Clifford, J. 1996. Finding patterns in time Series: A dynamic programming approach. In U. M. Fayyad et al. Advances in Knowledge Discovery and Data Mining, AAAI Press.

  • Chundi, P. and Rosenkrantz, D.J. 2004. On lossy time decompositions of time stamped documents. In Proc. of the 2004 ACM International Conference on Information and Knowledge Management, pp. 437–445.

  • Cormen, T.H., Leiserson, C.E., Rivest, R.L. and Stein, C. 2001. Introduction to Algorithms, 2nd Edition. MIT Press.

  • Dougherty, J. Kohavi, R., and Sahami, M. 1995. Supervised and unsupervised discretization of continuous features. In Proc. of the 12th International Conference on Machine Learning, pp. 194–202.

  • Kleinberg, J. 2002. Bursty and hierarchical structure in streams. In Proc. 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp 91–101.

  • Lent, B., Agrawal, R., and Srikant, R. 1997. Discovering trends in text databases. In Proc. 3rd Intl. Conf. Knowledge Discovery and Data Mining (KDD), pp. 227–230.

  • Motoyoshi, M., Miura, T., and Watanabe, K. 2002. Mining temporal classes from time series data. In Proc. 11th Intl. Conf. on Information and Knowledge Management, pp. 493–498.

  • Rosen, K. 2003. Discrete mathematics and its applications. 5th edition. McGraw Hill College Division.

  • Roy, S., Gevry, D., and Pottenger, W. M. 2002. Methodologies for trend detection in textual data mining. In Proc. Textmine 02 Workshop, SIAM Intl. Conf. on Data Mining.

  • Chakravarti, S., Sarawagi, S., and Dom, B. 1998. Mining surprising patterns using temporal description length, In Proc. of the 24th Int'l Conference on Very Large Databases (VLDB).

  • Swan, R. and Allan, J. 1999. Extracting significant time varying features from text. In Proc. 8th Intl. Conf. on Information and Knowledge Management, pp. 38–45.

  • Swan, R. and Allan, J. 2000. Automatic generation of overview timelines. In Proc. 23rd Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp.49–56.

  • Swan, R. and Jensen, D. 2000. Timemines: Constructing timelines with statistical models of word usage. In Proc. KDD 2000 Workshop on Text Mining.

Download references

Acknowledgments

We thank the anonymous referees for their detailed and helpful comments which greatly improved the readability of the paper.

This publication was made possible by NIH grant number P20 RR16469 from the INBRE program of the National Center for Research Resources. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of NIH.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Parvathi Chundi.

Additional information

A preliminary version of this paper was presented at the 2004 SIAM International Conference on Data Mining.

Research supported by NSF Grant CCR–0105536.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chundi, P., Rosenkrantz, D.J. Information Preserving Time Decompositions of Time Stamped Documents* . Data Min Knowl Disc 13, 41–65 (2006). https://doi.org/10.1007/s10618-005-0035-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-005-0035-1

Keywords

Navigation