Abstract
In this paper, we describe a document clustering method called novelty-based document clustering. This method clusters documents based on similarity and novelty. The method assigns higher weights to recent documents than old ones and generates clusters with the focus on recent topics. The similarity function is derived probabilistically, extending the conventional cosine measure of the vector space model by incorporating a document forgetting model to produce novelty-based clusters. The clustering procedure is a variation of the K-means method. An additional feature of our clustering method is an incremental update facility, which is applied when new documents are incorporated into a document repository. Performance of the clustering method is examined through experiments. Experimental results show the efficiency and effectiveness of our method.
Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Allan, J. (ed.): Topic Detection and Tracking: Event-based Information Organization. Kluwer, Boston, MA (2002)
Allan, J., Harding, S., Fisher, D., Bolivar, A., Guzman-Lara, S., Amstutz, P.: Taking topic detection and tracking from evaluation to practice. In: Proc. of the 38th Hawaii International Conference on System Sciences, pp. 1–10 (2005)
Avramescu, A.: Actuality and obsolescence of scientific literature. J. Am. Soc. Inf. Sci. 30, 96–303 (1979)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Harlow, England (1999)
Can, F.: Incremental clustering for dynamic information processing. ACM Trans. Inf. Sys. 11(2), 143–164 (1993)
Chakrabarti, D., Kumar, R., Tomkins, A.: Evolutionary clustering. In: Proc. of 12th ACM SIGKDD Conference, pp. 554–560 (2006)
Charikar, M., Chekuri, C., Feder, T., Motwani, R.: Incremental clustering and dynamic information retrieval. In: Proc. of 29th ACM Symposium on Theory of Computing (STOC), El Paso, Texas, USA, May 4–6, pp. 626–635 (1997)
Chaudhuri, B.B.: Dynamic clustering for time incremental data. Pattern Recogn. Lett. 15(1), 27–34 (1994)
Chen, H.H., Kuo, J.J., Huang, S.J., Lin, C.J., Wung, H.C.: A summarization system for Chinese news from multiple sources. J. Am. Soc. Inf. Sci. Technol. (JASIST), 54(13), 1224–1236 (2003)
Cohen, E., Strauss, M.: Maintaining time-decaying stream aggregates. In: Proc. of 20th ACM Symposium on Principles of Database Systems, San Diego, CA, June 9–11, pp. 223–233 (2003)
Cui, C., Kitagawa, H.: Topic activation analysis for document streams based on document arrival rate and relevance. In: Proc. of the 20th Annual ACM Symposium on Applied Computing, Santa Fe, NM, March 13–17, pp. 1089–1095 (2005)
Cutting, D., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: Proc. of 15th ACM SIGIR Conference, pp. 318–329 (1992)
Diodato, V.: Dictionary of Bibliometrics. Haworth Press, New York (1994)
Eichmann, D., Srinivasan, P.: Adaptive filtering of newswire stories using two-level clustering. Inf. Retr. 5, 209–237 (2002)
Egghe, L., Rousseau, R.: Introduction to Informetrics: Quantitative Methods in Library, Documentation and Information Science. Elsevier, Amsterdam (1990)
Franz, M., McCarley, J.S., Ward, T., Zhu, W.J.: Unsupervised and supervised clustering for topic tracking. In: Proc. of ACM SIGIR Conference, pp. 310–317 (2001)
Ishikawa, Y., Chen, Y., Kitagawa, H.: An on-line document clustering method based on forgetting factors. In: Proc. of 5th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), Darmstadt, Germany, September 4-9, pp. 325–339 (2001)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)
Khy, S., Ishikawa, Y., Kitagawa, H.: Novelty-based incremental document clustering for on-line documents. In: Proc. of 2nd International Workshop on Challenges in Web Information Retrieval and Integration (WIRI), Atlanta, April 3, pp. 41–50 (2006)
Kleinberg, J.: Bursty and hierarchical structure in streams. In: Proc. of ACM SIGKDD Conference, pp. 91–101 (2002)
Kumar, R., Novak, J., Raghavan, P., Tomkins, A.: On the bursty evolution of blogspace. World Wide Web J. 8(2), 159–178 (2005)
Kumaran, G., Allan, J.: Text classification and named entities for new event detection. In: Proc. of ACM SIGIR Conference, pp. 297–304 (2004)
Linguistic Data Consortium (LDC), http://www.ldc.upenn.edu/
Leuski, A., Allan, J.: Improving realism for topic tracking evaluation. In: Proc. of ACM SIGIR Conference, pp. 89–96 (2002)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proc. of 5th Berkeley Symp. Math. Statist. Prob., vol. 1, pp. 281–297 (1967)
Mei, Q., Liu, C., Su, H.: A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In: Proc. of WWW Conference, pp. 533–542 (2006)
Mei, Q., Zhai, C.X.: Discovering evolutionary theme patterns from text—an exploration of temporal text mining. In: Proc. of SIGKDD Conference, pp. 198–207 (2005)
Nallapati, R., Feng, A., Peng, F., Allan, J.: Event threading within news topics. In: Proc. of 13th ACM CIKM Conference, pp. 446–453 (2004)
National Institute of Standards and Technology (NIST), http://www.nist.gov/speech/tests/tdt/
Radev, D., Otterbacher, J., Winkel, A., Blair-Goldensohn, S.: NewsInEssence, summarizing online news topics. In: Proc. of Communications of the ACM, pp. 95–98 (2005)
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA (1989)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv., 34(1), 1–47 (2002)
Stokes, N., Carthy, J.: First story detection using a composite document representation. In: Proc. of the First International Conference on Human Language Technology Research, San Diego, CA, pp. 1–8 (2001)
van Rijsbergen, C.J.: Information Retrieval. Butter Worths, Sydney (1979)
Yang, Y., Pierce, T., Carbonell, J.G.: A study on retrospective and on-line event detection. In: Proc. of 21st ACM SIGIR Conference, pp. 28–36 (1998)
Yang, Y., Carbonell, J.G., Brown, R.G., Pierce, T., Archibald, B.T., Liu, X.: Learning approaches for detecting and tracking news event. IEEE Intel. Sys. Their Appl. 14(4), 32–43 (1999)
Yang, Y., Zhang, J., Carbonell, J., Jin, C.: Topic-conditioned novelty detection. In: Proc. of ACM SIGKDD Conference, pp. 688–693 (2002)
Zhang, Y., Chu, C.H., Ji, X., Zha, H.Y.: Correlating summarization of multi-source news with K-way graph bi-clustering. SIGKDD Explorations, 6(2), 34–42 (2004)
Zhang, Y., Yu, J.X., Hou, J.: Web Communities Analysis and Construction. Springer, Berlin Heidelberg New York (2006)
Zhant, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large databases. In: Proc. of ACM SIGMOD Conference, pp. 103–114 (1996)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Khy, S., Ishikawa, Y. & Kitagawa, H. A Novelty-based Clustering Method for On-line Documents. World Wide Web 11, 1–37 (2008). https://doi.org/10.1007/s11280-007-0018-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-007-0018-9