Skip to main content

Advertisement

Log in

A Novelty-based Clustering Method for On-line Documents

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

In this paper, we describe a document clustering method called novelty-based document clustering. This method clusters documents based on similarity and novelty. The method assigns higher weights to recent documents than old ones and generates clusters with the focus on recent topics. The similarity function is derived probabilistically, extending the conventional cosine measure of the vector space model by incorporating a document forgetting model to produce novelty-based clusters. The clustering procedure is a variation of the K-means method. An additional feature of our clustering method is an incremental update facility, which is applied when new documents are incorporated into a document repository. Performance of the clustering method is examined through experiments. Experimental results show the efficiency and effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Allan, J. (ed.): Topic Detection and Tracking: Event-based Information Organization. Kluwer, Boston, MA (2002)

    MATH  Google Scholar 

  2. Allan, J., Harding, S., Fisher, D., Bolivar, A., Guzman-Lara, S., Amstutz, P.: Taking topic detection and tracking from evaluation to practice. In: Proc. of the 38th Hawaii International Conference on System Sciences, pp. 1–10 (2005)

  3. Avramescu, A.: Actuality and obsolescence of scientific literature. J. Am. Soc. Inf. Sci. 30, 96–303 (1979)

    Google Scholar 

  4. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Harlow, England (1999)

    Google Scholar 

  5. Can, F.: Incremental clustering for dynamic information processing. ACM Trans. Inf. Sys. 11(2), 143–164 (1993)

    Article  Google Scholar 

  6. Chakrabarti, D., Kumar, R., Tomkins, A.: Evolutionary clustering. In: Proc. of 12th ACM SIGKDD Conference, pp. 554–560 (2006)

  7. Charikar, M., Chekuri, C., Feder, T., Motwani, R.: Incremental clustering and dynamic information retrieval. In: Proc. of 29th ACM Symposium on Theory of Computing (STOC), El Paso, Texas, USA, May 4–6, pp. 626–635 (1997)

  8. Chaudhuri, B.B.: Dynamic clustering for time incremental data. Pattern Recogn. Lett. 15(1), 27–34 (1994)

    Article  Google Scholar 

  9. Chen, H.H., Kuo, J.J., Huang, S.J., Lin, C.J., Wung, H.C.: A summarization system for Chinese news from multiple sources. J. Am. Soc. Inf. Sci. Technol. (JASIST), 54(13), 1224–1236 (2003)

    Article  Google Scholar 

  10. Cohen, E., Strauss, M.: Maintaining time-decaying stream aggregates. In: Proc. of 20th ACM Symposium on Principles of Database Systems, San Diego, CA, June 9–11, pp. 223–233 (2003)

  11. Cui, C., Kitagawa, H.: Topic activation analysis for document streams based on document arrival rate and relevance. In: Proc. of the 20th Annual ACM Symposium on Applied Computing, Santa Fe, NM, March 13–17, pp. 1089–1095 (2005)

  12. Cutting, D., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: Proc. of 15th ACM SIGIR Conference, pp. 318–329 (1992)

  13. Diodato, V.: Dictionary of Bibliometrics. Haworth Press, New York (1994)

    Google Scholar 

  14. Eichmann, D., Srinivasan, P.: Adaptive filtering of newswire stories using two-level clustering. Inf. Retr. 5, 209–237 (2002)

    Article  Google Scholar 

  15. Egghe, L., Rousseau, R.: Introduction to Informetrics: Quantitative Methods in Library, Documentation and Information Science. Elsevier, Amsterdam (1990)

    Google Scholar 

  16. Franz, M., McCarley, J.S., Ward, T., Zhu, W.J.: Unsupervised and supervised clustering for topic tracking. In: Proc. of ACM SIGIR Conference, pp. 310–317 (2001)

  17. Ishikawa, Y., Chen, Y., Kitagawa, H.: An on-line document clustering method based on forgetting factors. In: Proc. of 5th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), Darmstadt, Germany, September 4-9, pp. 325–339 (2001)

  18. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)

    Article  Google Scholar 

  19. Khy, S., Ishikawa, Y., Kitagawa, H.: Novelty-based incremental document clustering for on-line documents. In: Proc. of 2nd International Workshop on Challenges in Web Information Retrieval and Integration (WIRI), Atlanta, April 3, pp. 41–50 (2006)

  20. Kleinberg, J.: Bursty and hierarchical structure in streams. In: Proc. of ACM SIGKDD Conference, pp. 91–101 (2002)

  21. Kumar, R., Novak, J., Raghavan, P., Tomkins, A.: On the bursty evolution of blogspace. World Wide Web J. 8(2), 159–178 (2005)

    Article  Google Scholar 

  22. Kumaran, G., Allan, J.: Text classification and named entities for new event detection. In: Proc. of ACM SIGIR Conference, pp. 297–304 (2004)

  23. Linguistic Data Consortium (LDC), http://www.ldc.upenn.edu/

  24. Leuski, A., Allan, J.: Improving realism for topic tracking evaluation. In: Proc. of ACM SIGIR Conference, pp. 89–96 (2002)

  25. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proc. of 5th Berkeley Symp. Math. Statist. Prob., vol. 1, pp. 281–297 (1967)

  26. Mei, Q., Liu, C., Su, H.: A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In: Proc. of WWW Conference, pp. 533–542 (2006)

  27. Mei, Q., Zhai, C.X.: Discovering evolutionary theme patterns from text—an exploration of temporal text mining. In: Proc. of SIGKDD Conference, pp. 198–207 (2005)

  28. Nallapati, R., Feng, A., Peng, F., Allan, J.: Event threading within news topics. In: Proc. of 13th ACM CIKM Conference, pp. 446–453 (2004)

  29. National Institute of Standards and Technology (NIST), http://www.nist.gov/speech/tests/tdt/

  30. Radev, D., Otterbacher, J., Winkel, A., Blair-Goldensohn, S.: NewsInEssence, summarizing online news topics. In: Proc. of Communications of the ACM, pp. 95–98 (2005)

  31. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)

    MATH  Google Scholar 

  32. Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA (1989)

    Google Scholar 

  33. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv., 34(1), 1–47 (2002)

    Article  Google Scholar 

  34. Stokes, N., Carthy, J.: First story detection using a composite document representation. In: Proc. of the First International Conference on Human Language Technology Research, San Diego, CA, pp. 1–8 (2001)

  35. van Rijsbergen, C.J.: Information Retrieval. Butter Worths, Sydney (1979)

    Google Scholar 

  36. Yang, Y., Pierce, T., Carbonell, J.G.: A study on retrospective and on-line event detection. In: Proc. of 21st ACM SIGIR Conference, pp. 28–36 (1998)

  37. Yang, Y., Carbonell, J.G., Brown, R.G., Pierce, T., Archibald, B.T., Liu, X.: Learning approaches for detecting and tracking news event. IEEE Intel. Sys. Their Appl. 14(4), 32–43 (1999)

    Article  Google Scholar 

  38. Yang, Y., Zhang, J., Carbonell, J., Jin, C.: Topic-conditioned novelty detection. In: Proc. of ACM SIGKDD Conference, pp. 688–693 (2002)

  39. Zhang, Y., Chu, C.H., Ji, X., Zha, H.Y.: Correlating summarization of multi-source news with K-way graph bi-clustering. SIGKDD Explorations, 6(2), 34–42 (2004)

    Article  Google Scholar 

  40. Zhang, Y., Yu, J.X., Hou, J.: Web Communities Analysis and Construction. Springer, Berlin Heidelberg New York (2006)

    Google Scholar 

  41. Zhant, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large databases. In: Proc. of ACM SIGMOD Conference, pp. 103–114 (1996)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sophoin Khy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khy, S., Ishikawa, Y. & Kitagawa, H. A Novelty-based Clustering Method for On-line Documents. World Wide Web 11, 1–37 (2008). https://doi.org/10.1007/s11280-007-0018-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-007-0018-9

Keywords

Navigation