A Novelty-based Clustering Method for On-line Documents

Khy, Sophoin; Ishikawa, Yoshiharu; Kitagawa, Hiroyuki

doi:10.1007/s11280-007-0018-9

A Novelty-based Clustering Method for On-line Documents

Published: 17 February 2007

Volume 11, pages 1–37, (2008)
Cite this article

World Wide Web Aims and scope Submit manuscript

Sophoin Khy¹,
Yoshiharu Ishikawa² &
Hiroyuki Kitagawa^1,3

246 Accesses
Explore all metrics

Abstract

In this paper, we describe a document clustering method called novelty-based document clustering. This method clusters documents based on similarity and novelty. The method assigns higher weights to recent documents than old ones and generates clusters with the focus on recent topics. The similarity function is derived probabilistically, extending the conventional cosine measure of the vector space model by incorporating a document forgetting model to produce novelty-based clusters. The clustering procedure is a variation of the K-means method. An additional feature of our clustering method is an incremental update facility, which is applied when new documents are incorporated into a document repository. Performance of the clustering method is examined through experiments. Experimental results show the efficiency and effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Allan, J. (ed.): Topic Detection and Tracking: Event-based Information Organization. Kluwer, Boston, MA (2002)
MATH Google Scholar
Allan, J., Harding, S., Fisher, D., Bolivar, A., Guzman-Lara, S., Amstutz, P.: Taking topic detection and tracking from evaluation to practice. In: Proc. of the 38th Hawaii International Conference on System Sciences, pp. 1–10 (2005)
Avramescu, A.: Actuality and obsolescence of scientific literature. J. Am. Soc. Inf. Sci. 30, 96–303 (1979)
Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Harlow, England (1999)
Google Scholar
Can, F.: Incremental clustering for dynamic information processing. ACM Trans. Inf. Sys. 11(2), 143–164 (1993)
Article Google Scholar
Chakrabarti, D., Kumar, R., Tomkins, A.: Evolutionary clustering. In: Proc. of 12th ACM SIGKDD Conference, pp. 554–560 (2006)
Charikar, M., Chekuri, C., Feder, T., Motwani, R.: Incremental clustering and dynamic information retrieval. In: Proc. of 29th ACM Symposium on Theory of Computing (STOC), El Paso, Texas, USA, May 4–6, pp. 626–635 (1997)
Chaudhuri, B.B.: Dynamic clustering for time incremental data. Pattern Recogn. Lett. 15(1), 27–34 (1994)
Article Google Scholar
Chen, H.H., Kuo, J.J., Huang, S.J., Lin, C.J., Wung, H.C.: A summarization system for Chinese news from multiple sources. J. Am. Soc. Inf. Sci. Technol. (JASIST), 54(13), 1224–1236 (2003)
Article Google Scholar
Cohen, E., Strauss, M.: Maintaining time-decaying stream aggregates. In: Proc. of 20th ACM Symposium on Principles of Database Systems, San Diego, CA, June 9–11, pp. 223–233 (2003)
Cui, C., Kitagawa, H.: Topic activation analysis for document streams based on document arrival rate and relevance. In: Proc. of the 20th Annual ACM Symposium on Applied Computing, Santa Fe, NM, March 13–17, pp. 1089–1095 (2005)
Cutting, D., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: Proc. of 15th ACM SIGIR Conference, pp. 318–329 (1992)
Diodato, V.: Dictionary of Bibliometrics. Haworth Press, New York (1994)
Google Scholar
Eichmann, D., Srinivasan, P.: Adaptive filtering of newswire stories using two-level clustering. Inf. Retr. 5, 209–237 (2002)
Article Google Scholar
Egghe, L., Rousseau, R.: Introduction to Informetrics: Quantitative Methods in Library, Documentation and Information Science. Elsevier, Amsterdam (1990)
Google Scholar
Franz, M., McCarley, J.S., Ward, T., Zhu, W.J.: Unsupervised and supervised clustering for topic tracking. In: Proc. of ACM SIGIR Conference, pp. 310–317 (2001)
Ishikawa, Y., Chen, Y., Kitagawa, H.: An on-line document clustering method based on forgetting factors. In: Proc. of 5th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), Darmstadt, Germany, September 4-9, pp. 325–339 (2001)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)
Article Google Scholar
Khy, S., Ishikawa, Y., Kitagawa, H.: Novelty-based incremental document clustering for on-line documents. In: Proc. of 2nd International Workshop on Challenges in Web Information Retrieval and Integration (WIRI), Atlanta, April 3, pp. 41–50 (2006)
Kleinberg, J.: Bursty and hierarchical structure in streams. In: Proc. of ACM SIGKDD Conference, pp. 91–101 (2002)
Kumar, R., Novak, J., Raghavan, P., Tomkins, A.: On the bursty evolution of blogspace. World Wide Web J. 8(2), 159–178 (2005)
Article Google Scholar
Kumaran, G., Allan, J.: Text classification and named entities for new event detection. In: Proc. of ACM SIGIR Conference, pp. 297–304 (2004)
Linguistic Data Consortium (LDC), http://www.ldc.upenn.edu/
Leuski, A., Allan, J.: Improving realism for topic tracking evaluation. In: Proc. of ACM SIGIR Conference, pp. 89–96 (2002)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proc. of 5th Berkeley Symp. Math. Statist. Prob., vol. 1, pp. 281–297 (1967)
Mei, Q., Liu, C., Su, H.: A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In: Proc. of WWW Conference, pp. 533–542 (2006)
Mei, Q., Zhai, C.X.: Discovering evolutionary theme patterns from text—an exploration of temporal text mining. In: Proc. of SIGKDD Conference, pp. 198–207 (2005)
Nallapati, R., Feng, A., Peng, F., Allan, J.: Event threading within news topics. In: Proc. of 13th ACM CIKM Conference, pp. 446–453 (2004)
National Institute of Standards and Technology (NIST), http://www.nist.gov/speech/tests/tdt/
Radev, D., Otterbacher, J., Winkel, A., Blair-Goldensohn, S.: NewsInEssence, summarizing online news topics. In: Proc. of Communications of the ACM, pp. 95–98 (2005)
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
MATH Google Scholar
Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA (1989)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv., 34(1), 1–47 (2002)
Article Google Scholar
Stokes, N., Carthy, J.: First story detection using a composite document representation. In: Proc. of the First International Conference on Human Language Technology Research, San Diego, CA, pp. 1–8 (2001)
van Rijsbergen, C.J.: Information Retrieval. Butter Worths, Sydney (1979)
Google Scholar
Yang, Y., Pierce, T., Carbonell, J.G.: A study on retrospective and on-line event detection. In: Proc. of 21st ACM SIGIR Conference, pp. 28–36 (1998)
Yang, Y., Carbonell, J.G., Brown, R.G., Pierce, T., Archibald, B.T., Liu, X.: Learning approaches for detecting and tracking news event. IEEE Intel. Sys. Their Appl. 14(4), 32–43 (1999)
Article Google Scholar
Yang, Y., Zhang, J., Carbonell, J., Jin, C.: Topic-conditioned novelty detection. In: Proc. of ACM SIGKDD Conference, pp. 688–693 (2002)
Zhang, Y., Chu, C.H., Ji, X., Zha, H.Y.: Correlating summarization of multi-source news with K-way graph bi-clustering. SIGKDD Explorations, 6(2), 34–42 (2004)
Article Google Scholar
Zhang, Y., Yu, J.X., Hou, J.: Web Communities Analysis and Construction. Springer, Berlin Heidelberg New York (2006)
Google Scholar
Zhant, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large databases. In: Proc. of ACM SIGMOD Conference, pp. 103–114 (1996)

Download references

Author information

Authors and Affiliations

Graduate School of Systems and Information Engineering, University of Tsukuba, 1-1-1 Tennohdai, Tsukuba, Ibaraki, 305-8573, Japan
Sophoin Khy & Hiroyuki Kitagawa
Information Technology Center, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, 464-8601, Japan
Yoshiharu Ishikawa
Center for Computational Sciences, University of Tsukuba, 1-1-1 Tennohdai, Tsukuba, Ibaraki, 305-8573, Japan
Hiroyuki Kitagawa

Authors

Sophoin Khy
View author publications
You can also search for this author inPubMed Google Scholar
Yoshiharu Ishikawa
View author publications
You can also search for this author inPubMed Google Scholar
Hiroyuki Kitagawa
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Sophoin Khy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khy, S., Ishikawa, Y. & Kitagawa, H. A Novelty-based Clustering Method for On-line Documents. World Wide Web 11, 1–37 (2008). https://doi.org/10.1007/s11280-007-0018-9

Download citation

Published: 17 February 2007
Issue Date: March 2008
DOI: https://doi.org/10.1007/s11280-007-0018-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Novelty-based Clustering Method for On-line Documents

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

Semi-supervised Document Clustering via Loci

An Analytical Approach to Document Clustering Techniques

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

A Novelty-based Clustering Method for On-line Documents

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

Semi-supervised Document Clustering via Loci

An Analytical Approach to Document Clustering Techniques

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now