Abstract
Instant intercommunion techniques such as Instant Messaging (IM) are widely popularized. Aiming at such kind of large scale mass communication media, clustering on its text content is a practical method to analyze the characteristic of text content in instant messages, and find or track the social hot topics. However, key words in one instant message usually are few, even latent; moreover, single message can not describe the conversational context. This is very different from general document and makes common clustering algorithms unsuitable. A novel method called WR − KMeans is proposed, which synthesizes related instant messages as a conversation and enriches conversation’s vector by words which are not included in this conversation but are closely related with existing words in this conversation. WR − KMeans performs clustering like k-means on this extended vector space of conversations. Experiments on the public datasets show that WR − KMeans outperforms the traditional k-means and bisecting k-means algorithms.
This project is sponsored by national 863 high technology development foundation (No. 2006AA01Z451, No.2006AA10Z237).
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Resig, J., Teredesai, A.: A framework for mining instant messaging services. In: Proceedings of the 2004 SIAM Lake Buena Vista, Florida (2004)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th berkeley SMSP, pp. 281–297 (1967)
Guan, Y., et al.: Quantifying Semantic Similarity of Chinese Words from Hownet. In: IEEE Proceedings of ICMLC 2002, Beijing, vol. 1, pp. 234–239. IEEE Computer Society Press, Los Alamitos (2002)
Sack, et al.: A Content-Based Usenet Newsgroup Browser. In: Proceedings of the international conference on Intelligent user interfaces, New Orleans, Louisianna, pp. 233–240 (2000)
Khan, F.M., Fisher, T.A., Shuler, L., Wu, T., Pottenger, W.M.: Mining chat-room conversations for social and semantic interactions (2002)
Hearst, M.A.: TextTiling: A Quantitative Approach to Discourse Segmentation, Technical Report UCB: S2K-93-24 (1993)
Deerwester, S., et al.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)
Ding, C.H.Q.: A probabilistic model for dimensionality reduction in information retrieval and filtering. In: Proc. of the 1st SIAM, Raleigh, NC (2000)
Ikehara, S., et al.: Vector space model based on semantic attributes of words. In: PACLING. Proc. of the Pacific Association for Computational Linguistics, Kitakyushu, Japan (2001)
Daemi, A., et al.: From Ontologies to Trust through Entropy. In: Proceedings of the International Conference on Advances in Intelligent System, Luxembourg (2004)
Hotho, A., et al.: Ontology-based Text Document Clustering. KI 16(4), 48–54 (2002)
Strehl, A., Ghosh, J.: Cluster ensembles - a knowledge reuse framework for combining partitions. Journal of Machine Learning Research 3, 583–617 (2002)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, L., Jia, Y., Han, W. (2007). Instant Message Clustering Based on Extended Vector Space Model. In: Kang, L., Liu, Y., Zeng, S. (eds) Advances in Computation and Intelligence. ISICA 2007. Lecture Notes in Computer Science, vol 4683. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74581-5_48
Download citation
DOI: https://doi.org/10.1007/978-3-540-74581-5_48
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74580-8
Online ISBN: 978-3-540-74581-5
eBook Packages: Computer ScienceComputer Science (R0)