Abstract
In this research, the NTSO (Neural Text Self Organizer) is proposed as the approach to text clustering. It is required to encode documents into numerical vectors for using a traditional approach to text clustering. The two main problems, huge dimensionality and sparse distribution are caused by encoding so. The idea of this research is to encode documents into string vectors and use the NTSO as the approach to text clustering. As the empirical validation, we will compare the NTSO with other text clustering approaches with respect to the speed and the performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On Clustering Validation Tech- niques. Journal of Intelligent Information Systems 17(2), 107–145 (2001)
Sylwester, D., Seth, S.: A trainable, singlepass algorithm for column segmenta- tion, Technical Report UNL-CSE-95-003 of the Departement of Computer Science and Engineering at University of Nebraska-Lincoln (1995)
Papka, R., Allan, J.: On-Line New Event Detection using Single Pass Clustering, Technical Report UM-CS-1998-021 of the Department of Computer Science at University of Massachusetts (1998)
Hatzivassiloglou, V., Gravano, L., Maganti, A.: An Investigation of Linguistic Features and Clustering Algorithms for Topical Document Clustering. In: The Proceedings of 23rd SIGIR, pp. 224–231 (2000)
Hartigan, J.A., Wong, M.A.: A K-Means Clustering Algorithm. Applied Statistics 28(1), 101–108 (1979)
Beil, F.F., Ester, M., Xu, X.: Frequent term-based text clustering. In: The Proceedings of the eighth ACM SIGKDD international conference on Knowl- edge discovery and data mining, pp. 436–442 (1994)
Larsen, B., Aone, C.: Fast and Effective Text Mining Using Linear-time Doc- ument Clustering. In: The Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 16–22 (1999)
Kohonen, T.: Self Organized Formation of Topologically Correct Feature Maps. Biological Cybernetics 43, 59–69 (1982)
Kaski, S., Honkela, T., Lagus, K., Kohonen, T.: WEBSOM-Self Organizing Maps of Document Collections. Neurocomputing 21, 101–117 (1998)
Kohonen, T., Kaski, S., Lagus, K., Salojarvi, J., Paatero, V., Saarela, A.: Self Organization of a Massive Document Collection. IEEE Transaction on Neural Networks 11(3), 574–585 (2000)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from In- complete Data via EM algorithm. Journal of the Royal Statistics Society, Series B 39(1), 1–38 (1977)
Ambroise, C., Govaert, G.: Convergence of an EM-type algorithm for spatial clustering. Pattern Recognition Letters 19(10), 919–927 (1998)
Vinokourov, A., Girolami, M.: A Probabilistic Hierarchical Clustering Method for Organizing Collections of Text Documents. In: The Proceedings of 15th International Conference on Pattern Recognition, pp. 182–185 (2000)
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text Classification with String Kernels. Journal of Machine Learning Research 2(2), 419–444 (2002)
Jo, T., Lee, M.: The Evaluation Measure of Text Clustering for the Variable Number of Clusters. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds.) ISNN 2007. LNCS, vol. 4492, pp. 871–879. Springer, Heidelberg (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jo, T. (2009). Clustering News Articles in NewsPage.com Using NTSO. In: Ślęzak, D., Kim, Th., Zhang, Y., Ma, J., Chung, Ki. (eds) Database Theory and Application. DTA 2009. Communications in Computer and Information Science, vol 64. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10583-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-10583-8_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10582-1
Online ISBN: 978-3-642-10583-8
eBook Packages: Computer ScienceComputer Science (R0)