Abstract
This paper provides experimental results showing that we can use maximal substrings as elementary building blocks of documents in place of the words extracted by a current state-of-the-art supervised word extraction. Maximal substrings are defined as the substrings each giving a smaller number of occurrences even by appending only one character to its head or tail. The main feature of maximal substrings is that they can be extracted quite efficiently in an unsupervised manner. We extract maximal substrings from a document set and represent each document as a bag of maximal substrings. We also obtain a bag of words representation by using a state-of-the-art supervised word extraction over the same document set. We then apply the same document clustering method to both representations and obtain two clustering results for a comparison of their quality. We adopt a Bayesian document clustering based on Dirichlet compound multinomials for avoiding overfitting. Our experiment shows that the clustering quality achieved with maximal substrings is acceptable enough to use them in place of the words extracted by a supervised word extraction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abouelhoda, M., Ohlebusch, E., Kurtz, S.: Optimal Exact String Matching Based on Suffix Arrays. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 31–43. Springer, Heidelberg (2002)
Blei, D., Ng, A., Jordan, M.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Chen, X., Hu, X., Shen, X., Rosen, G.: Probabilistic Topic Modeling for Genomic Data Interpretation. In: Park, T., Tsui, S.K.-W., Chen, L., Ng, M.K., Wong, L., Hu, X. (eds.) IEEE International Conference on Bioinformatics and Biomedicine, pp. 18–21. IEEE (2010)
Choi, K.-S., Isahara, H., Kanzaki, K., Kim, H., Pak, S.M., Sun, M.: Word Segmentation Standard in Chinese, Japanese and Korean. In: 7th Workshop on Asian Language Resources, pp. 179–186. Association for Computational Linguistics (2009)
Chumwatana, T., Wong, K., Xie, H.: An Automatic Indexing Technique for Thai Texts Using Frequent Max Substring. In: Imsombut, A. (ed.) Eighth International Symposium on Natural Language Processing, pp. 67–72. IEEE (2009)
Chumwatana, T., Wong, K., Xie, H.: A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts. Journal of Intelligent Learning Systems & Applications 2, 117–125 (2010)
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)
Gang, S.: Korean Morphological Analyzer KLT Version 2.10b (2009), http://nlp.kookmin.ac.kr/HAM/kor/
Li, Y., Chung, S.M., Holt, J.D.: Text Document Clustering Based on Frequent Word Meaning Sequences. Data & Knowledge Engineering 64, 381–404 (2008)
Madsen, R., Kauchak, D., Elkan, C.: Modeling Word Burstiness Using the Dirichlet Distribution. In: Raedt, L.D., Wrobel, S. (eds.) 22nd International Conference on Machine Learning, pp. 545–552. ACM (2005)
Masada, T., Shibata, Y., Oguri, K.: Documents as a Bag of Maximal Substrings: An Unsupervised Feature Extraction for Document Clustering. In: 13th International Conference on Enterprise Information Systems, pp.5–13. INSTICC (2011)
Minka, T.: Estimating a Dirichlet Distribution (2000), http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/
Mochihashi, D., Yamada, T., Ueda, N.: Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling. In: Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the Fourth International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, pp. 100–108. Association for Computational Linguistics (2009)
Navarro, G., Mäkinen, V.: Compressed Full-Text Indexes. ACM Comput. Surv. 39(1) (2007)
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents Using EM. Machine Learning 39(2/3), 103–134 (2000)
Nong, G., Zhang, S., Chan, W.H.: Two Efficient Algorithms for Linear Time Suffix Array Construction. IEEE Transactions on Computers 99(PrePrints) (2008)
Okanohara, D., Tsujii, J.: Text Categorization with All Substring Features. In: Ninth SIAM International Conference on Data Mining, pp. 838–846. Society for Industrial and Applied Mathematics (2009)
Poon, H., Cherry, C., Toutanova, K.: Unsupervised Morphological Segmentation with Log-Linear Models. In: Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 209–217. Association for Computational Linguistics (2009)
Sutton, C., McCallum, A.: An Introduction to Conditional Random Fields for Relational Learning. In: Getoor, L., Taskar, B. (eds.) Introduction to Statistical Relational Learning, pp. 93–128. The MIT Press (2007)
Teh, Y.W.: A Hierarchical Bayesian Language Model Based on Pitman-Yor Processes. In: The 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 985–992. Association for Computational Linguistics (2006)
Tseng, H., Chang, P., Andrew, G., Jurafsky, D., Manning, C.: A Conditional Random Field Word Segmenter for SIGHAN Bakeoff 2005. In: Fourth SIGHAN Workshop on Chinese Language Processing, pp. 168–171. Association for Computational Linguistics (2005)
Tsuruoka, Y., Tsujii, J., Ananiadou, S.: Stochastic Gradient Descent Training for L1-Regularized Log-Linear Models with Cumulative Penalty. In: Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the fourth International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, pp. 477–485. Association for Computational Linguistics (2009)
Wang, X., McCallum, A.: Topics over Time: a Non-Markov Continuous-Time Model of Topical Trends. In: 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 424–433. ACM (2006)
Zhang, D., Dong, Y.: Semantic, Hierarchical, Online Clustering of Web Search Results. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 69–78. Springer, Heidelberg (2004)
Zhang, D., Lee, W.: Extracting Key-Substring-Group Features for Text Classification. In: 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 474–483. ACM (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Masada, T., Takasu, A., Shibata, Y., Oguri, K. (2012). Clustering Documents with Maximal Substrings. In: Zhang, R., Zhang, J., Zhang, Z., Filipe, J., Cordeiro, J. (eds) Enterprise Information Systems. ICEIS 2011. Lecture Notes in Business Information Processing, vol 102. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29958-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-29958-2_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29957-5
Online ISBN: 978-3-642-29958-2
eBook Packages: Computer ScienceComputer Science (R0)