Clustering Documents with Maximal Substrings

Masada, Tomonari; Takasu, Atsuhiro; Shibata, Yuichiro; Oguri, Kiyoshi

doi:10.1007/978-3-642-29958-2_2

Tomonari Masada⁸,
Atsuhiro Takasu⁹,
Yuichiro Shibata⁸ &
…
Kiyoshi Oguri⁸

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 102))

Included in the following conference series:

International Conference on Enterprise Information Systems

1117 Accesses
1 Citations

Abstract

This paper provides experimental results showing that we can use maximal substrings as elementary building blocks of documents in place of the words extracted by a current state-of-the-art supervised word extraction. Maximal substrings are defined as the substrings each giving a smaller number of occurrences even by appending only one character to its head or tail. The main feature of maximal substrings is that they can be extracted quite efficiently in an unsupervised manner. We extract maximal substrings from a document set and represent each document as a bag of maximal substrings. We also obtain a bag of words representation by using a state-of-the-art supervised word extraction over the same document set. We then apply the same document clustering method to both representations and obtain two clustering results for a comparison of their quality. We adopt a Bayesian document clustering based on Dirichlet compound multinomials for avoiding overfitting. Our experiment shows that the clustering quality achieved with maximal substrings is acceptable enough to use them in place of the words extracted by a supervised word extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Combining semantic and term frequency similarities for text clustering

Article 02 January 2019

Extended Strategies for Document Clustering with Word Co-occurrences

Soft document clustering using a novel graph covering approach

Article Open access 14 June 2018

References

Abouelhoda, M., Ohlebusch, E., Kurtz, S.: Optimal Exact String Matching Based on Suffix Arrays. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 31–43. Springer, Heidelberg (2002)
Chapter Google Scholar
Blei, D., Ng, A., Jordan, M.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Google Scholar
Chen, X., Hu, X., Shen, X., Rosen, G.: Probabilistic Topic Modeling for Genomic Data Interpretation. In: Park, T., Tsui, S.K.-W., Chen, L., Ng, M.K., Wong, L., Hu, X. (eds.) IEEE International Conference on Bioinformatics and Biomedicine, pp. 18–21. IEEE (2010)
Google Scholar
Choi, K.-S., Isahara, H., Kanzaki, K., Kim, H., Pak, S.M., Sun, M.: Word Segmentation Standard in Chinese, Japanese and Korean. In: 7th Workshop on Asian Language Resources, pp. 179–186. Association for Computational Linguistics (2009)
Google Scholar
Chumwatana, T., Wong, K., Xie, H.: An Automatic Indexing Technique for Thai Texts Using Frequent Max Substring. In: Imsombut, A. (ed.) Eighth International Symposium on Natural Language Processing, pp. 67–72. IEEE (2009)
Google Scholar
Chumwatana, T., Wong, K., Xie, H.: A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts. Journal of Intelligent Learning Systems & Applications 2, 117–125 (2010)
Article Google Scholar
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)
Chapter Google Scholar
Gang, S.: Korean Morphological Analyzer KLT Version 2.10b (2009), http://nlp.kookmin.ac.kr/HAM/kor/
Li, Y., Chung, S.M., Holt, J.D.: Text Document Clustering Based on Frequent Word Meaning Sequences. Data & Knowledge Engineering 64, 381–404 (2008)
Article Google Scholar
Madsen, R., Kauchak, D., Elkan, C.: Modeling Word Burstiness Using the Dirichlet Distribution. In: Raedt, L.D., Wrobel, S. (eds.) 22nd International Conference on Machine Learning, pp. 545–552. ACM (2005)
Google Scholar
Masada, T., Shibata, Y., Oguri, K.: Documents as a Bag of Maximal Substrings: An Unsupervised Feature Extraction for Document Clustering. In: 13th International Conference on Enterprise Information Systems, pp.5–13. INSTICC (2011)
Google Scholar
Minka, T.: Estimating a Dirichlet Distribution (2000), http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/
Mochihashi, D., Yamada, T., Ueda, N.: Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling. In: Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the Fourth International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, pp. 100–108. Association for Computational Linguistics (2009)
Google Scholar
Navarro, G., Mäkinen, V.: Compressed Full-Text Indexes. ACM Comput. Surv. 39(1) (2007)
Google Scholar
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents Using EM. Machine Learning 39(2/3), 103–134 (2000)
Article Google Scholar
Nong, G., Zhang, S., Chan, W.H.: Two Efficient Algorithms for Linear Time Suffix Array Construction. IEEE Transactions on Computers 99(PrePrints) (2008)
Google Scholar
Okanohara, D., Tsujii, J.: Text Categorization with All Substring Features. In: Ninth SIAM International Conference on Data Mining, pp. 838–846. Society for Industrial and Applied Mathematics (2009)
Google Scholar
Poon, H., Cherry, C., Toutanova, K.: Unsupervised Morphological Segmentation with Log-Linear Models. In: Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 209–217. Association for Computational Linguistics (2009)
Google Scholar
Sutton, C., McCallum, A.: An Introduction to Conditional Random Fields for Relational Learning. In: Getoor, L., Taskar, B. (eds.) Introduction to Statistical Relational Learning, pp. 93–128. The MIT Press (2007)
Google Scholar
Teh, Y.W.: A Hierarchical Bayesian Language Model Based on Pitman-Yor Processes. In: The 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 985–992. Association for Computational Linguistics (2006)
Google Scholar
Tseng, H., Chang, P., Andrew, G., Jurafsky, D., Manning, C.: A Conditional Random Field Word Segmenter for SIGHAN Bakeoff 2005. In: Fourth SIGHAN Workshop on Chinese Language Processing, pp. 168–171. Association for Computational Linguistics (2005)
Google Scholar
Tsuruoka, Y., Tsujii, J., Ananiadou, S.: Stochastic Gradient Descent Training for L1-Regularized Log-Linear Models with Cumulative Penalty. In: Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the fourth International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, pp. 477–485. Association for Computational Linguistics (2009)
Google Scholar
Wang, X., McCallum, A.: Topics over Time: a Non-Markov Continuous-Time Model of Topical Trends. In: 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 424–433. ACM (2006)
Google Scholar
Zhang, D., Dong, Y.: Semantic, Hierarchical, Online Clustering of Web Search Results. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 69–78. Springer, Heidelberg (2004)
Chapter Google Scholar
Zhang, D., Lee, W.: Extracting Key-Substring-Group Features for Text Classification. In: 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 474–483. ACM (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Nagasaki University, 1-14 Bunkyo-machi, Nagasaki-shi, Nagasaki, Japan
Tomonari Masada, Yuichiro Shibata & Kiyoshi Oguri
National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, Japan
Atsuhiro Takasu

Authors

Tomonari Masada
View author publications
You can also search for this author in PubMed Google Scholar
Atsuhiro Takasu
View author publications
You can also search for this author in PubMed Google Scholar
Yuichiro Shibata
View author publications
You can also search for this author in PubMed Google Scholar
Kiyoshi Oguri
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Economics and Management, Beijing Jiaotong University, Beijing, China
Runtong Zhang , Juliang Zhang & Zhenji Zhang , &
Instituto Politécnico de Setúbal (IPS), Institute for Systems and Technologies of Information, Control and Communication (INSTICC), Setúbal, Portugal
Joaquim Filipe & José Cordeiro &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Masada, T., Takasu, A., Shibata, Y., Oguri, K. (2012). Clustering Documents with Maximal Substrings. In: Zhang, R., Zhang, J., Zhang, Z., Filipe, J., Cordeiro, J. (eds) Enterprise Information Systems. ICEIS 2011. Lecture Notes in Business Information Processing, vol 102. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29958-2_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-29958-2_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29957-5
Online ISBN: 978-3-642-29958-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics