Abstract
We attempt to extract characteristic expressions from literary works. That is, our problem is, given literary works by a particular writer as positive examples and works by another writer as negative examples, to find expressions that appear frequently in the positive examples but do not so in the negative examples. It is considered as a special case of the optimal pattern discovery from textual data, in which only the substring patterns are considered. One reasonable approach is to create a list of substrings arranged in the descending order of their goodness, and to examine a first part of the list by a human expert. Since there is no word boundary in Japanese texts, a substring is often a fragment of a word or a phrase. How to assist the human expert is a key to success in discovery. In this paper, we propose (1) to restrict to the prime substrings in order to remove redundancy from the list, and (2) a way of browsing the neighbor of a focused string as well as its context. Using this method, we report successful results against two pairs of anthologies of classical Japanese poems. We expect that the extracted expressions will possibly lead to discovering overlooked aspects of individual poets.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
H. Arimura. Text data mining with optimized pattern discovery. In Proc. 17th Workshop on Machine Intelligence, Cambridge, July 2000.
A. Blumer, J. Blumer, D. Haussler, R. Mcconnell, and A. Ehrenfeucht. Complete inverted files for efficient text retrieval and analysis. J. ACM, 34(3):578–595, 1987. Previous version in: STOC’84.
M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994.
L. Devroye, L. Gy orfi, and G. Lugosi. A Probablistic Theory of Pattern Recognition. Springer, 1997.
U. M. Fayyad, G. P.-Shapiro, and P. Smyth. From data mining to knowledge discovery: an overview. In Advances in Knowledge Discovery and Data Mining, pages 1–34. The AAAI Press, 1996.
T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using twodimensional optimized association rules. In Proc. 1996 SIGMOD, pages 13–23, 1996.
M. Kondo. Studies on classical Japanese literature based on string analysis using n-gram statistics. Technical report, Chiba University, March 2000. (in Japanese).
H. Luhn. Keyword-in-context index for technical literature (KWICindex). American Documentation, 11:288–295, 1960.
M. Murakami and Y. Imanishi. On a quantitative analysis of auxiliary verbs used in Genji Monogatari. Transactions of Information Processing Society of Japan, 40(3):774–782, 1999. (in Japanese).
S. Shimozono, H. Arimura, and S. Arikawa. Efficient discovery of optimal wordassociation patterns in large databases. New Gener. Comput., 18(1):49–60, 2000.
M. Takeda, T. Fukuda, I. Nanri, M. Yamasaki, and K. Tamari. Discovering similar poems from anthologies of classical Japanese poems. Proceedings of the Institute of Statistical Mathematics, 48(2), 2000. to appear (in Japanese).
K. Tamari, M. Yamasaki, T. Kida, M. Takeda, T. Fukuda, and I. Nanri. Discovering poetic allusion in anthologies of classical Japaneses poems. In Proc. 2nd Int. Conf. Discovery Science, LNAI 1721, pages 128–138. Springer-Verlag, 1999.
M. Yamasaki, M. Takeda, T. Fukuda, and I. Nanri. Discovering characteristic patterns from collections of classical Japanese poems. New Gener. Comput., 18(1):61–73, 2000. Previous version in: DS’98, LNAI 1532.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Takeda, M., Matsumoto, T., Fukuda, T., Nanri, I. (2000). Discovering Characteristic Expressions from Literary Works: a New Text Analysis Method beyond N-Gram Statistics and KWIC. In: Arikawa, S., Morishita, S. (eds) Discovery Science. DS 2000. Lecture Notes in Computer Science(), vol 1967. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44418-1_10
Download citation
DOI: https://doi.org/10.1007/3-540-44418-1_10
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41352-3
Online ISBN: 978-3-540-44418-3
eBook Packages: Springer Book Archive