Abstract
Data mining is typically applied to large databases of highly structured information in order to discover new knowledge. In businesses and institutions, the amount of information existing in repositories of text documents usually rivals or surpasses the amount found in relational databases. Though the amount of potentially valuable knowledge contained in document collections can be great, they are often dificult to analyze. Therefore, it is important to develop methods to efficiently discover knowledge embedded in these document repositories. In this paper we describe an approach for mining knowledge from text collections by applying data mining techniques to metadata records generated via automated text categorization. By controlling the set of metadata fields as well as the set of assigned categories we can customize the knowledge discovery task to address specific questions. As an example, we apply the approach to a large collection of product reviews and evaluate the performance of the knowledge discovery.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
M. A. Hearst. Untangling Text Data Mining. In Proceedings of ACL’99: the 37th Annual Meeting of the Association for Computational Linguistics, 1999.
H. Ahonen and O. Heinonen. Applying Data Mining Techniques in Text Analysis. Report C-1997-23, University of Helsinki, Department of Computer Science, March 1997.
R. Ghani, R. Jones, D. Mladenic, K. Nigam, and S. Slattery. Data Mining on Symbolic Knowledge Extracted from the Web. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000) Workshop on Text Mining, 29–36, 2000.
U. Nahm and R. Mooney. Text Mining with Information Extraction. In Proceedings of the AAAI 2002 Spring Symposium on Mining Answers from Texts and Knowledge Bases, 2002.
Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of the 22nd Annual ACM SIGIR Conference on Research and Development in Information Retrieval, 42–49, 1999.
J. English, M. Hearst, R. Sinha, K. Swearingen, K.-P. Yee. Flexible Search and Navigation using Faceted Metadata. Submitted for publication, 2002.
A. McCallum and K. Nigam. A Comparison of Event Models for Naive Bayes Text Classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.
D. Lewis. Evaluating Text Categorization. In Proceedings of the Speech and Natural Language Workshop, 312–318, 1991.
R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast discovery of association rules. In U. Fayyad et al., editors, Advances in Knowledge Discovery and Data Mining, 307–328. AAAI Press, 1996.
C. Borgelt. Apriori. http://fuzzy.cs.uni-magdeburg.de/~borgelt/apriori/apriori.html
R. Feldman, M. Fresko, H. Hirsh, Y. Aumann, O. Liphstat, Y. Schler, M. Rajman. Knowledge Management: A Text Mining Approach. In Proceedings of the 2nd International Conference on Practical Aspects of Knowledge Management (PAKM98), 29–30, 1998.
S. Loh, L. Wives, J. P. M. de Oliveira. Concept-based Knowledge Discovery in Texts Extracted from the Web. SIGKDD Explorations, 2(1): 29–39, 2000.
S. Basu, R. J. Mooney, K. V. Pasupuleti, and J. Ghosh. Evaluting the Novelty of Text-Mined Rules Using Lexical Knowledge. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2001), 233–238, 2001.
J. Han and Y. Fu. Discovery of Multiple-Level Association Rules from Large Databases. In Proceedings of the 21st VLDB Conference, 1995.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pierre, J.M. (2002). Mining Knowledge from Text Collections Using Automatically Generated Metadata. In: Karagiannis, D., Reimer, U. (eds) Practical Aspects of Knowledge Management. PAKM 2002. Lecture Notes in Computer Science(), vol 2569. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36277-0_47
Download citation
DOI: https://doi.org/10.1007/3-540-36277-0_47
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00314-4
Online ISBN: 978-3-540-36277-7
eBook Packages: Springer Book Archive