Abstract
In this paper some results of a new text clustering methodology are presented. A prototype is an interesting document or a part of an extracted, interesting text. The given prototype is matched with the existing document database or the monitored document flow. Our claim is that the new methodology is capable of automatic content-based clustering using the information of the document. To verify this hypothesis an experiment was designed with the Bible. Four different translations, one Greek, one Latin, and two Finnish translations from years 1933/38 and 1992 were selected as test text material. Validation experiments were performed with a designed prototype version of the software application.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
M. Dewey. A Classification and subject index for cataloguing and arranging the books and pamphlets of a library. Case, Lockwood & Brainard Co., Amherst, MA, USA, 1876.
M. Dewey. Catalogs and Cataloguing: A Decimal Classification and Subject Index. In U.S. Bureau of Education Special Report on Public Libraries Part I, pages 623–648. U.S.G.P.O., Washington DC, USA, 1876.
F. C. Gey. Information Retrieval: Theory, Application, Evaluation. In Tutorial at HICSS-33, Hawaii International Conference on System Sciences (CD-ROM), Maui, Hawaii, USA, Jan. 4–7 2000.
T. Lahtinen. Automatic indexing: an approach using an index term corpus and combining linguistic and statistical methods. PhD thesis, Department of General Linguistics, University of Helsinki, Finland, 2000.
C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts, 1999.
D. W. Oard and G. Marchionini. A conceptual framework for text filtering. Technical Report CS-TR3643, University of Maryland, May 1996.
G. Salton. Automatic Text Processing. Addison-Wesley, 1989.
G. Salton, A. Wong, and C. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975.
A. Visa, J. Toivonen, S. Autio, J. Mäkinen, H. Vanharanta, and B. Back. Data Mining of Text as a Tool in Authorship Attribution. In B. V. Dasarathy, editor, Proceedings of AeroSense 2001, SPIE 15th Annual International Symposium on Aerospace/Defense Sensing, Simulation and Controls. Data Mining and Knowledge Discovery: Theory, Tools, and Technology III, volume 4384, Orlando, Florida, USA, April 16–202001.
A. Visa, J. Toivonen, B. Back, and H. Vanharanta. Improvements on a Knowledge Discovery Methodology for Text Documents. In Proceedings of SSGRR 2000-International Conference on Advances in Infrastructure for Electronic Business,Science, and Education on the Internet, L’Aquila, Rome, Italy, July 31-August 6 2000. (CD-ROM).
A. Visa, J. Toivonen, H. Vanharanta, and B. Back. Prototype Matching-Finding Meaning in the Books of the Bible. In J. Ralph H. Sprague, editor, Proceedings of the Thirty-Fourth Annual Hawaii International Conference on System Sciences (HICSS-34), Maui, Hawaii, USA, January 3–6 2001. (CD-ROM).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Toivonen, J., Visa, A., Vesanen, T., Back, B., Vanharanta, H. (2001). Validation of Text Clustering Based on Document Contents. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2001. Lecture Notes in Computer Science(), vol 2123. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44596-X_15
Download citation
DOI: https://doi.org/10.1007/3-540-44596-X_15
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42359-1
Online ISBN: 978-3-540-44596-8
eBook Packages: Springer Book Archive