Validation of Text Clustering Based on Document Contents

Toivonen, Jarmo; Visa, Ari; Vesanen, Tomi; Back, Barbro; Vanharanta, Hannu

doi:10.1007/3-540-44596-X_15

Jarmo Toivonen²,
Ari Visa²,
Tomi Vesanen²,
Barbro Back³ &
…
Hannu Vanharanta⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2123))

Included in the following conference series:

International Workshop on Machine Learning and Data Mining in Pattern Recognition

1260 Accesses

Abstract

In this paper some results of a new text clustering methodology are presented. A prototype is an interesting document or a part of an extracted, interesting text. The given prototype is matched with the existing document database or the monitored document flow. Our claim is that the new methodology is capable of automatic content-based clustering using the information of the document. To verify this hypothesis an experiment was designed with the Bible. Four different translations, one Greek, one Latin, and two Finnish translations from years 1933/38 and 1992 were selected as test text material. Validation experiments were performed with a designed prototype version of the software application.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

M. Dewey. A Classification and subject index for cataloguing and arranging the books and pamphlets of a library. Case, Lockwood & Brainard Co., Amherst, MA, USA, 1876.
Google Scholar
M. Dewey. Catalogs and Cataloguing: A Decimal Classification and Subject Index. In U.S. Bureau of Education Special Report on Public Libraries Part I, pages 623–648. U.S.G.P.O., Washington DC, USA, 1876.
Google Scholar
F. C. Gey. Information Retrieval: Theory, Application, Evaluation. In Tutorial at HICSS-33, Hawaii International Conference on System Sciences (CD-ROM), Maui, Hawaii, USA, Jan. 4–7 2000.
Google Scholar
T. Lahtinen. Automatic indexing: an approach using an index term corpus and combining linguistic and statistical methods. PhD thesis, Department of General Linguistics, University of Helsinki, Finland, 2000.
Google Scholar
C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts, 1999.
MATH Google Scholar
D. W. Oard and G. Marchionini. A conceptual framework for text filtering. Technical Report CS-TR3643, University of Maryland, May 1996.
Google Scholar
G. Salton. Automatic Text Processing. Addison-Wesley, 1989.
Google Scholar
G. Salton, A. Wong, and C. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975.
Article MATH Google Scholar
A. Visa, J. Toivonen, S. Autio, J. Mäkinen, H. Vanharanta, and B. Back. Data Mining of Text as a Tool in Authorship Attribution. In B. V. Dasarathy, editor, Proceedings of AeroSense 2001, SPIE 15th Annual International Symposium on Aerospace/Defense Sensing, Simulation and Controls. Data Mining and Knowledge Discovery: Theory, Tools, and Technology III, volume 4384, Orlando, Florida, USA, April 16–202001.
Google Scholar
A. Visa, J. Toivonen, B. Back, and H. Vanharanta. Improvements on a Knowledge Discovery Methodology for Text Documents. In Proceedings of SSGRR 2000-International Conference on Advances in Infrastructure for Electronic Business,Science, and Education on the Internet, L’Aquila, Rome, Italy, July 31-August 6 2000. (CD-ROM).
Google Scholar
A. Visa, J. Toivonen, H. Vanharanta, and B. Back. Prototype Matching-Finding Meaning in the Books of the Bible. In J. Ralph H. Sprague, editor, Proceedings of the Thirty-Fourth Annual Hawaii International Conference on System Sciences (HICSS-34), Maui, Hawaii, USA, January 3–6 2001. (CD-ROM).
Google Scholar

Download references

Author information

Authors and Affiliations

Tampere University of Technology, P.O. Box 553, FIN-33101, Tampere, Finland
Jarmo Toivonen, Ari Visa & Tomi Vesanen
Åbo Akademi University, Lemminkäisenkatu 14 A, FIN-20520, Turku, Finland
Barbro Back
Pori School of Technology and Economics, P.O. Box 300, FIN-28101, Pori, Finland
Hannu Vanharanta

Authors

Jarmo Toivonen
View author publications
You can also search for this author in PubMed Google Scholar
Ari Visa
View author publications
You can also search for this author in PubMed Google Scholar
Tomi Vesanen
View author publications
You can also search for this author in PubMed Google Scholar
Barbro Back
View author publications
You can also search for this author in PubMed Google Scholar
Hannu Vanharanta
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Vision and Applied Computer Sciences, Arno-Nitzsche-Str. 45, 04277, Leipzig, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Toivonen, J., Visa, A., Vesanen, T., Back, B., Vanharanta, H. (2001). Validation of Text Clustering Based on Document Contents. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2001. Lecture Notes in Computer Science(), vol 2123. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44596-X_15

Download citation

DOI: https://doi.org/10.1007/3-540-44596-X_15
Published: 26 July 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42359-1
Online ISBN: 978-3-540-44596-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics