Technology of Text Mining

Visa, Ari

doi:10.1007/3-540-44596-X_1

Ari Visa²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2123))

Included in the following conference series:

International Workshop on Machine Learning and Data Mining in Pattern Recognition

1339 Accesses
7 Citations

Abstract

A large amount of information is stored in databases, in intranets or in Internet. This information is organised in documents or in text documents. The difference depends on the fact if pictures, tables, figures, and formulas are included or not. The common problem is to find the desired piece of information, a trend, or an undiscovered pattern from these sources. The problem is not a new one. Traditionally the problem has been considered under the title of information seeking, this means the science how to find a book in the library. Traditionally the problem has been solved either by classifying and accessing documents by Dewey Decimal Classification system or by giving a number of characteristic keywords. The problem is that nowadays there are lots of unclassified documents in company databases and in intranet or in Internet.

First one defines some terms. Text filtering means an information seeking process in which documents are selected from a dynamic text stream. Text mining is a process of analysing text to extract information from it for particular purposes. Text categorisation means the process of clustering similar documents from a large document set. All these terms have a certain degree of overlapping.

Text mining, also know as document information mining, text data mining, or knowledge discovery in textual databases is an merging technology for analysing large collections of unstructured documents for the purposes of extracting interesting and non-trivial patterns or knowledge. Typical subproblems that have been solved are language identification, feature selection/extraction, clustering, natural language processing, summarisation, categorisation, search, indexing, and visualisation. These subproblems are discussed in detail and the most common approaches are given.

Finally some examples of current uses of text mining are given and some potential application areas are mentioned.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

B. Back, J. Toivonen, H. Vanharanta, and A. Visa. Toward Computer Aided Analysis of Text. The Journal of The Economic Society of Finland, 54(1):39–47, 2001.
Google Scholar
R. Baeza-Yates and B. Ribeiro-Neto, editors. Modern Information Retrieval Ad-dison Wesley Longman, 1999.
Google Scholar
D. C. Blair. Language and Representation in Information Retrieval Elsevier, Amsterdam, 1990.
Google Scholar
K. Bollacker, S. Lawrence, and C. L. Giles. Citeseer: An autonomous web agent for automatic retrieval and identification of interesting publications. In Proceedings of 2nd International ACM Conference on Autonomous Agents, pages 116–123. ACM Press, 1998.
Google Scholar
A. Brüggemann-Klein, R. Klein, and B. Landgraf. BibRelEx-Exploring Bibliographic Databases by Visualization of Annotated Content-Based Relations. D-Lib Magazine, 5(11), Nov. 1999.
Google Scholar
M. Dewey. A Classification and subject index for cataloguing and arranging the books and pamphlets of a library. Case, Lockwood & Brainard Co., Amherst, MA, USA, 1876.
Google Scholar
M. Dewey. Catalogs and Cataloguing: A Decimal Classification and Subject Index. In U.S. Bureau of Education Special Report on Public Libraries Part I, pages 623–648. U.S.G.P.O., Washington DC, USA, 1876.
Google Scholar
U. Hahn. Topic parsing: accounting for text macro structures in full-text analysis. Information Processing and Management, 26(1):135–170, 1990.
Article Google Scholar
S. P. Harter. Online Information Retrieval. Academic Press, Orlando, Florida, USA, 1986.
Google Scholar
S. Havre, B. Hetzler, and L. Nowell. ThemeRiver™: In search of trends, patterns, and relationships. In Proceedings of IEEE Symposium on Information Visualization (InfoVis’99), San Francisco, CA, USA, Oct. 1999.
Google Scholar
M. A. Hearst. TileBars: Visualization of Term Distribution Information in Full Text Information Access. In Proceedings of the ACM Conference on Human Factors in Computing Systems, (CHI’95), pages 56–66, 1995.
Google Scholar
.M. A. Hearst. Untangling text data mining. In Proceedings of ACL’99, the 37th Annual Meeting of the Association for Computational Linguistics, June 1999.
Google Scholar
M. A. Hearst and C. Plaunt. Subtopic Structuring for Full-Length Document Access. In Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 59–68, 1993.
Google Scholar
V. J. Hodge and J. Austin. An evaluation of standard retrieval algorithms and a binary neural approach. Neural Networks, 14(3):287–303, Apr. 2001.
Google Scholar
T. Kohonen, S. Kaski, K. Lagus, J. Salojärvi, J Honkela, V Paatero, and A. Saarela. Self organization of a massive document collection. IEEE Transactions on Neural Networks, 11(3):574–585, May 2000.
Google Scholar
T. Lahtinen. Automatic indexing: an approach using an index term corpus and combining linguistic and statistical methods PhD thesis, Department of General Linguistics, University of Helsinki, Finland, 2000.
Google Scholar
X. Lin. Map displays for information retrieval. Journal of the American Society for Information Science, 48(1):40–54, 1997.
Article Google Scholar
X. Lin, D. Soergel, and G. Marchionini. A Self-Organizing Semantic Map for Information Retrieval. In Proceedings of 14th Annual International ACM/SIGIR Conference on Research & Development in Information Retrieval, pages 262–269, 1991.
Google Scholar
H. P. Luhn. The automatic creation of literature abstracts. IBM Journal of Research and Development, 2:159–165, 1958.
Article MathSciNet Google Scholar
C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts, USA, 1999.
MATH Google Scholar
P. Nelson. Breaching the language barrier: Experimentation with Japanese to English machine translation. In D. I. Raitt, editor, 15th International Online Information Meeting Proceedings, pages 21–33. Learned Information, Dec. 1991.
Google Scholar
.D. W. Oard and B. J. Dorr. A Survey of Multilingual Text Retrieval. Technical Report CS-TR-3615, University of Maryland, 1996.
Google Scholar
R. Orwig, H. Chen, and J. F. Nunamaker. A Graphical, Self-Organizing Approach to Classifying Electronic Meeting Output. Journal of the American Society for Information Science, 48(2):157–170, 1997.
Article Google Scholar
C. Paice. Constructing literature abstracts by computer: techniques and prospects. Information Processing and Management, 26(1):171–186, 1990.
Article Google Scholar
P. Poinçot, S. Lesteven, and F. Murtagh. A spatial user interface to the astronomical literature. Astronomy and Astrophysics Supplement Series, 130:183–191, 1998.
Article Google Scholar
H. Ritter and T. Kohonen. Self-Organizing Semantic Maps. Biological Cybernetics, 61(4):241–254, 1989.
Article Google Scholar
G. Salton. Automatic processing of foreign language documents. Journal of the American Society for Information Science, 21(3):187–194, 1970.
Article Google Scholar
G. Salton. Automatic Text Processing. Addison-Wesley, 1989.
Google Scholar
G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975.
Article MATH Google Scholar
J. C. Scholtes. Neural Networks in Natural Language Processing and Information Retrieval. PhD thesis, Universiteit van Amsterdam, Amsterdam, Netherlands, 1993.
Google Scholar
B. Shneiderman. The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations. In Proceedings of IEEE Symposium on Visual Languages, (VL), pages 336–343, Sept. 1996.
Google Scholar
E. R. Tufte. The Visual Display of Quantitative Information. Graphic Press, 1983.
Google Scholar
A. Visa, J. Toivonen, S. Autio, J. Mäkinen, H. Vanharanta, and B. Back. Data Mining of Text as a Tool in Authorship Attribution.In B. V. Dasarathy, editor, Proceedings of AeroSense 2001, SPIE 15th Annual International Symposium on Aerospace/Defense Sensing, Simulation and Controls. Data Mining and Knowledge Discovery: Theory, Tools, and Technology III, volume 4384, Orlando, Florida, USA, Apr. 16–20 2001.
Google Scholar
J. A. Wise. The Ecological Approach to Text Visualization. Journal of the American Society of Information Science, 50(13):1224–1233, 1999.
Article Google Scholar
S. R. Young and P. J. Hayes. Automatic classification and summarisation of banking telexes. In Proceedings of The Second Conference on Artificial Intelligence Applications, pages 402–408, 1985.
Google Scholar
G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge, Massachusetts, USA, 1949.
Google Scholar

Download references

Author information

Authors and Affiliations

Tampere University of Technology, P.O. Box 553, FIN-33101, Tampere, Finland
Ari Visa

Authors

Ari Visa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Vision and Applied Computer Sciences, Arno-Nitzsche-Str. 45, 04277, Leipzig, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Visa, A. (2001). Technology of Text Mining. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2001. Lecture Notes in Computer Science(), vol 2123. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44596-X_1

Download citation

DOI: https://doi.org/10.1007/3-540-44596-X_1
Published: 26 July 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42359-1
Online ISBN: 978-3-540-44596-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics