Skip to main content

Technology of Text Mining

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2123))

Abstract

A large amount of information is stored in databases, in intranets or in Internet. This information is organised in documents or in text documents. The difference depends on the fact if pictures, tables, figures, and formulas are included or not. The common problem is to find the desired piece of information, a trend, or an undiscovered pattern from these sources. The problem is not a new one. Traditionally the problem has been considered under the title of information seeking, this means the science how to find a book in the library. Traditionally the problem has been solved either by classifying and accessing documents by Dewey Decimal Classification system or by giving a number of characteristic keywords. The problem is that nowadays there are lots of unclassified documents in company databases and in intranet or in Internet.

First one defines some terms. Text filtering means an information seeking process in which documents are selected from a dynamic text stream. Text mining is a process of analysing text to extract information from it for particular purposes. Text categorisation means the process of clustering similar documents from a large document set. All these terms have a certain degree of overlapping.

Text mining, also know as document information mining, text data mining, or knowledge discovery in textual databases is an merging technology for analysing large collections of unstructured documents for the purposes of extracting interesting and non-trivial patterns or knowledge. Typical subproblems that have been solved are language identification, feature selection/extraction, clustering, natural language processing, summarisation, categorisation, search, indexing, and visualisation. These subproblems are discussed in detail and the most common approaches are given.

Finally some examples of current uses of text mining are given and some potential application areas are mentioned.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. B. Back, J. Toivonen, H. Vanharanta, and A. Visa. Toward Computer Aided Analysis of Text. The Journal of The Economic Society of Finland, 54(1):39–47, 2001.

    Google Scholar 

  2. R. Baeza-Yates and B. Ribeiro-Neto, editors. Modern Information Retrieval Ad-dison Wesley Longman, 1999.

    Google Scholar 

  3. D. C. Blair. Language and Representation in Information Retrieval Elsevier, Amsterdam, 1990.

    Google Scholar 

  4. K. Bollacker, S. Lawrence, and C. L. Giles. Citeseer: An autonomous web agent for automatic retrieval and identification of interesting publications. In Proceedings of 2nd International ACM Conference on Autonomous Agents, pages 116–123. ACM Press, 1998.

    Google Scholar 

  5. A. Brüggemann-Klein, R. Klein, and B. Landgraf. BibRelEx-Exploring Bibliographic Databases by Visualization of Annotated Content-Based Relations. D-Lib Magazine, 5(11), Nov. 1999.

    Google Scholar 

  6. M. Dewey. A Classification and subject index for cataloguing and arranging the books and pamphlets of a library. Case, Lockwood & Brainard Co., Amherst, MA, USA, 1876.

    Google Scholar 

  7. M. Dewey. Catalogs and Cataloguing: A Decimal Classification and Subject Index. In U.S. Bureau of Education Special Report on Public Libraries Part I, pages 623–648. U.S.G.P.O., Washington DC, USA, 1876.

    Google Scholar 

  8. U. Hahn. Topic parsing: accounting for text macro structures in full-text analysis. Information Processing and Management, 26(1):135–170, 1990.

    Article  Google Scholar 

  9. S. P. Harter. Online Information Retrieval. Academic Press, Orlando, Florida, USA, 1986.

    Google Scholar 

  10. S. Havre, B. Hetzler, and L. Nowell. ThemeRiver™: In search of trends, patterns, and relationships. In Proceedings of IEEE Symposium on Information Visualization (InfoVis’99), San Francisco, CA, USA, Oct. 1999.

    Google Scholar 

  11. M. A. Hearst. TileBars: Visualization of Term Distribution Information in Full Text Information Access. In Proceedings of the ACM Conference on Human Factors in Computing Systems, (CHI’95), pages 56–66, 1995.

    Google Scholar 

  12. .M. A. Hearst. Untangling text data mining. In Proceedings of ACL’99, the 37th Annual Meeting of the Association for Computational Linguistics, June 1999.

    Google Scholar 

  13. M. A. Hearst and C. Plaunt. Subtopic Structuring for Full-Length Document Access. In Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 59–68, 1993.

    Google Scholar 

  14. V. J. Hodge and J. Austin. An evaluation of standard retrieval algorithms and a binary neural approach. Neural Networks, 14(3):287–303, Apr. 2001.

    Google Scholar 

  15. T. Kohonen, S. Kaski, K. Lagus, J. Salojärvi, J Honkela, V Paatero, and A. Saarela. Self organization of a massive document collection. IEEE Transactions on Neural Networks, 11(3):574–585, May 2000.

    Google Scholar 

  16. T. Lahtinen. Automatic indexing: an approach using an index term corpus and combining linguistic and statistical methods PhD thesis, Department of General Linguistics, University of Helsinki, Finland, 2000.

    Google Scholar 

  17. X. Lin. Map displays for information retrieval. Journal of the American Society for Information Science, 48(1):40–54, 1997.

    Article  Google Scholar 

  18. X. Lin, D. Soergel, and G. Marchionini. A Self-Organizing Semantic Map for Information Retrieval. In Proceedings of 14th Annual International ACM/SIGIR Conference on Research & Development in Information Retrieval, pages 262–269, 1991.

    Google Scholar 

  19. H. P. Luhn. The automatic creation of literature abstracts. IBM Journal of Research and Development, 2:159–165, 1958.

    Article  MathSciNet  Google Scholar 

  20. C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts, USA, 1999.

    MATH  Google Scholar 

  21. P. Nelson. Breaching the language barrier: Experimentation with Japanese to English machine translation. In D. I. Raitt, editor, 15th International Online Information Meeting Proceedings, pages 21–33. Learned Information, Dec. 1991.

    Google Scholar 

  22. .D. W. Oard and B. J. Dorr. A Survey of Multilingual Text Retrieval. Technical Report CS-TR-3615, University of Maryland, 1996.

    Google Scholar 

  23. R. Orwig, H. Chen, and J. F. Nunamaker. A Graphical, Self-Organizing Approach to Classifying Electronic Meeting Output. Journal of the American Society for Information Science, 48(2):157–170, 1997.

    Article  Google Scholar 

  24. C. Paice. Constructing literature abstracts by computer: techniques and prospects. Information Processing and Management, 26(1):171–186, 1990.

    Article  Google Scholar 

  25. P. Poinçot, S. Lesteven, and F. Murtagh. A spatial user interface to the astronomical literature. Astronomy and Astrophysics Supplement Series, 130:183–191, 1998.

    Article  Google Scholar 

  26. H. Ritter and T. Kohonen. Self-Organizing Semantic Maps. Biological Cybernetics, 61(4):241–254, 1989.

    Article  Google Scholar 

  27. G. Salton. Automatic processing of foreign language documents. Journal of the American Society for Information Science, 21(3):187–194, 1970.

    Article  Google Scholar 

  28. G. Salton. Automatic Text Processing. Addison-Wesley, 1989.

    Google Scholar 

  29. G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975.

    Article  MATH  Google Scholar 

  30. J. C. Scholtes. Neural Networks in Natural Language Processing and Information Retrieval. PhD thesis, Universiteit van Amsterdam, Amsterdam, Netherlands, 1993.

    Google Scholar 

  31. B. Shneiderman. The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations. In Proceedings of IEEE Symposium on Visual Languages, (VL), pages 336–343, Sept. 1996.

    Google Scholar 

  32. E. R. Tufte. The Visual Display of Quantitative Information. Graphic Press, 1983.

    Google Scholar 

  33. A. Visa, J. Toivonen, S. Autio, J. Mäkinen, H. Vanharanta, and B. Back. Data Mining of Text as a Tool in Authorship Attribution.In B. V. Dasarathy, editor, Proceedings of AeroSense 2001, SPIE 15th Annual International Symposium on Aerospace/Defense Sensing, Simulation and Controls. Data Mining and Knowledge Discovery: Theory, Tools, and Technology III, volume 4384, Orlando, Florida, USA, Apr. 16–20 2001.

    Google Scholar 

  34. J. A. Wise. The Ecological Approach to Text Visualization. Journal of the American Society of Information Science, 50(13):1224–1233, 1999.

    Article  Google Scholar 

  35. S. R. Young and P. J. Hayes. Automatic classification and summarisation of banking telexes. In Proceedings of The Second Conference on Artificial Intelligence Applications, pages 402–408, 1985.

    Google Scholar 

  36. G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge, Massachusetts, USA, 1949.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Visa, A. (2001). Technology of Text Mining. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2001. Lecture Notes in Computer Science(), vol 2123. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44596-X_1

Download citation

  • DOI: https://doi.org/10.1007/3-540-44596-X_1

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42359-1

  • Online ISBN: 978-3-540-44596-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics