Distributed Classification of Textual Documents on the Grid

Janciak, Ivan; Sarnovsky, Martin; Tjoa, A Min; Brezany, Peter

doi:10.1007/11847366_73

Ivan Janciak¹⁸,
Martin Sarnovsky¹⁹,
A Min Tjoa²⁰ &
…
Peter Brezany¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4208))

Included in the following conference series:

International Conference on High Performance Computing and Communications

814 Accesses

Abstract

Efficient access to information and integration of information from various sources and leveraging this information to knowledge are currently major challenges in life science research. However, a large fraction of this information is only available from scientific articles that are stored in huge document databases in free text format or from the Web, where it is available in semi-structured format.

Text mining provides some methods (e.g., classification, clustering, etc.) able to automatically extract relevant knowledge patterns contained in the free text data. The inclusion of the Grid text-mining services into a Grid-based knowledge discovery system can significantly support problem solving processes based on such a system.

Motivation for the research effort presented in this paper is to use the Grid computational, storage, and data access capabilities for text mining tasks and text classification in particular. Text classification mining methods are time-consuming and utilizing the Grid infrastructure can bring significant benefits. Implementation of text mining techniques in distributed environment allows us to access different geographically distributed data collections and perform text mining tasks in parallel/distributed fashion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Distributed Classification of Text Documents on Apache Spark Platform

Using Domain Ontologies for Text Classification. A Use Case to Classify Computer Science Papers

Improving Information-Carrying Data Capacity in Text Mining

References

Apte, C., Damerau, F., Weiss, S.M.: Towards language independent automated learning of text categorisation models. In: Research and Development in Information Retrieval, pp. 23–30 (1994)
Google Scholar
Bednar, P., Butka, P., Paralic, J.: Java library for support of text mining and retrieval. In: Proceedings of Znalosti 2005, Stara Lesna, pp. 162–169 (2005)
Google Scholar
Brezany, P., Janciak, I., Woehrer, A., Min Tjoa, A.: Gridminer: A framework for knowledge discovery on the grid - from a vision to design and implementation. In: Cracow Grid Workshop, Cracow (December 2004)
Google Scholar
Domingos, P., Pazzani, M.J.: On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning 29(2-3), 103–130 (1997)
Article MATH Google Scholar
Lewis, D.D.: Reuters-21578 text categorization test collection distribution 1.0 (1999), http://www.research.att.com/~lewis
Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Developement 4, 309–317 (1957)
Article MathSciNet Google Scholar
Quinlan, J.R.: Learning first-order definitions of functions. Journal of Artificial Intelligence Research 5, 139–161 (1996)
MATH Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 513–523 (1988)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Scientific Computing, University of Vienna, Nordbergstrasse 15/C/3, A-1090, Vienna, Austria
Ivan Janciak & Peter Brezany
Department of Cybernetics and Artificial Intelligence, Technical University of Kosice, Letna 9, Kosice, Slovakia
Martin Sarnovsky
Institute of Software Technology and Interactive Systems, Vienna University of Technology, Favoritenstrasse 9-11/E188, A-1040, Vienna, Austria
A Min Tjoa

Authors

Ivan Janciak
View author publications
You can also search for this author in PubMed Google Scholar
Martin Sarnovsky
View author publications
You can also search for this author in PubMed Google Scholar
A Min Tjoa
View author publications
You can also search for this author in PubMed Google Scholar
Peter Brezany
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

,
Michael Gerndt
GUP, Institute of Graphics and Parallel Processing, Johannes Kepler University, Altenbergerstraße 69, A-4040, Linz, Austria
Dieter Kranzlmüller

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Janciak, I., Sarnovsky, M., Tjoa, A.M., Brezany, P. (2006). Distributed Classification of Textual Documents on the Grid. In: Gerndt, M., Kranzlmüller, D. (eds) High Performance Computing and Communications. HPCC 2006. Lecture Notes in Computer Science, vol 4208. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11847366_73

Download citation

DOI: https://doi.org/10.1007/11847366_73
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-39368-9
Online ISBN: 978-3-540-39372-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Distributed Classification of Textual Documents on the Grid

Abstract

Access this chapter

Preview

Similar content being viewed by others

Distributed Classification of Text Documents on Apache Spark Platform

Using Domain Ontologies for Text Classification. A Use Case to Classify Computer Science Papers

Improving Information-Carrying Data Capacity in Text Mining

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Distributed Classification of Textual Documents on the Grid

Abstract

Access this chapter

Preview

Similar content being viewed by others

Distributed Classification of Text Documents on Apache Spark Platform

Using Domain Ontologies for Text Classification. A Use Case to Classify Computer Science Papers

Improving Information-Carrying Data Capacity in Text Mining

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation