Abstract
Efficient access to information and integration of information from various sources and leveraging this information to knowledge are currently major challenges in life science research. However, a large fraction of this information is only available from scientific articles that are stored in huge document databases in free text format or from the Web, where it is available in semi-structured format.
Text mining provides some methods (e.g., classification, clustering, etc.) able to automatically extract relevant knowledge patterns contained in the free text data. The inclusion of the Grid text-mining services into a Grid-based knowledge discovery system can significantly support problem solving processes based on such a system.
Motivation for the research effort presented in this paper is to use the Grid computational, storage, and data access capabilities for text mining tasks and text classification in particular. Text classification mining methods are time-consuming and utilizing the Grid infrastructure can bring significant benefits. Implementation of text mining techniques in distributed environment allows us to access different geographically distributed data collections and perform text mining tasks in parallel/distributed fashion.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Apte, C., Damerau, F., Weiss, S.M.: Towards language independent automated learning of text categorisation models. In: Research and Development in Information Retrieval, pp. 23–30 (1994)
Bednar, P., Butka, P., Paralic, J.: Java library for support of text mining and retrieval. In: Proceedings of Znalosti 2005, Stara Lesna, pp. 162–169 (2005)
Brezany, P., Janciak, I., Woehrer, A., Min Tjoa, A.: Gridminer: A framework for knowledge discovery on the grid - from a vision to design and implementation. In: Cracow Grid Workshop, Cracow (December 2004)
Domingos, P., Pazzani, M.J.: On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning 29(2-3), 103–130 (1997)
Lewis, D.D.: Reuters-21578 text categorization test collection distribution 1.0 (1999), http://www.research.att.com/~lewis
Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Developement 4, 309–317 (1957)
Quinlan, J.R.: Learning first-order definitions of functions. Journal of Artificial Intelligence Research 5, 139–161 (1996)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 513–523 (1988)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Janciak, I., Sarnovsky, M., Tjoa, A.M., Brezany, P. (2006). Distributed Classification of Textual Documents on the Grid. In: Gerndt, M., Kranzlmüller, D. (eds) High Performance Computing and Communications. HPCC 2006. Lecture Notes in Computer Science, vol 4208. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11847366_73
Download citation
DOI: https://doi.org/10.1007/11847366_73
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-39368-9
Online ISBN: 978-3-540-39372-6
eBook Packages: Computer ScienceComputer Science (R0)