Abstract
This paper presents a novel technique to semi-automatically identify metadata for documents when installing a knowledge management system. Document management systems often deal with large collections of documents. This vast amount of information needs to be searchable for the knowledge worker. Supporting techniques are needed to aid the knowledge worker in his search for information. Many of these techniques are based on the presence of metadata for each document. The techniques presented in this paper are based on a novel approach called multilayer clustering. Using this clustering technique, documents can be assigned to one or more document types. Besides this assignment to a specific type, properties and values are assigned to this document based on term networks extracted from this document. The preliminary tests presented in this paper were performed on a public and several private dataset. The results obtained from the tests indicate that this approach is promising.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval, 1st edn. Addison-Wesley, Reading (1999)
D’hondt, J., Vertommen, J., Verhaegen, P.-A., Catrysse, D., Duflou, J.R.: Pairwise-adaptive dissimilarity measure for document clustering. Information Sciences (2009) (Submitted)
Dominich, S.: The modern algebra of information retrieval, 1st edn. Springer, Berlin (2008)
Everitt, B., Landau, S., Leese, M.: Cluster Analysis, 1st edn. Arnold, London (2001)
Jain, A.K., Murty, M.R., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999)
Lang, K.: The 20 Newsgroups data set, 1997, 20 Newsgroup, (14, 01, 2008), http://www.ai.mit.edu/people/jrennie/20Newsgroups/ (Retrived: 14 January 2008)
Newman, M.E.: Power laws, Pareto distributions and Zipf’s law. Contemporary Physics 46(5), 323–351 (2005)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Raskutti, B., Leckie, C.: An Evaluation of Criteria for Measuring the Quality of Clusters. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, San Francisco, pp. 905–910 (1999)
Salton, G.: Introduction to Modern Information Retrieval, 1st edn. Mcgraw Hill, New York (1986)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management: an International Journal archive 24(5), 513–523 (1988)
Vertommen, J., Janssens, F., De Moor, B., Duflou, J.R.: Multiple-vector User Profiles in Support of Knowledge Sharing. Information Sciences 178(17), 3333–3346 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
D’hondt, J., Vandevenne, D., Verhaegen, PA., Vertommen, J., Cattrysse, D., Duflou, J.R. (2010). Identifying Document Metadata Based on Multilayer Clustering. In: Huang, G.Q., Mak, K.L., Maropoulos, P.G. (eds) Proceedings of the 6th CIRP-Sponsored International Conference on Digital Enterprise Technology. Advances in Intelligent and Soft Computing, vol 66. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10430-5_115
Download citation
DOI: https://doi.org/10.1007/978-3-642-10430-5_115
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10429-9
Online ISBN: 978-3-642-10430-5
eBook Packages: EngineeringEngineering (R0)