Abstract
We propose a method based on restricted random walk clustering as a (semi-)automated complement for the tedious, error-prone and expensive task of manual indexing in a scientific library. The first stage of our method is to cluster a set of (partially) indexed documents using restricted random walks on usage histories in order to find groups of similar documents. In the second stage, we derive possible keywords for documents without indexing information from the frequencies of keywords assigned to other documents in their respective cluster.
Due to the specific clustering algorithm, the proposed algorithm is still efficient with millions of documents and can be deployed on standard PC hardware.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ehrenberg, A.S.: Repeat-Buying: Facts, Theory and Applications, 2nd edn. Charles Griffin & Company Ltd., London (1988)
Geyer-Schulz, A., Neumann, A., Thede, A.: Others also use: a robust recommender system for scientific libraries. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 113–125. Springer, Heidelberg (2003)
Geyer-Schulz, A., Neumann, A., Thede, A.: An architecture for behavior-based library recommender systems – integration and first experiences. Information Technology and Libraries 22 (2003)
Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1, 69–90 (1999)
Creecy, R.H., Masand, B.M., Smith, S.J., Waltz, D.L.: Trading mips and memory for knowledge engineering. Communications of the ACM 35, 48–64 (1992)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)
Chung, Y.M., Pottenger, W.M., Schatz, B.R.: Automatic subject indexing using an associative neural network. In: Proceedings of the 3rd ACM International Conference on Digital Libraries, pp. 59–68. ACM Press, New York (1998)
Lauser, B., Hotho, A.: Automatic multi-label subject indexing in a multilingual environment. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 140–151. Springer, Heidelberg (2003)
Semeraro, G., Ferilli, S., Fanizzi, N., Esposito, F.: Document classification and interpretation through the inference of logic-based models. In: Constantopoulos, P., Sølvberg, I.T. (eds.) ECDL 2001. LNCS, vol. 2163, pp. 59–70. Springer, Heidelberg (2001)
Bock, H.: Automatische Klassifikation. Vandenhoeck&Ruprecht, Göttingen (1974)
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley- Interscience, New York (2001)
Hartigan, J.A.: Clustering Algorithms. John Wiley and Sons, New York (1975)
Viegener, J.: Inkrementelle, domänenunabhängige Thesauruserstellung in dokumentbasierten Informationssystemen durch Kombination von Konstruktionsverfahren. 1 edn. infix, Sankt Augustin (1997)
Schöll, J., Paschinger, E.: Cluster Analysis with Restricted Random Walks. In: Jajuga, K., Sokolowski, A., Bock, H.H. (eds.) Classification, Clustering, and Data Analysis, pp. 113–120. Springer, Heidelberg (2002)
Franke, M.: Clustering of very large document sets using random walks. Master’s thesis, Universität Karlsruhe (TH), Karlsruhe (2003)
Erdös, P., Renyi, A.: On random graphs I. Publ. Mathematicae 6, 290–297 (1957)
Kunz, M., et al.: SWD Sachgruppen. Technical report, Deutsche Bibliothek (2003)
Die Deutsche Bibliothek: MAB2 : Maschinelles Austauschformat für Bibliotheken. Dt. Bibliothek, Leipzig (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Franke, M., Geyer-Schulz, A. (2004). Automated Indexing with Restricted Random Walks on Large Document Sets. In: Heery, R., Lyon, L. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2004. Lecture Notes in Computer Science, vol 3232. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30230-8_22
Download citation
DOI: https://doi.org/10.1007/978-3-540-30230-8_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23013-7
Online ISBN: 978-3-540-30230-8
eBook Packages: Springer Book Archive