Abstract
Indexing documents with descriptors from a multilingual thesaurus is an approach to multilingual Information Retrieval. However, manual indexing is expensive. Automated indexing methods in general use terms found in the document. Thesaurus descriptors are complex terms that are often not used in documents or have specific meanings within the thesaurus; therefore most weighting schemes of automated indexing methods are not suited to select thesaurus descriptors.
In this paper a linear associative system is described that uses similarity values extracted from a large corpus of manually indexed documents to construct a rank ordering of the descriptors for a given document title. The system is adaptive and has to be tuned with a training sample of records for the specific task.
The system was tested on a corpus of some 80,000 bibliographic records. The results show a high variability with changing parameter values. This indicates that it is very important to empirically adapt the model to the specific situation it is used in. The overall median of the manually assigned descriptors in the automatically generated ranked list of all 3,631 descriptors is 14 for the set used to adapt the system and 11 for a test set not used in the optimization process. This result shows that the optimization is not a fitting to a specific training set but a real adaptation of the model to the setting.
Preview
Unable to display preview. Download preview PDF.
References
Biebricher, P., Fuhr, N., Knorz, G., Lustig, G., & Schwantner, M. (1988). Entwicklung und Anwendung des automatischen Indexierungssystems AIR/PHYS. Nachrichten für Dokumentation 39, 135–143.
Church, K. W., & Hanks, P. (1989). Word association norms, mutual information,, and lexicography. In 27th Annual Meeting of teh Association for Computational Linguistics, Proceedings of the Conference (1989), 76–83.
Crestani, F., & Van Rijsbergen, C. J. (1995). Information retrieval by logical imaging. Journal of Documentation 51(1), 3–17.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407.
Ferber, R., Wettler, M., & Rapp, R. (1995). An associative model of word selection in the generation of search queries. Journal of the American Society for Information Science (JASIS) 46(9), 685–699.
Fuhr, N., & Buckley, C. (1991). A probabilistic learning approach for document indexing. ACM Transactions on Information Systems 9(3), 223–248.
Fuhr, N., Hartmann, S., Lustig, G., Schwantner, M., Tzeras, K., & Knorz, G. (1991). Air/x-a rule-based multistage indexing system for large subject fields. In Proceedings of the RIAO 91 (1991).
Giuliano, V. E., & Jones, P. E. (1963). Linear associative information retrieval. In Vistas in Information Handling, P. W. Howerton & D. C. Weeks, Eds., vol. 1. Spartan Books, Washington D. C., Washington, D.C., ch. 2, 30–54.
Grefenstette, G. (1992). Use of syntactic context to produce term association lists for text retrieval. In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1992), ACM SIGIR, 89–97.
Harman, D. Overview of the Third Text REtreival Conference (TREC-3). WWW-Page: http://potomac.ncsl.nist.gov/TREC/trec3.papers/donnas.trec3paper.ps, 1995.
Harman, D. Overview of the Fourth Text REtreival Conference (TREC-4). WWW-Page: http://potomac.ncsl.nist.gov:80/TREC/tree4.papers/overview.ps, 1996.
Hull, D. A., & Grefenstette, G. (1996). Querying across languages: A dictionary-based approach to multilingual information retrieval. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR '96 (1996), 49–57.
James, W. (1890). The Principles of Psychology. New York: Holt, Reprinted New York: Dover Publications, 1950.
Jing, Y., & Croft, W. B. (1994). An association thesaurus for information retrieval. In Proceedings of the RIAO 94 (1994), vol. 1, 146–160.
Jones, W. P., & Furnas, G. W. (1987). Pictures of relevance: A geometric analysis of similarity measures. Journal of the American Society for Information Science 38(6), 420–442.
Peat, H. J., & Willett, P. (1991). The limitations of term co-occurrence data for query expansion in document retrieval systems. Journal of the American Society for Information Science 42(5), 378–383.
Rijsbergen, C. J. v., Harper, D. J., & Porter, H. F. (1981). The selection of good serach terms. Information Processing and Management 17, 77–91.
Ruge, G. (1992). Experiments on linguistically-based term associations. Information Processing and Management 28(3), 317–332.
Salton, G., & Buckley, C. (1988). On the use of spreading activation methods in automatic information retrieval. In Proceedings of the eleventh Annual International Conference on Research and Development in Information Retrieval (1988), ACM, 147–160.
Salton, G., & McGill, M. J. (1983). Introduction to Modern Information Retrieval. New York: McGraw-Hill.
Van Rijsbergen, C. J. (1977). A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation 33(2), 106–119.
Wettler, M., Rapp, R., & Ferber, R. (1993). Freie Assoziationen und Kontiguitäten von Wörtern in Texten. Zeitschrift für Psychologie 201, 99–108.
Willett, P. (1985). An algorithm for the calculation of exact term discrimination values. Information Processing and Management 21(3), 225–232.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1997 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ferber, R. (1997). Automated indexing with thesaurus descriptors: A co-occurrence based approach to multilingual retrieval. In: Peters, C., Thanos, C. (eds) Research and Advanced Technology for Digital Libraries. ECDL 1997. Lecture Notes in Computer Science, vol 1324. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0026731
Download citation
DOI: https://doi.org/10.1007/BFb0026731
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-63554-3
Online ISBN: 978-3-540-69597-4
eBook Packages: Springer Book Archive