Automated indexing with thesaurus descriptors: A co-occurrence based approach to multilingual retrieval

Ferber, Reginald

doi:10.1007/BFb0026731

Reginald Ferber¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1324))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

165 Accesses
4 Citations

Abstract

Indexing documents with descriptors from a multilingual thesaurus is an approach to multilingual Information Retrieval. However, manual indexing is expensive. Automated indexing methods in general use terms found in the document. Thesaurus descriptors are complex terms that are often not used in documents or have specific meanings within the thesaurus; therefore most weighting schemes of automated indexing methods are not suited to select thesaurus descriptors.

In this paper a linear associative system is described that uses similarity values extracted from a large corpus of manually indexed documents to construct a rank ordering of the descriptors for a given document title. The system is adaptive and has to be tuned with a training sample of records for the specific task.

The system was tested on a corpus of some 80,000 bibliographic records. The results show a high variability with changing parameter values. This indicates that it is very important to empirically adapt the model to the specific situation it is used in. The overall median of the manually assigned descriptors in the automatically generated ranked list of all 3,631 descriptors is 14 for the set used to adapt the system and 11 for a test set not used in the optimization process. This result shows that the optimization is not a fitting to a specific training set but a real adaptation of the model to the setting.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Biebricher, P., Fuhr, N., Knorz, G., Lustig, G., & Schwantner, M. (1988). Entwicklung und Anwendung des automatischen Indexierungssystems AIR/PHYS. Nachrichten für Dokumentation 39, 135–143.
Google Scholar
Church, K. W., & Hanks, P. (1989). Word association norms, mutual information,, and lexicography. In 27th Annual Meeting of teh Association for Computational Linguistics, Proceedings of the Conference (1989), 76–83.
Google Scholar
Crestani, F., & Van Rijsbergen, C. J. (1995). Information retrieval by logical imaging. Journal of Documentation 51(1), 3–17.
Google Scholar
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407.
Article Google Scholar
Ferber, R., Wettler, M., & Rapp, R. (1995). An associative model of word selection in the generation of search queries. Journal of the American Society for Information Science (JASIS) 46(9), 685–699.
Article Google Scholar
Fuhr, N., & Buckley, C. (1991). A probabilistic learning approach for document indexing. ACM Transactions on Information Systems 9(3), 223–248.
Article Google Scholar
Fuhr, N., Hartmann, S., Lustig, G., Schwantner, M., Tzeras, K., & Knorz, G. (1991). Air/x-a rule-based multistage indexing system for large subject fields. In Proceedings of the RIAO 91 (1991).
Google Scholar
Giuliano, V. E., & Jones, P. E. (1963). Linear associative information retrieval. In Vistas in Information Handling, P. W. Howerton & D. C. Weeks, Eds., vol. 1. Spartan Books, Washington D. C., Washington, D.C., ch. 2, 30–54.
Google Scholar
Grefenstette, G. (1992). Use of syntactic context to produce term association lists for text retrieval. In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1992), ACM SIGIR, 89–97.
Google Scholar
Harman, D. Overview of the Third Text REtreival Conference (TREC-3). WWW-Page: http://potomac.ncsl.nist.gov/TREC/trec3.papers/donnas.trec3paper.ps, 1995.
Google Scholar
Harman, D. Overview of the Fourth Text REtreival Conference (TREC-4). WWW-Page: http://potomac.ncsl.nist.gov:80/TREC/tree4.papers/overview.ps, 1996.
Google Scholar
Hull, D. A., & Grefenstette, G. (1996). Querying across languages: A dictionary-based approach to multilingual information retrieval. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR '96 (1996), 49–57.
Google Scholar
James, W. (1890). The Principles of Psychology. New York: Holt, Reprinted New York: Dover Publications, 1950.
Google Scholar
Jing, Y., & Croft, W. B. (1994). An association thesaurus for information retrieval. In Proceedings of the RIAO 94 (1994), vol. 1, 146–160.
Google Scholar
Jones, W. P., & Furnas, G. W. (1987). Pictures of relevance: A geometric analysis of similarity measures. Journal of the American Society for Information Science 38(6), 420–442.
Article Google Scholar
Peat, H. J., & Willett, P. (1991). The limitations of term co-occurrence data for query expansion in document retrieval systems. Journal of the American Society for Information Science 42(5), 378–383.
Article Google Scholar
Rijsbergen, C. J. v., Harper, D. J., & Porter, H. F. (1981). The selection of good serach terms. Information Processing and Management 17, 77–91.
Article Google Scholar
Ruge, G. (1992). Experiments on linguistically-based term associations. Information Processing and Management 28(3), 317–332.
Article Google Scholar
Salton, G., & Buckley, C. (1988). On the use of spreading activation methods in automatic information retrieval. In Proceedings of the eleventh Annual International Conference on Research and Development in Information Retrieval (1988), ACM, 147–160.
Google Scholar
Salton, G., & McGill, M. J. (1983). Introduction to Modern Information Retrieval. New York: McGraw-Hill.
Google Scholar
Van Rijsbergen, C. J. (1977). A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation 33(2), 106–119.
Google Scholar
Wettler, M., Rapp, R., & Ferber, R. (1993). Freie Assoziationen und Kontiguitäten von Wörtern in Texten. Zeitschrift für Psychologie 201, 99–108.
Google Scholar
Willett, P. (1985). An algorithm for the calculation of exact term discrimination values. Information Processing and Management 21(3), 225–232.
Article Google Scholar

Download references

Author information

Authors and Affiliations

GMD - IPSI, Dolivostr. 15, 64293, Darmstadt, Germany
Reginald Ferber

Authors

Reginald Ferber
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Carol Peters Costantino Thanos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ferber, R. (1997). Automated indexing with thesaurus descriptors: A co-occurrence based approach to multilingual retrieval. In: Peters, C., Thanos, C. (eds) Research and Advanced Technology for Digital Libraries. ECDL 1997. Lecture Notes in Computer Science, vol 1324. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0026731

Download citation

DOI: https://doi.org/10.1007/BFb0026731
Published: 17 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-63554-3
Online ISBN: 978-3-540-69597-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics