Skip to main content

Automated indexing with thesaurus descriptors: A co-occurrence based approach to multilingual retrieval

  • Multilingual Information Retrieval
  • Conference paper
  • First Online:
Research and Advanced Technology for Digital Libraries (ECDL 1997)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1324))

Included in the following conference series:

Abstract

Indexing documents with descriptors from a multilingual thesaurus is an approach to multilingual Information Retrieval. However, manual indexing is expensive. Automated indexing methods in general use terms found in the document. Thesaurus descriptors are complex terms that are often not used in documents or have specific meanings within the thesaurus; therefore most weighting schemes of automated indexing methods are not suited to select thesaurus descriptors.

In this paper a linear associative system is described that uses similarity values extracted from a large corpus of manually indexed documents to construct a rank ordering of the descriptors for a given document title. The system is adaptive and has to be tuned with a training sample of records for the specific task.

The system was tested on a corpus of some 80,000 bibliographic records. The results show a high variability with changing parameter values. This indicates that it is very important to empirically adapt the model to the specific situation it is used in. The overall median of the manually assigned descriptors in the automatically generated ranked list of all 3,631 descriptors is 14 for the set used to adapt the system and 11 for a test set not used in the optimization process. This result shows that the optimization is not a fitting to a specific training set but a real adaptation of the model to the setting.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Biebricher, P., Fuhr, N., Knorz, G., Lustig, G., & Schwantner, M. (1988). Entwicklung und Anwendung des automatischen Indexierungssystems AIR/PHYS. Nachrichten für Dokumentation 39, 135–143.

    Google Scholar 

  • Church, K. W., & Hanks, P. (1989). Word association norms, mutual information,, and lexicography. In 27th Annual Meeting of teh Association for Computational Linguistics, Proceedings of the Conference (1989), 76–83.

    Google Scholar 

  • Crestani, F., & Van Rijsbergen, C. J. (1995). Information retrieval by logical imaging. Journal of Documentation 51(1), 3–17.

    Google Scholar 

  • Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407.

    Article  Google Scholar 

  • Ferber, R., Wettler, M., & Rapp, R. (1995). An associative model of word selection in the generation of search queries. Journal of the American Society for Information Science (JASIS) 46(9), 685–699.

    Article  Google Scholar 

  • Fuhr, N., & Buckley, C. (1991). A probabilistic learning approach for document indexing. ACM Transactions on Information Systems 9(3), 223–248.

    Article  Google Scholar 

  • Fuhr, N., Hartmann, S., Lustig, G., Schwantner, M., Tzeras, K., & Knorz, G. (1991). Air/x-a rule-based multistage indexing system for large subject fields. In Proceedings of the RIAO 91 (1991).

    Google Scholar 

  • Giuliano, V. E., & Jones, P. E. (1963). Linear associative information retrieval. In Vistas in Information Handling, P. W. Howerton & D. C. Weeks, Eds., vol. 1. Spartan Books, Washington D. C., Washington, D.C., ch. 2, 30–54.

    Google Scholar 

  • Grefenstette, G. (1992). Use of syntactic context to produce term association lists for text retrieval. In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1992), ACM SIGIR, 89–97.

    Google Scholar 

  • Harman, D. Overview of the Third Text REtreival Conference (TREC-3). WWW-Page: http://potomac.ncsl.nist.gov/TREC/trec3.papers/donnas.trec3paper.ps, 1995.

    Google Scholar 

  • Harman, D. Overview of the Fourth Text REtreival Conference (TREC-4). WWW-Page: http://potomac.ncsl.nist.gov:80/TREC/tree4.papers/overview.ps, 1996.

    Google Scholar 

  • Hull, D. A., & Grefenstette, G. (1996). Querying across languages: A dictionary-based approach to multilingual information retrieval. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR '96 (1996), 49–57.

    Google Scholar 

  • James, W. (1890). The Principles of Psychology. New York: Holt, Reprinted New York: Dover Publications, 1950.

    Google Scholar 

  • Jing, Y., & Croft, W. B. (1994). An association thesaurus for information retrieval. In Proceedings of the RIAO 94 (1994), vol. 1, 146–160.

    Google Scholar 

  • Jones, W. P., & Furnas, G. W. (1987). Pictures of relevance: A geometric analysis of similarity measures. Journal of the American Society for Information Science 38(6), 420–442.

    Article  Google Scholar 

  • Peat, H. J., & Willett, P. (1991). The limitations of term co-occurrence data for query expansion in document retrieval systems. Journal of the American Society for Information Science 42(5), 378–383.

    Article  Google Scholar 

  • Rijsbergen, C. J. v., Harper, D. J., & Porter, H. F. (1981). The selection of good serach terms. Information Processing and Management 17, 77–91.

    Article  Google Scholar 

  • Ruge, G. (1992). Experiments on linguistically-based term associations. Information Processing and Management 28(3), 317–332.

    Article  Google Scholar 

  • Salton, G., & Buckley, C. (1988). On the use of spreading activation methods in automatic information retrieval. In Proceedings of the eleventh Annual International Conference on Research and Development in Information Retrieval (1988), ACM, 147–160.

    Google Scholar 

  • Salton, G., & McGill, M. J. (1983). Introduction to Modern Information Retrieval. New York: McGraw-Hill.

    Google Scholar 

  • Van Rijsbergen, C. J. (1977). A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation 33(2), 106–119.

    Google Scholar 

  • Wettler, M., Rapp, R., & Ferber, R. (1993). Freie Assoziationen und Kontiguitäten von Wörtern in Texten. Zeitschrift für Psychologie 201, 99–108.

    Google Scholar 

  • Willett, P. (1985). An algorithm for the calculation of exact term discrimination values. Information Processing and Management 21(3), 225–232.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Carol Peters Costantino Thanos

Rights and permissions

Reprints and permissions

Copyright information

© 1997 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ferber, R. (1997). Automated indexing with thesaurus descriptors: A co-occurrence based approach to multilingual retrieval. In: Peters, C., Thanos, C. (eds) Research and Advanced Technology for Digital Libraries. ECDL 1997. Lecture Notes in Computer Science, vol 1324. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0026731

Download citation

  • DOI: https://doi.org/10.1007/BFb0026731

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-63554-3

  • Online ISBN: 978-3-540-69597-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics