Abstract
We propose an approach to Distributed Information Retrieval based on the periodic and incremental centralisation of full-text indices of widely dispersed and autonomously managed content sources.
Inspired by the success of the Open Archive Initiative’s protocol for metadata harvesting, the approach occupies middle ground between: (i) the crawling of content, and (ii) the distribution of retrieval. As in crawling, some data moves towards the retrieval process, but it is statistics about the content rather than content itself. As in distributed retrieval, some processing is distributed along with the data, but it is indexing rather than retrieval itself. We show that the approach retains the good properties of centralised retrieval without renouncing to cost-effective resource pooling. We discuss the requirements associated with the approach and identify two strategies to deploy it on top of the OAI infrastructure.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bowman, C.M., Danzig, P.B., Hardy, D.R., et al.: Harvest: A Scalable, Customizable, Discovery and Access System. Technical Report TR CU-CS-732-94, Department of Computer Science, University of Colorado-Boulder (1994)
Callan, J.: Distributed information retrieval. In: Croft, W.B. (ed.) Advances in information retrieval, ch. 5, pp. 127–150. Kluwer Academic Publishers, Dordrecht (2000)
Callan, J., Fuhr, N., Nejdl, W. (eds.): Proceedings of the SIGIR Workshop on Peer-to-Peer Information Retrieval, 27th Annual International ACM SIGIR Conference, July 29 (2004)
Carmel, D., Cohen, D., et al.: Static Index Pruning for Information Retrieval Systems. In: Proceedings of the 24th ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43–50 (2001)
The Dublin Core Metadata Initiative: Dublin Core Metadata Element Set, Version 1.1: Reference Description (2004), http://dublincore.org/documents/dces/
Lagoze, C., Van de Sompel, H.: The Open Archives Initiative: Building a low-barrier interoperability framework. In: JCDL 2001: Proceedings of the First ACM/IEEE-CS Joint Conference on Digital Libraries (2001)
Lagoze, C., Hoehn, W., Arms, W., Allan, J., et al.: Core Services in the Architecture of the National Digital Library for Science Education (NDSL). Cornell University, Ithaca, arXiv Report, cs.DL/0201025 (2002)
Lynch, C.: The Z39.50 Information Retrieval Standard: Part I: A Strategic View of Its Past, Present, and Future. In: D-Lib Magazine (April 1997), http://www.dlib.org/dlib/april97/04lynch.html
The Open Archives Initiative: The Open Archives Initiative Protocol for Metadata Harvesting (2.0) (2003), http://www.openarchives.org/OAI/openarchivesprotocol.html
Simeoni, F.: Servicing the Federation: the Case for Metadata Harvesting. In: Heery, R., Lyon, L. (eds.) ECDL 2004. LNCS, vol. 3232, pp. 389–399. Springer, Heidelberg (2004)
Van de Sompel, H., Young, J., Hickey, T.: Using the OAI-PMH..Differently. In: D-lib Magazine (July/August 2003)
Suleman, H., Fox, E.: Designing Protocols in Support of Digital Library Componentization. In: Agosti, M., Thanos, C. (eds.) ECDL 2002. LNCS, vol. 2458, pp. 568–582. Springer, Heidelberg (2002)
Witten, I., Moffat, A., Bell, T.: Managing Gigabytes: Compressing and indexing documents and images. Van Nostrand Reinhold (1994)
Z39.50 Maintenance Agency: Information Retrieval (Z39.50): Application Service Definition and Protocol Specification (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Simeoni, F., Yakici, M., Neely, S., Crestani, F. (2005). Harvesting for Full-Text Retrieval. In: Fox, E.A., Neuhold, E.J., Premsmit, P., Wuwongse, V. (eds) Digital Libraries: Implementing Strategies and Sharing Experiences. ICADL 2005. Lecture Notes in Computer Science, vol 3815. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11599517_24
Download citation
DOI: https://doi.org/10.1007/11599517_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30850-8
Online ISBN: 978-3-540-32291-7
eBook Packages: Computer ScienceComputer Science (R0)