Abstract
This chapter argues that in addition to federated search and gathering (as by Web crawlers), harvesting is an important approach to address the needs for distributed IR. We highlight the use of the Open Archives Initiative Protocol for Metadata Harvesting, illustrating its use in three projects: OAD, NDLTD, and CITIDEL. We explain how traditional services can be extended in a user-centered fashion, providing details of our new: ESSEX search engine, multischeming browsing, and quality-oriented filtering (using rules and SVMs). We conclude with an overview of work in progress on logging and component architectures, as well as a summary of our findings.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Fox, E., Urs, S., Cronin, B. (ed.): Digital Libraries. Annual Review of Information Science and Technology 36(12), 503–589 (2002)
Fox, E., Feizbadi, F., Moxley, J., Weisser, C. (eds.): The ETD Sourcebook: Theses and Dissertations in the Electronic Age. Marcel Dekker, New York (2004) (in press)
National Information Standards Organization: Z39.50: Information Retrieval (Z39.50): Application Service Definition and Protocol Specification. NISO Press, Bethesda (1995)
Moen, W.E.: Accessing Distributed Cultural Heritage Information. CACM 41(4), 45–48 (1998)
Arms, W.Y.: Digital Libraries. MIT Press, Cambridge (2000)
Lagoze, C., Davis, J.R.: Dienst – An Architecture for Distributed Document Libraries. CACM 38(4), 47 (1995)
NCSTRL: Networked Computer Science Technical Reference Library. Homepage, http://www.ncstrl.org (Available November 3, 2003)
Lagoze, C., Fielding, D., Payette, S.: Making global digital libraries work: collection services, connectivity regions, and collection views. In: Proceedings of the Third ACM Conference on Digital Libraries, pp. 134–143 (1998)
Anan, H., Liu, X., Maly, K., Nelson, M., Zubair, M., French, J., Fox, E., Shivakumar, P.: Preservation and transition of NCSTRL using an OAI-based architecture. In: JCDL 2002, pp. 181–182 (2002)
Bowman, C., Danzig, P., Hardy, D., Manber, U., Schwartz, M.: The Harvest Information Discovery and Access System. Computer Networks and ISDN Systems 28(1&2), 119–125 (1995)
OAI: Open Archives Initiative. Homepage, http://www.openarchives.org (Available November 3, 2003)
Lagoze, C., van de Sompel, H.: The Open Archives Initiative: building a low-barrier interoperability framework. In: JCDL 2001, pp. 54–62 (2001)
Suleman, H., Fox, E.: The Open Archives Initiative: Realizing Simple and Effective Digital Library Interoperability. Special issue on “Libraries and Electronic Resources: New Partnerships, New Practices, New Perspectives” of J. Library Automation 35(1/2), 125–145 (2002)
Dublin Core Metadata Initiative. Homepage, http://dublincore.org/ (Available November 3, 2003)
Hochstenbach, H., Van de Sompel, H.: The OAI-PMH Static Repository and Static Repository Gateway. In: JCDL 2003, pp. 210–217 (2003)
Calado, P., Gonçalves, M., Fox, E., Ribeiro-Neto, B., Laender, A., da Silva, A., Reis, D., Roberto, P., Vieira, M., Lage, J.: The Web-DL Environment for Building Digital Libraries from the Web. In: JCDL 2003, pp. 346–357 (2003)
Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: WWW 2002, pp. 136–147 (2002)
Ipeirotis, P., Gravano, L., Sahami, M.: Count, and Classify: Categorizing Hidden Web Databases. In: SIGMOD Conference (2001)
Lage, J.P., Silva, A.S., Golgher, P.B., Laender, A.H.F.: Collecting hidden web pages for data extraction. In: WIDM 2002, pp. 69–75 (2002)
OAD. Open Archives : Distributed services for physicists and graduate students. Homepage, http://www.dlib.vt.edu/OAD/ (Available November 3, 2003)
PhysNet. The Worldwide Physics Departments and Documents Network. Homepage, http://www.phys.vt.edu/PhysNet/ (Available November 3, 2003)
OCLC. Online Computer Library Center. Homepage, http://www.oclc.org (Available November 3, 2003)
Fox, E.A.: Networked Digital Library of Theses and Dissertations (NDLTD), Homepage http://www.ndltd.org (Available November 3, 2003)
Suleman, H., Atkins, A., Gonçalves, M.A., France, R.K., Fox, E.A., Virginia Tech., Chachra, V., Crowder, M., VTLS Inc., Young, J.: OCLC: Networked Digital Library of Theses and Dissertations: Bridging the Gaps for Global Access – Part 1: Mission and Progress. D-Lib Magazine 7(9) (2001), http://www.dlib.org/dlib/september01/suleman/09suleman-pt1.html (Available November 3, 2003)
NDLTD Union Catalog Project. Electronic Thesis/Dissertation OAI Union Catalog Based at OCLC. Homepage, http://rocky.dlib.vt.edu/~etdunion/cgi-bin/OCLCUnion/UI/ (Available November 3, 2003)
Suleman, H., Luo, M.: Electronic Thesis/Dissertation OAI Union Catalog. Homepage, http://purl.org/net/etdunion (Available November 3, 2003)
ODL. Open Digital Libraries. Homepage, http://oai.dlib.vt.edu/odl/ (Available November 3, 2003)
DSpace Federation. DSpace at MIT. Homepage, http://www.dspace.org/ (Available November 3, 2003)
BEPres. The Berkeley Electronic Press. Homepage, http://www.bepress.com/ (Available November 3, 2003)
ETDMS. ETD-MS: An Interoperability Metadata Standard for Electronic Theses and Dissertations. Homepage, http://www.bepress.com/ (Available November 3, 2003)
CALIS. China Academic Library & Information System. Homepage, http://www.calis.edu.cn (Available November 3, 2003)
CITIDEL. Homepage, http://www.citidel.org (Available November 3, 2003)
NSDL. National Science Digital Library. Homepage, http://www.nsdl.org (Available November 3, 2003)
On-line Virtual Computer History Museum. Homepage, http://virtualmuseum.dlib.vt.edu (Available November 3, 2003)
CSTC. Computer Science Teaching Center. Homepage, http://www.cstc.org (Available November 3, 2003)
Krowne, A.: An Architecture for Collaborative Math and Science Digital Libraries. In: Masters thesis, Virginia Tech Dept. of Computer Science, Blacksburg, VA 24061 USA, http://scholar.lib.vt.edu/theses/available/etd-09022003-150851/ (Available November 3, 2003)
Ley, M. (ed.): dblp.uni-trier.de Computer Science Bibliography. Homepage, http://www.informatik.uni-trier.de/~ley/db/ (Available November 3, 2003)
IEEE-CS. IEEE Computer Society Digital Library. Homepage, http://www.computer.org/publications/dlib/ (Available November 3, 2003)
eBizSearch. Homepage, http://gunther.smeal.psu.edu/ (Available November 3, 2003)
NEC Research Institute CiteSeer : Scientific Literature Digital Library. Homepage, http://citeseer.org (Available November 3, 2003)
Krowne, A. The ESSEX Search Engine, http://br.endernet.org/~akrowne/elaine/essex/ (Available November 3, 2003)
Suleman, H.: Open Digital Libraries. PhD Dissertation, Virginia Tech (2002), http://scholar.lib.vt.edu/theses/available/etd-11222002-155624/ (Available November 3, 2003)
Fox, E., Suleman, S., Luo, M.: Building Digital Libraries Made Easy: Toward Open Digital Libraries. In: Lim, E.-p., Foo, S.S.-B., Khoo, C., Chen, H., Fox, E., Urs, S.R., Costantino, T. (eds.) ICADL 2002. LNCS, vol. 2555, pp. 14–24. Springer, Heidelberg (2002)
Dumais, S., Chen, H.: Hierarchical classification of Web content. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, Athens, Greece, pp. 256–263 (2000)
Yahoo! Homepage, http://www.yahoo.com (Available November 3, 2003)
dmoz. Open Directory Project. Homepage, http://www.dmoz.org (Available November 3, 2003)
Krowne, A., Fox, E.: An Architecture for Multischeming in Digital Libraries. Virginia Tech Dept. of Computer Science Technical Report TR-03-25, Blacksburg, VA (2003), http://eprints.cs.vt.edu:8000/archive/00000692/ (Available November 3, 2003)
Fox, E.A.: Networked Digital Library of Theses and Dissertations. Nature Web Matters 12 (August 1999), http://helix.nature.com/webmatters/library/library.html (Available November 3, 2003)
Platt, J.: Fast Training of Support Vector Machines using Sequential Minimal Optimization. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, MIT Press, Cambridge (1998)
Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of CIKM 1998, 7th ACM International Conference on Information and Knowledge Management, Bethesda, MD, pp. 148–155 (1998)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Joachims, T.: A statistical learning model of text classification for support vector machines. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, New Orleans, LA, pp. 128–136 (2001)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34(1), 1–47 (2002)
ACM Digital Library. Homepage, http://www.acm.org/dl (Available November 3, 2003)
Gonçalves, M.A., Luo, M., Shen, R., Ali, M.F., Fox, E.A.: An XML Log Standard and Tool for Digital Library Logging Analysis. In: Agosti, M., Thanos, C. (eds.) ECDL 2002. LNCS, vol. 2458, pp. 129–143. Springer, Heidelberg (2002)
Gonçalves, M.A., Panchanathan, G., Ravindranathan, U., Krowne, A., Fox, E.A., Jagodzinski, F., Cassel, L.N.: The XML Log Standard for Digital Libraries: Analysis, Evolution, and Deployment. In: JCDL 2003, pp. 312–314 (2003)
DLbox Team. Digital Libraries in a Box. Homepage, http://dlbox.nudl.org (Available November 3, 2003)
Castelli, D., Pagano, P.: OpenDLib: A Digital Library Service System. In: Agosti, M., Thanos, C. (eds.) ECDL 2002. LNCS, vol. 2458, pp. 292–308. Springer, Heidelberg (2002)
Castelli, D., Pagano, P.: A System for Building Expandable Digital Libraries. JCDL 2003, 335–345 (2003)
W3C. Web Services Architecture. Homepage, http://www.w3c.org/TR/ws-arch (Available November 3, 2003)
Papazoglou, M.P., Georgakopoulos, D.: Service-Oriented Computing, Special Section. CACM 46(10) (October 2003)
Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web: A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities. Scientific American 279(5), 35–43 (2001), http://www.sciam.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21 (Available November 3, 2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fox, E.A. et al. (2004). Harvesting: Broadening the Field of Distributed Information Retrieval. In: Callan, J., Crestani, F., Sanderson, M. (eds) Distributed Multimedia Information Retrieval. DIR 2003. Lecture Notes in Computer Science, vol 2924. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24610-7_1
Download citation
DOI: https://doi.org/10.1007/978-3-540-24610-7_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20875-4
Online ISBN: 978-3-540-24610-7
eBook Packages: Springer Book Archive