Skip to main content

Harvesting: Broadening the Field of Distributed Information Retrieval

  • Conference paper
Distributed Multimedia Information Retrieval (DIR 2003)

Abstract

This chapter argues that in addition to federated search and gathering (as by Web crawlers), harvesting is an important approach to address the needs for distributed IR. We highlight the use of the Open Archives Initiative Protocol for Metadata Harvesting, illustrating its use in three projects: OAD, NDLTD, and CITIDEL. We explain how traditional services can be extended in a user-centered fashion, providing details of our new: ESSEX search engine, multischeming browsing, and quality-oriented filtering (using rules and SVMs). We conclude with an overview of work in progress on logging and component architectures, as well as a summary of our findings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Fox, E., Urs, S., Cronin, B. (ed.): Digital Libraries. Annual Review of Information Science and Technology 36(12), 503–589 (2002)

    Google Scholar 

  2. Fox, E., Feizbadi, F., Moxley, J., Weisser, C. (eds.): The ETD Sourcebook: Theses and Dissertations in the Electronic Age. Marcel Dekker, New York (2004) (in press)

    Google Scholar 

  3. National Information Standards Organization: Z39.50: Information Retrieval (Z39.50): Application Service Definition and Protocol Specification. NISO Press, Bethesda (1995)

    Google Scholar 

  4. Moen, W.E.: Accessing Distributed Cultural Heritage Information. CACM 41(4), 45–48 (1998)

    Google Scholar 

  5. Arms, W.Y.: Digital Libraries. MIT Press, Cambridge (2000)

    Google Scholar 

  6. Lagoze, C., Davis, J.R.: Dienst – An Architecture for Distributed Document Libraries. CACM 38(4), 47 (1995)

    Google Scholar 

  7. NCSTRL: Networked Computer Science Technical Reference Library. Homepage, http://www.ncstrl.org (Available November 3, 2003)

  8. Lagoze, C., Fielding, D., Payette, S.: Making global digital libraries work: collection services, connectivity regions, and collection views. In: Proceedings of the Third ACM Conference on Digital Libraries, pp. 134–143 (1998)

    Google Scholar 

  9. Anan, H., Liu, X., Maly, K., Nelson, M., Zubair, M., French, J., Fox, E., Shivakumar, P.: Preservation and transition of NCSTRL using an OAI-based architecture. In: JCDL 2002, pp. 181–182 (2002)

    Google Scholar 

  10. Bowman, C., Danzig, P., Hardy, D., Manber, U., Schwartz, M.: The Harvest Information Discovery and Access System. Computer Networks and ISDN Systems 28(1&2), 119–125 (1995)

    Article  Google Scholar 

  11. OAI: Open Archives Initiative. Homepage, http://www.openarchives.org (Available November 3, 2003)

  12. Lagoze, C., van de Sompel, H.: The Open Archives Initiative: building a low-barrier interoperability framework. In: JCDL 2001, pp. 54–62 (2001)

    Google Scholar 

  13. Suleman, H., Fox, E.: The Open Archives Initiative: Realizing Simple and Effective Digital Library Interoperability. Special issue on “Libraries and Electronic Resources: New Partnerships, New Practices, New Perspectives” of J. Library Automation 35(1/2), 125–145 (2002)

    Google Scholar 

  14. Dublin Core Metadata Initiative. Homepage, http://dublincore.org/ (Available November 3, 2003)

  15. Hochstenbach, H., Van de Sompel, H.: The OAI-PMH Static Repository and Static Repository Gateway. In: JCDL 2003, pp. 210–217 (2003)

    Google Scholar 

  16. Calado, P., Gonçalves, M., Fox, E., Ribeiro-Neto, B., Laender, A., da Silva, A., Reis, D., Roberto, P., Vieira, M., Lage, J.: The Web-DL Environment for Building Digital Libraries from the Web. In: JCDL 2003, pp. 346–357 (2003)

    Google Scholar 

  17. Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: WWW 2002, pp. 136–147 (2002)

    Google Scholar 

  18. Ipeirotis, P., Gravano, L., Sahami, M.: Count, and Classify: Categorizing Hidden Web Databases. In: SIGMOD Conference (2001)

    Google Scholar 

  19. Lage, J.P., Silva, A.S., Golgher, P.B., Laender, A.H.F.: Collecting hidden web pages for data extraction. In: WIDM 2002, pp. 69–75 (2002)

    Google Scholar 

  20. OAD. Open Archives : Distributed services for physicists and graduate students. Homepage, http://www.dlib.vt.edu/OAD/ (Available November 3, 2003)

  21. PhysNet. The Worldwide Physics Departments and Documents Network. Homepage, http://www.phys.vt.edu/PhysNet/ (Available November 3, 2003)

  22. OCLC. Online Computer Library Center. Homepage, http://www.oclc.org (Available November 3, 2003)

  23. Fox, E.A.: Networked Digital Library of Theses and Dissertations (NDLTD), Homepage http://www.ndltd.org (Available November 3, 2003)

  24. Suleman, H., Atkins, A., Gonçalves, M.A., France, R.K., Fox, E.A., Virginia Tech., Chachra, V., Crowder, M., VTLS Inc., Young, J.: OCLC: Networked Digital Library of Theses and Dissertations: Bridging the Gaps for Global Access – Part 1: Mission and Progress. D-Lib Magazine 7(9) (2001), http://www.dlib.org/dlib/september01/suleman/09suleman-pt1.html (Available November 3, 2003)

  25. NDLTD Union Catalog Project. Electronic Thesis/Dissertation OAI Union Catalog Based at OCLC. Homepage, http://rocky.dlib.vt.edu/~etdunion/cgi-bin/OCLCUnion/UI/ (Available November 3, 2003)

  26. Suleman, H., Luo, M.: Electronic Thesis/Dissertation OAI Union Catalog. Homepage, http://purl.org/net/etdunion (Available November 3, 2003)

  27. ODL. Open Digital Libraries. Homepage, http://oai.dlib.vt.edu/odl/ (Available November 3, 2003)

  28. DSpace Federation. DSpace at MIT. Homepage, http://www.dspace.org/ (Available November 3, 2003)

  29. BEPres. The Berkeley Electronic Press. Homepage, http://www.bepress.com/ (Available November 3, 2003)

  30. ETDMS. ETD-MS: An Interoperability Metadata Standard for Electronic Theses and Dissertations. Homepage, http://www.bepress.com/ (Available November 3, 2003)

  31. CALIS. China Academic Library & Information System. Homepage, http://www.calis.edu.cn (Available November 3, 2003)

  32. CITIDEL. Homepage, http://www.citidel.org (Available November 3, 2003)

  33. NSDL. National Science Digital Library. Homepage, http://www.nsdl.org (Available November 3, 2003)

  34. On-line Virtual Computer History Museum. Homepage, http://virtualmuseum.dlib.vt.edu (Available November 3, 2003)

  35. CSTC. Computer Science Teaching Center. Homepage, http://www.cstc.org (Available November 3, 2003)

    Google Scholar 

  36. Krowne, A.: An Architecture for Collaborative Math and Science Digital Libraries. In: Masters thesis, Virginia Tech Dept. of Computer Science, Blacksburg, VA 24061 USA, http://scholar.lib.vt.edu/theses/available/etd-09022003-150851/ (Available November 3, 2003)

  37. Ley, M. (ed.): dblp.uni-trier.de Computer Science Bibliography. Homepage, http://www.informatik.uni-trier.de/~ley/db/ (Available November 3, 2003)

  38. IEEE-CS. IEEE Computer Society Digital Library. Homepage, http://www.computer.org/publications/dlib/ (Available November 3, 2003)

  39. eBizSearch. Homepage, http://gunther.smeal.psu.edu/ (Available November 3, 2003)

  40. NEC Research Institute CiteSeer : Scientific Literature Digital Library. Homepage, http://citeseer.org (Available November 3, 2003)

  41. Krowne, A. The ESSEX Search Engine, http://br.endernet.org/~akrowne/elaine/essex/ (Available November 3, 2003)

  42. Suleman, H.: Open Digital Libraries. PhD Dissertation, Virginia Tech (2002), http://scholar.lib.vt.edu/theses/available/etd-11222002-155624/ (Available November 3, 2003)

  43. Fox, E., Suleman, S., Luo, M.: Building Digital Libraries Made Easy: Toward Open Digital Libraries. In: Lim, E.-p., Foo, S.S.-B., Khoo, C., Chen, H., Fox, E., Urs, S.R., Costantino, T. (eds.) ICADL 2002. LNCS, vol. 2555, pp. 14–24. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  44. Dumais, S., Chen, H.: Hierarchical classification of Web content. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, Athens, Greece, pp. 256–263 (2000)

    Google Scholar 

  45. Yahoo! Homepage, http://www.yahoo.com (Available November 3, 2003)

  46. dmoz. Open Directory Project. Homepage, http://www.dmoz.org (Available November 3, 2003)

  47. Krowne, A., Fox, E.: An Architecture for Multischeming in Digital Libraries. Virginia Tech Dept. of Computer Science Technical Report TR-03-25, Blacksburg, VA (2003), http://eprints.cs.vt.edu:8000/archive/00000692/ (Available November 3, 2003)

  48. Fox, E.A.: Networked Digital Library of Theses and Dissertations. Nature Web Matters 12 (August 1999), http://helix.nature.com/webmatters/library/library.html (Available November 3, 2003)

  49. Platt, J.: Fast Training of Support Vector Machines using Sequential Minimal Optimization. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, MIT Press, Cambridge (1998)

    Google Scholar 

  50. Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of CIKM 1998, 7th ACM International Conference on Information and Knowledge Management, Bethesda, MD, pp. 148–155 (1998)

    Google Scholar 

  51. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  52. Joachims, T.: A statistical learning model of text classification for support vector machines. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, New Orleans, LA, pp. 128–136 (2001)

    Google Scholar 

  53. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34(1), 1–47 (2002)

    Article  Google Scholar 

  54. ACM Digital Library. Homepage, http://www.acm.org/dl (Available November 3, 2003)

  55. Gonçalves, M.A., Luo, M., Shen, R., Ali, M.F., Fox, E.A.: An XML Log Standard and Tool for Digital Library Logging Analysis. In: Agosti, M., Thanos, C. (eds.) ECDL 2002. LNCS, vol. 2458, pp. 129–143. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  56. Gonçalves, M.A., Panchanathan, G., Ravindranathan, U., Krowne, A., Fox, E.A., Jagodzinski, F., Cassel, L.N.: The XML Log Standard for Digital Libraries: Analysis, Evolution, and Deployment. In: JCDL 2003, pp. 312–314 (2003)

    Google Scholar 

  57. DLbox Team. Digital Libraries in a Box. Homepage, http://dlbox.nudl.org (Available November 3, 2003)

  58. Castelli, D., Pagano, P.: OpenDLib: A Digital Library Service System. In: Agosti, M., Thanos, C. (eds.) ECDL 2002. LNCS, vol. 2458, pp. 292–308. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  59. Castelli, D., Pagano, P.: A System for Building Expandable Digital Libraries. JCDL 2003, 335–345 (2003)

    Google Scholar 

  60. W3C. Web Services Architecture. Homepage, http://www.w3c.org/TR/ws-arch (Available November 3, 2003)

  61. Papazoglou, M.P., Georgakopoulos, D.: Service-Oriented Computing, Special Section. CACM 46(10) (October 2003)

    Google Scholar 

  62. Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web: A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities. Scientific American 279(5), 35–43 (2001), http://www.sciam.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21 (Available November 3, 2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fox, E.A. et al. (2004). Harvesting: Broadening the Field of Distributed Information Retrieval. In: Callan, J., Crestani, F., Sanderson, M. (eds) Distributed Multimedia Information Retrieval. DIR 2003. Lecture Notes in Computer Science, vol 2924. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24610-7_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24610-7_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-20875-4

  • Online ISBN: 978-3-540-24610-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics