skip to main content
10.1145/3197768.3201537acmotherconferencesArticle/Chapter ViewAbstractPublication PagespetraConference Proceedingsconference-collections
research-article

Error Analysis on Harvesting Data over the Internet

Published: 26 June 2018 Publication History

Abstract

Harvesting tasks gather information to a central repository. We studied 880560 harvesting tasks from 3446 harvesting services in 354 harvesting rounds during a period of 15 months, of which 382705 failed and the remaining tasks occasionally returning fewer records. A significant part of the Open Archive Initiative harvesting services never worked or have ceased working while many other services fail occasionally. A harvesting task includes many stages of information exchange, and each one of them may fail - but with different consequences each time. We studied the reported warning messages, the number of records returned, and the required response time to discover relations among them. We found that about half of the harvesting tasks on each harvesting round fail, and the number of failing tasks is slowly increasing. We developed a method of analysis that can be used to reverse engineering such complex network systems and to categorize the reasons of failure into useful classes. Our results do not indicate a new approach to harvesting or conclude to a breakthrough advice, but make clear the complexity of the operation in an ever changing networking environment and alarm the reader that some facts that may be considered trivial, actually they are not! They help us to better understand the risks involved, and to design more reliable procedures and improved ways to closely monitor them.

References

[1]
Bui, Y. & Park, J., "An assessment of metadata quality: a case study of the National Science Digital Library Metadata Repository," (2005) In Haidar Moukdad (Ed.) CAIS/ACSI 2006 Information Science Revisited: Approaches to Innovation. Proceedings of the 2005 annual conference of the Canadian Association for Information Science held with the Congress of the Social Sciences and Humanities of Canada at York University, Toronto, Ontario.
[2]
Fuhr, N., Tsakonas, G., Aalberg, T., Agosti, M., Hansen, P., Kapidakis, S., Klas, P., Kovács, L, Landoni, M., Micsik, A., Papatheodorou, C., Peters C. and Sølvberg, I., "Evaluation of Digital Libraries", (2007) International Journal of Digital Library, Springer-Verlag, vol. 8, no 1, November 2007, pp. 21--38.
[3]
Hughes, B., "Metadata quality evaluation: experience from the open language archives community," (2005) Berlin Springer. Lecture Notes in Computer Science vol. 3334. ISBN 978-3-540-24030-3.
[4]
Kapidakis, S., "Comparing Metadata Quality in the Europeana Context," (2012) Proceedings of the 5th ACM international conference on PErvasive Technologies Related to Assistive Environments (PETRA 2012), Heraklion, Greece, June 6-8 2012, ACM International Conference Proceeding Series; vol. 661.
[5]
Kapidakis, S., "Rating Quality in Metadata Harvesting," (2015) Proceedings of the 8th ACM international conference on PErvasive Technologies Related to Assistive Environments (PETRA 2015), Corfu, Greece, July 1-3 2015, ACM International Conference Proceeding Series; ISBN 978-1-4503-3452-5.
[6]
Kapidakis, S., "Exploring Metadata Providers Reliability and Update Behavior" (2016) Proceedings of the International Conference on Theory and Practice of Digital Libraries (TPDL 2016), LNCS 9819, Springer, Hannover, Germany, September 5-9, 2016.
[7]
Kapidakis, S., "Exploring the Consistent behavior of Information Services", CSCC 2016, Corfu, July 13-16, 2016.
[8]
Kapidakis, S., "When a Metadata Provider Task is Successful" (2017) Proceedings of the International Conference on Theory and Practice of Digital Libraries (TPDL 2017), LNCS 10450, Springer, Thessaloniki, Greece, September 18-21, 2017, pp. 544--552
[9]
Lagoze, C., Krafft, D., Cornwell, T., Dushay, N., Eckstrom, D. & Saylor, J., "Metadata aggregation and "automated digital libraries": a retrospective on the NSDL experience", (2006) Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries (JCDL 06), pp. 230--239
[10]
Moreira, B.L., Goncalves, M.A., Laender, A.H.F. & Fox, E.A. "Automatic evaluation of digital libraries with 5SQual," (2009) Journal of Informetrics, vol. 3, 2, pp. 102--123.
[11]
Ochoa, X. & Duval, E., "Automatic evaluation of metadata quality in digital repositories," (2009). International Journal on Digital Libraries, vol. 10(2/3), pp. 67--91.
[12]
Yesikov, Dmitry & Ivutin, Alexey & Larkin, E.V. & Kotov, Vladislav. (2017). Multi-agent Approach for Distributed Information Systems Reliability Prediction. Procedia Computer Science. 103, pp 416--420.
[13]
Ward., J. "A quantitative analysis of unqualified dublin core metadata element set usage within data providers registered with the open archives initiative", (2003) Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries (JCDL 03), ISBN:0-7695-1939-3, pp. 315--317
[14]
Zhang, Y., "Developing a holistic model for digital library evaluation," (2010) Journal of the American Society for Information Science and Technology, vol. 61, 1, pp. 88--110.

Cited By

View all
  • (2022)Fourth Industrial Revolution between Knowledge Management and Digital HumanitiesInformation10.3390/info1306029213:6(292)Online publication date: 8-Jun-2022
  • (2018)Metadata Synthesis and Updates on Collections Harvested Using the Open Archive Initiative Protocol for Metadata HarvestingDigital Libraries for Open Knowledge10.1007/978-3-030-00066-0_2(16-31)Online publication date: 5-Sep-2018

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
PETRA '18: Proceedings of the 11th PErvasive Technologies Related to Assistive Environments Conference
June 2018
591 pages
ISBN:9781450363907
DOI:10.1145/3197768
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

  • NSF: National Science Foundation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Harvesting
  2. Metadata
  3. Open Archive Initiative
  4. Reliability
  5. Tool

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

PETRA '18

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Fourth Industrial Revolution between Knowledge Management and Digital HumanitiesInformation10.3390/info1306029213:6(292)Online publication date: 8-Jun-2022
  • (2018)Metadata Synthesis and Updates on Collections Harvested Using the Open Archive Initiative Protocol for Metadata HarvestingDigital Libraries for Open Knowledge10.1007/978-3-030-00066-0_2(16-31)Online publication date: 5-Sep-2018

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media