skip to main content
research-article

Delve: A Dataset-Driven Scholarly Search and Analysis System

Authors Info & Claims
Published:21 November 2017Publication History
Skip Abstract Section

Abstract

Research and experimentation in various scientific fields are based on the observation, analysis and benchmarking on datasets. The advancement of research and development has thus, strengthened the importance of dataset access. However, without enough knowledge of relevant datasets, researchers usually have to go through a process which we term \manual dataset retrieval". With the accelerated rate of scholarly publications, manually finding the relevant dataset for a given research area based on its usage or popularity is increasingly becoming more and more difficult and tedious. In this paper, we present Delve, a web-based dataset retrieval and document analysis system. Unlike traditional academic search engines and dataset repositories, Delve is dataset driven and provides a medium for dataset retrieval based on the suitability or usage in a given field. It also visualizes dataset and document citation relationship, and enables users to analyze a scientific document by uploading its full PDF. In this paper, we first discuss the reasons why the scientific community needs a system like Delve. We then proceed to introduce its internal design and explain how Delve works and how it is beneficial to researchers of all levels

References

  1. About citeseerx. http://citeseerx.ist.psu.edu/about/site.Google ScholarGoogle Scholar
  2. M. P. Adams, C. J. Collier, S. Uthicke, Y. X. Ow, L. Langlois, and K. R. OBrien. Model t versus biological relevance: Evaluating photosynthesis-temperature models for three tropical seagrass species. Scientific reports, 7, 2017.Google ScholarGoogle Scholar
  3. U. Akujuobi and X. Zhang. Delve: A data set retrieval and document analysis system. In ECML-PKDD Demo, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  4. C. Cardamone, K. Schawinski, M. Sarzi, S. P. Bamford, N. Bennert, C. Urry, C. Lintott, W. C. Keel, J. Parejko, R. C. Nichol, et al. Galaxy zoo green peas: discovery of a class of compact extremely star-forming galaxies. Monthly Notices of the Royal Astronomical Society, 399(3):1191--1205, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  5. G. Cedersund and J. Roll. Systems biology: model based evaluation and comparison of potential explanations for given biological data. The FEBS journal, 276(4):903--922, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  6. I. G. Councill, C. L. Giles, and M.-Y. Kan. ParsCit: an open-source CRF reference string parsing package. In LREC, volume 2008, 2008.Google ScholarGoogle Scholar
  7. R. P. Duin. A note on comparing classifiers. Pattern Recognition Letters, 17(5):529--536, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Efron. {statistical modeling: The two cultures}: Comment. Statistical Science, 16(3):218--219, 2001.Google ScholarGoogle Scholar
  9. Y. Fujiwara and G. Irie. Efficient label propagation. In Proceedings of the 31st international conference on machine learning (ICML), pages 784--792, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. L. Giles, K. D. Bollacker, and S. Lawrence. Citeseer: An automatic citation indexing system. In Proceedings of the third ACM conference on Digital libraries, pages 89--98. ACM, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Z. Guo, Z. Zhang, E. Xing, and C. Faloutsos. Enhanced max margin learning on multimodal data mining in a multimedia database. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 340--349. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. J. Hand et al. Classifier technology and the illusion of progress. Statistical science, 21(1):1--14, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  13. H. C. Harris, J. A. Munn, M. Kilic, J. Liebert, K. A. Williams, T. von Hippel, S. E. Levine, D. G. Monet, D. J. Eisenstein, S. Kleinman, et al. The white dwarf luminosity function from sloan digital sky survey imaging data. The Astronomical Journal, 131(1):571, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  14. H. Hirsh. Data mining research: Current status and future opportunities. Statistical Analysis and Data Mining: The ASA Data Science Journal, 1(2):104--107, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. L. Isenhour. The Evolution of Modern Science. Bookboon, 2015.Google ScholarGoogle Scholar
  16. A. J. Jakeman, R. A. Letcher, and J. P. Norton. Ten iterative steps in development and evaluation of environmental models. Environmental Modelling & Software, 21(5):602--614, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Janssen, Y. Charalabidis, and A. Zuiderwijk. Benefits, adoption barriers and myths of open data and open government. Information systems management, 29(4):258--268, 2012.Google ScholarGoogle Scholar
  18. S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub. Extrapolation methods for accelerating pagerank computations. In Proceedings of the 12th international conference on World Wide Web, pages 261-- 270. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. Keogh and S. Kasetty. On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Mining and knowledge discovery, 7(4):349--371, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Levy. The gentleman who made scholar, 2015. https://medium.com/backchannel/the-gentleman-who-made-scholar-d71289d9a82d.Google ScholarGoogle Scholar
  21. M. Lichman. UCI machine learning repository, 2013. http://archive.ics.uci.edu/ml.Google ScholarGoogle Scholar
  22. National Research Council and others. Models in environmental regulatory decision making. National Academies Press, 2007.Google ScholarGoogle Scholar
  23. National Science Board (US). Science & engineering indicators, volume 1. National Science Board, 2012.Google ScholarGoogle Scholar
  24. N. Padmanabhan, D. J. Schlegel, D. P. Finkbeiner, J. Barentine, M. R. Blanton, H. J. Brewington, J. E. Gunn, M. Harvanek, D. W. Hogg, Z. Ivezić, et al. An improved photometric calibration of the sloan digital sky survey imaging data. The Astrophysical Journal, 674(2):1217, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  25. N. Padmanabhan, D. J. Schlegel, U. Seljak, A. Makarov, N. A. Bahcall, M. R. Blanton, J. Brinkmann, D. J. Eisenstein, D. P. Finkbeiner, J. E. Gunn, et al. The clustering of luminous red galaxies in the sloan digital sky survey imaging data. Monthly Notices of the Royal Astronomical Society, 378(3):852--872, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  26. L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.Google ScholarGoogle Scholar
  27. T. Pedersen. Empiricism is not a matter of faith. Computational Linguistics, 34(3):465--470, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. L. Salzberg. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data mining and knowledge discovery, 1(3):317--328, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. I. Strateva, Z. Ivezić, G. R. Knapp, V. K. Narayanan, M. A. Strauss, J. E. Gunn, R. H. Lupton, D. Schlegel, N. A. Bahcall, J. Brinkmann, et al. Color separation of galaxy types in the sloan digital sky survey imaging data. The Astronomical Journal, 122(4):1861, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  30. A. S. Szalay, J. Gray, A. R. Thakar, P. Z. Kunszt, T. Malik, J. Raddick, C. Stoughton, and J. vandenBerg. The sdss skyserver: public access to the sloan digital sky server data. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pages 570--581. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. Arnetminer: extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 990--998. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. D. Tkaczyk, P. Szostek, P. J. Dendek, M. Fedoryszak, and L. Bolikowski. Cermine--automatic extraction of metadata and references from scientific literature. In Document Analysis Systems (DAS), 11th IAPR International Workshop on, pages 217--221. IEEE, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Vanschoren, J. N. Van Rijn, B. Bischl, and L. Torgo. Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49--60, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. K. Verstrepen, K. Bhaduriy, B. Cule, and B. Goethals. Collaborative filtering for binary, positiveonly data. ACM SIGKDD Explorations Newsletter, 19(1):1--21, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. N. Webster. Webster's Revised Unabridged Dictionary of the English Language. G. & C. Merriam Company, 1913.Google ScholarGoogle Scholar
  36. D. G. York, J. Adelman, J. E. Anderson Jr, S. F. Anderson, J. Annis, N. A. Bahcall, J. Bakken, R. Barkhouser, S. Bastian, E. Berman, et al. The sloan digital sky survey: Technical summary. The Astronomical Journal, 120(3):1579, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  37. X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Carnegie Mellon University, 2002.Google ScholarGoogle Scholar

Index Terms

  1. Delve: A Dataset-Driven Scholarly Search and Analysis System
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGKDD Explorations Newsletter
      ACM SIGKDD Explorations Newsletter  Volume 19, Issue 2
      December 2017
      46 pages
      ISSN:1931-0145
      EISSN:1931-0153
      DOI:10.1145/3166054
      Issue’s Table of Contents

      Copyright © 2017 Authors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 November 2017

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader