Abstract
Research and experimentation in various scientific fields are based on the observation, analysis and benchmarking on datasets. The advancement of research and development has thus, strengthened the importance of dataset access. However, without enough knowledge of relevant datasets, researchers usually have to go through a process which we term \manual dataset retrieval". With the accelerated rate of scholarly publications, manually finding the relevant dataset for a given research area based on its usage or popularity is increasingly becoming more and more difficult and tedious. In this paper, we present Delve, a web-based dataset retrieval and document analysis system. Unlike traditional academic search engines and dataset repositories, Delve is dataset driven and provides a medium for dataset retrieval based on the suitability or usage in a given field. It also visualizes dataset and document citation relationship, and enables users to analyze a scientific document by uploading its full PDF. In this paper, we first discuss the reasons why the scientific community needs a system like Delve. We then proceed to introduce its internal design and explain how Delve works and how it is beneficial to researchers of all levels
- About citeseerx. http://citeseerx.ist.psu.edu/about/site.Google Scholar
- M. P. Adams, C. J. Collier, S. Uthicke, Y. X. Ow, L. Langlois, and K. R. OBrien. Model t versus biological relevance: Evaluating photosynthesis-temperature models for three tropical seagrass species. Scientific reports, 7, 2017.Google Scholar
- U. Akujuobi and X. Zhang. Delve: A data set retrieval and document analysis system. In ECML-PKDD Demo, 2017.Google ScholarCross Ref
- C. Cardamone, K. Schawinski, M. Sarzi, S. P. Bamford, N. Bennert, C. Urry, C. Lintott, W. C. Keel, J. Parejko, R. C. Nichol, et al. Galaxy zoo green peas: discovery of a class of compact extremely star-forming galaxies. Monthly Notices of the Royal Astronomical Society, 399(3):1191--1205, 2009.Google ScholarCross Ref
- G. Cedersund and J. Roll. Systems biology: model based evaluation and comparison of potential explanations for given biological data. The FEBS journal, 276(4):903--922, 2009.Google ScholarCross Ref
- I. G. Councill, C. L. Giles, and M.-Y. Kan. ParsCit: an open-source CRF reference string parsing package. In LREC, volume 2008, 2008.Google Scholar
- R. P. Duin. A note on comparing classifiers. Pattern Recognition Letters, 17(5):529--536, 1996. Google ScholarDigital Library
- B. Efron. {statistical modeling: The two cultures}: Comment. Statistical Science, 16(3):218--219, 2001.Google Scholar
- Y. Fujiwara and G. Irie. Efficient label propagation. In Proceedings of the 31st international conference on machine learning (ICML), pages 784--792, 2014. Google ScholarDigital Library
- C. L. Giles, K. D. Bollacker, and S. Lawrence. Citeseer: An automatic citation indexing system. In Proceedings of the third ACM conference on Digital libraries, pages 89--98. ACM, 1998. Google ScholarDigital Library
- Z. Guo, Z. Zhang, E. Xing, and C. Faloutsos. Enhanced max margin learning on multimodal data mining in a multimedia database. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 340--349. ACM, 2007. Google ScholarDigital Library
- D. J. Hand et al. Classifier technology and the illusion of progress. Statistical science, 21(1):1--14, 2006.Google ScholarCross Ref
- H. C. Harris, J. A. Munn, M. Kilic, J. Liebert, K. A. Williams, T. von Hippel, S. E. Levine, D. G. Monet, D. J. Eisenstein, S. Kleinman, et al. The white dwarf luminosity function from sloan digital sky survey imaging data. The Astronomical Journal, 131(1):571, 2006.Google ScholarCross Ref
- H. Hirsh. Data mining research: Current status and future opportunities. Statistical Analysis and Data Mining: The ASA Data Science Journal, 1(2):104--107, 2008. Google ScholarDigital Library
- T. L. Isenhour. The Evolution of Modern Science. Bookboon, 2015.Google Scholar
- A. J. Jakeman, R. A. Letcher, and J. P. Norton. Ten iterative steps in development and evaluation of environmental models. Environmental Modelling & Software, 21(5):602--614, 2006. Google ScholarDigital Library
- M. Janssen, Y. Charalabidis, and A. Zuiderwijk. Benefits, adoption barriers and myths of open data and open government. Information systems management, 29(4):258--268, 2012.Google Scholar
- S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub. Extrapolation methods for accelerating pagerank computations. In Proceedings of the 12th international conference on World Wide Web, pages 261-- 270. ACM, 2003. Google ScholarDigital Library
- E. Keogh and S. Kasetty. On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Mining and knowledge discovery, 7(4):349--371, 2003. Google ScholarDigital Library
- S. Levy. The gentleman who made scholar, 2015. https://medium.com/backchannel/the-gentleman-who-made-scholar-d71289d9a82d.Google Scholar
- M. Lichman. UCI machine learning repository, 2013. http://archive.ics.uci.edu/ml.Google Scholar
- National Research Council and others. Models in environmental regulatory decision making. National Academies Press, 2007.Google Scholar
- National Science Board (US). Science & engineering indicators, volume 1. National Science Board, 2012.Google Scholar
- N. Padmanabhan, D. J. Schlegel, D. P. Finkbeiner, J. Barentine, M. R. Blanton, H. J. Brewington, J. E. Gunn, M. Harvanek, D. W. Hogg, Z. Ivezić, et al. An improved photometric calibration of the sloan digital sky survey imaging data. The Astrophysical Journal, 674(2):1217, 2008.Google ScholarCross Ref
- N. Padmanabhan, D. J. Schlegel, U. Seljak, A. Makarov, N. A. Bahcall, M. R. Blanton, J. Brinkmann, D. J. Eisenstein, D. P. Finkbeiner, J. E. Gunn, et al. The clustering of luminous red galaxies in the sloan digital sky survey imaging data. Monthly Notices of the Royal Astronomical Society, 378(3):852--872, 2007.Google ScholarCross Ref
- L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.Google Scholar
- T. Pedersen. Empiricism is not a matter of faith. Computational Linguistics, 34(3):465--470, 2008. Google ScholarDigital Library
- S. L. Salzberg. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data mining and knowledge discovery, 1(3):317--328, 1997. Google ScholarDigital Library
- I. Strateva, Z. Ivezić, G. R. Knapp, V. K. Narayanan, M. A. Strauss, J. E. Gunn, R. H. Lupton, D. Schlegel, N. A. Bahcall, J. Brinkmann, et al. Color separation of galaxy types in the sloan digital sky survey imaging data. The Astronomical Journal, 122(4):1861, 2001.Google ScholarCross Ref
- A. S. Szalay, J. Gray, A. R. Thakar, P. Z. Kunszt, T. Malik, J. Raddick, C. Stoughton, and J. vandenBerg. The sdss skyserver: public access to the sloan digital sky server data. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pages 570--581. ACM, 2002. Google ScholarDigital Library
- J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. Arnetminer: extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 990--998. ACM, 2008. Google ScholarDigital Library
- D. Tkaczyk, P. Szostek, P. J. Dendek, M. Fedoryszak, and L. Bolikowski. Cermine--automatic extraction of metadata and references from scientific literature. In Document Analysis Systems (DAS), 11th IAPR International Workshop on, pages 217--221. IEEE, 2014. Google ScholarDigital Library
- J. Vanschoren, J. N. Van Rijn, B. Bischl, and L. Torgo. Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49--60, 2014. Google ScholarDigital Library
- K. Verstrepen, K. Bhaduriy, B. Cule, and B. Goethals. Collaborative filtering for binary, positiveonly data. ACM SIGKDD Explorations Newsletter, 19(1):1--21, 2017. Google ScholarDigital Library
- N. Webster. Webster's Revised Unabridged Dictionary of the English Language. G. & C. Merriam Company, 1913.Google Scholar
- D. G. York, J. Adelman, J. E. Anderson Jr, S. F. Anderson, J. Annis, N. A. Bahcall, J. Bakken, R. Barkhouser, S. Bastian, E. Berman, et al. The sloan digital sky survey: Technical summary. The Astronomical Journal, 120(3):1579, 2000.Google ScholarCross Ref
- X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Carnegie Mellon University, 2002.Google Scholar
Index Terms
- Delve: A Dataset-Driven Scholarly Search and Analysis System
Recommendations
Exploring prestigious citations sourced from top universities in bibliometrics and altmetrics: a case study in the computer science discipline
Citation count is an important indicator for measuring research outputs. There have been numerous studies that have investigated factors affecting citation counts from the perspectives of cited papers and citing papers. In this paper, we focused ...
Journal self-citation study for semiconductor literature: synchronous and diachronous approach
Special issue: InformetricsThe present study investigates the self-citations of the most productive semiconductor journals by synchronous (self-citing rate) and diachronous (self-cited rate) approaches. Journal's productivity of 100 most productive semiconductor journals was ...
Team size and retracted citations reveal the patterns of retractions from 1981 to 2020
AbstractThe growth of the retraction databases reveals the disturbing trend in science and also the rising trend of citations of retracted papers is a serious concern. The objective of the study is to investigate the patterns of retractions through the ...
Comments