research-article

Delve: A Dataset-Driven Scholarly Search and Analysis System

Authors:
Uchenna Akujuobi

King Abdullah University of Science and Technology (KAUST), Saudi Arabia

King Abdullah University of Science and Technology (KAUST), Saudi Arabia
View Profile

,
Xiangliang Zhang

King Abdullah University of Science and Technology (KAUST), Saudi Arabia

King Abdullah University of Science and Technology (KAUST), Saudi Arabia
View Profile

Authors Info & Claims

ACM SIGKDD Explorations Newsletter Volume 19 Issue 2December 2017pp 36–46https://doi.org/10.1145/3166054.3166059

Published:21 November 2017Publication History

ACM SIGKDD Explorations Newsletter

Abstract

Research and experimentation in various scientific fields are based on the observation, analysis and benchmarking on datasets. The advancement of research and development has thus, strengthened the importance of dataset access. However, without enough knowledge of relevant datasets, researchers usually have to go through a process which we term \manual dataset retrieval". With the accelerated rate of scholarly publications, manually finding the relevant dataset for a given research area based on its usage or popularity is increasingly becoming more and more difficult and tedious. In this paper, we present Delve, a web-based dataset retrieval and document analysis system. Unlike traditional academic search engines and dataset repositories, Delve is dataset driven and provides a medium for dataset retrieval based on the suitability or usage in a given field. It also visualizes dataset and document citation relationship, and enables users to analyze a scientific document by uploading its full PDF. In this paper, we first discuss the reasons why the scientific community needs a system like Delve. We then proceed to introduce its internal design and explain how Delve works and how it is beneficial to researchers of all levels

References

About citeseerx. http://citeseerx.ist.psu.edu/about/site.Google Scholar
M. P. Adams, C. J. Collier, S. Uthicke, Y. X. Ow, L. Langlois, and K. R. OBrien. Model t versus biological relevance: Evaluating photosynthesis-temperature models for three tropical seagrass species. Scientific reports, 7, 2017.Google Scholar
U. Akujuobi and X. Zhang. Delve: A data set retrieval and document analysis system. In ECML-PKDD Demo, 2017.Google ScholarCross Ref
C. Cardamone, K. Schawinski, M. Sarzi, S. P. Bamford, N. Bennert, C. Urry, C. Lintott, W. C. Keel, J. Parejko, R. C. Nichol, et al. Galaxy zoo green peas: discovery of a class of compact extremely star-forming galaxies. Monthly Notices of the Royal Astronomical Society, 399(3):1191--1205, 2009.Google ScholarCross Ref
G. Cedersund and J. Roll. Systems biology: model based evaluation and comparison of potential explanations for given biological data. The FEBS journal, 276(4):903--922, 2009.Google ScholarCross Ref
I. G. Councill, C. L. Giles, and M.-Y. Kan. ParsCit: an open-source CRF reference string parsing package. In LREC, volume 2008, 2008.Google Scholar
R. P. Duin. A note on comparing classifiers. Pattern Recognition Letters, 17(5):529--536, 1996. Google ScholarDigital Library
B. Efron. {statistical modeling: The two cultures}: Comment. Statistical Science, 16(3):218--219, 2001.Google Scholar
Y. Fujiwara and G. Irie. Efficient label propagation. In Proceedings of the 31st international conference on machine learning (ICML), pages 784--792, 2014. Google ScholarDigital Library
C. L. Giles, K. D. Bollacker, and S. Lawrence. Citeseer: An automatic citation indexing system. In Proceedings of the third ACM conference on Digital libraries, pages 89--98. ACM, 1998. Google ScholarDigital Library
Z. Guo, Z. Zhang, E. Xing, and C. Faloutsos. Enhanced max margin learning on multimodal data mining in a multimedia database. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 340--349. ACM, 2007. Google ScholarDigital Library
D. J. Hand et al. Classifier technology and the illusion of progress. Statistical science, 21(1):1--14, 2006.Google ScholarCross Ref
H. C. Harris, J. A. Munn, M. Kilic, J. Liebert, K. A. Williams, T. von Hippel, S. E. Levine, D. G. Monet, D. J. Eisenstein, S. Kleinman, et al. The white dwarf luminosity function from sloan digital sky survey imaging data. The Astronomical Journal, 131(1):571, 2006.Google ScholarCross Ref
H. Hirsh. Data mining research: Current status and future opportunities. Statistical Analysis and Data Mining: The ASA Data Science Journal, 1(2):104--107, 2008. Google ScholarDigital Library
T. L. Isenhour. The Evolution of Modern Science. Bookboon, 2015.Google Scholar
A. J. Jakeman, R. A. Letcher, and J. P. Norton. Ten iterative steps in development and evaluation of environmental models. Environmental Modelling & Software, 21(5):602--614, 2006. Google ScholarDigital Library
M. Janssen, Y. Charalabidis, and A. Zuiderwijk. Benefits, adoption barriers and myths of open data and open government. Information systems management, 29(4):258--268, 2012.Google Scholar
S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub. Extrapolation methods for accelerating pagerank computations. In Proceedings of the 12th international conference on World Wide Web, pages 261-- 270. ACM, 2003. Google ScholarDigital Library
E. Keogh and S. Kasetty. On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Mining and knowledge discovery, 7(4):349--371, 2003. Google ScholarDigital Library
S. Levy. The gentleman who made scholar, 2015. https://medium.com/backchannel/the-gentleman-who-made-scholar-d71289d9a82d.Google Scholar
M. Lichman. UCI machine learning repository, 2013. http://archive.ics.uci.edu/ml.Google Scholar
National Research Council and others. Models in environmental regulatory decision making. National Academies Press, 2007.Google Scholar
National Science Board (US). Science & engineering indicators, volume 1. National Science Board, 2012.Google Scholar
N. Padmanabhan, D. J. Schlegel, D. P. Finkbeiner, J. Barentine, M. R. Blanton, H. J. Brewington, J. E. Gunn, M. Harvanek, D. W. Hogg, Z. Ivezić, et al. An improved photometric calibration of the sloan digital sky survey imaging data. The Astrophysical Journal, 674(2):1217, 2008.Google ScholarCross Ref
N. Padmanabhan, D. J. Schlegel, U. Seljak, A. Makarov, N. A. Bahcall, M. R. Blanton, J. Brinkmann, D. J. Eisenstein, D. P. Finkbeiner, J. E. Gunn, et al. The clustering of luminous red galaxies in the sloan digital sky survey imaging data. Monthly Notices of the Royal Astronomical Society, 378(3):852--872, 2007.Google ScholarCross Ref
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.Google Scholar
T. Pedersen. Empiricism is not a matter of faith. Computational Linguistics, 34(3):465--470, 2008. Google ScholarDigital Library
S. L. Salzberg. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data mining and knowledge discovery, 1(3):317--328, 1997. Google ScholarDigital Library
I. Strateva, Z. Ivezić, G. R. Knapp, V. K. Narayanan, M. A. Strauss, J. E. Gunn, R. H. Lupton, D. Schlegel, N. A. Bahcall, J. Brinkmann, et al. Color separation of galaxy types in the sloan digital sky survey imaging data. The Astronomical Journal, 122(4):1861, 2001.Google ScholarCross Ref
A. S. Szalay, J. Gray, A. R. Thakar, P. Z. Kunszt, T. Malik, J. Raddick, C. Stoughton, and J. vandenBerg. The sdss skyserver: public access to the sloan digital sky server data. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pages 570--581. ACM, 2002. Google ScholarDigital Library
J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. Arnetminer: extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 990--998. ACM, 2008. Google ScholarDigital Library
D. Tkaczyk, P. Szostek, P. J. Dendek, M. Fedoryszak, and L. Bolikowski. Cermine--automatic extraction of metadata and references from scientific literature. In Document Analysis Systems (DAS), 11th IAPR International Workshop on, pages 217--221. IEEE, 2014. Google ScholarDigital Library
J. Vanschoren, J. N. Van Rijn, B. Bischl, and L. Torgo. Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49--60, 2014. Google ScholarDigital Library
K. Verstrepen, K. Bhaduriy, B. Cule, and B. Goethals. Collaborative filtering for binary, positiveonly data. ACM SIGKDD Explorations Newsletter, 19(1):1--21, 2017. Google ScholarDigital Library
N. Webster. Webster's Revised Unabridged Dictionary of the English Language. G. & C. Merriam Company, 1913.Google Scholar
D. G. York, J. Adelman, J. E. Anderson Jr, S. F. Anderson, J. Annis, N. A. Bahcall, J. Bakken, R. Barkhouser, S. Bastian, E. Berman, et al. The sloan digital sky survey: Technical summary. The Astronomical Journal, 120(3):1579, 2000.Google ScholarCross Ref
X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Carnegie Mellon University, 2002.Google Scholar

Index Terms

Delve: A Dataset-Driven Scholarly Search and Analysis System
1. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Exploring prestigious citations sourced from top universities in bibliometrics and altmetrics: a case study in the computer science discipline

Citation count is an important indicator for measuring research outputs. There have been numerous studies that have investigated factors affecting citation counts from the perspectives of cited papers and citing papers. In this paper, we focused ...
Read More
Journal self-citation study for semiconductor literature: synchronous and diachronous approach
Special issue: Informetrics

The present study investigates the self-citations of the most productive semiconductor journals by synchronous (self-citing rate) and diachronous (self-cited rate) approaches. Journal's productivity of 100 most productive semiconductor journals was ...
Read More
Team size and retracted citations reveal the patterns of retractions from 1981 to 2020
Abstract
The growth of the retraction databases reveals the disturbing trend in science and also the rising trend of citations of retracted papers is a serious concern. The objective of the study is to investigate the patterns of retractions through the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGKDD Explorations Newsletter Volume 19, Issue 2
December 2017
46 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/3166054
Editors:
Charu Aggarwal
IBM T.J. Watson
,
Haixun Wang
Google
,
Ankur Teredesai
University of Washington Tacoma
,
Hanghang Tong
Arizona State University
Issue’s Table of Contents
Copyright © 2017 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 November 2017
Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 190
  Total Downloads
- Downloads (Last 12 months)34
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Delve: A Dataset-Driven Scholarly Search and Analysis System

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Exploring prestigious citations sourced from top universities in bibliometrics and altmetrics: a case study in the computer science discipline

Journal self-citation study for semiconductor literature: synchronous and diachronous approach

Team size and retracted citations reveal the patterns of retractions from 1981 to 2020