ABSTRACT
For decades, scientists bemoaned the scarcity of observational data to analyze and against which to test their models. Exponential growth in data volumes from ever-cheaper environmental sensors has provided scientists with the answer to their prayers: "big data". Now, scientists face a new challenge: with terabytes, petabytes or exabytes of data at hand, stored in thousands of heterogeneous datasets, how can scientists find the datasets most relevant to their research interests? If they cannot find the data, then they may as well never have collected it; that data is lost to them. Our research addresses this challenge, using an existing scientific archive as our test-bed. We approach this problem in a new way: by adapting Information Retrieval techniques, developed for searching text documents, into the world of (primarily numeric) scientific data. We propose an approach that uses a blend of automated and "semi-curated" methods to extract metadata from large archives of scientific data. We then perform searches over the extracted metadata, returning results ranked by similarity to the query terms. We briefly describe an implementation performed at an ocean observatory to validate the proposed approach. We propose performance and scalability research to explore how continued archive growth will affect our goal of interactive response, no matter the scale.
- Ageev, M. et al. 2011. Find it if you can: A game for modeling different types of web search success using interaction data. Proceedings of SIGIR (2011). Google ScholarDigital Library
- Agrawal, R. and Srikant, R. 2003. Searching with numbers. Knowledge and Data Engineering, IEEE Transactions on. 15, 4 (Aug. 2003), 855--870. Google ScholarDigital Library
- Al-Maskari, A. et al. 2007. The relationship between IR effectiveness measures and user satisfaction. Proc. of SIGIR (2007), 773--774. Google ScholarDigital Library
- Aula, A. et al. 2010. How does search behavior change as search becomes more difficult? Proc. of the 28th International Conference on Human Factors in Computing Systems (2010), 35--44. Google ScholarDigital Library
- Cacheda, F. et al. 2005. A case study of distributed information retrieval architectures to index one terabyte of text. Information Processing & Management. 41, 5 (2005). Google ScholarDigital Library
- Center for Coastal Margin Observation & Prediction (CMOP): http://www.stccmop.org/. Accessed: 2011-04-17.Google Scholar
- Chaudhuri, S. et al. 2005. Integrating DB and IR technolo-gies. CIDR'05. (2005), 1--12.Google Scholar
- D'Ulizia, A. et al. 2009. Approximating Geographical Queries. Journal of Computer Science and Technology. 24, 6 (2009), 1109--1124.Google ScholarCross Ref
- Demartini, G. et al. 2010. Overview of the INEX 2009 entity ranking track. Focused Retrieval and Evaluation. (2010). Google ScholarDigital Library
- Fabrikant, S. I. et al. 2004. The distance-similarity metaphor in network-display spatializations. Cartography and Geographic Information Science. 31, 4 (2004), 237--252.Google ScholarCross Ref
- Gartner Says Solving "Big Data" Challenge Involves More Than Just Managing Volumes of Data: 2011. http://www.gartner.com/it/page.jsp?id=1731916. Accessed: 2012-06-28.Google Scholar
- Geospatial One Stop (GOS): http://gos2.geodata.gov/wps/portal/gos. Accessed: 2011-01-19.Google Scholar
- Global Change Master Directory Web Site: http://gcmd.nasa.gov/. Accessed: 2011-01-19.Google Scholar
- Goodchild, M. F. and Zhou, J. 2003. Finding geographic information: Collection-level metadata. GeoInformatica. 7, 2 (2003), 95--112. Google ScholarDigital Library
- Grossner, K. E. et al. 2008. Defining a digital earth system. Transactions in GIS. 12, 1 (2008), 145--160.Google ScholarCross Ref
- Hey, T. and Trefethen, A.E. 2003. The Data Deluge: An e-Science Perspective. Grid Computing: Making the Global Infrastructure a Reality (eds F. Berman, G. Fox and T. Hey). John Wiley & Sons, Ltd, Chichester, UK. 809--824.Google Scholar
- Houle, M. et al. 2010. Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? Scientific and Statistical Database Management (2010), 482--500. Google ScholarDigital Library
- Ilyas, I. F. et al. 2008. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys (CSUR). 40, 4 (2008), 11. Google ScholarDigital Library
- Jansen, B. J. et al. 2000. Real life, real users, and real needs: a study and analysis of user queries on the web. Information Processing & Management. 36, 2 (2000), 207--227. Google ScholarDigital Library
- Lakoff, G. 2000. Where Mathematics Comes From. Basic Books.Google Scholar
- Long, X. and Suel, T. 2003. Optimized query execution in large search engines with global page ordering. Proc. of the 29th VLDB Conference (2003), 129--140. Google ScholarDigital Library
- Maier, D. et al. 2012. Navigating Oceans of Data. Scientific and Statistical Database Management (2012), 1--19. Google ScholarDigital Library
- Manning, C. D. et al. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarDigital Library
- Megler, V. M. and Maier, D. 2011. Finding Haystacks with Needles. Scientific and Statistical Database Management (2011), 55--72. Google ScholarDigital Library
- Montello, D. R. 1991. The measurement of cognitive distance: Methods and construct validity. Journal of Environmental Psychology. 11, 2 (1991), 101--122.Google ScholarCross Ref
- Pallickara, S. L. et al. 2010. Efficient metadata generation to enable interactive data discovery over large-scale scientific data collections. 2nd IEEE International Conference on Cloud Computing Technology and Science (2010), 573--580. Google ScholarDigital Library
- Rajasekar, A. and Moore, R. 2010. Data and metadata collections for scientific applications. High-Performance Computing and Networking (2010), 72--80. Google ScholarDigital Library
- Salton, G. 1968. Automatic Information Organization and Retrieval. (1968). Google ScholarDigital Library
- Schurman, E. and Brutlag, J. 2009. The user and business impact of server delays, additional bytes, and HTTP chunking in web search. Proc. Velocity: Web Performance and Operations Conf. (2009).Google Scholar
- Skupin, A. and Buttenfield, B. P. 1996. Spatial metaphors for visualizing very large data archives. Proceedings of GIS/LIS '96 (1996), 607--617.Google Scholar
- Su, L. T. 1994. The relevance of recall and precision in user evaluation. Journal of the American Society for Information Science. 45, 3 (1994), 207--217. Google ScholarDigital Library
- Tomasic, A. and Garcia-Molina, H. 1993. Performance of inverted indices in shared-nothing distributed text document information retrieval systems. Proceedings of the Second International Conference on Parallel and Distributed Information Systems (1993), 8--17. Google ScholarDigital Library
- Tversky, A. and Gati, I. 1978. Studies of similarity. Cognition and Categorization. 1, (1978), 79--98.Google Scholar
- Venetis, P. et al. 2011. Recovering semantics of tables on the web. Proc. of VLDB 37. 4, 9 (2011), 528--538. Google ScholarDigital Library
- Voorhees, E. and Tice, D. M. 1999. The TREC-8 question answering track evaluation. Text Retrieval Conference TREC (1999).Google ScholarCross Ref
- Wang, J. et al. 2010. Indexing multi-dimensional data in a cloud system. Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD '10), 591--602. Google ScholarDigital Library
Index Terms
- When big data leads to lost data
Recommendations
Demonstrating "Data Near Here": Scientific Data Search
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataPrior work proposed "Data Near Here" (DNH), a data search engine for scientific archives that is modeled on Internet search engines. DNH performs a periodic, asynchronous scan of each dataset in an archive, extracting lightweight features that are ...
Data Like This: Ranked Search of Genomic Data Vision Paper
ExploreDB '15: Proceedings of the Second International Workshop on Exploratory Search in Databases and the WebHigh-throughput genetic sequencing produces the ultimate "big data": a human genome sequence contains more than 3B base pairs, and more and more characteristics, or annotations, are being recorded at the base-pair level. Locating areas of interest ...
Comments