research-article

When big data leads to lost data

Authors:
V. M. Megler

Portland State University, Portland, OR, USA

Portland State University, Portland, OR, USA
View Profile

,
David Maier

Portland State University, Portland, OR, USA

Portland State University, Portland, OR, USA
View Profile

PIKM '12: Proceedings of the 5th Ph.D. workshop on Information and knowledgeNovember 2012Pages 1–8https://doi.org/10.1145/2389686.2389688

Published:02 November 2012Publication History

PIKM '12: Proceedings of the 5th Ph.D. workshop on Information and knowledge

Pages 1–8

ABSTRACT

For decades, scientists bemoaned the scarcity of observational data to analyze and against which to test their models. Exponential growth in data volumes from ever-cheaper environmental sensors has provided scientists with the answer to their prayers: "big data". Now, scientists face a new challenge: with terabytes, petabytes or exabytes of data at hand, stored in thousands of heterogeneous datasets, how can scientists find the datasets most relevant to their research interests? If they cannot find the data, then they may as well never have collected it; that data is lost to them. Our research addresses this challenge, using an existing scientific archive as our test-bed. We approach this problem in a new way: by adapting Information Retrieval techniques, developed for searching text documents, into the world of (primarily numeric) scientific data. We propose an approach that uses a blend of automated and "semi-curated" methods to extract metadata from large archives of scientific data. We then perform searches over the extracted metadata, returning results ranked by similarity to the query terms. We briefly describe an implementation performed at an ocean observatory to validate the proposed approach. We propose performance and scalability research to explore how continued archive growth will affect our goal of interactive response, no matter the scale.

References

Ageev, M. et al. 2011. Find it if you can: A game for modeling different types of web search success using interaction data. Proceedings of SIGIR (2011). Google ScholarDigital Library
Agrawal, R. and Srikant, R. 2003. Searching with numbers. Knowledge and Data Engineering, IEEE Transactions on. 15, 4 (Aug. 2003), 855--870. Google ScholarDigital Library
Al-Maskari, A. et al. 2007. The relationship between IR effectiveness measures and user satisfaction. Proc. of SIGIR (2007), 773--774. Google ScholarDigital Library
Aula, A. et al. 2010. How does search behavior change as search becomes more difficult? Proc. of the 28th International Conference on Human Factors in Computing Systems (2010), 35--44. Google ScholarDigital Library
Cacheda, F. et al. 2005. A case study of distributed information retrieval architectures to index one terabyte of text. Information Processing & Management. 41, 5 (2005). Google ScholarDigital Library
Center for Coastal Margin Observation & Prediction (CMOP): http://www.stccmop.org/. Accessed: 2011-04-17.Google Scholar
Chaudhuri, S. et al. 2005. Integrating DB and IR technolo-gies. CIDR'05. (2005), 1--12.Google Scholar
D'Ulizia, A. et al. 2009. Approximating Geographical Queries. Journal of Computer Science and Technology. 24, 6 (2009), 1109--1124.Google ScholarCross Ref
Demartini, G. et al. 2010. Overview of the INEX 2009 entity ranking track. Focused Retrieval and Evaluation. (2010). Google ScholarDigital Library
Fabrikant, S. I. et al. 2004. The distance-similarity metaphor in network-display spatializations. Cartography and Geographic Information Science. 31, 4 (2004), 237--252.Google ScholarCross Ref
Gartner Says Solving "Big Data" Challenge Involves More Than Just Managing Volumes of Data: 2011. http://www.gartner.com/it/page.jsp?id=1731916. Accessed: 2012-06-28.Google Scholar
Geospatial One Stop (GOS): http://gos2.geodata.gov/wps/portal/gos. Accessed: 2011-01-19.Google Scholar
Global Change Master Directory Web Site: http://gcmd.nasa.gov/. Accessed: 2011-01-19.Google Scholar
Goodchild, M. F. and Zhou, J. 2003. Finding geographic information: Collection-level metadata. GeoInformatica. 7, 2 (2003), 95--112. Google ScholarDigital Library
Grossner, K. E. et al. 2008. Defining a digital earth system. Transactions in GIS. 12, 1 (2008), 145--160.Google ScholarCross Ref
Hey, T. and Trefethen, A.E. 2003. The Data Deluge: An e-Science Perspective. Grid Computing: Making the Global Infrastructure a Reality (eds F. Berman, G. Fox and T. Hey). John Wiley & Sons, Ltd, Chichester, UK. 809--824.Google Scholar
Houle, M. et al. 2010. Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? Scientific and Statistical Database Management (2010), 482--500. Google ScholarDigital Library
Ilyas, I. F. et al. 2008. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys (CSUR). 40, 4 (2008), 11. Google ScholarDigital Library
Jansen, B. J. et al. 2000. Real life, real users, and real needs: a study and analysis of user queries on the web. Information Processing & Management. 36, 2 (2000), 207--227. Google ScholarDigital Library
Lakoff, G. 2000. Where Mathematics Comes From. Basic Books.Google Scholar
Long, X. and Suel, T. 2003. Optimized query execution in large search engines with global page ordering. Proc. of the 29th VLDB Conference (2003), 129--140. Google ScholarDigital Library
Maier, D. et al. 2012. Navigating Oceans of Data. Scientific and Statistical Database Management (2012), 1--19. Google ScholarDigital Library
Manning, C. D. et al. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarDigital Library
Megler, V. M. and Maier, D. 2011. Finding Haystacks with Needles. Scientific and Statistical Database Management (2011), 55--72. Google ScholarDigital Library
Montello, D. R. 1991. The measurement of cognitive distance: Methods and construct validity. Journal of Environmental Psychology. 11, 2 (1991), 101--122.Google ScholarCross Ref
Pallickara, S. L. et al. 2010. Efficient metadata generation to enable interactive data discovery over large-scale scientific data collections. 2nd IEEE International Conference on Cloud Computing Technology and Science (2010), 573--580. Google ScholarDigital Library
Rajasekar, A. and Moore, R. 2010. Data and metadata collections for scientific applications. High-Performance Computing and Networking (2010), 72--80. Google ScholarDigital Library
Salton, G. 1968. Automatic Information Organization and Retrieval. (1968). Google ScholarDigital Library
Schurman, E. and Brutlag, J. 2009. The user and business impact of server delays, additional bytes, and HTTP chunking in web search. Proc. Velocity: Web Performance and Operations Conf. (2009).Google Scholar
Skupin, A. and Buttenfield, B. P. 1996. Spatial metaphors for visualizing very large data archives. Proceedings of GIS/LIS '96 (1996), 607--617.Google Scholar
Su, L. T. 1994. The relevance of recall and precision in user evaluation. Journal of the American Society for Information Science. 45, 3 (1994), 207--217. Google ScholarDigital Library
Tomasic, A. and Garcia-Molina, H. 1993. Performance of inverted indices in shared-nothing distributed text document information retrieval systems. Proceedings of the Second International Conference on Parallel and Distributed Information Systems (1993), 8--17. Google ScholarDigital Library
Tversky, A. and Gati, I. 1978. Studies of similarity. Cognition and Categorization. 1, (1978), 79--98.Google Scholar
Venetis, P. et al. 2011. Recovering semantics of tables on the web. Proc. of VLDB 37. 4, 9 (2011), 528--538. Google ScholarDigital Library
Voorhees, E. and Tice, D. M. 1999. The TREC-8 question answering track evaluation. Text Retrieval Conference TREC (1999).Google ScholarCross Ref
Wang, J. et al. 2010. Indexing multi-dimensional data in a cloud system. Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD '10), 591--602. Google ScholarDigital Library

Index Terms

When big data leads to lost data

Recommendations

Demonstrating "Data Near Here": Scientific Data Search
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Prior work proposed "Data Near Here" (DNH), a data search engine for scientific archives that is modeled on Internet search engines. DNH performs a periodic, asynchronous scan of each dataset in an archive, extracting lightweight features that are ...
Read More
Data Like This: Ranked Search of Genomic Data Vision Paper
ExploreDB '15: Proceedings of the Second International Workshop on Exploratory Search in Databases and the Web

High-throughput genetic sequencing produces the ultimate "big data": a human genome sequence contains more than 3B base pairs, and more and more characteristics, or annotations, are being recorded at the base-pair level. Locating areas of interest ...
Read More
Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PIKM '12: Proceedings of the 5th Ph.D. workshop on Information and knowledge
November 2012
108 pages
ISBN:9781450317191
DOI:10.1145/2389686
Program Chairs:
Aparna S. Varde
Montclair State University
,
Fabian M. Suchanek
Max Planck Institute for Informatics
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 November 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
ranked data search
scientific data
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate25of62submissions,40%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 646
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

When big data leads to lost data

PIKM '12: Proceedings of the 5th Ph.D. workshop on Information and knowledge

ABSTRACT

References

Cited By

Index Terms

Recommendations

Demonstrating "Data Near Here": Scientific Data Search

Data Like This: Ranked Search of Genomic Data Vision Paper

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

When big data leads to lost data

PIKM '12: Proceedings of the 5th Ph.D. workshop on Information and knowledge

ABSTRACT

References

Cited By

Index Terms

Recommendations

Demonstrating "Data Near Here": Scientific Data Search

Data Like This: Ranked Search of Genomic Data Vision Paper

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media