skip to main content
10.1145/2389686.2389688acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

When big data leads to lost data

Published:02 November 2012Publication History

ABSTRACT

For decades, scientists bemoaned the scarcity of observational data to analyze and against which to test their models. Exponential growth in data volumes from ever-cheaper environmental sensors has provided scientists with the answer to their prayers: "big data". Now, scientists face a new challenge: with terabytes, petabytes or exabytes of data at hand, stored in thousands of heterogeneous datasets, how can scientists find the datasets most relevant to their research interests? If they cannot find the data, then they may as well never have collected it; that data is lost to them. Our research addresses this challenge, using an existing scientific archive as our test-bed. We approach this problem in a new way: by adapting Information Retrieval techniques, developed for searching text documents, into the world of (primarily numeric) scientific data. We propose an approach that uses a blend of automated and "semi-curated" methods to extract metadata from large archives of scientific data. We then perform searches over the extracted metadata, returning results ranked by similarity to the query terms. We briefly describe an implementation performed at an ocean observatory to validate the proposed approach. We propose performance and scalability research to explore how continued archive growth will affect our goal of interactive response, no matter the scale.

References

  1. Ageev, M. et al. 2011. Find it if you can: A game for modeling different types of web search success using interaction data. Proceedings of SIGIR (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Agrawal, R. and Srikant, R. 2003. Searching with numbers. Knowledge and Data Engineering, IEEE Transactions on. 15, 4 (Aug. 2003), 855--870. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Al-Maskari, A. et al. 2007. The relationship between IR effectiveness measures and user satisfaction. Proc. of SIGIR (2007), 773--774. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Aula, A. et al. 2010. How does search behavior change as search becomes more difficult? Proc. of the 28th International Conference on Human Factors in Computing Systems (2010), 35--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Cacheda, F. et al. 2005. A case study of distributed information retrieval architectures to index one terabyte of text. Information Processing & Management. 41, 5 (2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Center for Coastal Margin Observation & Prediction (CMOP): http://www.stccmop.org/. Accessed: 2011-04-17.Google ScholarGoogle Scholar
  7. Chaudhuri, S. et al. 2005. Integrating DB and IR technolo-gies. CIDR'05. (2005), 1--12.Google ScholarGoogle Scholar
  8. D'Ulizia, A. et al. 2009. Approximating Geographical Queries. Journal of Computer Science and Technology. 24, 6 (2009), 1109--1124.Google ScholarGoogle ScholarCross RefCross Ref
  9. Demartini, G. et al. 2010. Overview of the INEX 2009 entity ranking track. Focused Retrieval and Evaluation. (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Fabrikant, S. I. et al. 2004. The distance-similarity metaphor in network-display spatializations. Cartography and Geographic Information Science. 31, 4 (2004), 237--252.Google ScholarGoogle ScholarCross RefCross Ref
  11. Gartner Says Solving "Big Data" Challenge Involves More Than Just Managing Volumes of Data: 2011. http://www.gartner.com/it/page.jsp?id=1731916. Accessed: 2012-06-28.Google ScholarGoogle Scholar
  12. Geospatial One Stop (GOS): http://gos2.geodata.gov/wps/portal/gos. Accessed: 2011-01-19.Google ScholarGoogle Scholar
  13. Global Change Master Directory Web Site: http://gcmd.nasa.gov/. Accessed: 2011-01-19.Google ScholarGoogle Scholar
  14. Goodchild, M. F. and Zhou, J. 2003. Finding geographic information: Collection-level metadata. GeoInformatica. 7, 2 (2003), 95--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Grossner, K. E. et al. 2008. Defining a digital earth system. Transactions in GIS. 12, 1 (2008), 145--160.Google ScholarGoogle ScholarCross RefCross Ref
  16. Hey, T. and Trefethen, A.E. 2003. The Data Deluge: An e-Science Perspective. Grid Computing: Making the Global Infrastructure a Reality (eds F. Berman, G. Fox and T. Hey). John Wiley & Sons, Ltd, Chichester, UK. 809--824.Google ScholarGoogle Scholar
  17. Houle, M. et al. 2010. Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? Scientific and Statistical Database Management (2010), 482--500. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ilyas, I. F. et al. 2008. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys (CSUR). 40, 4 (2008), 11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jansen, B. J. et al. 2000. Real life, real users, and real needs: a study and analysis of user queries on the web. Information Processing & Management. 36, 2 (2000), 207--227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Lakoff, G. 2000. Where Mathematics Comes From. Basic Books.Google ScholarGoogle Scholar
  21. Long, X. and Suel, T. 2003. Optimized query execution in large search engines with global page ordering. Proc. of the 29th VLDB Conference (2003), 129--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Maier, D. et al. 2012. Navigating Oceans of Data. Scientific and Statistical Database Management (2012), 1--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Manning, C. D. et al. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Megler, V. M. and Maier, D. 2011. Finding Haystacks with Needles. Scientific and Statistical Database Management (2011), 55--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Montello, D. R. 1991. The measurement of cognitive distance: Methods and construct validity. Journal of Environmental Psychology. 11, 2 (1991), 101--122.Google ScholarGoogle ScholarCross RefCross Ref
  26. Pallickara, S. L. et al. 2010. Efficient metadata generation to enable interactive data discovery over large-scale scientific data collections. 2nd IEEE International Conference on Cloud Computing Technology and Science (2010), 573--580. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Rajasekar, A. and Moore, R. 2010. Data and metadata collections for scientific applications. High-Performance Computing and Networking (2010), 72--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Salton, G. 1968. Automatic Information Organization and Retrieval. (1968). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Schurman, E. and Brutlag, J. 2009. The user and business impact of server delays, additional bytes, and HTTP chunking in web search. Proc. Velocity: Web Performance and Operations Conf. (2009).Google ScholarGoogle Scholar
  30. Skupin, A. and Buttenfield, B. P. 1996. Spatial metaphors for visualizing very large data archives. Proceedings of GIS/LIS '96 (1996), 607--617.Google ScholarGoogle Scholar
  31. Su, L. T. 1994. The relevance of recall and precision in user evaluation. Journal of the American Society for Information Science. 45, 3 (1994), 207--217. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Tomasic, A. and Garcia-Molina, H. 1993. Performance of inverted indices in shared-nothing distributed text document information retrieval systems. Proceedings of the Second International Conference on Parallel and Distributed Information Systems (1993), 8--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Tversky, A. and Gati, I. 1978. Studies of similarity. Cognition and Categorization. 1, (1978), 79--98.Google ScholarGoogle Scholar
  34. Venetis, P. et al. 2011. Recovering semantics of tables on the web. Proc. of VLDB 37. 4, 9 (2011), 528--538. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Voorhees, E. and Tice, D. M. 1999. The TREC-8 question answering track evaluation. Text Retrieval Conference TREC (1999).Google ScholarGoogle ScholarCross RefCross Ref
  36. Wang, J. et al. 2010. Indexing multi-dimensional data in a cloud system. Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD '10), 591--602. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. When big data leads to lost data

                  Recommendations

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in
                  • Published in

                    cover image ACM Conferences
                    PIKM '12: Proceedings of the 5th Ph.D. workshop on Information and knowledge
                    November 2012
                    108 pages
                    ISBN:9781450317191
                    DOI:10.1145/2389686

                    Copyright © 2012 ACM

                    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                    Publisher

                    Association for Computing Machinery

                    New York, NY, United States

                    Publication History

                    • Published: 2 November 2012

                    Permissions

                    Request permissions about this article.

                    Request Permissions

                    Check for updates

                    Qualifiers

                    • research-article

                    Acceptance Rates

                    Overall Acceptance Rate25of62submissions,40%

                    Upcoming Conference

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader