ABSTRACT
Real-world data --- especially when generated by distributed measurement infrastructures such as sensor networks --- tends to be incomplete, imprecise, and erroneous, making it impossible to present it to users or feed it directly into applications. The traditional approach to dealing with this problem is to first process the data using statistical or probabilistic models that can provide more robust interpretations of the data. Current database systems, however, do not provide adequate support for applying models to such data, especially when those models need to be frequently updated as new data arrives in the system. Hence, most scientists and engineers who depend on models for managing their data do not use database systems for archival or querying at all; at best, databases serve as a persistent raw data store.In this paper we define a new abstraction called model-based views and present the architecture of MauveDB, the system we are building to support such views. Just as traditional database views provide logical data independence, model-based views provide independence from the details of the underlying data generating mechanism and hide the irregularities of the data by using models to present a consistent view to the users. MauveDB supports a declarative language for defining model-based views, allows declarative querying over such views using SQL, and supports several different materialization strategies and techniques to efficiently maintain them in the face of frequent updates. We have implemented a prototype system that currently supports views based on regression and interpolation, using the Apache Derby open source DBMS, and we present results that show the utility and performance benefits that can be obtained by supporting several different types of model-based views in a database system.
- I.F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci. Wireless sensor networks: a survey. Computer Networks, 38, 2002. Google ScholarDigital Library
- Periklis Andritsos, Ariel Fuxman, and Renee J. Miller. Clean answers over dirty databases. In ICDE, 2006. Google ScholarDigital Library
- The Apache Derby Project. Web Site. http://db.apache.org/derby/.Google Scholar
- D. Barbara, H. Garcia-Molina, and D. Porter. The management of probabilistic data. IEEE TKDE, 4(5):487--502, 1992. Google ScholarDigital Library
- Tim Brooke and Jenna Burrell. From ethnography to design in a vineyard. In Proceeedings of the Design User Experiences (DUX) Conference, June 2003. Google ScholarDigital Library
- A. Cerpa, J. Elson, D.Estrin, L. Girod, M. Hamilton, and J. Zhao. Habitat monitoring: Application driver for wireless communications technology. In Proceedings of ACM SIGCOMM 2001 Workshop on Data Communications in Latin America and the Caribbean. Google ScholarDigital Library
- Surajit Chaudhuri, Vivek Narasayya, and Sunita Sarawagi. Efficient evaluation of queries with mining predicates. In Proceedings of ICDE, 2002. Google ScholarDigital Library
- Reynold Cheng, Dmitri V. Kalashnikov, and Sunil Prabhakar. Evaluating probabilistic queries over imprecise data. In Proceedings of SIGMOD, 2003. Google ScholarDigital Library
- M. Chu, H. Haussecker, and F. Zhao. Scalable information-driven sensor querying and routing for ad hoc heterogeneous sensor networks. In Intl Journal of High Performance Computing Applications, 2002.Google ScholarDigital Library
- Nilesh N. Dalvi and Dan Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004.Google ScholarDigital Library
- Dorothy E. Denning et al. Views for multilevel database security. IEEE Trans. Softw. Eng., 1987. Google ScholarDigital Library
- Amol Deshpande, Carlos Guestrin, Sam Madden, Joe Hellerstein, and Wei Hong. Model-driven data acquisition in sensor networks. In VLDB, 2004.Google ScholarDigital Library
- Norbert Fuhr and Thomas Rolleke. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst., 15(1):32--66, 1997. Google ScholarDigital Library
- G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins, 1989.Google Scholar
- G. Grahne. Horn tables - an efficient tool for handling incomplete information in databases. In PODS, 1989. Google ScholarDigital Library
- S. Grumbach, P. Rigaux, and L. Segoufin. Manipulating interpolated data is easier than you thought. In VLDB, 2000. Google ScholarDigital Library
- C. Guestrin, P. Bodik, R. Thibaux, M. Paskin, and S. Madden. Distributed regression: an efficient frame- work for modeling sensor network data. In IPSN, 2004. Google ScholarDigital Library
- A. Gupta and I.S. Mumick. Materialized views: techniques, implementations, and applications. MIT Press, 1999. Google ScholarDigital Library
- David Hand, Heikki Mannila, and Padhraic Smyth. Principles of Data Mining. MIT Press, 2001. Google ScholarDigital Library
- DB2 Intelligent Miner. Web Site. http://www-306.ibm.com/software/data/iminer/.Google Scholar
- T. Imielinski and W. Lipski Jr. Incomplete infor- mation in relational databases. JACM, 31(4), 1984. Google ScholarDigital Library
- C. Intanagonwiwat, R. Govindan, and D. Estrin. Directed diffusion: A scalable and robust communication paradigm for sensor networks. In MOBICOM, 2000. Google ScholarDigital Library
- A. Jain, E. Change, and Y. Wang. Adaptive stream resource management using kalman filters. In SIGMOD, 2004. Google ScholarDigital Library
- L. V. S. Lakshmanan, N. Leone, R. Ross, and V. S. Subrahmanian. Probview: a flexible probabilistic database system. ACM TODS, 22(3), 1997. Google ScholarDigital Library
- Suk Kyoon Lee. An extended relational database model for uncertain and imprecise information. In VLDB, 1992. Google ScholarDigital Library
- L. Liao, D. Fox, and H. Kautz. Location-based activity recognition using relational markov networks. In IJCAI, 2005. Google ScholarDigital Library
- Sam Madden. Intel lab data, 2004. http://berkeley.intel-research.net/labdata.Google Scholar
- Samuel Madden, Wei Hong, Joseph M. Hellerstein, and Michael Franklin. TinyDB web page. http://telegraph.cs.berkeley.edu/tinydb.Google Scholar
- A. Mainwaring, J. Polastre, R. Szewczyk, and D. Culler. Wireless sensor networks for habitat monitoring. In ACM Workshop on Sensor Networks and Applications, 2002. Google ScholarDigital Library
- Erin McKean, editor. The Oxford English Dictionary (2nd Edition). Oxford Univeristy Press, 2005.Google Scholar
- Leonore Neugebauer. Optimization and evaluation of database queries including embedded interpolation procedures. In Proceedings of SIGMOD, 1991. Google ScholarDigital Library
- George M. Phillips. Interpolation and Approximation by Polynomials. Springer-Verlag, 2003.Google ScholarCross Ref
- PMML 3.0 Specification. Web Site. http://www.dmg.org/v3-0/GeneralStructure.html.Google Scholar
- S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with databases: alternatives and implications. In Proceedings of SIGMOD, 1998. Google ScholarDigital Library
- Business Analytics Software Solutions (SAS). Web Site. http://www.sas.com/technologies/analytics.Google Scholar
- J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, 2005.Google Scholar
- Y. Xia, S. Prabhakar, S. Lei, R. Cheng, and R. Shah. Indexing continuously changing data with mean-variance tree. In ACM SAC, 2005. Google ScholarDigital Library
- Y. Yao and J. Gehrke. Query processing in sensor networks. In CIDR, 2003.Google Scholar
Index Terms
- MauveDB: supporting model-based user views in database systems
Recommendations
Top-k best probability queries and semantics ranking properties on probabilistic databases
There has been much interest in answering top-k queries on probabilistic data in various applications such as market analysis, personalized services, and decision making. In probabilistic relational databases, the most common problem in answering top-k ...
Incremental Recomputation of Active Relational Expressions
Database updates are small and incremental compared to database contents. It is therefore desirable that recomputations of active relational expressions-such as views, derived data, integrity constraints, active queries, and monitors-can also be ...
Ranking queries on uncertain data
Uncertain data is inherent in a few important applications. It is far from trivial to extend ranking queries (also known as top-k queries), a popular type of queries on certain data, to uncertain data. In this paper, we cast ranking queries on uncertain ...
Comments