skip to main content
10.1145/3340964.3340974acmotherconferencesArticle/Chapter ViewAbstractPublication PagessstdConference Proceedingsconference-collections
research-article

Representative Query Answers on Uncertain Data

Published: 19 August 2019 Publication History

Abstract

Our goal is to incorporate uncertainty information in the querying process to enhance results with probabilistic guarantees. Existing probabilistic querying solutions can not be scaled to realistic data sets due to #P-complete nature of querying uncertain data. We present a new approach to query uncertain sets of spatial data by sampling the possible database worlds, each resulting in a possible query result. The main challenge is to find a consensus of the retrieved results. We tackle this by finding query results that are representative. A representative query result is associated with a probabilistic guarantee, stating that with a guaranteed probability, the true (but unknown) query result is sufficiently similar. Our experiments show that our sampling approach provides probabilistic guarantees while scaling to large data sets, thus allowing to perform queries such as range queries, kNN queries, RkNN queries and ranking queries, where state-of-the-art solutions do not scale.

References

[1]
P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. Nabar, T. Sugihara, and J. Widom. Trio: A system for data, uncertainty, and lineage. In Proc. VLDB, 2006.
[2]
P. Agrawal, A. D. Sarma, J. Ullman, and J. Widom. Foundations of uncertain-data integration. PVLDB, 3(1-2):1080--1090, 2010.
[3]
T. Bernecker, H.-P. Kriegel, N. Mamoulis, M. Renz, and A. Züfle. Scalable probabilistic similarity ranking in uncertain databases. IEEE TKDE, 22(9):1234--1246, 2010.
[4]
G. Beskales, M. A. Soliman, and I. F. IIyas. Efficient search for the top-k probable nearest neighbors in uncertain databases. PVLDB, 1(1):326--339, 2008.
[5]
J. Boulos, N. Dalvi, B. Mandhani, S. Mathur, C. Re, and D. Suciu. Mystiq: a system for finding more answers by using probabilities. In Proc. SIGMOD, pages 891--893, 2005.
[6]
R. Cheng, J. Chen, M. Mokbel, and C. Chow. Probabilistic verifiers: Evaluating constrained nearest-neighbor queries over uncertain data. In Proc. ICDE, 2008.
[7]
R. Cheng, L. Chen, J. Chen, and X. Xie. Evaluating probability threshold k-nearest-neighbor queries over uncertain data. In Proc. EDBT, pages 672--683, 2009.
[8]
R. Cheng, D. V. Kalashnikov, and S. Prabhakar. Querying imprecise data in moving object environments. In IEEE TKDE, 2004.
[9]
R. Cheng, S. Singh, and S. Prabhakar. U-DBMS: a database system for managing constantly-evolving data. In Proc. VLDB, 2005.
[10]
C. Clopper and E. S. Pearson. Probable inference, the law of succession, and statistical inference. Biometrika, 26:404--413, 1934.
[11]
G. Cormode, F. Li, and K. Yi. Semantics of ranking queries for probabilistic data and expected results. In Proc. ICDE, pages 305--316, 2009.
[12]
Y. Cui, X. Z. Fern, and J. G. Dy. Non-redundant multi-view clustering via orthog-onalization. In Proc. ICDM, pages 133--142, 2007.
[13]
N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. VLDB J., 16(4):523--544, 2007.
[14]
A. Deshpande, C. Guestrin, S. R. Madden, J. M. Hellerstein, and W. Hong. Model-based approximate querying in sensor networks. VLDB J., 14(4):417--443, 2005.
[15]
G. Ghinita, P. Karras, P. Kalnis, and N. Mamoulis. Fast data anonymization with low information loss. In VLDB, pages 758--769, 2007.
[16]
D. S. Hochbaum. Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. In Approximation algorithms for NP-hard problems, pages 94--143. PWS Publishing Co., 1996.
[17]
W. Hoeffding. Probability inequalities for sums of bounded random variables. JASA, 58(301):13--30, 1963.
[18]
M. Hua, J. Pei, W. Zhang, and X. Lin. Ranking queries on uncertain data: a probabilistic threshold approach. In Proc. SIGMOD, pages 673--686, 2008.
[19]
J. Huang, L. Antova, C. Koch, and D. Olteanu. Maybms: A probabilistic database management system. In Proc. SIGMOD, pages 1071--1074, 2009.
[20]
L. Hubert and P. Arabie. Comparing partitions. J. Classif., 2(1):193--218, 1985.
[21]
Y. Iijima and Y. Ishikawa. Finding probabilistic nearest neighbors for query objects with imprecise locations. In Proc. MDM, 2009.
[22]
P. Jaccard. Distribution de la florine alpine dans la bassin de dranses et dans quelques regiones voisines. Bulletin de la Societe Vaudoise des Sciences Naturelles, 37:241--272, 1901.
[23]
A. K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Lett., 31(8):651--666, 2010.
[24]
A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM CSUR, 31(3):264--323, 1999.
[25]
P. Jain, R. Meka, and I. S. Dhillon. Simultaneous unsupervised learning of disparate clusterings. Stat. Anal. Data Min., 1(3):195--210, 2008.
[26]
R. Jampani, F. Xu, M. Wu, L. Perez, C. Jermaine, and P. J. Haas. The monte carlo database system: Stochastic analysis close to the data. ACM Trans. Database Syst., 36(3):18:1--18:41, 2011.
[27]
K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422--446, 2002.
[28]
C. Jin, K. Yi, L. Chen, J. X. Yu, and X. Lin. Sliding-window top-k queries on uncertain streams. Proceedings of the VLDB Endowment, 1(1):301--312, 2008.
[29]
R. M. Karp. Reducibility among combinatorial problems. Springer, 1972.
[30]
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley&Sons, 1990.
[31]
M. G. Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81--93, 1938.
[32]
H. Li, B. Yu, and D. Zhou. Error rate analysis of labeling by crowdsourcing. In Proc. Machine Learning meets Crowdsourcing, Workshop at the Int Conference on Machine Learning (ICML-2013), 2013.
[33]
J. Li, B. Saha, and A. Deshpande. A unified approach to ranking in probabilistic databases. PVLDB, 2(1):502--513, 2009.
[34]
X. Lian and L. Chen. Probabilistic ranked queries in uncertain databases. In Proc. EDBT, pages 511--522, 2008.
[35]
A. D. Sarma, O. Benjelloun, A. Halevy, and J. Widom. Working models for uncertain data. In Proc. ICDE, 2006.
[36]
E. Schubert, A. Koos, T. Emrich, A. Züfle, K. A. Schmid, and A. Zimek. A framework for clustering uncertain data. Proceedings of the VLDB Endowment, 8(12):1976--1979, 2015.
[37]
E. Schubert, A. Koos, T. Emrich, A. Züfle, K. A. Schmid, and A. Zimek. A framework for clustering uncertain data. PVLDB, 8(12):1976--1979, 2015.
[38]
P. Sen, A. Deshpande, and L. Getoor. Prdb: Managing and exploiting rich correlations in probabilistic databases. VLDB J., 18(5):1065--1090, 2009.
[39]
M. A. Soliman and I. F. Ilyas. Ranking with uncertain scores. In Proc. ICDE, pages 317--328, 2009.
[40]
M. A. Soliman, I. F. Ilyas, and K. C.-C. Chang. Top-k query processing in uncertain databases. In Proc. ICDE, pages 896--905, 2007.
[41]
Y. Tao, X. Xiao, and R. Cheng. Range search on multidimensional uncertain data. ACM Transactions on Database Systems (TODS), 32(3):15, 2007.
[42]
G. Trajcevski, R. Tamassia, P. Scheuermann, D. Hartglass, and C. Zamierowski. Ranking continuous nearest neighbors for uncertain trajectories. VLDB J., 20(5):767--791, 2011.
[43]
E. B. Wilson. Probable inference, the law of succession, and statistical inference. JASA, 22:209--212, 1927.
[44]
K. Yi, F. Li, G. Kollios, and D. Srivastava. Efficient processing of top-k queries in uncertain databases. In Proc. ICDE, 2008.
[45]
K. Yi, F. Li, G. Kollios, and D. Srivastava. Efficient processing of top-k queries in uncertain databases with x-relations. IEEE TKDE, 20(12):1669--1682, 2008.
[46]
J. Yuan, Y. Zheng, X. Xie, and G. Sun. Driving with knowledge from the physical world. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 316--324. ACM, 2011.
[47]
J. Yuan, Y. Zheng, C. Zhang, W. Xie, X. Xie, G. Sun, and Y. Huang. T-drive: driving directions based on taxi trajectories. In Proceedings of the 18th SIGSPATIAL International conference on advances in geographic information systems, pages 99--108. ACM, 2010.
[48]
A. Züfle, T. Emrich, K. A. Schmid, N. Mamoulis, A. Zimek, and M. Renz. Representative clustering of uncertain data. In Proc. KDD, pages 243--252, 2014.

Cited By

View all
  • (2020)COVID-19 ensemble models using representative clusteringSIGSPATIAL Special10.1145/3431843.343184812:2(33-41)Online publication date: 26-Oct-2020
  • (2020)Managing Uncertainty in Evolving Geo-Spatial Data2020 21st IEEE International Conference on Mobile Data Management (MDM)10.1109/MDM48529.2020.00021(5-8)Online publication date: Jun-2020
  • (2020)Uncertain Spatial Data Management: An OverviewHandbook of Big Geospatial Data10.1007/978-3-030-55462-0_14(355-397)Online publication date: 17-Dec-2020

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SSTD '19: Proceedings of the 16th International Symposium on Spatial and Temporal Databases
August 2019
245 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

  • TU Wien: TU Wien

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 August 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Monte Carlo Sampling
  2. Representative Queries
  3. Uncertain Data

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SSTD '19

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2020)COVID-19 ensemble models using representative clusteringSIGSPATIAL Special10.1145/3431843.343184812:2(33-41)Online publication date: 26-Oct-2020
  • (2020)Managing Uncertainty in Evolving Geo-Spatial Data2020 21st IEEE International Conference on Mobile Data Management (MDM)10.1109/MDM48529.2020.00021(5-8)Online publication date: Jun-2020
  • (2020)Uncertain Spatial Data Management: An OverviewHandbook of Big Geospatial Data10.1007/978-3-030-55462-0_14(355-397)Online publication date: 17-Dec-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media