Abstract
Modern data processing techniques such as entity resolution, data cleaning, information extraction, and automated tagging often produce results consisting of objects whose attributes may contain uncertainty. This uncertainty is frequently captured in the form of a set of multiple mutually exclusive value choices for each uncertain attribute along with a measure of probability for alternative values. However, the lay end-user, as well as some end-applications, might not be able to interpret the results if outputted in such a form. Thus, the question is how to present such results to the user in practice, for example, to support attribute-value selection and object selection queries the user might be interested in. Specifically, in this article we study the problem of maximizing the quality of these selection queries on top of such a probabilistic representation. The quality is measured using the standard and commonly used set-based quality metrics. We formalize the problem and then develop efficient approaches that provide high-quality answers for these queries. The comprehensive empirical evaluation over three different domains demonstrates the advantage of our approach over existing techniques.
- Antova, L., Jansen, T., Koch, C., and Olteanu, D. 2008. Fast and simple relational processing of uncertain data. In Proceedings of the International Conference on Data Engineering (ICDE). Google ScholarDigital Library
- Ashish, N., Mehrota, S., and Pirzadeh, P. 2009. XAR: An integrated framework for free text information extraction. In Proceedings of the IEEE CSIE Conference. Google ScholarDigital Library
- Asuncion, A., Smyth, P., and Welling, M. 2008. Asynchronous distributed learning of topic models. In Proceedings of the NIPS Conference.Google Scholar
- Baeza-Yates, R. and Riberto-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley. Google ScholarDigital Library
- Bookstein, A. and R.Swanson, D. 1975. A decision theoretic foundation for indexing. J. Amer. Soc. Inf. Sci.Google ScholarCross Ref
- Carroll, J. and Briscoe, T. 2002. High precision extraction of grammatical relations. In Proceedings of the COLING Conference. Google ScholarDigital Library
- Chang, K. and Hwang, S. 2002. Minimal probing: supporting expensive predicates for top-k queries. In Proceedings of the ACM SIGMOD Conference on Management of Data. Google ScholarDigital Library
- Chaudhuri, S., Ganjam, K., Ganti, V., and Motwani, R. 2003. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the ACM SIGMOD Conference on Management of Data. Google ScholarDigital Library
- Chaudhuri, S., Gravano, L., and Marian, A. 2004. Optimizing top-k selection queries over multimedia repositories. Trans. Knowl. Data Engin. 16, 8. Google ScholarDigital Library
- Chen, J., Tan, T., and Mulhem, P. 2001. A method for photograph indexing using speech annotation. In Proceedings of the IEEE Pacific Rim Conference on Multimedia. Google ScholarDigital Library
- Chen, S., Kalashnikov, D. V., and Mehrotra, S. 2007. Adaptive graphical approach to entity resolution. In Proceedings of the ACM IEEE Joint Conference on Digital Libraries (JCDL'07). Google ScholarDigital Library
- Chen, Z. S., Kalashnikov, D. V., and Mehrotra, S. 2009. Exploiting context analysis for combining multiple entity resolution systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Cheng, R., Kalashnikov, D. V., and Prabhakar, S. 2003. Evaluating probabilistic queries over imprecise data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Cheng, R., Kalashnikov, D. V., and Prabhakar, S. 2007. Evaluation of probabilistic queries over imprecise data in constantly-evolving environments. Inf. Syst. J. 32, 1, 104--130. Google ScholarDigital Library
- Cormode, G., Li, F., and Yi, K. 2009. Semantics of ranking queries for probabilistic data and expected ranks. In Proceedings of the International Conference on Data Engineering (ICDE). Google ScholarDigital Library
- Dalvi, N. and Suciu, D. 2004. Efficient query evaluation on probabilistic databases. In Proceedings of the International Conference on Very Large Databases (VLDB). Google ScholarDigital Library
- Desai, C., Kalashnikov, D. V., Mehrotra, S., and Venkatasubramanian, N. 2009. Using semantics for speech annotation of images. In Proceedings of the 25th IEEE International Conference on Data Engineering (ICDE'09). Google ScholarDigital Library
- Harter, S. 1975. A probabilistic apporach to automatic keyword indexing: Part II, An algorithm for probabilistic indexing. J. Amer. Soc. Inf. Sci.Google Scholar
- Hernandez, M. and Stolfo, S. 1995. The merge/purge problem for large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data. Google ScholarDigital Library
- Kalashnikov, D. V., Ma, Y., Mehrotra, S., and Hariharan, R. 2006. Index for fast retrieval of uncertain spatial point data. In Proceedings of the International Symposium on Advances in Geographic Information Systems. Google ScholarDigital Library
- Kalashnikov, D. V. and Mehrotra, S. 2006. Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans. Datab. Syst. 31, 2, 716--767. Google ScholarDigital Library
- Kalashnikov, D. V., Mehrotra, S., and Chen, Z. 2005. Exploiting relationships for domain-independent data cleaning. In Proceedings of the SIAM International Conference on Data Mining (SIAM Data Mining'05).Google Scholar
- Kalashnikov, D. V., Mehrotra, S., Xu, J., and Venkatasubramanian, N. 2011. A semantics-based approach for speech annotation of images. IEEE Trans. Knowl. Data Engin. 23, 9, 1373--1387. Google ScholarDigital Library
- Kraft, D. 1973. A decision theory view of the information retrieval situation: An operations research approach. J. Amer. Soc. Inf. Sci.Google ScholarCross Ref
- Li, J. and Deshpande, A. 2009. Consensus answers for queries over probabilistic databases. In Proceedings of the Conference on Principles of Database Systems (PODS). Google ScholarDigital Library
- Ma, Y., Kalashnikov, D. V., and Mehrotra, S. 2008. Towards managing uncertain spatial information for situational awareness applications. IEEE Trans. Knowl. Data Engin. 20, 10. Google ScholarDigital Library
- Martín-Bautista, M. J., Sánchez, D., Miranda, M. A. V., and Larsen, H. L. 2000. Measuring effectiveness in fuzzy information retrieval. In Proceedings of the FQAS Conference.Google Scholar
- Menestrina, D., Benjelloun, O., and Garcia-Molina, H. 2006. Generic entity resolution with data confidences. In Proceedings of the CleanDB Conference.Google Scholar
- Moenck, R. T. 1976. Practical fast polynomial multiplication. In Proceedings of the ACM ISSAC Conference. Google ScholarDigital Library
- Niculescu-Mizil, A. and Caruana, R. 2005. Predicting good probabilities with supervised learning. In Proceedings of the International Conference on Machine Learning (ICML). Google ScholarDigital Library
- Nottelmann, H. and Fuhr. Evaluating different methods of estimating retrieval quality for resource selection. In Proceedings of the SIGIR'03 Conference. Google ScholarDigital Library
- Nuray-Turan, R., Kalashnikov, D. V., and Mehrotra, S. 2007. Self-tuning in graph-based reference disambiguation. In Proceedings of the 12th International Conference on Database Systems for Advanced Applications. Google ScholarDigital Library
- Ravindra, G., Balakrishnan, N., and Ramakrishnan, K. R. 2004. Automatic evaluation of extract summaries using fuzzy f-score measure. In 5th International Conference on Knowledge Based Computer Systems.Google Scholar
- Re, C., Dalvi, N. N., and Suciu, D. 2007. Efficient top-k query evaluation on probabilistic data. In Proceedings of the International Conference on Data Engineering (ICDE).Google Scholar
- Robertson, S. E. 1977. The probability ranking principle in IR. In Reading Information.Google Scholar
- Sarma, A. D., Theobald, M., and Widom, J. 2008. Exploiting lineage for confidence computation in uncertain and probabilistic databases. In Proceedings of the International Conference on Data Engineering (ICDE). Google ScholarDigital Library
- Satpal, S. and Sarawagi, S. 2007. Domain adaptation of conditional probability models via feature subsetting. In Proceedings of the PKDD Conference. Google ScholarDigital Library
- Singh, S., Mayfield, C., Mittal, S., Prabhakar, S., Hambrusch, S. E., and Shah, R. 2008. The orion uncertain data management system. In Proceedings of the COMAD Conference. 273--276.Google Scholar
- Soliman, M. A., Ilyas, I. F., and Cheng, K. C.-C. 2007. Top-k query processing in uncertain databases. In Proceedings of the International Conference on Data Engineering (ICDE).Google Scholar
- Steyvers, M., Smyth, P., Rosen-Zvi, M., and Griffiths, T. L. 2004. Probabilistic author-topic models for information discovery. In Proceedings of the KDD Conference. 306--315. Google ScholarDigital Library
- Takenobu, T., Kenji, K., Hironori, O., and Hozumi, T. 2002. Selecting effective index terms using a decision tree. J. Natural Lang. Engin. 8, 3. Google ScholarDigital Library
- Theobald, M., Weikum, G., and Schenkel, R. 2004. Top-k query evaluation with probabilistic guarantees. In Proceedings of the International Conference on Very Large Databases (VLDB). Google ScholarDigital Library
- Wick, M. L., Rohanimanesh, K., Schultz, K., and McCallum, A. 2008. A unified approach for schema matching, coreference and canonicalization. In Proceedings of the KDD Conference. Google ScholarDigital Library
- Widom, J. 2005. Trio: A system for integrated management of data, accuracy, and lineage. In Proceedings of the CIDR Conference. 262--276.Google Scholar
- Zadrozny, B. and Elkan, C. 2001. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Proceedings of the International Conference on Machine Learning (ICML). 609--616. Google ScholarDigital Library
- Zadrozny, B. and Elkan, C. 2002. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the SIGKDD Conference. Google ScholarDigital Library
- Zhang, J. and Yang, Y. 2004. Probabilistic score estimation with piecewise logistic regression. In Proceedings of the International Conference on Machine Learning (ICML). Google ScholarDigital Library
- Zhang, X. and Chomicki, J. 2009. Semantics and evaluation of top-k queries in probabilistic databases. http://arxiv.org/pdf/0811.2250.pdf.Google Scholar
- Ziolko, B., Manandhar, S., and Wilson, R. 2007. Fuzzy recall and precision for speech segmentation evaluation. In Proceedings of the 3rd Language and Technology Conference.Google Scholar
Index Terms
- Attribute and object selection queries on objects with probabilistic attributes
Recommendations
Top-k best probability queries and semantics ranking properties on probabilistic databases
There has been much interest in answering top-k queries on probabilistic data in various applications such as market analysis, personalized services, and decision making. In probabilistic relational databases, the most common problem in answering top-k ...
Probabilistic top-k and ranking-aggregate queries
Ranking and aggregation queries are widely used in data exploration, data analysis, and decision-making scenarios. While most of the currently proposed ranking and aggregation techniques focus on deterministic data, several emerging applications involve ...
Answering skyline queries on probabilistic data using the dominance of probabilistic skyline tuples
Although skyline queries are very useful in such areas such as decision support, market analysis and personalized services, they have not been extensively studied in the context of uncertain data. The existing work on answering probabilistic skyline ...
Comments