ABSTRACT
An increasing number of web sites offer structured information about recognizable concepts, relevant to many application domains, such as finance, sport, commercial products. However, web data is inherently imprecise and uncertain, and conflicting values can be provided by different web sources. Characterizing the uncertainty of web data represents an important issue and several models have been recently proposed in the literature. The paper illustrates state-of-the-art Bayesan models to evaluate the quality of data extracted from the Web and reports the results of an extensive application of the models on real life web data. Our experimental results show that for some applications even simple approaches can provide effective results, while sophisticated solutions are needed to obtain a more precise characterization of the uncertainty.
- B. Amento, L. G. Terveen, and W. C. Hill. Does "authority" mean quality? predicting expert quality ratings of web documents. In SIGIR, pages 296--303, 2000. Google ScholarDigital Library
- C. Batini and M. Scannapieco. Data Quality: Concepts, Methodologies, and Techniques. Springer-Verlag, 2008. Google ScholarDigital Library
- L. Blanco, M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Exploiting information redundancy to wring out structured data from the web. In Proceedings of the 19th international conference on World wide web, WWW '10, pages 1063--1064, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- L. Blanco, M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Redundancy-driven web data extraction and integration. In WebDB, 2010. Google ScholarDigital Library
- L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In CAiSE, pages 83--97, 2010. Google ScholarDigital Library
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1--7):107--117, 1998. Google ScholarDigital Library
- M. J. Cafarella, O. Etzioni, and D. Suciu. Structured queries over web text. IEEE Data Eng. Bull., 29(4):45--51, 2006.Google Scholar
- R. T. Clemen and R. L. Winkler. Combining probability distributions from experts in risk analysis. Risk Analysis, 19(2):187--203, 1999.Google ScholarCross Ref
- N. N. Dalvi and D. Suciu. Management of probabilistic data: foundations and challenges. In PODS, pages 1--12, 2007. Google ScholarDigital Library
- X. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. PVLDB, 3(1):1358--1369, 2010. Google ScholarDigital Library
- X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: The role of source dependence. PVLDB, 2(1):550--561, 2009. Google ScholarDigital Library
- X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1):562--573, 2009. Google ScholarDigital Library
- D. Downey, O. Etzioni, and S. Soderland. A probabilistic model of redundancy in information extraction. In IJCAI, pages 1034--1041, 2005. Google ScholarDigital Library
- D. Florescu, D. Koller, and A. Y. Levy. Using probabilistic information in data integration. In VLDB, pages 216--225, 1997. Google ScholarDigital Library
- A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In Proc. WSDM, New York, USA, 2010. Google ScholarDigital Library
- M. Wu and A. Marian. Corroborating answers from multiple web sources. In WebDB, 2007.Google Scholar
- X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng., 20(6):796--808, 2008. Google ScholarDigital Library
Index Terms
- Characterizing the uncertainty of web data: models and experiences
Recommendations
Automatically building probabilistic databases from the web
WWW '11: Proceedings of the 20th international conference companion on World wide webA relevant number of web sites publish structured data about recognizable concepts (such as stock quotes, movies, restau- rants, etc.). There is a great chance to create applications that rely on a huge amount of data taken from the Web. We present an ...
Query answering techniques on uncertain and probabilistic data: tutorial summary
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of dataUncertain data are inherent in some important applications, such as environmental surveillance, market analysis, and quantitative economics research. Due to the importance of those applications and the rapidly increasing amount of uncertain data ...
Finding frequent items in probabilistic data
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of dataComputing statistical information on probabilistic data has attracted a lot of attention recently, as the data generated from a wide range of data sources are inherently fuzzy or uncertain. In this paper, we study an important statistical query on ...
Comments