research-article

Characterizing the uncertainty of web data: models and experiences

Authors:
Lorenzo Blanco

Università degli Studi Roma Tre, Rome, Italy

Università degli Studi Roma Tre, Rome, Italy
View Profile

,
Valter Crescenzi

Università degli Studi Roma Tre, Rome, Italy

Università degli Studi Roma Tre, Rome, Italy
View Profile

,
Paolo Merialdo

Università degli Studi Roma Tre, Rome, Italy

Università degli Studi Roma Tre, Rome, Italy
View Profile

,
Paolo Papotti

Università degli Studi Roma Tre, Rome, Italy

Università degli Studi Roma Tre, Rome, Italy
View Profile

WebQuality '11: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web QualityMarch 2011Pages 1–8https://doi.org/10.1145/1964114.1964116

Published:28 March 2011Publication History

WebQuality '11: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality

Pages 1–8

ABSTRACT

An increasing number of web sites offer structured information about recognizable concepts, relevant to many application domains, such as finance, sport, commercial products. However, web data is inherently imprecise and uncertain, and conflicting values can be provided by different web sources. Characterizing the uncertainty of web data represents an important issue and several models have been recently proposed in the literature. The paper illustrates state-of-the-art Bayesan models to evaluate the quality of data extracted from the Web and reports the results of an extensive application of the models on real life web data. Our experimental results show that for some applications even simple approaches can provide effective results, while sophisticated solutions are needed to obtain a more precise characterization of the uncertainty.

References

B. Amento, L. G. Terveen, and W. C. Hill. Does "authority" mean quality? predicting expert quality ratings of web documents. In SIGIR, pages 296--303, 2000. Google ScholarDigital Library
C. Batini and M. Scannapieco. Data Quality: Concepts, Methodologies, and Techniques. Springer-Verlag, 2008. Google ScholarDigital Library
L. Blanco, M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Exploiting information redundancy to wring out structured data from the web. In Proceedings of the 19th international conference on World wide web, WWW '10, pages 1063--1064, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
L. Blanco, M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Redundancy-driven web data extraction and integration. In WebDB, 2010. Google ScholarDigital Library
L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In CAiSE, pages 83--97, 2010. Google ScholarDigital Library
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1--7):107--117, 1998. Google ScholarDigital Library
M. J. Cafarella, O. Etzioni, and D. Suciu. Structured queries over web text. IEEE Data Eng. Bull., 29(4):45--51, 2006.Google Scholar
R. T. Clemen and R. L. Winkler. Combining probability distributions from experts in risk analysis. Risk Analysis, 19(2):187--203, 1999.Google ScholarCross Ref
N. N. Dalvi and D. Suciu. Management of probabilistic data: foundations and challenges. In PODS, pages 1--12, 2007. Google ScholarDigital Library
X. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. PVLDB, 3(1):1358--1369, 2010. Google ScholarDigital Library
X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: The role of source dependence. PVLDB, 2(1):550--561, 2009. Google ScholarDigital Library
X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1):562--573, 2009. Google ScholarDigital Library
D. Downey, O. Etzioni, and S. Soderland. A probabilistic model of redundancy in information extraction. In IJCAI, pages 1034--1041, 2005. Google ScholarDigital Library
D. Florescu, D. Koller, and A. Y. Levy. Using probabilistic information in data integration. In VLDB, pages 216--225, 1997. Google ScholarDigital Library
A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In Proc. WSDM, New York, USA, 2010. Google ScholarDigital Library
M. Wu and A. Marian. Corroborating answers from multiple web sources. In WebDB, 2007.Google Scholar
X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng., 20(6):796--808, 2008. Google ScholarDigital Library

Index Terms

Characterizing the uncertainty of web data: models and experiences
1. Information systems
  1. World Wide Web
    1. Web interfaces
      1. Browsers

Recommendations

Automatically building probabilistic databases from the web
WWW '11: Proceedings of the 20th international conference companion on World wide web

A relevant number of web sites publish structured data about recognizable concepts (such as stock quotes, movies, restau- rants, etc.). There is a great chance to create applications that rely on a huge amount of data taken from the Web. We present an ...
Read More
Query answering techniques on uncertain and probabilistic data: tutorial summary
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data

Uncertain data are inherent in some important applications, such as environmental surveillance, market analysis, and quantitative economics research. Due to the importance of those applications and the rapidly increasing amount of uncertain data ...
Read More
Finding frequent items in probabilistic data
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data

Computing statistical information on probabilistic data has attracted a lot of attention recently, as the data generated from a wide range of data sources are inherently fuzzy or uncertain. In this paper, we study an important statistical query on ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

WebQuality '11: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
March 2011
55 pages
ISBN:9781450307062
DOI:10.1145/1964114

Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 March 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data reconciliation
probabilistic data
web data extraction
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 161
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Characterizing the uncertainty of web data: models and experiences

WebQuality '11: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatically building probabilistic databases from the web

Query answering techniques on uncertain and probabilistic data: tutorial summary

Finding frequent items in probabilistic data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Characterizing the uncertainty of web data: models and experiences

WebQuality '11: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatically building probabilistic databases from the web

Query answering techniques on uncertain and probabilistic data: tutorial summary

Finding frequent items in probabilistic data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media