skip to main content
10.1145/1964114.1964116acmotherconferencesArticle/Chapter ViewAbstractPublication PageswebqualityConference Proceedingsconference-collections
research-article

Characterizing the uncertainty of web data: models and experiences

Published:28 March 2011Publication History

ABSTRACT

An increasing number of web sites offer structured information about recognizable concepts, relevant to many application domains, such as finance, sport, commercial products. However, web data is inherently imprecise and uncertain, and conflicting values can be provided by different web sources. Characterizing the uncertainty of web data represents an important issue and several models have been recently proposed in the literature. The paper illustrates state-of-the-art Bayesan models to evaluate the quality of data extracted from the Web and reports the results of an extensive application of the models on real life web data. Our experimental results show that for some applications even simple approaches can provide effective results, while sophisticated solutions are needed to obtain a more precise characterization of the uncertainty.

References

  1. B. Amento, L. G. Terveen, and W. C. Hill. Does "authority" mean quality? predicting expert quality ratings of web documents. In SIGIR, pages 296--303, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Batini and M. Scannapieco. Data Quality: Concepts, Methodologies, and Techniques. Springer-Verlag, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. Blanco, M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Exploiting information redundancy to wring out structured data from the web. In Proceedings of the 19th international conference on World wide web, WWW '10, pages 1063--1064, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Blanco, M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Redundancy-driven web data extraction and integration. In WebDB, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In CAiSE, pages 83--97, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1--7):107--117, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. J. Cafarella, O. Etzioni, and D. Suciu. Structured queries over web text. IEEE Data Eng. Bull., 29(4):45--51, 2006.Google ScholarGoogle Scholar
  8. R. T. Clemen and R. L. Winkler. Combining probability distributions from experts in risk analysis. Risk Analysis, 19(2):187--203, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  9. N. N. Dalvi and D. Suciu. Management of probabilistic data: foundations and challenges. In PODS, pages 1--12, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. X. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. PVLDB, 3(1):1358--1369, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: The role of source dependence. PVLDB, 2(1):550--561, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1):562--573, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Downey, O. Etzioni, and S. Soderland. A probabilistic model of redundancy in information extraction. In IJCAI, pages 1034--1041, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Florescu, D. Koller, and A. Y. Levy. Using probabilistic information in data integration. In VLDB, pages 216--225, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In Proc. WSDM, New York, USA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Wu and A. Marian. Corroborating answers from multiple web sources. In WebDB, 2007.Google ScholarGoogle Scholar
  17. X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng., 20(6):796--808, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Characterizing the uncertainty of web data: models and experiences

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      WebQuality '11: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
      March 2011
      55 pages
      ISBN:9781450307062
      DOI:10.1145/1964114

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 28 March 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader