Abstract
A large and increasing number of web sites publish structured data about recognizable concepts (such as stock quotes, movies, restaurants). The great chance to create applications that rely on the huge amount of data taken from these sites has been discussed for more than a decade now, but in practice, only a small fraction of such information is currently used. The main reason is that extracting and integrating web data of good quality is an expensive task, which often requires human intervention. In this chapter, we present the main results of the Flint project, which aims at developing automatic and domain-independent tools to perform all the steps required to benefit from Web data: discovering data-intensive web sites containing information about entities of interest, extracting and integrating the published data, and performing a probabilistic analysis to characterize the impreciseness of the data and the accuracy of the sources. The results of the processing are semantically annotated data that can be used to populate a probabilistic database and to develop novel applications.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
The distance between an attribute and a mapping is from the centroid of the mapping.
- 2.
The names of the models presented in this chapter are inspired by those introduced by Dong et al. in [31].
References
Agichtein, E., Gravano, L.: Snowball: extracting relations from large plain-text collections. DL ’00, pp. 85–94 (2000)
Amento, B., Terveen, L.G., Hill, W.C.: Does “authority” mean quality? predicting expert quality ratings of web documents. SIGIR, pp. 296–303 (2000)
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. ACM SIGMOD international conference on management of data (SIGMOD’2003), San Diego, California, pp. 337–348 (2003)
Banko, M., Cafarella, M., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. IJCAI (2007)
Batini, C., Scannapieco, M.: Data Quality: Concepts, Methodologies, and Techniques. Springer, Berlin, Heidelberg, New York (2008)
Bilke, A., Naumann, F.: Schema matching using duplicates. ICDE, pp. 69–80 (2005)
Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P.: Exploiting information redundancy to wring out structured data from the web. In: Rappa, M., Jones, P., Freire, J., Chakrabarti, S. (eds.) WWW, pp. 1063–1064. ACM, New York (2010)
Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P.: Redundancy-driven web data extraction and integration. WebDB (2010)
Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P.: Automatically building probabilistic databases from the web. WWW (Companion Volume), pp. 185–188 (2011)
Blanco, L., Crescenzi, V., Merialdo, P.: Efficiently locating collections of web pages to wrap. WEBIST (2005)
Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: Supporting the automatic construction of entity aware search engines. WIDM, pp. 149–156 (2008)
Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: Probabilistic models to reconcile complex data from inaccurate data sources. CAiSE, pp. 83–97 (2010)
Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: Contextual data extraction and instance-based integration. International workshop on searching and integrating new web data sources (VLDS) (2011)
Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: Wrapper generation for overlapping web sources. Web Intelligence (WI) (2011)
Blanco, L., Dalvi, N.N., Machanavajjhala, A.: Highly efficient algorithms for structural clustering of large websites. WWW, pp. 437–446 (2011)
Brin, S.: Extracting patterns and relations from the World Wide Web. Proceedings of the First Workshop on the Web and Databases (WebDB’98) (in conjunction with EDBT’98, pp. 102–108 (1998)
Cafarella, M.J., Etzioni, O., Suciu, D.: Structured queries over web text. IEEE Data Eng. Bull. 29(4), 45–51 (2006)
Cafarella, M.J., Halevy, A.Y., Khoussainova, N.: Data integration for the relational web. PVLDB 2(1), 1090–1101 (2009)
Cafarella, M.J., Halevy, A.Y., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. PVLDB 1(1), 538–549 (2008)
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Comput. Networks (Amsterdam, Netherlands) 31(11–16), 1623–1640 (1999)
Chang, K.C.C., Bin, H., Zhen, Z.: Toward large scale integration: building a metaquerier over databases on the web. CIDR 2005, pp. 44–66 (2005)
Chuang, S.L., Chang, K.C.C., Zhai, C.X.: Context-aware wrapping: synchronized data extraction. VLDB, pp. 699–710 (2007)
Clemen, R.T., Winkler, R.L.: Combining probability distributions from experts in risk analysis. Risk Anal. 19(2), 187–203 (1999)
Crescenzi, V., Mecca, G., Merialdo, P.: roadRunner: towards automatic data extraction from large Web sites. International conference on very large data bases (VLDB 2001), Roma, Italy, 11–14 September 2001, pp. 109–118
Dalvi, N.N., Suciu, D.: Management of probabilistic data: foundations and challenges. PODS, pp. 1–12 (2007)
Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J.Y.: Semtag and seeker: bootstrapping the semantic web via automated semantic annotation. WWW ’03: proceedings of the 12th International Conference on World Wide Web, pp. 178–186. ACM, New York, NY, USA (2003). http://doi.acm.org/10.1145/775152.775178
Do, H.H., Rahm, E.: Matching large schemas: approaches and evaluation. Inf. Syst. 32(6), 857–885 (2007)
Doan, A., Madhavan, J., Domingos, P., Halevy, A.: Learning to map between ontologies on the semantic web. WWW ’02, pp. 662–673 (2002)
Doan, A., Ramakrishnan, R., Chen, F., DeRose, P., Lee, Y., McCann, R., Sayyadian, M., Shen, W.: Community information management. IEEE Data Eng. Bull. 29(1), 64–72 (2006)
Dong, X., Berti-Equille, L., Hu, Y., Srivastava, D.: Global detection of complex copying relationships between sources. PVLDB 3(1), 1358–1369 (2010)
Dong, X.L., Berti-Equille, L., Srivastava, D.: Integrating conflicting data: the role of source dependence. PVLDB 2(1), 550–561 (2009)
Dong, X.L., Berti-Equille, L., Srivastava, D.: Truth discovery and copying detection in a dynamic world. PVLDB 2(1), 562–573 (2009)
Downey, D., Etzioni, O., Soderland, S.: A probabilistic model of redundancy in information extraction. IJCAI, pp. 1034–1041 (2005)
Florescu, D., Koller, D., Levy, A.Y.: Using probabilistic information in data integration. VLDB, pp. 216–225 (1997)
Galland, A., Abiteboul, S., Marian, A., Senellart, P.: Corroborating information from disagreeing views. Proceedings of WSDM, New York, USA (2010)
Guha, R., McCool, R.: Tap: a semantic web platform. Comput. Networks 42(5), 557–577 (2003)
Madhavan, J., Bernstein, P.A., Doan, A., Halevy, A.Y.: Corpus-based schema matching. ICDE, pp. 57–68 (2005)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008). http://www.informationretrieval.org
Pant, G., Srinivasan, P.: Learning to crawl: comparing classification schemes. ACM Trans. Inf. Syst. 23(4), 430–462 (2005)
Sarma, A.D., Dong, X., Halevy, A.Y.: Bootstrapping pay-as-you-go data integration systems. SIGMOD conference, pp. 861–874 (2008)
Shen, W., DeRose, P., McCann, R., Doan, A., Ramakrishnan, R.: Toward best-effort information extraction. SIGMOD conference, pp. 1031–1042 (2008)
Shen, W., DeRose, P., Vu, L., Doan, A., Ramakrishnan, R.: Source-aware entity matching: a compositional approach. ICDE, pp. 196–205. IEEE Computer Society, Silver Spring, MD (2007)
Sizov, S., Biwer, M., Graupmann, J., Siersdorfer, S., Theobald, M., Weikum, G., Zimmer, P.: The bingo! system for information portal generation and expert web search. CIDR 2003, First Biennial conference on innovative data systems research, Asilomar, CA, USA, 2003
Vidal, M.L.A., da Silva, A.S., de Moura, E.S., Cavalcanti, J.M.B.: Structure-driven crawler generation by example. In: Efthimiadis, E.N., Dumais, S.T., Hawking, D., Järvelin, K. (eds.) SIGIR, pp. 292–299. ACM, New York (2006)
Wu, M., Marian, A.: Corroborating answers from multiple web sources. WebDB (2007)
Yin, X., Han, J., Yu, P.S.: Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng. 20(6), 796–808 (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P. (2012). Flint: From Web Pages to Probabilistic Semantic Data. In: De Virgilio, R., Guerra, F., Velegrakis, Y. (eds) Semantic Search over the Web. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25008-8_13
Download citation
DOI: https://doi.org/10.1007/978-3-642-25008-8_13
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25007-1
Online ISBN: 978-3-642-25008-8
eBook Packages: Computer ScienceComputer Science (R0)