Abstract
Web pages contain a large number of structured data, which are useful for many advanced applications. Existing works mainly focused on extracting structured data from web pages by individual wrappers but ignored the quality for these underlying web pages, which in fact impact the extracting results seriously. Thus, we define the quality of a web page by the data quality a wrapper can achieve in extraction. This paper proposes a novel approach to assess the quality of web pages in the deep web. In our approach, we first define the schema of web data with a hierarchical model. Then web pages are dealt with as XML documents and parsed into a DOM tree. The data units and attribute values in the web page are annotated with the schema semantics and the XPATH of position in the DOM tree. Based on the annotation, we build an assessment model for the quality of web pages with two dimensions: the structure complexity and the text complexity of node in the DOM tree. The quality is partitioned into three quality levels in our model, and the quality of web pages in the same quality level is compared by the proposed formulas. Moreover, we design an XQuery-based wrapper to extract the web page and validate our quality model since most of existing wrappers can not handle the data with hierarchical structure. The wrapper generates XQuery statements to extract web data with the annotation information. The experimental results demonstrated our approach is accurate for assessing the data quality of web pages. It is very helpful for data quality control in the deep web related applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bergman, M.: The deep web: surfacing hidden value. The Journal of Electronic Publishing 7(1) (2001)
Cohen, W., Hurst, M., Jensen, L.: A flexible learning system for wrapping tables and lists in HTML documents. In: WWW (2002)
Pinto, D., McCallum, A., Wei, X., Bruce, W.: Table extraction using conditional random fields. In: SIGIR (2003)
Wang, Y., Hu, J.: A machine learning based approach for table detection on the Web. In: WWW (2002)
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction from large web sites. In: VLDB (2001)
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD (2003)
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW (2005)
Liu, W., Meng, X., Meng, W.: Vision-based web data records extraction. In: WebDB (2006)
Cai, D., Yu, S., Wen, J., Ma, W.-Y.: Extracting content structure for web pages based on visual representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)
XQuery 1.0: An XML Query Language, http://www.w3.org/TR/xquery/
Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., Crespo, A.: Extracting semistructured information from the Web. In: Workshop on the Management of Semistructured Data (1997)
Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artificial Intelligence 118, 15–68 (2000)
Arocena, G.O., Mendelzon, A.O.: WebOQL: restructuring documents, databases, and webs. In: ICDE (1998)
Liu, L., Pu, C., Han, W.: XWRAP: An XML-enabled wrapper construction system for web information sources. In: ICDE (2000)
Wang, J.-Y., Lochovsky, F.: Data extraction and label assignment for Web databases. In: WWW (2003)
Liu, B., Grossman, R., Zhai, Y.: Mining data records from Web pages. In: KDD (2003)
Zhao, H., Meng, W., Yu, C.: Automatic extraction of dynamic record sections from search engine result pages. In: VLDB (2006)
Simon, K., Lausen, G.: ViPER: Augmenting automatic information extraction with visual perceptions. In: CIKM (2005)
Gertz, M., Ozsu, T., Saake, G., Sattler, K.: Data Quality on the web. Report (2003)
Strong, D., Lee, Y., Wang, R.: Data Quality in Context. CACM 40(5) (1997)
Even, A., Shankaranarayanan, G.: Utility-driven assessment of data quality. ACM SIGMIS Database 38(2), 75–93 (2007)
Pipino, L., Lee, Y., Wang, R.: Data quality assessment. CACM 45(4) (2002)
Batini, C., Cappiello, C., Francalanci, C., Maurino, A.: Methodologies for data quality assessment and improvement. ACM Comput. Surv (2009)
Xu, Y., Papakonstantinou, Y.: Efficient Keyword Search for Smallest LCAs in XML Database. In: SIGMOD (2005)
Yamada, Y., Craswell, N., Nakatoh, T., Hirokawa, S.: Testbed for information extraction from deep web. In: WWW (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nie, T., Yu, G., Shen, D., Kou, Y., Yue, D. (2011). An Approach to Assess the Quality of Web Pages in the Deep Web. In: Xu, J., Yu, G., Zhou, S., Unland, R. (eds) Database Systems for Adanced Applications. DASFAA 2011. Lecture Notes in Computer Science, vol 6637. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20244-5_49
Download citation
DOI: https://doi.org/10.1007/978-3-642-20244-5_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20243-8
Online ISBN: 978-3-642-20244-5
eBook Packages: Computer ScienceComputer Science (R0)