An Approach to Assess the Quality of Web Pages in the Deep Web

Nie, Tiezheng; Yu, Ge; Shen, Derong; Kou, Yue; Yue, Dejun

doi:10.1007/978-3-642-20244-5_49

Tiezheng Nie²⁰,
Ge Yu²⁰,
Derong Shen²⁰,
Yue Kou²⁰ &
…
Dejun Yue²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6637))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1698 Accesses

Abstract

Web pages contain a large number of structured data, which are useful for many advanced applications. Existing works mainly focused on extracting structured data from web pages by individual wrappers but ignored the quality for these underlying web pages, which in fact impact the extracting results seriously. Thus, we define the quality of a web page by the data quality a wrapper can achieve in extraction. This paper proposes a novel approach to assess the quality of web pages in the deep web. In our approach, we first define the schema of web data with a hierarchical model. Then web pages are dealt with as XML documents and parsed into a DOM tree. The data units and attribute values in the web page are annotated with the schema semantics and the XPATH of position in the DOM tree. Based on the annotation, we build an assessment model for the quality of web pages with two dimensions: the structure complexity and the text complexity of node in the DOM tree. The quality is partitioned into three quality levels in our model, and the quality of web pages in the same quality level is compared by the proposed formulas. Moreover, we design an XQuery-based wrapper to extract the web page and validate our quality model since most of existing wrappers can not handle the data with hierarchical structure. The wrapper generates XQuery statements to extract web data with the annotation information. The experimental results demonstrated our approach is accurate for assessing the data quality of web pages. It is very helpful for data quality control in the deep web related applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bergman, M.: The deep web: surfacing hidden value. The Journal of Electronic Publishing 7(1) (2001)
Google Scholar
Cohen, W., Hurst, M., Jensen, L.: A flexible learning system for wrapping tables and lists in HTML documents. In: WWW (2002)
Google Scholar
Pinto, D., McCallum, A., Wei, X., Bruce, W.: Table extraction using conditional random fields. In: SIGIR (2003)
Google Scholar
Wang, Y., Hu, J.: A machine learning based approach for table detection on the Web. In: WWW (2002)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction from large web sites. In: VLDB (2001)
Google Scholar
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD (2003)
Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW (2005)
Google Scholar
Liu, W., Meng, X., Meng, W.: Vision-based web data records extraction. In: WebDB (2006)
Google Scholar
Cai, D., Yu, S., Wen, J., Ma, W.-Y.: Extracting content structure for web pages based on visual representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)
Chapter Google Scholar
XQuery 1.0: An XML Query Language, http://www.w3.org/TR/xquery/
Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., Crespo, A.: Extracting semistructured information from the Web. In: Workshop on the Management of Semistructured Data (1997)
Google Scholar
Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artificial Intelligence 118, 15–68 (2000)
Article MathSciNet MATH Google Scholar
Arocena, G.O., Mendelzon, A.O.: WebOQL: restructuring documents, databases, and webs. In: ICDE (1998)
Google Scholar
Liu, L., Pu, C., Han, W.: XWRAP: An XML-enabled wrapper construction system for web information sources. In: ICDE (2000)
Google Scholar
Wang, J.-Y., Lochovsky, F.: Data extraction and label assignment for Web databases. In: WWW (2003)
Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining data records from Web pages. In: KDD (2003)
Google Scholar
Zhao, H., Meng, W., Yu, C.: Automatic extraction of dynamic record sections from search engine result pages. In: VLDB (2006)
Google Scholar
Simon, K., Lausen, G.: ViPER: Augmenting automatic information extraction with visual perceptions. In: CIKM (2005)
Google Scholar
Gertz, M., Ozsu, T., Saake, G., Sattler, K.: Data Quality on the web. Report (2003)
Google Scholar
Strong, D., Lee, Y., Wang, R.: Data Quality in Context. CACM 40(5) (1997)
Google Scholar
Even, A., Shankaranarayanan, G.: Utility-driven assessment of data quality. ACM SIGMIS Database 38(2), 75–93 (2007)
Article Google Scholar
Pipino, L., Lee, Y., Wang, R.: Data quality assessment. CACM 45(4) (2002)
Google Scholar
Batini, C., Cappiello, C., Francalanci, C., Maurino, A.: Methodologies for data quality assessment and improvement. ACM Comput. Surv (2009)
Google Scholar
Xu, Y., Papakonstantinou, Y.: Efficient Keyword Search for Smallest LCAs in XML Database. In: SIGMOD (2005)
Google Scholar
Yamada, Y., Craswell, N., Nakatoh, T., Hirokawa, S.: Testbed for information extraction from deep web. In: WWW (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Information Science and Engineering, Northeastern University, 110819, Shenyang, China
Tiezheng Nie, Ge Yu, Derong Shen, Yue Kou & Dejun Yue

Authors

Tiezheng Nie
View author publications
You can also search for this author in PubMed Google Scholar
Ge Yu
View author publications
You can also search for this author in PubMed Google Scholar
Derong Shen
View author publications
You can also search for this author in PubMed Google Scholar
Yue Kou
View author publications
You can also search for this author in PubMed Google Scholar
Dejun Yue
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, KLN, Hong Kong, China
Jianliang Xu
School of Information Science and Engineering, Northeastern University, Shenyang, 110004, Liaoning, China
Ge Yu
School of Computer Science, Fudan University, 220 Handan Road, 200433, Shanghai, China
Shuigeng Zhou
Institute for Computer Science and Business Information Systems (ICB), University of Duisburg-Essen, Schützenbahn 70, 45117, Essen, Germany
Rainer Unland

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nie, T., Yu, G., Shen, D., Kou, Y., Yue, D. (2011). An Approach to Assess the Quality of Web Pages in the Deep Web. In: Xu, J., Yu, G., Zhou, S., Unland, R. (eds) Database Systems for Adanced Applications. DASFAA 2011. Lecture Notes in Computer Science, vol 6637. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20244-5_49

Download citation

DOI: https://doi.org/10.1007/978-3-642-20244-5_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20243-8
Online ISBN: 978-3-642-20244-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics