Web Data Extraction Based on Structure Feature

Anxiang, Ma; Kening, Gao; Xiaohong, Zhang; Bin, Zhang

doi:10.1007/978-3-642-23235-0_75

Ma Anxiang²,
Gao Kening²,
Zhang Xiaohong² &
…
Zhang Bin²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 226))

Included in the following conference series:

International Conference on Applied Informatics and Communication

1569 Accesses
1 Citations

Abstract

Most existing methods of Web data extraction realize goals based on DOM tree analysis or wrapper building. However, applicability and efficiency of these methods need to be further improved. According to the amount of information, Web pages will be divided into two structure types which are 1:1 and 1:N type respectively. As same type of Web pages has similar structure features, the paper proposes an approach of two-phase Web data extraction base on structure feature. In the phase of samples learning, structure feature and depository rules of Web pages are obtained according to the text feature of sample pages. In the phase of information extraction, Web data extraction is implemented by matching the page to be extracted with depository rules in knowledge base. Experimental results show that the approach proposed in the paper has well applicability and high efficiency.

This work is supported by “The Fundamental Research Funds for the Central Universities” (N100304003), “National Natural Science Foundation of China” (61073062), and “Natural Science Foundation of LiaoNing Province” (20102060).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aykut, F., Stuart, M.E., Nor, A.Y., et al.: Information Aggregation Using the Caméléon# Web Wrapper. In: Proceedings of the 6th International Conference on E-Commerce and Web Technologies, Copenhagen, Denmark, pp. 76–86 (2005)
Google Scholar
Pinto, D., McCallum, A., Wei, X.: Table Extraction Using Conditional Random Fields. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 235–242. ACM Press, New York (2003)
Google Scholar
Wang, Y., Hu, J.: A Machine Learning Based Approach for Table Detection on the Web. In: Proceedings of the 11th International World Web Conference, pp. 242–250. ACM Press, New York (2002)
Google Scholar
Zhai, Y., Liu, B.: Web Data Extraction Based on Partial Tree Alignment. In: Proceedings of the 14th International Conference on World Wide Web, pp. 76–85. ACM Press, New York (2005)
Chapter Google Scholar
Zhao, H., Meng, W., Wu, Z., et al.: Fully Automatic Wrapper Generation for Search Engines. In: Proceedings of the 14th International Conference on World Wide Web, pp. 66–75. ACM Press, New York (2005)
Chapter Google Scholar
Baumgartner, R., Ceresna, M., Ledermuller, G.: Deep Web Navigation in Web Data Extraction. In: Proceedings of the International Conference on Intelligent Agents, Web Technology and Internet Commerce, pp. 698–703. IEEE Press, Los Alamitos (2005)
Google Scholar
Liao, T., Liu, Z.T., Sun, R.: Research and Implementation of Web Table Positioning Technology. Computer Science 36(9), 227–230 (2009)
Google Scholar
Ren, Z.S., Xue, Y.S.: Structured Data Extraction Based on Web Page Tags. Computer Science 34(10), 133–136 (2007)
Google Scholar
Liu, W., Meng, X., Meng, W.: Vision-based Web Data Records Extraction. In: Proceedings of the 9th SIGMOD International Workshop in Web and Databases, pp. 20–25 (2006)
Google Scholar
Gao, K.: Technology and Application of Web Information Reorganization Based on Visual classifying Schema (Ph.D. Thesis). Northeastern University, Shenyang (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Information Science and Engineering, Northeastern University, Shenyang, China
Ma Anxiang, Gao Kening, Zhang Xiaohong & Zhang Bin

Authors

Ma Anxiang
View author publications
You can also search for this author in PubMed Google Scholar
Gao Kening
View author publications
You can also search for this author in PubMed Google Scholar
Zhang Xiaohong
View author publications
You can also search for this author in PubMed Google Scholar
Zhang Bin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Suzhou University, No. 50 Donghuan Road, 215021, China
Jianwei Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Anxiang, M., Kening, G., Xiaohong, Z., Bin, Z. (2011). Web Data Extraction Based on Structure Feature. In: Zhang, J. (eds) Applied Informatics and Communication. ICAIC 2011. Communications in Computer and Information Science, vol 226. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23235-0_75

Download citation

DOI: https://doi.org/10.1007/978-3-642-23235-0_75
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23234-3
Online ISBN: 978-3-642-23235-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics