Automatic Extraction Rules Generation Based on XPath Pattern Learning

Zhang, Jingwei; Zhang, Can; Qian, Weining; Zhou, Aoying

doi:10.1007/978-3-642-24396-7_6

Automatic Extraction Rules Generation Based on XPath Pattern Learning

Jingwei Zhang²³,
Can Zhang²³,
Weining Qian²³ &
…
Aoying Zhou²³

Conference paper

1043 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6724))

Abstract

Web forums have become important information sources on the Web due to their rich content contributed by millions of Internet users every day. Data extraction from Web pages is a key but cumbersome step for data analysis because of significant human intervention. Web forums have fairly regular structures which allow us to generate extraction rules automatically according to their paths. In this paper, we introduce formal expressions for XPath patterns and pattern mapping rules, and advise machine learning methods to generate extraction rules for automatic data extraction from Web forums. The experimental results on real-life Web forums show good feasibility and accuracy for forum data.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems 4(1-2), 93–114 (2001)
Article Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 119–128. Morgan Kaufmann, San Francisco (2001)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 109–118. Morgan Kaufmann, San Francisco (2001)
Google Scholar
Zaki, M.J., Aggarwal, C.C.: XRules: An Effective Structural Classifier for XML Data. In: SIGKDD 2003, pp. 316–325. ACM, New York (2003)
Google Scholar
Sahuguet, A., Azavant, F.: Building light-weight wrappers for legacy Web data-sources using W4F. In: Proceedings of the 25th International Conference on Very Large Data Bases, pp. 738–741. Morgan Kaufmann, San Francisco (1999)
Google Scholar
Flesca, S., Manco, G., Masciari, E., Rende, E., Tagarelli, A.: Web wrapper induction: a brief survey. AI Commun. 17(2), 57–61 (2004)
Google Scholar
Shen, W., Doan, A.H., Naughton, J.F., Ramakrishnan, R.: Declarative Information Extraction Using Datalog with Embedded Extraction Predicates. In: Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB 2007), pp. 1033–1044. VLDB Endowment (2007)
Google Scholar
Huang, Y., Liu, Z.Y., Chen, Y.: eXtract: A Snippet Generation System for XML Search. Proc. VLDB Endow. 1(2), 1392–1395 (2008)
Article Google Scholar
Cohen, S.: Generating XML Structure Using Examples and Constraints. Proc. VLDB Endow. 1(1), 490–501 (2008)
Article Google Scholar
Cai, R., Yang, J.M., Lai, W., Wang, Y.D., Zhang, L.: iRobot: An Intelligent Crawler for Web Forums. In: Proceeding of the 17th International Conference on World Wide Web (WWW 2008), pp. 447–456. ACM, New York (2008)
Chapter Google Scholar
Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT : A System for Extracting Document Type Descriptors from XML Documents. SIGMOD Rec. 29(2), 165–176 (2000)
Article Google Scholar
Mengel, S., Jing, Y.: Extracting structured data from web pages with maximum entropy segmental markov model. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds.) WISE 2009. LNCS, vol. 5802, pp. 219–226. Springer, Heidelberg (2009)
Chapter Google Scholar
Anton, T.: XPath-Wrapper Induction by Generalizing Tree Traversal Patterns. LWA, 126–133 (2005)
Google Scholar
Myllymaki, J.: Effective Web Data Extraction with Standard XML Technologies. In: Proceedings of the 10th International Conference on World Wide Web (WWW 2001), pp. 689–696. ACM, New York (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Massive Computing, East China Normal University, Shanghai, 200062, China
Jingwei Zhang, Can Zhang, Weining Qian & Aoying Zhou

Authors

Jingwei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Can Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Weining Qian
View author publications
You can also search for this author in PubMed Google Scholar
Aoying Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dickson Computer Systems, 7A Victory Avenue 4/F Homantin, Kowloon, Hong Kong, China
Dickson K. W. Chiu
Ecole Nationale Supérieure de Mécanique et d’Aréotechnique, Laboratoire d’Informatique Scientifique et Industrielle, Téléport 2 - avenue Clément Ader, 86961, Futuroscope Chasseneuil Cedex, France
Ladjel Bellatreche
Dept. of Computer Science and Engineering, Ritsumeikan University, Wakakusa 6-4-10, 525-0045, Kusatu, Shiga, Japan
Hideyasu Sasaki
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Sha Tin, Hong Kong, China
Ho-fung Leung
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
Shing-Chi Cheung
School of Computer Science, Hangshou Dianzi University, Xiasha Higher Education Zone, 310018, Hanshou City, Zhejiang, China
Haiyang Hu
Department of Computer Science and Software Engineering, The University of Melbourne, 3010, Parkville, Victoria, Australia
Jie Shao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, J., Zhang, C., Qian, W., Zhou, A. (2011). Automatic Extraction Rules Generation Based on XPath Pattern Learning. In: Chiu, D.K.W., et al. Web Information Systems Engineering – WISE 2010 Workshops. WISE 2010. Lecture Notes in Computer Science, vol 6724. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24396-7_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-24396-7_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24395-0
Online ISBN: 978-3-642-24396-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics