research-article

Towards combining web classification and web information extraction: a case study

Authors:
Ping Luo

HP Labs China, Beijing, China

HP Labs China, Beijing, China
View Profile

,
Fen Lin

Institute of Computing Technology, CAS, Beijing, China

Institute of Computing Technology, CAS, Beijing, China
View Profile

,
Yuhong Xiong

HP Labs China, Beijing, China

HP Labs China, Beijing, China
View Profile

,
Yong Zhao

HP Labs China, Beijing, China

HP Labs China, Beijing, China
View Profile

,
Zhongzhi Shi

Institute of Computing Technology, CAS, Beijing, China

Institute of Computing Technology, CAS, Beijing, China
View Profile

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data miningJune 2009Pages 1235–1244https://doi.org/10.1145/1557019.1557152

Published:28 June 2009Publication History

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 1235–1244

ABSTRACT

Web content analysis often has two sequential and separate steps: Web Classification to identify the target Web pages, and Web Information Extraction to extract the metadata contained in the target Web pages. This decoupled strategy is highly ineffective since the errors in Web classification will be propagated to Web information extraction and eventually accumulate to a high level. In this paper we study the mutual dependencies between these two steps and propose to combine them by using a model of Conditional Random Fields (CRFs). This model can be used to simultaneously recognize the target Web pages and extract the corresponding metadata. Systematic experiments in our project OfCourse for online course search show that this model significantly improves the F1 value for both of the two steps. We believe that our model can be easily generalized to many Web applications.

Supplemental Material

p1235-luo.mp4

mp4

67 MB

Download

References

I. Bhattacharya, S. Godbole, and S. Joshi. Structured entity identification and document categorization: two tasks with one joint model. In Proc. of the 14th ACM SIGKDD, 2008. Google ScholarDigital Library
M. Castellanos, Q. Chen, U. Dayal, M. Hsu, M. Lemon, P. Siegel, andJ. Stinger. Component adviser: a tool for automatically extracting electronic component data from web datasheets. In Proc. of the Workshop on Reuse of Web-based Information, the 7th WWW, 1998.Google Scholar
D. Hosmer and S. Lemeshow. Applied Logistic Regression. Wiley, New York, 2000.Google ScholarCross Ref
A. Kulesza and F. Pereira. Structured learning with approximate inference. In Proc. of the 21st NIPS, 2007.Google Scholar
J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proc. of the 18th ICML, 2001. Google ScholarDigital Library
D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical Programming, 45:503--528, 1989. Google ScholarDigital Library
A. McCallum. Information extraction: Distilling structured data from unstructured text. ACM Queue, 2005. Google ScholarDigital Library
Z. Nie, J. Wen, and W. Ma. Object-level vertical search. In Proc. of the Conf. on Innovative Data Systems Research, 2007.Google Scholar
V. Punyakanok, D. Roth, W. Yih, and D. Zimak. Learning and inference over constrained output. In Proc. of the 19th IJCAI, 2005.Google Scholar
J. Rennie and A. McCallum. Using reinforcement learning to spider the web efficiently. In Proc. of the 16th ICML, 1999. Google ScholarDigital Library
D. Roth and W. Yih. Probabilistic reasoning for entity and relation recognition. In Proc. the 19th COLING, 2002. Google ScholarDigital Library
Y. Xue, Y. Hu, G. Xin, R. Song, S. Shi, Y. Cao, C.-Y. Lin, andH. Li. Web page title extraction and its application. Information Processing and Management, 43(5):1332-1347, 2007. Google ScholarDigital Library
J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. 2d conditional random fields for web information extraction. In Proc. of the 22nd ICML, 2005. Google ScholarDigital Library
J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. Simultaneous record detection and attribute labeling in web data extraction. In Proc. of the 12th ACM SIGKDD, 2006. Google ScholarDigital Library

Recommendations

Web-scale table census and classification
WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

We report on a census of the types of HTML tables on the Web according to a fine-grained classification taxonomy describing the semantics that they express. For each relational table type, we describe open challenges for extracting from them semantic ...
Read More
Web Information Extraction Technology Research Based on Ajax
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global Informatization

Along with the rapid development of Internet, research of information extraction in the field has been extensive concerned by scholars. However, with the widely application of Web2.0, the traditional web information extraction technology can't meet the ...
Read More
Information extraction from web tables
iiWAS '09: Proceedings of the 11th International Conference on Information Integration and Web-based Applications & Services

Nowadays, many users use web search engines to find and gather information. User faces an increasing amount of various web pages information sources. The issue of correlating, integrating and presenting related information to users becomes important. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
June 2009
1426 pages
ISBN:9781605584959
DOI:10.1145/1557019
General Chairs:
John Elder
Elder Research, Inc., USA
,
Françoise Soulié Fogelman
KXEN, France
,
Program Chairs:
Peter Flach
University of Bristol, UK
,
Mohammed Zaki
RPI, USA
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 June 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
classification
graphical model
information extraction
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

KDD '24: The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 726
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Towards combining web classification and web information extraction: a case study

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

Supplemental Material

References

Cited By

Recommendations

Web-scale table census and classification

Web Information Extraction Technology Research Based on Ajax

Information extraction from web tables

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Towards combining web classification and web information extraction: a case study

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

Supplemental Material

References

Cited By

Recommendations

Web-scale table census and classification

Web Information Extraction Technology Research Based on Ajax

Information extraction from web tables

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media