ABSTRACT
Web content analysis often has two sequential and separate steps: Web Classification to identify the target Web pages, and Web Information Extraction to extract the metadata contained in the target Web pages. This decoupled strategy is highly ineffective since the errors in Web classification will be propagated to Web information extraction and eventually accumulate to a high level. In this paper we study the mutual dependencies between these two steps and propose to combine them by using a model of Conditional Random Fields (CRFs). This model can be used to simultaneously recognize the target Web pages and extract the corresponding metadata. Systematic experiments in our project OfCourse for online course search show that this model significantly improves the F1 value for both of the two steps. We believe that our model can be easily generalized to many Web applications.
Supplemental Material
- I. Bhattacharya, S. Godbole, and S. Joshi. Structured entity identification and document categorization: two tasks with one joint model. In Proc. of the 14th ACM SIGKDD, 2008. Google ScholarDigital Library
- M. Castellanos, Q. Chen, U. Dayal, M. Hsu, M. Lemon, P. Siegel, andJ. Stinger. Component adviser: a tool for automatically extracting electronic component data from web datasheets. In Proc. of the Workshop on Reuse of Web-based Information, the 7th WWW, 1998.Google Scholar
- D. Hosmer and S. Lemeshow. Applied Logistic Regression. Wiley, New York, 2000.Google ScholarCross Ref
- A. Kulesza and F. Pereira. Structured learning with approximate inference. In Proc. of the 21st NIPS, 2007.Google Scholar
- J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proc. of the 18th ICML, 2001. Google ScholarDigital Library
- D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical Programming, 45:503--528, 1989. Google ScholarDigital Library
- A. McCallum. Information extraction: Distilling structured data from unstructured text. ACM Queue, 2005. Google ScholarDigital Library
- Z. Nie, J. Wen, and W. Ma. Object-level vertical search. In Proc. of the Conf. on Innovative Data Systems Research, 2007.Google Scholar
- V. Punyakanok, D. Roth, W. Yih, and D. Zimak. Learning and inference over constrained output. In Proc. of the 19th IJCAI, 2005.Google Scholar
- J. Rennie and A. McCallum. Using reinforcement learning to spider the web efficiently. In Proc. of the 16th ICML, 1999. Google ScholarDigital Library
- D. Roth and W. Yih. Probabilistic reasoning for entity and relation recognition. In Proc. the 19th COLING, 2002. Google ScholarDigital Library
- Y. Xue, Y. Hu, G. Xin, R. Song, S. Shi, Y. Cao, C.-Y. Lin, andH. Li. Web page title extraction and its application. Information Processing and Management, 43(5):1332-1347, 2007. Google ScholarDigital Library
- J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. 2d conditional random fields for web information extraction. In Proc. of the 22nd ICML, 2005. Google ScholarDigital Library
- J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. Simultaneous record detection and attribute labeling in web data extraction. In Proc. of the 12th ACM SIGKDD, 2006. Google ScholarDigital Library
Recommendations
Web-scale table census and classification
WSDM '11: Proceedings of the fourth ACM international conference on Web search and data miningWe report on a census of the types of HTML tables on the Web according to a fine-grained classification taxonomy describing the semantics that they express. For each relational table type, we describe open challenges for extracting from them semantic ...
Web Information Extraction Technology Research Based on Ajax
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global InformatizationAlong with the rapid development of Internet, research of information extraction in the field has been extensive concerned by scholars. However, with the widely application of Web2.0, the traditional web information extraction technology can't meet the ...
Information extraction from web tables
iiWAS '09: Proceedings of the 11th International Conference on Information Integration and Web-based Applications & ServicesNowadays, many users use web search engines to find and gather information. User faces an increasing amount of various web pages information sources. The issue of correlating, integrating and presenting related information to users becomes important. ...
Comments