skip to main content
10.1145/1557019.1557152acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Towards combining web classification and web information extraction: a case study

Published:28 June 2009Publication History

ABSTRACT

Web content analysis often has two sequential and separate steps: Web Classification to identify the target Web pages, and Web Information Extraction to extract the metadata contained in the target Web pages. This decoupled strategy is highly ineffective since the errors in Web classification will be propagated to Web information extraction and eventually accumulate to a high level. In this paper we study the mutual dependencies between these two steps and propose to combine them by using a model of Conditional Random Fields (CRFs). This model can be used to simultaneously recognize the target Web pages and extract the corresponding metadata. Systematic experiments in our project OfCourse for online course search show that this model significantly improves the F1 value for both of the two steps. We believe that our model can be easily generalized to many Web applications.

Skip Supplemental Material Section

Supplemental Material

p1235-luo.mp4

mp4

67 MB

References

  1. I. Bhattacharya, S. Godbole, and S. Joshi. Structured entity identification and document categorization: two tasks with one joint model. In Proc. of the 14th ACM SIGKDD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Castellanos, Q. Chen, U. Dayal, M. Hsu, M. Lemon, P. Siegel, andJ. Stinger. Component adviser: a tool for automatically extracting electronic component data from web datasheets. In Proc. of the Workshop on Reuse of Web-based Information, the 7th WWW, 1998.Google ScholarGoogle Scholar
  3. D. Hosmer and S. Lemeshow. Applied Logistic Regression. Wiley, New York, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  4. A. Kulesza and F. Pereira. Structured learning with approximate inference. In Proc. of the 21st NIPS, 2007.Google ScholarGoogle Scholar
  5. J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proc. of the 18th ICML, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical Programming, 45:503--528, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. McCallum. Information extraction: Distilling structured data from unstructured text. ACM Queue, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Z. Nie, J. Wen, and W. Ma. Object-level vertical search. In Proc. of the Conf. on Innovative Data Systems Research, 2007.Google ScholarGoogle Scholar
  9. V. Punyakanok, D. Roth, W. Yih, and D. Zimak. Learning and inference over constrained output. In Proc. of the 19th IJCAI, 2005.Google ScholarGoogle Scholar
  10. J. Rennie and A. McCallum. Using reinforcement learning to spider the web efficiently. In Proc. of the 16th ICML, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Roth and W. Yih. Probabilistic reasoning for entity and relation recognition. In Proc. the 19th COLING, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Y. Xue, Y. Hu, G. Xin, R. Song, S. Shi, Y. Cao, C.-Y. Lin, andH. Li. Web page title extraction and its application. Information Processing and Management, 43(5):1332-1347, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. 2d conditional random fields for web information extraction. In Proc. of the 22nd ICML, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. Simultaneous record detection and attribute labeling in web data extraction. In Proc. of the 12th ACM SIGKDD, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
    June 2009
    1426 pages
    ISBN:9781605584959
    DOI:10.1145/1557019

    Copyright © 2009 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 28 June 2009

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate1,133of8,635submissions,13%

    Upcoming Conference

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader