skip to main content
article

Learning to extract information from large domain-specific websites using sequential models

Published:01 December 2004Publication History
Skip Abstract Section

Abstract

In this article we describe a novel information extraction task on the web and show how it can be solved effectively using the emerging conditional exponential models. The task involves learning to find specific goal pages on large domain-specific websites. An example of such a task is to find computer science publications starting from university root pages. We encode this as a sequential labeling problem solved using Conditional Random Fields (CRFs). These models enable us to exploit a wide variety of features including keywords and patterns extracted from and around hyperlinks and HTML pages, dependency among labels of adjacent pages, and existing databases of named entities in a unified probabilistic framework. This is an important advantage over previous rule-based or generative models for tackling the challenges of diversity on web data.

References

  1. S. Chakrabarti, K. Punera, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In WWW, Hawaii. ACM, May 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew K. McCallum, Tom M. Mitchell, Kamal Nigam, and Seán Slattery. Learning to construct Knowledge Bases from the World Wide Web. Artificial Intelligence, 118(1/2):69--113, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, C. Lee Giles, and Marco Gori. Focused crawling using Context graphs. In 26th International Conference on Very Large Databases, VLDB 2000, pages 527--534, Cairo, Egypt, 10--14 September 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. R. Quinlan. Learning logical definitions from relations. Machine Learning, 5:239--266, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the 18th International Conference on Machine Learning (ICML-2001), pages 282--289. Morgan Kaufmann, San Francisco, CA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. C. Liu and J. Nocedal. On the limited memory bfgs method for large-scale optimization. Mathematic Programming, 45:503--528, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Robert Malouf. A comparison of algorithms for maximum entropy parameter estimation. In Proceedings of The Sixth Conference on Natural Language Learning (CoNLL-2002), pages 49--55, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Andrew McCallum and Wei Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of The Seventh Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Lawrence R. Rabiner. A tutorial on Hidden Markov Models and selected applications in speech recognition. In Proceedings of the IEEE, volume 77(2), pages 257--286, February 1989.Google ScholarGoogle ScholarCross RefCross Ref
  10. Jason Rennie and Andrew Kachites McCallum. Using reinforcement learning to spider the Web efficiently. In Ivan Bratko and Saso Dzeroski, editors, Proceedings of ICML-99, 16th International Conference on Machine Learning, pages 335--343, Bled, SL, 1999. Morgan Kaufmann Publishers, San Francisco, US. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Fei Sha and Fernando Pereira. Shallow parsing with conditional random fields. In Proceedings of HLT-NAACL 2003, pages 213--220. Association for Computational Linguistics, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. V. G. Vinod Vydiswaran and Sunita Sarawagi. Learning to extract information from large websites using sequential models. In COMAD, 2005.Google ScholarGoogle Scholar

Index Terms

  1. Learning to extract information from large domain-specific websites using sequential models
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGKDD Explorations Newsletter
        ACM SIGKDD Explorations Newsletter  Volume 6, Issue 2
        December 2004
        161 pages
        ISSN:1931-0145
        EISSN:1931-0153
        DOI:10.1145/1046456
        Issue’s Table of Contents

        Copyright © 2004 Authors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 December 2004

        Check for updates

        Qualifiers

        • article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader