Abstract
In this article we describe a novel information extraction task on the web and show how it can be solved effectively using the emerging conditional exponential models. The task involves learning to find specific goal pages on large domain-specific websites. An example of such a task is to find computer science publications starting from university root pages. We encode this as a sequential labeling problem solved using Conditional Random Fields (CRFs). These models enable us to exploit a wide variety of features including keywords and patterns extracted from and around hyperlinks and HTML pages, dependency among labels of adjacent pages, and existing databases of named entities in a unified probabilistic framework. This is an important advantage over previous rule-based or generative models for tackling the challenges of diversity on web data.
- S. Chakrabarti, K. Punera, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In WWW, Hawaii. ACM, May 2002. Google ScholarDigital Library
- Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew K. McCallum, Tom M. Mitchell, Kamal Nigam, and Seán Slattery. Learning to construct Knowledge Bases from the World Wide Web. Artificial Intelligence, 118(1/2):69--113, 2000. Google ScholarDigital Library
- Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, C. Lee Giles, and Marco Gori. Focused crawling using Context graphs. In 26th International Conference on Very Large Databases, VLDB 2000, pages 527--534, Cairo, Egypt, 10--14 September 2000. Google ScholarDigital Library
- J. R. Quinlan. Learning logical definitions from relations. Machine Learning, 5:239--266, 1990. Google ScholarDigital Library
- John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the 18th International Conference on Machine Learning (ICML-2001), pages 282--289. Morgan Kaufmann, San Francisco, CA, 2001. Google ScholarDigital Library
- D. C. Liu and J. Nocedal. On the limited memory bfgs method for large-scale optimization. Mathematic Programming, 45:503--528, 1989. Google ScholarDigital Library
- Robert Malouf. A comparison of algorithms for maximum entropy parameter estimation. In Proceedings of The Sixth Conference on Natural Language Learning (CoNLL-2002), pages 49--55, 2002. Google ScholarDigital Library
- Andrew McCallum and Wei Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of The Seventh Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada, 2003. Google ScholarDigital Library
- Lawrence R. Rabiner. A tutorial on Hidden Markov Models and selected applications in speech recognition. In Proceedings of the IEEE, volume 77(2), pages 257--286, February 1989.Google ScholarCross Ref
- Jason Rennie and Andrew Kachites McCallum. Using reinforcement learning to spider the Web efficiently. In Ivan Bratko and Saso Dzeroski, editors, Proceedings of ICML-99, 16th International Conference on Machine Learning, pages 335--343, Bled, SL, 1999. Morgan Kaufmann Publishers, San Francisco, US. Google ScholarDigital Library
- Fei Sha and Fernando Pereira. Shallow parsing with conditional random fields. In Proceedings of HLT-NAACL 2003, pages 213--220. Association for Computational Linguistics, 2003. Google ScholarDigital Library
- V. G. Vinod Vydiswaran and Sunita Sarawagi. Learning to extract information from large websites using sequential models. In COMAD, 2005.Google Scholar
Index Terms
- Learning to extract information from large domain-specific websites using sequential models
Recommendations
A QIIIEP based domain specific hidden web crawler
ICWET '11: Proceedings of the International Conference & Workshop on Emerging Trends in TechnologyFor context based surfing of World Wide Web in a systematic and automatic manner, a web crawler is required. The World Wide Web consists interlinked documents and resources that are easily crawled by general web crawler, known as surface web crawler. ...
Minimum classification error learning for sequential data in the wavelet domain
Wavelet analysis has found widespread use in signal processing and many classification tasks. Nevertheless, its use in dynamic pattern recognition have been much more restricted since most of wavelet models cannot handle variable length sequences ...
Organizing domain-specific information on the Web: An experiment on the Spanish business Web directory
Web directories organize voluminous information into hierarchical structures, helping users to quickly locate relevant information and to support decision-making. The development of existing ontologies and Web directories either relies on expert ...
Comments