article

Learning to extract information from large domain-specific websites using sequential models

ACM SIGKDD Explorations Newsletter Volume 6 Issue 2December 2004pp 61–66https://doi.org/10.1145/1046456.1046464

Published:01 December 2004Publication History

ACM SIGKDD Explorations Newsletter

Abstract

In this article we describe a novel information extraction task on the web and show how it can be solved effectively using the emerging conditional exponential models. The task involves learning to find specific goal pages on large domain-specific websites. An example of such a task is to find computer science publications starting from university root pages. We encode this as a sequential labeling problem solved using Conditional Random Fields (CRFs). These models enable us to exploit a wide variety of features including keywords and patterns extracted from and around hyperlinks and HTML pages, dependency among labels of adjacent pages, and existing databases of named entities in a unified probabilistic framework. This is an important advantage over previous rule-based or generative models for tackling the challenges of diversity on web data.

References

S. Chakrabarti, K. Punera, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In WWW, Hawaii. ACM, May 2002. Google ScholarDigital Library
Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew K. McCallum, Tom M. Mitchell, Kamal Nigam, and Seán Slattery. Learning to construct Knowledge Bases from the World Wide Web. Artificial Intelligence, 118(1/2):69--113, 2000. Google ScholarDigital Library
Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, C. Lee Giles, and Marco Gori. Focused crawling using Context graphs. In 26th International Conference on Very Large Databases, VLDB 2000, pages 527--534, Cairo, Egypt, 10--14 September 2000. Google ScholarDigital Library
J. R. Quinlan. Learning logical definitions from relations. Machine Learning, 5:239--266, 1990. Google ScholarDigital Library
John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the 18th International Conference on Machine Learning (ICML-2001), pages 282--289. Morgan Kaufmann, San Francisco, CA, 2001. Google ScholarDigital Library
D. C. Liu and J. Nocedal. On the limited memory bfgs method for large-scale optimization. Mathematic Programming, 45:503--528, 1989. Google ScholarDigital Library
Robert Malouf. A comparison of algorithms for maximum entropy parameter estimation. In Proceedings of The Sixth Conference on Natural Language Learning (CoNLL-2002), pages 49--55, 2002. Google ScholarDigital Library
Andrew McCallum and Wei Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of The Seventh Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada, 2003. Google ScholarDigital Library
Lawrence R. Rabiner. A tutorial on Hidden Markov Models and selected applications in speech recognition. In Proceedings of the IEEE, volume 77(2), pages 257--286, February 1989.Google ScholarCross Ref
Jason Rennie and Andrew Kachites McCallum. Using reinforcement learning to spider the Web efficiently. In Ivan Bratko and Saso Dzeroski, editors, Proceedings of ICML-99, 16th International Conference on Machine Learning, pages 335--343, Bled, SL, 1999. Morgan Kaufmann Publishers, San Francisco, US. Google ScholarDigital Library
Fei Sha and Fernando Pereira. Shallow parsing with conditional random fields. In Proceedings of HLT-NAACL 2003, pages 213--220. Association for Computational Linguistics, 2003. Google ScholarDigital Library
V. G. Vinod Vydiswaran and Sunita Sarawagi. Learning to extract information from large websites using sequential models. In COMAD, 2005.Google Scholar

Index Terms

Learning to extract information from large domain-specific websites using sequential models
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

A QIIIEP based domain specific hidden web crawler
ICWET '11: Proceedings of the International Conference & Workshop on Emerging Trends in Technology

For context based surfing of World Wide Web in a systematic and automatic manner, a web crawler is required. The World Wide Web consists interlinked documents and resources that are easily crawled by general web crawler, known as surface web crawler. ...
Read More
Minimum classification error learning for sequential data in the wavelet domain

Wavelet analysis has found widespread use in signal processing and many classification tasks. Nevertheless, its use in dynamic pattern recognition have been much more restricted since most of wavelet models cannot handle variable length sequences ...
Read More
Organizing domain-specific information on the Web: An experiment on the Spanish business Web directory

Web directories organize voluminous information into hierarchical structures, helping users to quickly locate relevant information and to support decision-making. The development of existing ontologies and Web directories either relies on expert ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGKDD Explorations Newsletter Volume 6, Issue 2
December 2004
161 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/1046456
Issue’s Table of Contents

Copyright © 2004 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 December 2004
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 396
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning to extract information from large domain-specific websites using sequential models

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

A QIIIEP based domain specific hidden web crawler

Minimum classification error learning for sequential data in the wavelet domain

Organizing domain-specific information on the Web: An experiment on the Spanish business Web directory

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Learning to extract information from large domain-specific websites using sequential models

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

A QIIIEP based domain specific hidden web crawler

Minimum classification error learning for sequential data in the wavelet domain

Organizing domain-specific information on the Web: An experiment on the Spanish business Web directory

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media