Skip to main content

Automatic Web Information Extraction Based on Rules

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6997))

Abstract

Web Information Extraction is the initial step of effective web mining. In this article a few heuristic rules which describe the characteristics of the main content of web pages are summarized. The rules are constructed by some pre-defined terms and metrics, which can be considered as reusable and extensible for different kinds of HTML pages. Afterwards, a probabilistic model which utilizes the rules and metrics is suggested and the corresponding algorithm is implemented. The algorithm is tested on 1000 randomly selected web pages. The experiment shows that the algorithm is more precise and more applicable to the diverse structure of different web sites than other algorithms.

This research is supported by the National Science and Technology Pillar Program of China under grant number 2009BAH46B03 and National Nature Science Foundation of China under grant number 61003126.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco (2002)

    Google Scholar 

  2. Muslea, I., Minton, S., Knoblock, C.: Hierarchical Wrapper Induction for Semistructured Information Sources. J. Auton. Agent. Multi-Ag. 4, 93–114 (2001)

    Article  Google Scholar 

  3. Pan, A., Raposo, J., Alvarez, M., Hidalgo, J., Vina, A.: Semi-Automatic Wrapper Generation for Commercial Web Sources. In: IFIP Working Conference on Engineering Information Systems in the Internet Context, pp. 265–283. Kluwer Academic Publishers, Norwell (2002)

    Chapter  Google Scholar 

  4. Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.: A Survey of Web Information Extraction Systems. J. IEEE T. Knowl. Data En. 18, 1411–1428 (2006)

    Article  Google Scholar 

  5. Ji, X.W., Zeng, J.P., Shang, S.Y., Wu, C.R.: Tag Tree Template for Web Information and Schema Extraction. J. Expert Syst. Appl. 37, 8492–8498 (2010)

    Article  Google Scholar 

  6. Crescenzi, V., Mecca, G.: Automatic Information Extraction from Large Websites. J. ACM 51, 731–779 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  7. Wang, J.Y., Lochovsky, F.H.: Data-Rich Section Extraction from HTML Pages. In: Proceedings of the 3rd International Conference on Web Information Systems Engineering, pp. 313–322. IEEE Computer Soc., Los Alamitos (2002)

    Google Scholar 

  8. Lin, S.H., Ho, J.M.: Discovering Informative Content Blocks from Web Documents. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York (2002)

    Google Scholar 

  9. Mengel, S., Jing, Y.: Extracting Structured Data from Web Pages with Maximum Entropy Segmental Markov Model. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds.) WISE 2009. LNCS, vol. 5802, pp. 219–226. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  10. Weninger, T., Hsu, W.H.: Text Extraction from the Web via Text-to-Tag Ratio. In: 19th International Workshop on Database and Expert Systems Application, pp. 23–28. IEEE Computer Soc., Los Alamitos (2008)

    Google Scholar 

  11. Document Object Model, http://www.w3.org/DOM/

  12. Salton, G., Buckley, C.: Term-Weighting Approaches in Automatic Text Retrieval. J. Inf. Process. Manag. 24(5), 513–523 (1988)

    Article  Google Scholar 

  13. Grishman, R., Sundheim, B.: Message Understanding Conference 6: A Brief History. In: Proceedings of the 16th Conference on Computational Linguistics. Association for Computational Linguistics, Stroudsburg (1996)

    Google Scholar 

  14. MUC-7 Information Extraction Task Definition, http://www-nlpir.nist.gov/related_projects/muc/proceedings/ie_task.html

  15. Lewis, D.D.: Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hu, F., Ruan, T., Shao, Z., Ding, J. (2011). Automatic Web Information Extraction Based on Rules. In: Bouguettaya, A., Hauswirth, M., Liu, L. (eds) Web Information System Engineering – WISE 2011. WISE 2011. Lecture Notes in Computer Science, vol 6997. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24434-6_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-24434-6_21

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-24433-9

  • Online ISBN: 978-3-642-24434-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics