Skip to main content

Automatic Web Pages Author Extraction

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5822))

Abstract

This paper addresses the problem of automatically extracting the author from heterogeneous HTML resources as a sub problem of automatic metadata extraction from (Web) documents.

We take a supervised machine learning approach to address the problem using a C4.5 Decision Tree algorithm. The particularity of our approach is that it focuses on both, structure and contextual information.

A semi-automatic approach was conducted for corpus expansion in order to help annotating the dataset with less human effort.

This paper shows that our method can achieve good results (more than 80% in term of F1-measure) despite the heterogeneity of our corpus.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American (2001)

    Google Scholar 

  2. Romero, C., Ventura, S.: Educational data mining: A survey from 1995 to 2005. Expert Syst. Appl. 33, 135–146 (2007)

    Article  Google Scholar 

  3. Kato, Y., Kawahara, D., Inui, K., Kurohashi, S., Shibata, T.: Extracting the author of web pages. In: WICOW 2008: Proceeding of the 2nd ACM workshop on Information credibility on the web, pp. 35–42. ACM, New York (2008)

    Chapter  Google Scholar 

  4. Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: Dom-based content extraction of html documents. In: WWW 2003: Proceedings of the 12th international conference on World Wide Web, pp. 207–214. ACM, New York (2003)

    Google Scholar 

  5. Etzioni, O., Cafarella, M., Downey, D., Popescu, A.M., Shaked, T., Soderland, t., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165, 91–134 (2005)

    Article  Google Scholar 

  6. Evans, D., Klavans, J.L., McKeown, K.R.: Columbia newsblaster: Multilingual news summarization on the web. In: Proceedings of Human Language Technology conference of the North American (2004)

    Google Scholar 

  7. Ciravegna, F.: lp) 2, an adaptive algorithm for information extraction from web- related texts. In: Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining (2001)

    Google Scholar 

  8. Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pp. 577–583. AAAI Press / The MIT Press (2000)

    Google Scholar 

  9. Amitay, E., Harel, N., Sivan, R., Soffer, A.: Web-a-where: geotagging web content. In: SIGIR 2004: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 273–280. ACM Press, New York (2004)

    Google Scholar 

  10. Nadeau, D., Turney, P., Matwin, S.: Unsupervised named-entity recognition: Gen- erating gazetteers and resolving ambiguity, pp. 266–277 (2006)

    Google Scholar 

  11. Changuel, S., Labroche, N., Bouchon-meunier, B.: A general learning method for automatic title extraction from html pages. In: Perner, P. (ed.) MLDM 2009. LNCS, vol. 5632, pp. 704–718. Springer, Heidelberg (2009)

    Google Scholar 

  12. Alias-i 2006. LingPipe Natural Language Toolkit, http://www.alias-i.com/lingpipe

  13. Ian, H., Witten, E.F.: Data Mining: Practical Machine Learning Tools and Tech- niques, 2nd edn., Diane Cerra (2005)

    Google Scholar 

  14. Greenberg, J., Spurgin, K., Crystal, A.: Functionalities for automatic metadata generation applications: a survey of metadata experts opinions. Int. J. Metadata Semant. Ontologies 1, 320 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Changuel, S., Labroche, N., Bouchon-Meunier, B. (2009). Automatic Web Pages Author Extraction. In: Andreasen, T., Yager, R.R., Bulskov, H., Christiansen, H., Larsen, H.L. (eds) Flexible Query Answering Systems. FQAS 2009. Lecture Notes in Computer Science(), vol 5822. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04957-6_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-04957-6_26

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-04956-9

  • Online ISBN: 978-3-642-04957-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics