Automatic Web Pages Author Extraction

Changuel, Sahar; Labroche, Nicolas; Bouchon-Meunier, Bernadette

doi:10.1007/978-3-642-04957-6_26

Automatic Web Pages Author Extraction

Sahar Changuel²³,
Nicolas Labroche²³ &
Bernadette Bouchon-Meunier²³

Conference paper

773 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5822))

Abstract

This paper addresses the problem of automatically extracting the author from heterogeneous HTML resources as a sub problem of automatic metadata extraction from (Web) documents.

We take a supervised machine learning approach to address the problem using a C4.5 Decision Tree algorithm. The particularity of our approach is that it focuses on both, structure and contextual information.

A semi-automatic approach was conducted for corpus expansion in order to help annotating the dataset with less human effort.

This paper shows that our method can achieve good results (more than 80% in term of F1-measure) despite the heterogeneity of our corpus.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American (2001)
Google Scholar
Romero, C., Ventura, S.: Educational data mining: A survey from 1995 to 2005. Expert Syst. Appl. 33, 135–146 (2007)
Article Google Scholar
Kato, Y., Kawahara, D., Inui, K., Kurohashi, S., Shibata, T.: Extracting the author of web pages. In: WICOW 2008: Proceeding of the 2nd ACM workshop on Information credibility on the web, pp. 35–42. ACM, New York (2008)
Chapter Google Scholar
Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: Dom-based content extraction of html documents. In: WWW 2003: Proceedings of the 12th international conference on World Wide Web, pp. 207–214. ACM, New York (2003)
Google Scholar
Etzioni, O., Cafarella, M., Downey, D., Popescu, A.M., Shaked, T., Soderland, t., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165, 91–134 (2005)
Article Google Scholar
Evans, D., Klavans, J.L., McKeown, K.R.: Columbia newsblaster: Multilingual news summarization on the web. In: Proceedings of Human Language Technology conference of the North American (2004)
Google Scholar
Ciravegna, F.: lp) 2, an adaptive algorithm for information extraction from web- related texts. In: Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining (2001)
Google Scholar
Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pp. 577–583. AAAI Press / The MIT Press (2000)
Google Scholar
Amitay, E., Harel, N., Sivan, R., Soffer, A.: Web-a-where: geotagging web content. In: SIGIR 2004: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 273–280. ACM Press, New York (2004)
Google Scholar
Nadeau, D., Turney, P., Matwin, S.: Unsupervised named-entity recognition: Gen- erating gazetteers and resolving ambiguity, pp. 266–277 (2006)
Google Scholar
Changuel, S., Labroche, N., Bouchon-meunier, B.: A general learning method for automatic title extraction from html pages. In: Perner, P. (ed.) MLDM 2009. LNCS, vol. 5632, pp. 704–718. Springer, Heidelberg (2009)
Google Scholar
Alias-i 2006. LingPipe Natural Language Toolkit, http://www.alias-i.com/lingpipe
Ian, H., Witten, E.F.: Data Mining: Practical Machine Learning Tools and Tech- niques, 2nd edn., Diane Cerra (2005)
Google Scholar
Greenberg, J., Spurgin, K., Crystal, A.: Functionalities for automatic metadata generation applications: a survey of metadata experts opinions. Int. J. Metadata Semant. Ontologies 1, 320 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratoire d’Informatique de Paris 6 (LIP6), DAPA, LIP6, 104, Avenue du Président Kennedy, 75016, Paris, France
Sahar Changuel, Nicolas Labroche & Bernadette Bouchon-Meunier

Authors

Sahar Changuel
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Labroche
View author publications
You can also search for this author in PubMed Google Scholar
Bernadette Bouchon-Meunier
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Roskilde University, Universitetsvej 1, 4000, Roskilde, Denmark
Troels Andreasen & Henrik Bulskov &
Iona College, Machine Intelligence Institute, 10801, New Rochelle, NY, USA
Ronald R. Yager
Computer Science Dept., Research group PLIS: Programming, Roskilde University, Universitetsvej 1, 4000, Roskilde, Denmark
Henning Christiansen
Department of Computer Science and Engineering, Aalborg University Esbjerg, Niels Bohrs Vej 8, 6700, Esbjerg, Denmark
Henrik Legind Larsen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Changuel, S., Labroche, N., Bouchon-Meunier, B. (2009). Automatic Web Pages Author Extraction. In: Andreasen, T., Yager, R.R., Bulskov, H., Christiansen, H., Larsen, H.L. (eds) Flexible Query Answering Systems. FQAS 2009. Lecture Notes in Computer Science(), vol 5822. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04957-6_26

Download citation

DOI: https://doi.org/10.1007/978-3-642-04957-6_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04956-9
Online ISBN: 978-3-642-04957-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics