Abstract
This paper addresses the problem of automatically extracting the author from heterogeneous HTML resources as a sub problem of automatic metadata extraction from (Web) documents.
We take a supervised machine learning approach to address the problem using a C4.5 Decision Tree algorithm. The particularity of our approach is that it focuses on both, structure and contextual information.
A semi-automatic approach was conducted for corpus expansion in order to help annotating the dataset with less human effort.
This paper shows that our method can achieve good results (more than 80% in term of F1-measure) despite the heterogeneity of our corpus.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American (2001)
Romero, C., Ventura, S.: Educational data mining: A survey from 1995 to 2005. Expert Syst. Appl. 33, 135–146 (2007)
Kato, Y., Kawahara, D., Inui, K., Kurohashi, S., Shibata, T.: Extracting the author of web pages. In: WICOW 2008: Proceeding of the 2nd ACM workshop on Information credibility on the web, pp. 35–42. ACM, New York (2008)
Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: Dom-based content extraction of html documents. In: WWW 2003: Proceedings of the 12th international conference on World Wide Web, pp. 207–214. ACM, New York (2003)
Etzioni, O., Cafarella, M., Downey, D., Popescu, A.M., Shaked, T., Soderland, t., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165, 91–134 (2005)
Evans, D., Klavans, J.L., McKeown, K.R.: Columbia newsblaster: Multilingual news summarization on the web. In: Proceedings of Human Language Technology conference of the North American (2004)
Ciravegna, F.: lp) 2, an adaptive algorithm for information extraction from web- related texts. In: Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining (2001)
Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pp. 577–583. AAAI Press / The MIT Press (2000)
Amitay, E., Harel, N., Sivan, R., Soffer, A.: Web-a-where: geotagging web content. In: SIGIR 2004: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 273–280. ACM Press, New York (2004)
Nadeau, D., Turney, P., Matwin, S.: Unsupervised named-entity recognition: Gen- erating gazetteers and resolving ambiguity, pp. 266–277 (2006)
Changuel, S., Labroche, N., Bouchon-meunier, B.: A general learning method for automatic title extraction from html pages. In: Perner, P. (ed.) MLDM 2009. LNCS, vol. 5632, pp. 704–718. Springer, Heidelberg (2009)
Alias-i 2006. LingPipe Natural Language Toolkit, http://www.alias-i.com/lingpipe
Ian, H., Witten, E.F.: Data Mining: Practical Machine Learning Tools and Tech- niques, 2nd edn., Diane Cerra (2005)
Greenberg, J., Spurgin, K., Crystal, A.: Functionalities for automatic metadata generation applications: a survey of metadata experts opinions. Int. J. Metadata Semant. Ontologies 1, 320 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Changuel, S., Labroche, N., Bouchon-Meunier, B. (2009). Automatic Web Pages Author Extraction. In: Andreasen, T., Yager, R.R., Bulskov, H., Christiansen, H., Larsen, H.L. (eds) Flexible Query Answering Systems. FQAS 2009. Lecture Notes in Computer Science(), vol 5822. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04957-6_26
Download citation
DOI: https://doi.org/10.1007/978-3-642-04957-6_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04956-9
Online ISBN: 978-3-642-04957-6
eBook Packages: Computer ScienceComputer Science (R0)