Abstract
We propose a generative statistical model for the classification of semi-structured multimedia documents. Its main originality is its ability to simultaneously take into account the structural and the content information present in a semi-structured document and also to cope with different types of content (text, image, etc.). We then present the results obtained on two sets of experiments:
• One set concerns the filtering of pornographic Web pages
• The second one concerns the thematic classification of Wikipedia documents.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
K. Barnard and D. Forsyth. Learning the semantics of words and pictures. In Proceedings of the 8th International Conference on Computer Vision, volume 2, pages 408–415, 2001.
K. Barnard, M. Johnson, and D. Forsyth. Word sense disambiguation with pictures. In Workshop on Learning word Meaning from Non-Linguistic Data, 2003.
M. L. Cascia, S. Sethi, and S. Sclaroff. Combining textual and visual cues for content-based image retrieval on the world wide web. In Proceedings of IEEE Workshop on Content-Based Access of Image and Video Libraries, June 1998.
Y. Chan, R. Harvey, and D. Smith. Building systems to block pornography. In Challenge of Image Retrieval, 1999.
L. Denoyer and P. Gallinari. Using Belief Networks and Fisher Kernels for structured document classification. In PKDD 2003, 2003.
L. Denoyer and P. Gallinari. Report on the XML Mining Track at INEX 2005 and INEX 2006. In Advances in XML Information Retrieval and Evaluation: Fifth Workshop of the INitiative for the Evaluation of XML Retrieval (INEX’06), 2007.
L. Denoyer and P. Gallinari. The Wikipedia XML Corpus. In Advances in XML Information Retrieval and Evaluation: Fifth Workshop of the INitiative for the Evaluation of XML Retrieval (INEX’06), 2007
M. Diligenti, M. Gori, M. Maggini, and F. Scarselli. Classification of HTML documents by Hidden Tree-Markov Models. In 6th International Conference on Document Analysis and Recognition, Seattle, WA, USA, August. 2001.
S. T. Dumais and H. Chen. Hierarchical classification of Web content. In N. J. Belkin, P. Ingwersen, and M.-K. Leong, editors, Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval, pages 256–263, Athens, GR, 2000. ACM Press, New York, US.
Z. Ghahramani. Learning Dynamic Bayesion Networks In Lecture Notes in Computer Science, pages 168–197, 1998
T. Joachims. Text categorization with support vector machines: learning with many relevant features. In C. Nédellec and C. Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398, pages 137–142, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE.
M. J. Jones and J. M. Rehg. Detecting adult images. Technical report, 2002.
D. D. Lewis. Representation and Learning in Information Retrieval. PhD thesis, Department of Computer Science, University of Massachusetts, Amherst, US, 1992.
D. D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In C. Nédellec and C. Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398, pages 4–15, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE.
M. Ortega, K. Porkaew, and S. Mehrotra. Information retrieval over multimedia documents. In the SIGIR Post-Conference Workshop on Multimedia Indexing and Retrieval (ACM SIGIR), 1999.
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 2002.
Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2–3):219–241, 2002.
J. Yi and N. Sundaresan. A classifier for semi-structured documents. In Proceedings of the Conferance Knowledge Discovery in Data, pages 190–197, 2000.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Denoyer, L., Gallinari, P. (2008). Machine Learning for Semi-structured Multimedia Documents: Application to Pornographic Filtering and Thematic Categorization. In: Cord, M., Cunningham, P. (eds) Machine Learning Techniques for Multimedia. Cognitive Technologies. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75171-7_10
Download citation
DOI: https://doi.org/10.1007/978-3-540-75171-7_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75170-0
Online ISBN: 978-3-540-75171-7
eBook Packages: Computer ScienceComputer Science (R0)