Machine Learning for Semi-structured Multimedia Documents: Application to Pornographic Filtering and Thematic Categorization

Denoyer, Ludovic; Gallinari, Patrick

doi:10.1007/978-3-540-75171-7_10

Ludovic Denoyer⁵ &
Patrick Gallinari⁵

Part of the book series: Cognitive Technologies ((COGTECH))

4398 Accesses

Abstract

We propose a generative statistical model for the classification of semi-structured multimedia documents. Its main originality is its ability to simultaneously take into account the structural and the content information present in a semi-structured document and also to cope with different types of content (text, image, etc.). We then present the results obtained on two sets of experiments:

• One set concerns the filtering of pornographic Web pages

• The second one concerns the thematic classification of Wikipedia documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

K. Barnard and D. Forsyth. Learning the semantics of words and pictures. In Proceedings of the 8th International Conference on Computer Vision, volume 2, pages 408–415, 2001.
Google Scholar
K. Barnard, M. Johnson, and D. Forsyth. Word sense disambiguation with pictures. In Workshop on Learning word Meaning from Non-Linguistic Data, 2003.
Google Scholar
M. L. Cascia, S. Sethi, and S. Sclaroff. Combining textual and visual cues for content-based image retrieval on the world wide web. In Proceedings of IEEE Workshop on Content-Based Access of Image and Video Libraries, June 1998.
Google Scholar
Y. Chan, R. Harvey, and D. Smith. Building systems to block pornography. In Challenge of Image Retrieval, 1999.
Google Scholar
L. Denoyer and P. Gallinari. Using Belief Networks and Fisher Kernels for structured document classification. In PKDD 2003, 2003.
Google Scholar
L. Denoyer and P. Gallinari. Report on the XML Mining Track at INEX 2005 and INEX 2006. In Advances in XML Information Retrieval and Evaluation: Fifth Workshop of the INitiative for the Evaluation of XML Retrieval (INEX’06), 2007.
Google Scholar
L. Denoyer and P. Gallinari. The Wikipedia XML Corpus. In Advances in XML Information Retrieval and Evaluation: Fifth Workshop of the INitiative for the Evaluation of XML Retrieval (INEX’06), 2007
Google Scholar
M. Diligenti, M. Gori, M. Maggini, and F. Scarselli. Classification of HTML documents by Hidden Tree-Markov Models. In 6th International Conference on Document Analysis and Recognition, Seattle, WA, USA, August. 2001.
Google Scholar
S. T. Dumais and H. Chen. Hierarchical classification of Web content. In N. J. Belkin, P. Ingwersen, and M.-K. Leong, editors, Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval, pages 256–263, Athens, GR, 2000. ACM Press, New York, US.
Google Scholar
Z. Ghahramani. Learning Dynamic Bayesion Networks In Lecture Notes in Computer Science, pages 168–197, 1998
Google Scholar
T. Joachims. Text categorization with support vector machines: learning with many relevant features. In C. Nédellec and C. Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398, pages 137–142, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE.
Google Scholar
M. J. Jones and J. M. Rehg. Detecting adult images. Technical report, 2002.
Google Scholar
D. D. Lewis. Representation and Learning in Information Retrieval. PhD thesis, Department of Computer Science, University of Massachusetts, Amherst, US, 1992.
Google Scholar
D. D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In C. Nédellec and C. Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398, pages 4–15, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE.
Google Scholar
M. Ortega, K. Porkaew, and S. Mehrotra. Information retrieval over multimedia documents. In the SIGIR Post-Conference Workshop on Multimedia Indexing and Retrieval (ACM SIGIR), 1999.
Google Scholar
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 2002.
Google Scholar
Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2–3):219–241, 2002.
Article Google Scholar
J. Yi and N. Sundaresan. A classifier for semi-structured documents. In Proceedings of the Conferance Knowledge Discovery in Data, pages 190–197, 2000.
Google Scholar

Download references

Author information

Authors and Affiliations

LIP6, UPMC, Paris, France
Ludovic Denoyer & Patrick Gallinari

Authors

Ludovic Denoyer
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Gallinari
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

UPMC University, CNRS (UMR 7606) Lab. LIP6, 104 Avenue du Président Kennedy, 75016 Paris, France
Matthieu Cord
University College Dublin, School of Computer Science & Informatics, Belfield, Dublin 2, Ireland
Pádraig Cunningham

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Denoyer, L., Gallinari, P. (2008). Machine Learning for Semi-structured Multimedia Documents: Application to Pornographic Filtering and Thematic Categorization. In: Cord, M., Cunningham, P. (eds) Machine Learning Techniques for Multimedia. Cognitive Technologies. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75171-7_10

Download citation

DOI: https://doi.org/10.1007/978-3-540-75171-7_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75170-0
Online ISBN: 978-3-540-75171-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics