Skip to main content

Machine Learning for Semi-structured Multimedia Documents: Application to Pornographic Filtering and Thematic Categorization

  • Chapter
Machine Learning Techniques for Multimedia

Part of the book series: Cognitive Technologies ((COGTECH))

  • 4398 Accesses

Abstract

We propose a generative statistical model for the classification of semi-structured multimedia documents. Its main originality is its ability to simultaneously take into account the structural and the content information present in a semi-structured document and also to cope with different types of content (text, image, etc.). We then present the results obtained on two sets of experiments:

• One set concerns the filtering of pornographic Web pages

• The second one concerns the thematic classification of Wikipedia documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. K. Barnard and D. Forsyth. Learning the semantics of words and pictures. In Proceedings of the 8th International Conference on Computer Vision, volume 2, pages 408–415, 2001.

    Google Scholar 

  2. K. Barnard, M. Johnson, and D. Forsyth. Word sense disambiguation with pictures. In Workshop on Learning word Meaning from Non-Linguistic Data, 2003.

    Google Scholar 

  3. M. L. Cascia, S. Sethi, and S. Sclaroff. Combining textual and visual cues for content-based image retrieval on the world wide web. In Proceedings of IEEE Workshop on Content-Based Access of Image and Video Libraries, June 1998.

    Google Scholar 

  4. Y. Chan, R. Harvey, and D. Smith. Building systems to block pornography. In Challenge of Image Retrieval, 1999.

    Google Scholar 

  5. L. Denoyer and P. Gallinari. Using Belief Networks and Fisher Kernels for structured document classification. In PKDD 2003, 2003.

    Google Scholar 

  6. L. Denoyer and P. Gallinari. Report on the XML Mining Track at INEX 2005 and INEX 2006. In Advances in XML Information Retrieval and Evaluation: Fifth Workshop of the INitiative for the Evaluation of XML Retrieval (INEX’06), 2007.

    Google Scholar 

  7. L. Denoyer and P. Gallinari. The Wikipedia XML Corpus. In Advances in XML Information Retrieval and Evaluation: Fifth Workshop of the INitiative for the Evaluation of XML Retrieval (INEX’06), 2007

    Google Scholar 

  8. M. Diligenti, M. Gori, M. Maggini, and F. Scarselli. Classification of HTML documents by Hidden Tree-Markov Models. In 6th International Conference on Document Analysis and Recognition, Seattle, WA, USA, August. 2001.

    Google Scholar 

  9. S. T. Dumais and H. Chen. Hierarchical classification of Web content. In N. J. Belkin, P. Ingwersen, and M.-K. Leong, editors, Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval, pages 256–263, Athens, GR, 2000. ACM Press, New York, US.

    Google Scholar 

  10. Z. Ghahramani. Learning Dynamic Bayesion Networks In Lecture Notes in Computer Science, pages 168–197, 1998

    Google Scholar 

  11. T. Joachims. Text categorization with support vector machines: learning with many relevant features. In C. Nédellec and C. Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398, pages 137–142, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE.

    Google Scholar 

  12. M. J. Jones and J. M. Rehg. Detecting adult images. Technical report, 2002.

    Google Scholar 

  13. D. D. Lewis. Representation and Learning in Information Retrieval. PhD thesis, Department of Computer Science, University of Massachusetts, Amherst, US, 1992.

    Google Scholar 

  14. D. D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In C. Nédellec and C. Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398, pages 4–15, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE.

    Google Scholar 

  15. M. Ortega, K. Porkaew, and S. Mehrotra. Information retrieval over multimedia documents. In the SIGIR Post-Conference Workshop on Multimedia Indexing and Retrieval (ACM SIGIR), 1999.

    Google Scholar 

  16. F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 2002.

    Google Scholar 

  17. Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2–3):219–241, 2002.

    Article  Google Scholar 

  18. J. Yi and N. Sundaresan. A classifier for semi-structured documents. In Proceedings of the Conferance Knowledge Discovery in Data, pages 190–197, 2000.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Denoyer, L., Gallinari, P. (2008). Machine Learning for Semi-structured Multimedia Documents: Application to Pornographic Filtering and Thematic Categorization. In: Cord, M., Cunningham, P. (eds) Machine Learning Techniques for Multimedia. Cognitive Technologies. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75171-7_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-75171-7_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-75170-0

  • Online ISBN: 978-3-540-75171-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics