Content Independent Metadata Production as a Machine Learning Problem

Changuel, Sahar; Labroche, Nicolas

doi:10.1007/978-3-642-31537-4_24

Content Independent Metadata Production as a Machine Learning Problem

Sahar Changuel²⁰ &
Nicolas Labroche²⁰

Conference paper

5849 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7376))

Abstract

Metadata provide a high-level description of digital library resources and represent the key to enable the discovery and selection of suitable resources. However the growth in size and diversity of digital collections makes manual metadata extraction an expensive task. This paper proposes a new content independent method to automatically generate metadata in order to characterize resources in a given learning objects repository. The key idea is to rely on few existing metadata to learn predictive models of metadata values. The proposed method is content independent and handles resources in different formats: text, image, video, Java applet, etc.

Two classical machine learning approaches are studied in this paper: in the first approach a supervised machine learning technique classify each value of a metadata field to be predicted according to the other a-priori filled metadata fields. The second approach used the FP-Growth algorithm to discover relationships between the different metadata fields as association rules. Experiments on two well-known educational data repositories show that both approaches can enhance metadata extraction and can even fill subjective metadata fields that are difficult to extract from the content of a resource, such as the difficulty of a resource.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Liddy, E., Chen, J., Finneran, C., Diekema, A., Harwell, S., Yilmazel, O.: Generating and Evaluating Automatic Metadata for Educational Resources. In: Rauber, A., Christodoulakis, S., Tjoa, A.M. (eds.) ECDL 2005. LNCS, vol. 3652, pp. 513–514. Springer, Heidelberg (2005)
Chapter Google Scholar
Greenberg, J., Robertson, W.D.: Semantic web construction: an inquiry of authors’ views on collaborative metadata generation. In: Proc. of the 2002 Int. Conf. on Dublin Core and Metadata Applications: Metadata for e-communities: Supporting Diversity and Convergence, Dublin Core Metadata Initiative, pp. 45–52 (2002)
Google Scholar
Greenberg, J., Spurgin, K., Crystal, A.: Functionalities for automatic metadata generation applications: a survey of metadata experts’ opinions. International Journal of Metadata, Semantics and Ontologies 1, 3–20 (2006)
Article Google Scholar
Greenberg, J.: Metadata extraction and harvesting: A comparison of two automatic metadata generation applications. Journal of Internet Cataloging 6, 59–82 (2004)
Article Google Scholar
Tang, X., Zeng, Q., Cui, T., Wu, Z.: Regular expression-based reference metadata extraction from the web. In: IEEE 2nd Symposium on Web Society (SWS), pp. 346–350 (2010)
Google Scholar
Shek, E.C., Yang, J.: Knowledge-based metadata extraction from postscript files. In: Proc. of the 5th ACM Conference on Digital Libraries, pp. 77–84. ACM Press (2000)
Google Scholar
Changuel, S., Labroche, N., Bouchon-Meunier, B.: A General Learning Method for Automatic Title Extraction from HTML Pages. In: Perner, P. (ed.) MLDM 2009. LNCS, vol. 5632, pp. 704–718. Springer, Heidelberg (2009)
Chapter Google Scholar
Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: Proc. of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2003, pp. 37–48. IEEE Computer Society (2003)
Google Scholar
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. SIGMOD Record 29, 1–12 (2000)
Article Google Scholar
Duval, E., Forte, E., Cardinaels, K., Verhoeven, B., Van Durm, R., Hendrikx, K., Forte, M.W., Ebel, N., Macowicz, M., Warkentyne, K., Haenni, F.: The Ariadne knowledge pool system. Communications of the ACM 44, 72–78 (2001)
Article Google Scholar
Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: Proc. of the 26th Annual International ACM SIGIR Conf. on Research and Development in Informaion Retrieval, SIGIR 2003, pp. 119–126. ACM, New York (2003)
Chapter Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: An update. SIGKDD Explorations 11 (2009)
Google Scholar
John, G., Langley, P.: Estimating continuous distributions in bayesian classifiers. In: Proc. of the Eleventh Conf. on Uncertainty in Artificial Intelligence, pp. 338–345. Morgan Kaufmann (1995)
Google Scholar
Quinlan, R.: C4.5: Programs for Machine Learning, 1st edn. Morgan Kaufmann (1993)
Google Scholar
Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996)
MathSciNet MATH Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory, 2nd edn. Springer (1999)
Google Scholar
Platt, J.C.: Fast training of support vector machines using sequential minimal optimization, pp. 185–208. MIT Press, Cambridge (1999)
Google Scholar
Changuel, S., Labroche, N., Bouchon-Meunier, B.: Automatic Web Pages Author Extraction. In: Andreasen, T., Yager, R.R., Bulskov, H., Christiansen, H., Larsen, H.L. (eds.) FQAS 2009. LNCS, vol. 5822, pp. 300–311. Springer, Heidelberg (2009)
Chapter Google Scholar
Wang, R., Xu, L., Marsland, S., Rayudu, R.: An efficient algorithm for mining frequent closed itemsets in dynamic transaction databases. International Journal of Intelligent Systems Technologies and Applications 4, 313–326 (2008)
Article Google Scholar
Liu, B., Hsu, W., Ma, Y.: Pruning and summarizing the discovered associations. In: Proc. of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 125–134. ACM Press (1999)
Google Scholar
Alvarez, S.A.: Chi-squared computation for association rules: preliminary results. Technical report, Computer Science Department, Boston College (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

CNRS, UMR7606, LIP6, Université Pierre et Marie Curie, Paris 6, France
Sahar Changuel & Nicolas Labroche

Authors

Sahar Changuel
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Labroche
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Vision and Applied Computer Sciences, IBaI, Kohlenstraße 2, 04107, Leipzig, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Changuel, S., Labroche, N. (2012). Content Independent Metadata Production as a Machine Learning Problem. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2012. Lecture Notes in Computer Science(), vol 7376. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31537-4_24

Download citation

DOI: https://doi.org/10.1007/978-3-642-31537-4_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31536-7
Online ISBN: 978-3-642-31537-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics