Abstract
Metadata provide a high-level description of digital library resources and represent the key to enable the discovery and selection of suitable resources. However the growth in size and diversity of digital collections makes manual metadata extraction an expensive task. This paper proposes a new content independent method to automatically generate metadata in order to characterize resources in a given learning objects repository. The key idea is to rely on few existing metadata to learn predictive models of metadata values. The proposed method is content independent and handles resources in different formats: text, image, video, Java applet, etc.
Two classical machine learning approaches are studied in this paper: in the first approach a supervised machine learning technique classify each value of a metadata field to be predicted according to the other a-priori filled metadata fields. The second approach used the FP-Growth algorithm to discover relationships between the different metadata fields as association rules. Experiments on two well-known educational data repositories show that both approaches can enhance metadata extraction and can even fill subjective metadata fields that are difficult to extract from the content of a resource, such as the difficulty of a resource.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Liddy, E., Chen, J., Finneran, C., Diekema, A., Harwell, S., Yilmazel, O.: Generating and Evaluating Automatic Metadata for Educational Resources. In: Rauber, A., Christodoulakis, S., Tjoa, A.M. (eds.) ECDL 2005. LNCS, vol. 3652, pp. 513–514. Springer, Heidelberg (2005)
Greenberg, J., Robertson, W.D.: Semantic web construction: an inquiry of authors’ views on collaborative metadata generation. In: Proc. of the 2002 Int. Conf. on Dublin Core and Metadata Applications: Metadata for e-communities: Supporting Diversity and Convergence, Dublin Core Metadata Initiative, pp. 45–52 (2002)
Greenberg, J., Spurgin, K., Crystal, A.: Functionalities for automatic metadata generation applications: a survey of metadata experts’ opinions. International Journal of Metadata, Semantics and Ontologies 1, 3–20 (2006)
Greenberg, J.: Metadata extraction and harvesting: A comparison of two automatic metadata generation applications. Journal of Internet Cataloging 6, 59–82 (2004)
Tang, X., Zeng, Q., Cui, T., Wu, Z.: Regular expression-based reference metadata extraction from the web. In: IEEE 2nd Symposium on Web Society (SWS), pp. 346–350 (2010)
Shek, E.C., Yang, J.: Knowledge-based metadata extraction from postscript files. In: Proc. of the 5th ACM Conference on Digital Libraries, pp. 77–84. ACM Press (2000)
Changuel, S., Labroche, N., Bouchon-Meunier, B.: A General Learning Method for Automatic Title Extraction from HTML Pages. In: Perner, P. (ed.) MLDM 2009. LNCS, vol. 5632, pp. 704–718. Springer, Heidelberg (2009)
Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: Proc. of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2003, pp. 37–48. IEEE Computer Society (2003)
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. SIGMOD Record 29, 1–12 (2000)
Duval, E., Forte, E., Cardinaels, K., Verhoeven, B., Van Durm, R., Hendrikx, K., Forte, M.W., Ebel, N., Macowicz, M., Warkentyne, K., Haenni, F.: The Ariadne knowledge pool system. Communications of the ACM 44, 72–78 (2001)
Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: Proc. of the 26th Annual International ACM SIGIR Conf. on Research and Development in Informaion Retrieval, SIGIR 2003, pp. 119–126. ACM, New York (2003)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: An update. SIGKDD Explorations 11 (2009)
John, G., Langley, P.: Estimating continuous distributions in bayesian classifiers. In: Proc. of the Eleventh Conf. on Uncertainty in Artificial Intelligence, pp. 338–345. Morgan Kaufmann (1995)
Quinlan, R.: C4.5: Programs for Machine Learning, 1st edn. Morgan Kaufmann (1993)
Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996)
Vapnik, V.N.: The Nature of Statistical Learning Theory, 2nd edn. Springer (1999)
Platt, J.C.: Fast training of support vector machines using sequential minimal optimization, pp. 185–208. MIT Press, Cambridge (1999)
Changuel, S., Labroche, N., Bouchon-Meunier, B.: Automatic Web Pages Author Extraction. In: Andreasen, T., Yager, R.R., Bulskov, H., Christiansen, H., Larsen, H.L. (eds.) FQAS 2009. LNCS, vol. 5822, pp. 300–311. Springer, Heidelberg (2009)
Wang, R., Xu, L., Marsland, S., Rayudu, R.: An efficient algorithm for mining frequent closed itemsets in dynamic transaction databases. International Journal of Intelligent Systems Technologies and Applications 4, 313–326 (2008)
Liu, B., Hsu, W., Ma, Y.: Pruning and summarizing the discovered associations. In: Proc. of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 125–134. ACM Press (1999)
Alvarez, S.A.: Chi-squared computation for association rules: preliminary results. Technical report, Computer Science Department, Boston College (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Changuel, S., Labroche, N. (2012). Content Independent Metadata Production as a Machine Learning Problem. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2012. Lecture Notes in Computer Science(), vol 7376. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31537-4_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-31537-4_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31536-7
Online ISBN: 978-3-642-31537-4
eBook Packages: Computer ScienceComputer Science (R0)