Abstract
The increasing volume and variety of science data has led to the creation of metadata extraction systems that automatically derive and synthesize relevant information from files. A critical component of metadata extraction systems is a mechanism for mapping extractors—lightweight tools to mine information from a particular file types—to each file in a repository. However, existing methods do little to address the heterogeneity and scale of science data, thereby leaving valuable data unextracted or wasting significant compute resources applying incorrect extractors to data. We construct an extractor scheduler that leverages file type identification (FTI) methods. We show that by training lightweight multi-label, multi-class statistical models on byte samples from files, we can correctly map 35% more extractors to files than by using libmagic. Further, we introduce a metadata quality toolkit to automatically assess the utility of extracted metadata.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Libmagic(3) - linux man page, November 2009. https://linux.die.net/man/3/libmagic
Cdiac, March 2018. https://cdiac.ess-dive.lbl.gov/
Chard, R., et al.: Funcx: a federated function serving fabric for science. In: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, pp. 65–76 (2020)
Deng, H., Runger, G., Tuv, E.: Bias of importance measures for multi-valued attributes and solutions. In: International Conference on Artificial Neural Networks, pp. 293–300 (2011)
Deutsch, E.W., et al.: BDQC: a general-purpose analytics tool for domain-blind validation of big data. bioRxiv 258822 (2018)
Gopal, S., Yang, Y., Salomatin, K., et al.: Statistical learning for file-type identification. In: International Conference on Machine Learning and Applications, pp. 68–73 (2011)
Hughes, Baden: Metadata quality evaluation: experience from the open language archives community. In: Chen, Zhaoneng, Chen, Hsinchun, Miao, Qihao, Fu, Yuxi, Fox, Edward, Lim, Ee.-peng (eds.) ICADL 2004. LNCS, vol. 3334, pp. 320–329. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30544-6_34
Király, P.: Measuring Metadata Quality. Ph.D. thesis, Georg-August-Universität Göttingen, June 2019. https://doi.org/10.13140/RG.2.2.33177.77920
Li, W.J., Wang, K., Stolfo, S.J., Herzog, B.: Fileprints: identifying file types by n-gram analysis. In: IEEE SMC Information Assurance Workshop, pp. 64–71 (2005)
Margaritopoulos, M., Margaritopoulos, T., Mavridis, I., Manitsaris, A.: Quantifying and measuring metadata completeness. J. Am. Soc. Inform. Sci. Technol. 63(4), 724–737 (2012)
Marini, L., Gutierrez-Polo, I., et al.: Clowder: open source data management for long tail data. In: Practice and Experience on Advance Research Computing (2018)
Mattmann, C., Zitting, J.: Tika in Action. Manning Publications Co., USA (2011)
McDaniel, M., Heydari, M.H.: Content based file type detection algorithms. In: 36th Annual Hawaii Int’l Conference on System Sciences, pp. 10-pp. IEEE (2003)
Ochoa, X., Duval, E.: Automatic evaluation of metadata quality in digital repositories. Int. J. Digit. Lib. 67–91 (2009). https://doi.org/10.1007/s00799-009-0054-4
Poisel, R., Tjoa, S.: A comprehensive literature review of file carving. In: Int’l Conference on Availability, Reliability and Security, pp. 475–484. IEEE (2013)
Rodrigo, G., Henderson, M., et al.: ScienceSearch: enabling search through automatic metadata generation. In: 14th International Conference on e-Science, pp. 93–104 (2018)
Shannon, C.E.: A mathematical theory of communication. ACM SIGMOBILE Mob. Comput. Commun. Rev. 5(1), 3–55 (2001)
Skluzacek, T., et al.: A serverless framework for distributed bulk metadata extraction. In: 30th International Symposium on High-Performance Parallel and Distributed Computing (2021)
Skluzacek, T.J., : Serverless workflows for indexing large scientific data. In: Int’l Workshop on Serverless Computing, pp. 43–48 (2019)
Skluzacek, T.J., et al.: Skluma: an extensible metadata extraction pipeline for disorganized data. In: IEEE 14th International Conference on e-Science, pp. 256–266. IEEE (2018)
Tabish, S.M., Shafiq, M.Z., Farooq, M.: Malware detection using statistical analysis of byte-level file content. In: ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics, pp. 23–31 (2009)
Talburt, J.: The Flesch index: An easily programmable readability analysis algorithm. In: International Conference on Systems Documentation, pp. 114–122 (1986)
Vazhkudai, S.S., Harney, J., et al.: Constellation: a science graph network for scalable data and knowledge discovery in extreme-scale scientific collaborations. In: IEEE International Conference on Big Data, pp. 3052–3061 (2016)
Wang, L.L., Lo, K., et al.: Cord-19: The covid-19 open research dataset. arXiv:2004.10706 (2020). https://doi.org/10.48550/ARXIV.2004.10706
Wang, Y., Li, Y., et al.: Efficient test for nonlinear dependence of two continuous variables. BMC Bioinform. 16(1), 1–8 (2015)
Acknowledgements
We gratefully acknowledge Takuya Kurihana (University of Chicago) for sharing his machine learning expertise. This work is supported in part by the National Science Foundation under Grants No. 2004894 and 1757970, and used resources of the Argonne Leadership Computing Facility.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Skluzacek, T.J., Chen, M., Hsu, E., Chard, K., Foster, I. (2022). Models and Metrics for Mining Meaningful Metadata. In: Groen, D., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds) Computational Science – ICCS 2022. ICCS 2022. Lecture Notes in Computer Science, vol 13350. Springer, Cham. https://doi.org/10.1007/978-3-031-08751-6_30
Download citation
DOI: https://doi.org/10.1007/978-3-031-08751-6_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08750-9
Online ISBN: 978-3-031-08751-6
eBook Packages: Computer ScienceComputer Science (R0)