Models and Metrics for Mining Meaningful Metadata

Skluzacek, Tyler J.; Chen, Matthew; Hsu, Erica; Chard, Kyle; Foster, Ian

doi:10.1007/978-3-031-08751-6_30

Models and Metrics for Mining Meaningful Metadata

Tyler J. Skluzacek¹³,
Matthew Chen¹⁴,
Erica Hsu¹⁵,
Kyle Chard^13,16 &
…
Ian Foster^13,16

Conference paper
First Online: 15 June 2022

1150 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13350))

Abstract

The increasing volume and variety of science data has led to the creation of metadata extraction systems that automatically derive and synthesize relevant information from files. A critical component of metadata extraction systems is a mechanism for mapping extractors—lightweight tools to mine information from a particular file types—to each file in a repository. However, existing methods do little to address the heterogeneity and scale of science data, thereby leaving valuable data unextracted or wasting significant compute resources applying incorrect extractors to data. We construct an extractor scheduler that leverages file type identification (FTI) methods. We show that by training lightweight multi-label, multi-class statistical models on byte samples from files, we can correctly map 35% more extractors to files than by using libmagic. Further, we introduce a metadata quality toolkit to automatically assess the utility of extracted metadata.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Libmagic(3) - linux man page, November 2009. https://linux.die.net/man/3/libmagic
Cdiac, March 2018. https://cdiac.ess-dive.lbl.gov/
Chard, R., et al.: Funcx: a federated function serving fabric for science. In: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, pp. 65–76 (2020)
Google Scholar
Deng, H., Runger, G., Tuv, E.: Bias of importance measures for multi-valued attributes and solutions. In: International Conference on Artificial Neural Networks, pp. 293–300 (2011)
Google Scholar
Deutsch, E.W., et al.: BDQC: a general-purpose analytics tool for domain-blind validation of big data. bioRxiv 258822 (2018)
Google Scholar
Gopal, S., Yang, Y., Salomatin, K., et al.: Statistical learning for file-type identification. In: International Conference on Machine Learning and Applications, pp. 68–73 (2011)
Google Scholar
Hughes, Baden: Metadata quality evaluation: experience from the open language archives community. In: Chen, Zhaoneng, Chen, Hsinchun, Miao, Qihao, Fu, Yuxi, Fox, Edward, Lim, Ee.-peng (eds.) ICADL 2004. LNCS, vol. 3334, pp. 320–329. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30544-6_34
Chapter Google Scholar
Király, P.: Measuring Metadata Quality. Ph.D. thesis, Georg-August-Universität Göttingen, June 2019. https://doi.org/10.13140/RG.2.2.33177.77920
Li, W.J., Wang, K., Stolfo, S.J., Herzog, B.: Fileprints: identifying file types by n-gram analysis. In: IEEE SMC Information Assurance Workshop, pp. 64–71 (2005)
Google Scholar
Margaritopoulos, M., Margaritopoulos, T., Mavridis, I., Manitsaris, A.: Quantifying and measuring metadata completeness. J. Am. Soc. Inform. Sci. Technol. 63(4), 724–737 (2012)
Article Google Scholar
Marini, L., Gutierrez-Polo, I., et al.: Clowder: open source data management for long tail data. In: Practice and Experience on Advance Research Computing (2018)
Google Scholar
Mattmann, C., Zitting, J.: Tika in Action. Manning Publications Co., USA (2011)
Google Scholar
McDaniel, M., Heydari, M.H.: Content based file type detection algorithms. In: 36th Annual Hawaii Int’l Conference on System Sciences, pp. 10-pp. IEEE (2003)
Google Scholar
Ochoa, X., Duval, E.: Automatic evaluation of metadata quality in digital repositories. Int. J. Digit. Lib. 67–91 (2009). https://doi.org/10.1007/s00799-009-0054-4
Poisel, R., Tjoa, S.: A comprehensive literature review of file carving. In: Int’l Conference on Availability, Reliability and Security, pp. 475–484. IEEE (2013)
Google Scholar
Rodrigo, G., Henderson, M., et al.: ScienceSearch: enabling search through automatic metadata generation. In: 14th International Conference on e-Science, pp. 93–104 (2018)
Google Scholar
Shannon, C.E.: A mathematical theory of communication. ACM SIGMOBILE Mob. Comput. Commun. Rev. 5(1), 3–55 (2001)
Article MathSciNet Google Scholar
Skluzacek, T., et al.: A serverless framework for distributed bulk metadata extraction. In: 30th International Symposium on High-Performance Parallel and Distributed Computing (2021)
Google Scholar
Skluzacek, T.J., : Serverless workflows for indexing large scientific data. In: Int’l Workshop on Serverless Computing, pp. 43–48 (2019)
Google Scholar
Skluzacek, T.J., et al.: Skluma: an extensible metadata extraction pipeline for disorganized data. In: IEEE 14th International Conference on e-Science, pp. 256–266. IEEE (2018)
Google Scholar
Tabish, S.M., Shafiq, M.Z., Farooq, M.: Malware detection using statistical analysis of byte-level file content. In: ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics, pp. 23–31 (2009)
Google Scholar
Talburt, J.: The Flesch index: An easily programmable readability analysis algorithm. In: International Conference on Systems Documentation, pp. 114–122 (1986)
Google Scholar
Vazhkudai, S.S., Harney, J., et al.: Constellation: a science graph network for scalable data and knowledge discovery in extreme-scale scientific collaborations. In: IEEE International Conference on Big Data, pp. 3052–3061 (2016)
Google Scholar
Wang, L.L., Lo, K., et al.: Cord-19: The covid-19 open research dataset. arXiv:2004.10706 (2020). https://doi.org/10.48550/ARXIV.2004.10706
Wang, Y., Li, Y., et al.: Efficient test for nonlinear dependence of two continuous variables. BMC Bioinform. 16(1), 1–8 (2015)
Article MathSciNet Google Scholar

Download references

Acknowledgements

We gratefully acknowledge Takuya Kurihana (University of Chicago) for sharing his machine learning expertise. This work is supported in part by the National Science Foundation under Grants No. 2004894 and 1757970, and used resources of the Argonne Leadership Computing Facility.

Author information

Authors and Affiliations

University of Chicago, Chicago, IL, USA
Tyler J. Skluzacek, Kyle Chard & Ian Foster
University of Illinois at Urbana-Champaign, Champaign, IL, USA
Matthew Chen
Carnegie Mellon University, Pittsburgh, PA, USA
Erica Hsu
Argonne National Lab, Lemont, IL, USA
Kyle Chard & Ian Foster

Authors

Tyler J. Skluzacek
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Chen
View author publications
You can also search for this author in PubMed Google Scholar
Erica Hsu
View author publications
You can also search for this author in PubMed Google Scholar
Kyle Chard
View author publications
You can also search for this author in PubMed Google Scholar
Ian Foster
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tyler J. Skluzacek .

Editor information

Editors and Affiliations

Brunel University London, London, UK
Derek Groen
University of Amsterdam, Amsterdam, The Netherlands
Clélia de Mulatier
AGH University of Science and Technology, Krakow, Poland
Maciej Paszynski
University of Amsterdam, Amsterdam, The Netherlands
Valeria V. Krzhizhanovskaya
University of Tennessee at Knoxville, Knoxville, TN, USA
Jack J. Dongarra
University of Amsterdam, Amsterdam, The Netherlands
Peter M. A. Sloot

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Skluzacek, T.J., Chen, M., Hsu, E., Chard, K., Foster, I. (2022). Models and Metrics for Mining Meaningful Metadata. In: Groen, D., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds) Computational Science – ICCS 2022. ICCS 2022. Lecture Notes in Computer Science, vol 13350. Springer, Cham. https://doi.org/10.1007/978-3-031-08751-6_30

Download citation

DOI: https://doi.org/10.1007/978-3-031-08751-6_30
Published: 15 June 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08750-9
Online ISBN: 978-3-031-08751-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics