Skip to main content

Models and Metrics for Mining Meaningful Metadata

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13350))

Abstract

The increasing volume and variety of science data has led to the creation of metadata extraction systems that automatically derive and synthesize relevant information from files. A critical component of metadata extraction systems is a mechanism for mapping extractors—lightweight tools to mine information from a particular file types—to each file in a repository. However, existing methods do little to address the heterogeneity and scale of science data, thereby leaving valuable data unextracted or wasting significant compute resources applying incorrect extractors to data. We construct an extractor scheduler that leverages file type identification (FTI) methods. We show that by training lightweight multi-label, multi-class statistical models on byte samples from files, we can correctly map 35% more extractors to files than by using libmagic. Further, we introduce a metadata quality toolkit to automatically assess the utility of extracted metadata.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Libmagic(3) - linux man page, November 2009. https://linux.die.net/man/3/libmagic

  2. Cdiac, March 2018. https://cdiac.ess-dive.lbl.gov/

  3. Chard, R., et al.: Funcx: a federated function serving fabric for science. In: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, pp. 65–76 (2020)

    Google Scholar 

  4. Deng, H., Runger, G., Tuv, E.: Bias of importance measures for multi-valued attributes and solutions. In: International Conference on Artificial Neural Networks, pp. 293–300 (2011)

    Google Scholar 

  5. Deutsch, E.W., et al.: BDQC: a general-purpose analytics tool for domain-blind validation of big data. bioRxiv 258822 (2018)

    Google Scholar 

  6. Gopal, S., Yang, Y., Salomatin, K., et al.: Statistical learning for file-type identification. In: International Conference on Machine Learning and Applications, pp. 68–73 (2011)

    Google Scholar 

  7. Hughes, Baden: Metadata quality evaluation: experience from the open language archives community. In: Chen, Zhaoneng, Chen, Hsinchun, Miao, Qihao, Fu, Yuxi, Fox, Edward, Lim, Ee.-peng (eds.) ICADL 2004. LNCS, vol. 3334, pp. 320–329. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30544-6_34

    Chapter  Google Scholar 

  8. Király, P.: Measuring Metadata Quality. Ph.D. thesis, Georg-August-Universität Göttingen, June 2019. https://doi.org/10.13140/RG.2.2.33177.77920

  9. Li, W.J., Wang, K., Stolfo, S.J., Herzog, B.: Fileprints: identifying file types by n-gram analysis. In: IEEE SMC Information Assurance Workshop, pp. 64–71 (2005)

    Google Scholar 

  10. Margaritopoulos, M., Margaritopoulos, T., Mavridis, I., Manitsaris, A.: Quantifying and measuring metadata completeness. J. Am. Soc. Inform. Sci. Technol. 63(4), 724–737 (2012)

    Article  Google Scholar 

  11. Marini, L., Gutierrez-Polo, I., et al.: Clowder: open source data management for long tail data. In: Practice and Experience on Advance Research Computing (2018)

    Google Scholar 

  12. Mattmann, C., Zitting, J.: Tika in Action. Manning Publications Co., USA (2011)

    Google Scholar 

  13. McDaniel, M., Heydari, M.H.: Content based file type detection algorithms. In: 36th Annual Hawaii Int’l Conference on System Sciences, pp. 10-pp. IEEE (2003)

    Google Scholar 

  14. Ochoa, X., Duval, E.: Automatic evaluation of metadata quality in digital repositories. Int. J. Digit. Lib. 67–91 (2009). https://doi.org/10.1007/s00799-009-0054-4

  15. Poisel, R., Tjoa, S.: A comprehensive literature review of file carving. In: Int’l Conference on Availability, Reliability and Security, pp. 475–484. IEEE (2013)

    Google Scholar 

  16. Rodrigo, G., Henderson, M., et al.: ScienceSearch: enabling search through automatic metadata generation. In: 14th International Conference on e-Science, pp. 93–104 (2018)

    Google Scholar 

  17. Shannon, C.E.: A mathematical theory of communication. ACM SIGMOBILE Mob. Comput. Commun. Rev. 5(1), 3–55 (2001)

    Article  MathSciNet  Google Scholar 

  18. Skluzacek, T., et al.: A serverless framework for distributed bulk metadata extraction. In: 30th International Symposium on High-Performance Parallel and Distributed Computing (2021)

    Google Scholar 

  19. Skluzacek, T.J., : Serverless workflows for indexing large scientific data. In: Int’l Workshop on Serverless Computing, pp. 43–48 (2019)

    Google Scholar 

  20. Skluzacek, T.J., et al.: Skluma: an extensible metadata extraction pipeline for disorganized data. In: IEEE 14th International Conference on e-Science, pp. 256–266. IEEE (2018)

    Google Scholar 

  21. Tabish, S.M., Shafiq, M.Z., Farooq, M.: Malware detection using statistical analysis of byte-level file content. In: ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics, pp. 23–31 (2009)

    Google Scholar 

  22. Talburt, J.: The Flesch index: An easily programmable readability analysis algorithm. In: International Conference on Systems Documentation, pp. 114–122 (1986)

    Google Scholar 

  23. Vazhkudai, S.S., Harney, J., et al.: Constellation: a science graph network for scalable data and knowledge discovery in extreme-scale scientific collaborations. In: IEEE International Conference on Big Data, pp. 3052–3061 (2016)

    Google Scholar 

  24. Wang, L.L., Lo, K., et al.: Cord-19: The covid-19 open research dataset. arXiv:2004.10706 (2020). https://doi.org/10.48550/ARXIV.2004.10706

  25. Wang, Y., Li, Y., et al.: Efficient test for nonlinear dependence of two continuous variables. BMC Bioinform. 16(1), 1–8 (2015)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

We gratefully acknowledge Takuya Kurihana (University of Chicago) for sharing his machine learning expertise. This work is supported in part by the National Science Foundation under Grants No. 2004894 and 1757970, and used resources of the Argonne Leadership Computing Facility.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tyler J. Skluzacek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Skluzacek, T.J., Chen, M., Hsu, E., Chard, K., Foster, I. (2022). Models and Metrics for Mining Meaningful Metadata. In: Groen, D., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds) Computational Science – ICCS 2022. ICCS 2022. Lecture Notes in Computer Science, vol 13350. Springer, Cham. https://doi.org/10.1007/978-3-031-08751-6_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-08751-6_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-08750-9

  • Online ISBN: 978-3-031-08751-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics