Abstract
A new technology for storage and categorization of heterogeneous data on the properties of matter is proposed. Availability of a multitude of heterogeneous data from a variety of sources justifies the use of one of the popular toolkit for Big Data processing, Apache Spark. Its role in the proposed technology is to manage with extensive data warehouse in text files of the JSON format. The first stage of the technology involves the conversion of primary resources (relational databases, digital archives, Web-portals, etc.) to a standardized form of the JSON document. Advantages of JSON-format - the ability to store data and metadata within a text document, accessible perceptions of a person and a computer and support for the hierarchical structures needed to represent complex and irregular data structure. The presence of such data structures is associated with the possible expansion of the subject area: new types of materials, expansion of the nomenclature of properties, and so on. For the semantic integration of resources converted to the JSON format a repository of subject-oriented ontologies is used. The search for data in the JSON document store is implemented through a combination of SPARQL and SQL queries. The first one (addressed to the ontology repository) provide the user with the ability to view and search for adequate and related concepts. The second, accessing the JSON document sets, retrieves the required data from the document body using the capabilities of Apache Spark SQL. The efficiency of the developed technology is tested on the problems of thermophysical data integration with a characteristic for them complexity of the logical structure.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Section “Databases at the Joint Institute for High Temperatures, Russian Academy of Sciences” on page 47 of the review [13].
References
WhatIs.com (a reference and self-education tool about information technology). http://whatis/techtarget.com/definition/3Vs
Erkimbaev, A.O., Zitserman, V.Y., Kobzev, G.A., Kosinov, A.V.: Standardization of Storage and Retrieval of Semi-structured Thermophysical Data in JSON-documents Associated with the Ontology. In: CEUR –WS 2022, urn: nbn:de:0074-2022-6 (2017). http://ceur-ws.org/Vol-2022/paper36.pdf
Frenkel, M., Chirico, R.D., Diky, V., et al.: XML-based IUPAC standard for experimental, predicted, and critically evaluated thermodynamic property data storage and capture (ThermoML). Pure Appl. Chem. 78, 541–612 (2006). https://doi.org/10.1351/pac200678030541
Sturrock, C.P., Begley, E.F., Kaufman, J.G.: NISTIR 6785. MatML – Materials Markup Language Workshop Report, U.S. Department of Commerce. National Institute of Standards and Technology (2001)
Introducing JSON. http://json.org/index.html
Michel, K., Meredig, B.: Beyond bulk single crystals: A data format for all materials structure–property–processing relationships. MRS Bull. 41, pp. 617–623. https://doi.org/10.1557/mrs.2016.166
Ontobee: A linked data server designed for ontologies. http://www.ontobee.org
Erkimbaev, A.O., Zhizhchenko, A.B., Zitserman, V.Yu, Kobzev, G.A., Son, E.E., Sotnikov, A.N.: Integration of databases on substance properties: approaches and technologies. Autom. Documentation Math. Linguist. 46, 170–176 (2012). https://doi.org/10.3103/S000510551204005X
Ataeva, O.M., Erkimbaev, A.O., Zitserman, V.Yu. et al.: Ontological Modeling as a Means of Integration Data on Substances Thermophysical Properties. In: 15th All-Russian Science Conference “Electronic Libraries: Advanced Approaches and Technologies, Electronic Collections”, s1_3. Yaroslavl (2013). http://rcdl.ru/doc/2013/paper/s1_3.pdf
ChemSpider. http://www.chemspider.com
Hall, S.R., McMahon, B.: The implementation and evolution of STAR/CIF ontologies: interoperability and preservation of structured data. Data Sci. J. 15(3), 1–15 (2016). https://doi.org/10.5334/dsj-2016-003
Apache Spark. http://spark.apache.org
Kiselyova, N.N., Dudarev, V.A., Zemskov, V.S.: Computer information resources of inorganic chemistry and materials science. Rus. Chem. Rev. 79, 145–166 (2010). https://doi.org/10.1070/RC2010v079n02ABEH004104
Frenkel, M.: Global communications and expert systems in thermodynamics: Connecting property measurement and chemical process design. Pure Appl. Chem. 77, 1349–1367 (2005). https://doi.org/10.1351/pac200577081349
Belov, G.V., Iorish, V.S., Yungman, V.S.: IVTANTHERMO for Windows-database on thermodynamic properties and related software. Calphad 23, 173–180 (1999). https://doi.org/10.1016/s0364-5916(99)00023-1
Acknowledgments
The work is supported by Russian Scientific Foundation, grant 14-50-00124.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Erkimbaev, A., Zitserman, V., Kobzev, G., Kosinov, A. (2018). Integration of Data on Substance Properties Using Big Data Technologies and Domain-Specific Ontologies. In: Kalinichenko, L., Manolopoulos, Y., Malkov, O., Skvortsov, N., Stupnikov, S., Sukhomlin, V. (eds) Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2017. Communications in Computer and Information Science, vol 822. Springer, Cham. https://doi.org/10.1007/978-3-319-96553-6_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-96553-6_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96552-9
Online ISBN: 978-3-319-96553-6
eBook Packages: Computer ScienceComputer Science (R0)