Abstract
The future of the Internet of Things (IoT) demands the integration of synergetic applications to cater to societal needs. Examples of IoT-based confederated applications include Ambient Assisted Living with Active Healthy Ageing, CasAware with Smart Energy, Smart Gas Distribution Networks with GIS systems, and more. However, the data heterogeneity hinders integration, as these systems follow different standards, data formats, semantic models, and representations. Further, this leads to data interoperability issues in IoT. The major concern of academia and industry in the smooth integration of heterogeneous applications is interpreting different data formats and representing them in a common schema for further analysis. Existing solutions, such as message payload translation, middleware/cloud format, and Inter-IoT, are complex, time-consuming, and ineffective. Hence, this paper proposes the heterogeneous data format integration and conversion (HDFIC), a machine learning-based system to identify data formats using a Random Forest classifier and integrate them using the Data Format Description Language (DFDL). The content-based data format identification in the proposed HDFIC is trained with the standard features defined in RFC 7111, 8259, and 8996. Subsequently, the data is integrated into a single XML Schema Definition and converted into the required data format using the IBM App Connect Enterprise tool and DFDL. Finally, the performance of HDFIC is evaluated with the synergetic patient body vitals and room ambiance dataset for accuracy, data integration time, and conversion efficiency.
Similar content being viewed by others
Data availability
All data generated or analyzed during this study are included in this published article. Methods to generate the dataset are mentioned in the article. A sample dataset is presented in the article.
Notes
References
Ahmed A, Kleiner M, Roucoules L (2019) Model-based interoperability IoT hub for the supervision of smart gas distribution networks. IEEE Syst J 13(2):1526–1533. https://doi.org/10.1109/JSYST.2018.2851663
Bannister M (2021) How humidity damages home. https://www.airthings.com/resources/home-humidity-damage. Accessed 21 Dec 2023
Bray T (2017) The JavaScript object notation (JSON) data interchange format. RFC 8259. https://doi.org/10.17487/RFC8259. https://www.rfc-editor.org/info/rfc8259
Buczak AL, Guven E (2016) A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Commun Surv Tutor 18(2):1153–1176. https://doi.org/10.1109/COMST.2015.2494502
Calhoun W, Coles D (2008) Predicting the types of file fragments. Digit Investig. https://doi.org/10.1016/j.diin.2008.05.005
Cedillo P, Riofrio X, Prado D et al (2020) A middleware for managing the heterogeneity of data provining from IoT devices in ambient assisted living environments, pp 1–6. https://doi.org/10.1109/ANDESCON50619.2020.9272163
Coulby G, Clear A, Jones DO et al (2020) Towards remote healthcare monitoring using accessible IoT technology: state-of-the-art, insights and experimental design. BioMed Eng OnLine. https://doi.org/10.1186/s12938-020-00825-9
Crockford D (2006) The application/json media type for JavaScript Object Notation (JSON). RFC 4627. https://doi.org/10.17487/RFC4627. https://www.rfc-editor.org/info/rfc4627
DiGiacinto J, Seladi-Schulman J (2022) Normal vs. dangerous heart rate: How to tell the difference. Heathline https://www.healthline.com/health/dangerous-heart-rate. Accessed 22 Dec 2023
Doan T, Kayes ASM, Rahayu W et al (2020) IoT streaming data integration from multiple sources. Computing. https://doi.org/10.1007/s00607-020-00830-9
Evensen JD, Lindahl S, Goodwin M (2014) File-type detection using naïve Bayes and n-gram analysis. BIBSYS: Open J Syst. https://core.ac.uk/reader/228628450
Gonzalez-Usach R, Julian M, Esteve M et al (2021) Federation of AAL & AHA systems through semantically interoperable framework. In: 2021 IEEE international conference on communications workshops (ICC workshops), pp 1–6. https://doi.org/10.1109/ICCWorkshops50388.2021.9473503
Google (2023) Bigquery public datasets. https://cloud.google.com/bigquery/public-data. Accessed 21 Dec 2023
Hassine K, Erbad A, Hamila R (2019) Important complexity reduction of random forest in multi-classification problem. In: 2019 15th International wireless communications and mobile computing conference (IWCMC), pp 226–231. https://doi.org/10.1109/IWCMC.2019.8766544
Hausenblas M, Wilde E, Tennison J (2014) Uri fragment identifiers for the text/csv media type. RFC 7111. https://doi.org/10.17487/RFC7111. https://www.rfc-editor.org/info/rfc7111
Hojlo J (2022) Future of industry ecosystems: shared data and insights. https://blogs.idc.com/2021/01/06/future-of-industry-ecosystems-shared-data-and-insights/. Accessed 21 Dec 2023
Hu L, Sun R, Wang F et al (2016) A stream processing system for multisource heterogeneous sensor data. J Sens 2016:1–8. https://doi.org/10.1155/2016/4287834
IBM (2023a) Data format description language (DFDL). https://www.ibm.com/docs/en/app-connect/11.0.0?topic=model-data-format-description-language-dfdl. Accessed 21 Dec 2023
IBM (2023b) IBM app connect enterprise. https://www.ibm.com/docs/en/app-connect/11.0.0. Accessed 21 Dec 2023
ISA (2022) ISA95, enterprise-control system integration. https://www.isa.org/standards-and-publications/isa-standards/isa-standards-committees/isa95. Accessed 21 Dec 2023
ISO (2022) Industrial automation systems and integration-integration of life-cycle data for process plants including oil and gas production facilities-part 2: data model. https://www.iso.org/obp/ui/#iso:std:29557:en. Accessed 21 Dec 2023
Jaleel A, Mahmood T, Hassan MA et al (2020) Towards medical data interoperability through collaboration of healthcare devices. IEEE Access 8:132302–132319. https://doi.org/10.1109/ACCESS.2020.3009783
Konaray S, Toprak A, Pek G et al (2019) Detecting file types using machine learning algorithms, pp 1–4. https://doi.org/10.1109/ASYU48272.2019.8946393
Li T (2019) Design and implementation of interworking between OneM2M and external systems. https://doi.org/10.2991/icmeit-19.2019.32
Li Wj, Wang K, Stolfo S et al (2005) Fileprints: identifying file types by n-gram analysis, pp 64–71. https://doi.org/10.1109/IAW.2005.1495935
Liu J, Jiang L, Chen Y et al (2023) Study on prediction model of liquid hold up based on random forest algorithm. Chem Eng Sci 268(118):383. https://doi.org/10.1016/j.ces.2022.118383
M S, Chandavarkar BR (2021a) Data processing in IoT, sensor to cloud: survey. In: 12th International conference on computing communication and networking technologies (ICCCNT), IIT Kharagpur. https://doi.org/10.1109/ICCCNT51525.2021.9579976
M S, Chandavarkar BR (2021b) IoTs communication technologies, data formats, and protocols—a survey. In: 2nd International conference on secure cyber computing and communications (ICSCCC), NIT Jalandhar, pp 483–488. https://doi.org/10.1109/ICSCCC51823.2021.9478093
Manyika J, Chui M, Bisson P, Woetzel J, Dobbs R, Bughin J, Aharon D (2015) The Internet of Things: mapping the value beyond the hype, vol 24. McKinsey Global Institute, New York, NY, USA
Mezei G, Somogyi F, Farkas K (2018) The dynamic sensor data description and data format conversion language, pp 372–380. https://doi.org/10.5220/0006912203720380
Milankovic M (2018) IoT data interoperability POC: a pragmatic feasibility proof. https://htecgroup.com/insights/tech-blog/iot-data-interoperability-poc-a-pragmatic-feasibility-proof/. Accessed 30 Aug 2022
Modoni G, Caldarola EG, Mincuzzi N et al (2020) Integrating IoT platforms using the inter-IoT approach: a case study of the CasAware project. J Ambient Intell Smart Environ. https://doi.org/10.3233/AIS-200578
Moon J, Kum SW, Lee S (2019) A heterogeneous IoT data analysis framework with collaboration of edge-cloud computing: focusing on indoor PM10 and PM2.5 status prediction. Sensors 19:3038. https://doi.org/10.3390/s19143038
Moriarty K, Farrell S (2021) Deprecating TLS 1.0 and TLS 1.1. https://doi.org/10.17487/RFC8996. https://www.rfc-editor.org/info/rfc8996
Nilsson J, Delsing J, Sandin F (2020) Autoencoder alignment approach to run-time interoperability for system of systems engineering. In: 2020 IEEE 24th international conference on intelligent engineering systems (INES), pp 139–144. https://doi.org/10.1109/INES49302.2020.9147168
Palm E, Paniagua C, Bodin U et al (2019) Syntactic translation of message payloads between at least partially equivalent encodings. In: 2019 IEEE international conference on industrial technology (ICIT), pp 812–817. https://doi.org/10.1109/ICIT.2019.8755159
Pramukantoro E, Gofuku A (2020) Prototype of multi-layer personal cardiac monitoring system for data interoperability problem, pp 84–89. https://doi.org/10.1145/3427423.3427442
Rose DMT, Hollenbeck S, Masinter LM (2003) Guidelines for the use of extensible markup language (XML) within IETF protocols. RFC 3470. https://doi.org/10.17487/RFC3470. https://www.rfc-editor.org/info/rfc3470
Shafranovich Y (2005) Common format and MIME type for comma-separated values (CSV) files. RFC 4180. https://doi.org/10.17487/RFC4180. https://www.rfc-editor.org/info/rfc4180
Shaikh DJ (2022) What are blood oxygen levels? Chart. https://www.medicinenet.com/what _are_blood_oxygen_levels/article.htm. Accessed 30 Aug 2022
Sharma S, Hashmi MF, Bhattacharya PT (2022) Hypotension. [Updated 2022 Feb 16]. https://www.ncbi.nlm.nih.gov/books/NBK499961/. Accessed 30 Aug 2022
Singh M, Wu W, Rizou S et al (2022) Data information interoperability model for IoT-enabled smart water networks. In: 2022 IEEE 16th international conference on semantic computing (ICSC), pp 179–186. https://doi.org/10.1109/ICSC52841.2022.00038
Umishio W, Ikaga T, Kario K et al (2019) Cross-sectional analysis of the relationship between home blood pressure and indoor temperature in winter: a nationwide smart wellness housing survey in Japan. Hypertension. https://doi.org/10.1161/HYPERTENSIONAHA.119.12914
Valliant (2022) Ideal room temperature. https://www.vaillant.co.uk/homeowners/ advice-and-knowledge/vaillant-blog-pieces/what-is-the-ideal-room-temperature-1769698.html. Accessed 30 Aug 2022
Venkata SK, Young P, Green A (2020) Using machine learning for text file format identification. EasyChair Preprint no. 4698
Walker HK, Hall WD, Hurst JW (1990) Clinical methods: the history, physical, and laboratory examinations, Chapter-218. Butterworths. https://www.ncbi.nlm.nih.gov/books/NBK331/. Accessed 30 Aug 2022
Wilgenbus EF (2013) The file fragment classification problem: a combined neural network and linear programming discriminant model approach. http://hdl.handle.net/10394/10215. Accessed 30 Aug 2022
Wu J, Zhang J, Qiao J (2022) Adaptive integration algorithm of sports event network marketing data based on big data. Secur Commun Netw 2022:1–9. https://doi.org/10.1155/2022/7660071
Xu B, Xu LD, Cai H et al (2014) Ubiquitous data accessing method in IoT-based information system for emergency medical services. IEEE Trans Ind Inform 10(2):1578–1586. https://doi.org/10.1109/TII.2014.2306382
Xu D, Zhang Y, Wang B et al (2019) Acute effects of temperature exposure on blood pressure: an hourly level panel study. Environ Int 124:493–500
Funding
The authors did not receive support from any organization for the submitted work. No funding was received to assist with the preparation of this manuscript.
Author information
Authors and Affiliations
Contributions
Authors contribution The major contributions of the authors are as follows: IoT’s data format identification originating at multiple sources using a content-based approach.
Integration of identified IoT’s heterogeneous format data using common field identifier.
Representation of integrated IoT’s heterogeneous format data into an appropriate format using Data Format Definition Language (DFDL).
Corresponding author
Ethics declarations
Competing interests:
The authors have no relevant financial or nonfinancial interests to disclose.
Consent
Has followed ethical standards, and No conflict of interests to disclose.
Materials and/or Code availability
Data will be provided on request.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
M, S., Chandavarkar, B.R. & Khatri, S. Heterogeneous data format integration and conversion (HDFIC) using machine learning and IBM-DFDL for IoT. Evolving Systems 15, 375–396 (2024). https://doi.org/10.1007/s12530-024-09568-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12530-024-09568-7