Towards Increasing Open Data Adoption Through Stream Data Integration and Imputation

Kunicki, Robert; Grzenda, Maciej

doi:10.1007/978-3-030-79457-6_2

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12798))

Included in the following conference series:

International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems

1631 Accesses

Abstract

Open data portals are used to make a growing number of government data resources public. However, difficulties in preprocessing and integrating multiple possibly incomplete open data resources hinder the potential of open data re-use for software development. While incomplete data sets can be imputed in an offline process, this is not the case for data streams expected to be published in an online manner.

In this work, we propose a novel data stream preprocessing method aimed at simplifying open data stream re-use through the unification of time resolution and imputation of incomplete instances. The method relies on stream mining methods to predict categorical values to be imputed. A separate online learning model is built for every incomplete feature. The method we propose allows the model to benefit from both inter-feature similarities and temporal dependencies present in data streams. We validate the proposed method with public transport data streams.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Highways Agency network journey time and traffic flow data was an example of such a data set.

References

Bifet, A., Gavaldà, R.: Adaptive learning from evolving data streams. In: Adams, N.M., Robardet, C., Siebes, A., Boulicaut, J.-F. (eds.) IDA 2009. LNCS, vol. 5772, pp. 249–260. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03915-7_22
Chapter Google Scholar
Bifet, A., Gavaldà, R., Holmes, G., Pfahringer, B.: Machine Learning for Data Streams with Practical Examples in MOA. MIT Press, Cambridge (2018)
Book Google Scholar
Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2000, pp. 71–80. ACM, New York (2000)
Google Scholar
Gomes, H.M., et al.: Adaptive random forests for evolving data stream classification. Mach. Learn. 106(9), 1469–1495 (2017)
Article MathSciNet Google Scholar
Grzenda, M., Kwasiborska, K., Zaremba, T.: Hybrid short term prediction to address limited timeliness of public transport data streams. Neurocomputing 391, 305–317 (2020). https://doi.org/10.1016/j.neucom.2019.08.100
Article Google Scholar
Grzenda, M., Legierski, J.: Towards increased understanding of open data use for software development. Inf. Syst. Front. 23(2), 495–513 (2019). https://doi.org/10.1007/s10796-019-09954-6
Article Google Scholar
Grzymala-Busse, J.W., Grzymala-Busse, W.J.: Handling missing attribute values. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 37–57. Springer, Boston (2005). https://doi.org/10.1007/978-0-387-09823-4_3
Chapter MATH Google Scholar
Jetzek, T., Avital, M., Bjorn-Andersen, N.: Data-driven innovation through open government data. J. Theor. Appl. Electron. Commer. Res. 9(2), 100–120 (2014). https://doi.org/10.4067/S0718-18762014000200008
Article Google Scholar
Miao, X., Gao, Y., Guo, S., Liu, W.: Incomplete data management: a survey. Front. Comput. Sci. 12(1), 4–25 (2018). https://doi.org/10.1007/s11704-016-6195-x
Article Google Scholar
Thorsby, J., Stowers, G.N., Wolslegel, K., Tumbuan, E.: Understanding the content and features of open data portals in American cities. Gov. Inf. Q. 34(1), 53–61 (2017)
Article Google Scholar
Yu, Q., Miche, Y., Eirola, E., van Heeswijk, M., Séverin, E., Lendasse, A.: Regularized extreme learning machine for regression with missing data. Neurocomputing 102(C), 45–51 (2013)
Article Google Scholar

Download references

Acknowledgements

The project was funded by the POB Research Centre for Artificial Intelligence and Robotics of Warsaw University of Technology within the Excellence Initiative Program - Research University (ID-UB).

Author information

Authors and Affiliations

Faculty of Mathematics and Information Science, Warsaw University of Technology, ul. Koszykowa 75, 00-662, Warszawa, Poland
Robert Kunicki & Maciej Grzenda
Digitalisation Department, The City of Warsaw, pl. Bankowy 2, 00-095, Warszawa, Poland
Robert Kunicki

Authors

Robert Kunicki
View author publications
You can also search for this author in PubMed Google Scholar
Maciej Grzenda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maciej Grzenda .

Editor information

Editors and Affiliations

i-SOMET Incorporate Association, Morioka, Japan
Hamido Fujita
Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia
Ali Selamat
Western Norway University of Applied Sciences, Bergen, Norway
Jerry Chun-Wei Lin
Texas State University San Marcos, San Marcos, TX, USA
Moonis Ali

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kunicki, R., Grzenda, M. (2021). Towards Increasing Open Data Adoption Through Stream Data Integration and Imputation. In: Fujita, H., Selamat, A., Lin, J.CW., Ali, M. (eds) Advances and Trends in Artificial Intelligence. Artificial Intelligence Practices. IEA/AIE 2021. Lecture Notes in Computer Science(), vol 12798. Springer, Cham. https://doi.org/10.1007/978-3-030-79457-6_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-79457-6_2
Published: 19 July 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-79456-9
Online ISBN: 978-3-030-79457-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics