Abstract
Open data portals are used to make a growing number of government data resources public. However, difficulties in preprocessing and integrating multiple possibly incomplete open data resources hinder the potential of open data re-use for software development. While incomplete data sets can be imputed in an offline process, this is not the case for data streams expected to be published in an online manner.
In this work, we propose a novel data stream preprocessing method aimed at simplifying open data stream re-use through the unification of time resolution and imputation of incomplete instances. The method relies on stream mining methods to predict categorical values to be imputed. A separate online learning model is built for every incomplete feature. The method we propose allows the model to benefit from both inter-feature similarities and temporal dependencies present in data streams. We validate the proposed method with public transport data streams.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Highways Agency network journey time and traffic flow data was an example of such a data set.
References
Bifet, A., Gavaldà , R.: Adaptive learning from evolving data streams. In: Adams, N.M., Robardet, C., Siebes, A., Boulicaut, J.-F. (eds.) IDA 2009. LNCS, vol. 5772, pp. 249–260. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03915-7_22
Bifet, A., Gavaldà , R., Holmes, G., Pfahringer, B.: Machine Learning for Data Streams with Practical Examples in MOA. MIT Press, Cambridge (2018)
Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2000, pp. 71–80. ACM, New York (2000)
Gomes, H.M., et al.: Adaptive random forests for evolving data stream classification. Mach. Learn. 106(9), 1469–1495 (2017)
Grzenda, M., Kwasiborska, K., Zaremba, T.: Hybrid short term prediction to address limited timeliness of public transport data streams. Neurocomputing 391, 305–317 (2020). https://doi.org/10.1016/j.neucom.2019.08.100
Grzenda, M., Legierski, J.: Towards increased understanding of open data use for software development. Inf. Syst. Front. 23(2), 495–513 (2019). https://doi.org/10.1007/s10796-019-09954-6
Grzymala-Busse, J.W., Grzymala-Busse, W.J.: Handling missing attribute values. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 37–57. Springer, Boston (2005). https://doi.org/10.1007/978-0-387-09823-4_3
Jetzek, T., Avital, M., Bjorn-Andersen, N.: Data-driven innovation through open government data. J. Theor. Appl. Electron. Commer. Res. 9(2), 100–120 (2014). https://doi.org/10.4067/S0718-18762014000200008
Miao, X., Gao, Y., Guo, S., Liu, W.: Incomplete data management: a survey. Front. Comput. Sci. 12(1), 4–25 (2018). https://doi.org/10.1007/s11704-016-6195-x
Thorsby, J., Stowers, G.N., Wolslegel, K., Tumbuan, E.: Understanding the content and features of open data portals in American cities. Gov. Inf. Q. 34(1), 53–61 (2017)
Yu, Q., Miche, Y., Eirola, E., van Heeswijk, M., Séverin, E., Lendasse, A.: Regularized extreme learning machine for regression with missing data. Neurocomputing 102(C), 45–51 (2013)
Acknowledgements
The project was funded by the POB Research Centre for Artificial Intelligence and Robotics of Warsaw University of Technology within the Excellence Initiative Program - Research University (ID-UB).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Kunicki, R., Grzenda, M. (2021). Towards Increasing Open Data Adoption Through Stream Data Integration and Imputation. In: Fujita, H., Selamat, A., Lin, J.CW., Ali, M. (eds) Advances and Trends in Artificial Intelligence. Artificial Intelligence Practices. IEA/AIE 2021. Lecture Notes in Computer Science(), vol 12798. Springer, Cham. https://doi.org/10.1007/978-3-030-79457-6_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-79457-6_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-79456-9
Online ISBN: 978-3-030-79457-6
eBook Packages: Computer ScienceComputer Science (R0)