Abstract
The temporal aspects often play an important role in information extraction. Given the peculiarities of temporal data, their management typically requires the use of dedicated algorithms, that make the overall data mining process complex, especially in those cases in which a dataset is characterised by both temporal and atemporal information. In such a situation, typical solutions include combining different algorithms for the independent handling of the temporal and atemporal parts, or relying on an encoding of temporal data that makes it possible to apply classical machine learning algorithms (such as with the use of lagged variables). This work investigates the management of temporal information in an environmental problem, that is, assessing the relationships between concentrations of the pollutants \(NO_2\), \(NO_X\), and \(PM_{2.5}\), and a set of independent variables that include meteorological conditions and traffic flow in the city of Wrocław (Poland). We show that taking into account temporal information by means of lagged variables leads to better results with respect to atemporal models. More importantly, an even higher performance may be achieved by making use of a recently proposed decision tree model, called J48SS, that is capable of handling heterogeneous datasets consisting of static (i.e., categorical and numerical) attributes, as well as sequential and time series data. Such an outcome highlights the importance of proper temporal data modelling.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We use the term timestamp to refer to the kind of variables that identify a specific time instant, to distinguish them from those we will consider to be proper temporal features, i.e., the ones that encode historical values.
- 2.
Note that the European directives identify the two relevant values of 0 and 200 for \(NO_2\) concentrations. However, we chose here to rely on different interval boundaries, since in the considered data there are just 4 instances with values over 200. Although this is a rather arbitrary choice, it does not compromise the goal of the work, namely, assessing the role played by temporal information in the overall classification task.
References
European Union air quality standards. http://ec.europa.eu/environment/air/quality/standards.htm. Accessed 21 May 2019
NOx level objectives. http://www.icopal-noxite.co.uk/nox-problem/nox-level-objectives.aspx. Accessed 21 May 2019
Scikit-learn’s compute_class_weight function. https://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html. Accessed: 22 May 2019
Scikit-learn’s RandomForestClassifier. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html. Accessed 22 May 2019
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Brunello, A., Marzano, E., Montanari, A., Sciavicco, G.: J48SS: a novel decision tree approach for the handling of sequential and time series data. Computers 8(1), 21 (2019)
Deters, J.K., Zalakeviciute, R., Gonzalez, M., Rybarczyk, Y.: Modeling \(PM_{2.5}\) urban pollution using machine learning and selected meteorological parameters. J. Electr. Comput. Eng. 2017, 5106045:1–5106045:14 (2017)
Frank, E., Witten, I.H.: Generating accurate rule sets without global optimization. In: Proceedings of the 15th International Conference on Machine Learning (ICML), pp. 144–151. Morgan Kaufmann (1998)
Ho, T.K.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 832–844 (1998)
Kamińska, J.A.: The use of random forests in modelling short-term air pollution effects based on traffic and meteorological conditions: A case study in Wrocław. J. Environ. Manag. 217, 164–174 (2018)
Mbarak, A., Yetis, Y., Jamshidi, M.: Data - based pollution forecasting via machine learning: case of Northwest Texas. In: Proceedings of the 2018 World Automation Congress (WAC), pp. 1–6 (2018)
Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993)
Sasaki, Y.: The truth of the F-measure. Teach Tutor Mater 1(5), 1–5 (2007)
Shang, Z., Deng, T., He, J., Duan, X.: A novel model for hourly \(PM_{2.5}\) concentration prediction based on CART and EELM. Sci. Total Environ. 651, 3043–3052 (2019)
Wilkins, A.S.: To lag or not to lag?: Re-evaluating the use of lagged dependent variables in regression analysis. Polit. Sci. Res. Methods 6(2), 393–411 (2018)
Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington (2016)
Xie, J., et al.: The characteristics of hourly wind field and its impacts on air quality in the pearl river delta region during 2013–2017. Atmos. Res. 227, 112–124 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Brunello, A., Kamińska, J., Marzano, E., Montanari, A., Sciavicco, G., Turek, T. (2019). Assessing the Role of Temporal Information in Modelling Short-Term Air Pollution Effects Based on Traffic and Meteorological Conditions: A Case Study in Wrocław. In: Welzer, T., et al. New Trends in Databases and Information Systems. ADBIS 2019. Communications in Computer and Information Science, vol 1064. Springer, Cham. https://doi.org/10.1007/978-3-030-30278-8_45
Download citation
DOI: https://doi.org/10.1007/978-3-030-30278-8_45
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30277-1
Online ISBN: 978-3-030-30278-8
eBook Packages: Computer ScienceComputer Science (R0)