Abstract
Data cubes are multidimensional databases, often built from several separate databases, that serve as flexible basis for data analysis. Surprisingly, outlier detection on data cubes has not yet been treated extensively. In this work, we provide the first framework to evaluate robust outlier detection methods in data cubes (RODD). We introduce a novel random forest-based outlier detection approach (RODD-RF) and compare it with more traditional methods based on robust location estimators. We propose a general type of test data and examine all methods in a simulation study. Moreover, we apply ROOD-RF to real-world data. The results show that RODD-RF leads to improved outlier detection.
L. Kuhlmann and D. Wilmes—These authors contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Andrews, J.T., Morton, E.J., Griffin, L.D.: Detecting anomalous data using auto-encoders. Int. J. Mach. Learn. Comput. 6(1), 21 (2016)
Ardabili, S., Mosavi, A., Várkonyi-Kóczy, A.R.: Advances in machine learning modeling reviewing hybrid and ensemble methods. In: Várkonyi-Kóczy, A.R. (ed.) INTER-ACADEMIA 2019. LNNS, vol. 101, pp. 215–227. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-36841-8_21
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Lof: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management Of Data, pp. 93–104 (2000)
Campos, G., et al.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Disc. 30(4), 891–927 (2016). https://doi.org/10.1007/s10618-015-0444-8
Chen, J., Sathe, S., Aggarwal, C., Turaga, D.: Outlier detection with autoencoder ensembles. In: Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 90–98. SIAM (2017)
Cootes, T.F., Ionita, M.C., Lindner, C., Sauer, P.: Robust and accurate shape model fitting using random forest regression voting. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7578, pp. 278–291. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33786-4_21
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 39(1), 1–22 (1977)
El Mrabet, Z., Sugunaraj, N., Ranganathan, P., Abhyankar, S.: Random forest regressor-based approach for detecting fault location and duration in power systems. Sensors 22(2), 458 (2022)
Friedman, J.H.: Recent advances in predictive (machine) learning. J. Classif. 23(2), 175–197 (2006)
Gao, J., Cheng, H., Tan, P.N.: Semi-supervised outlier detection. In: Proceedings of the 2006 ACM Symposium on Applied Computing, pp. 635–636 (2006)
Gray, J., et al.: Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Disc. 1(1), 29–53 (1997)
Hawkins, S., He, H., Williams, G., Baxter, R.: Outlier detection using replicator neural networks. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) DaWaK 2002. LNCS, vol. 2454, pp. 170–180. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-46145-0_17
Hill, M., Dixon, W.: Robustness in real life: A study of clinical laboratory data. Biometrics, pp. 377–396 (1982)
Hochkamp, F., Rabe, M.: Outlier detection in data mining: Exclusion of errors or loss of information? In: Hamburg International Conference of Logistics (HICL) 2022. In: Proceedings of the Hamburg International Conference of Logistics (HICL) (2022)
Holst, A.: Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2025. Statista, June (2021)
Huang, H., Pouls, M., Meyer, A., Pauly, M.: Travel time prediction using tree-based ensembles. In: Lalla-Ruiz, E., Mes, M., Voß, S. (eds.) ICCL 2020. LNCS, vol. 12433, pp. 412–427. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59747-4_27
Jin, W., Tung, A.K.H., Han, J., Wang, W.: Ranking outliers using symmetric neighborhood relationship. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 577–593. Springer, Heidelberg (2006). https://doi.org/10.1007/11731139_68
Knorr, E.M., Ng, R.T.: A unified notion of outliers: Properties and computation. In: KDD. vol. 97, pp. 219–222 (1997)
Latecki, L.J., Lazarevic, A., Pokrajac, D.: Outlier detection with kernel density functions. In: Perner, P. (ed.) MLDM 2007. LNCS (LNAI), vol. 4571, pp. 61–75. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-73499-4_6
Mohandoss, D.P., Shi, Y., Suo, K.: Outlier prediction using random forest classifier. In: 2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0027–0033. IEEE (2021)
Nakashima, H., Arai, I., Fujikawa, K.: Passenger counter based on random forest regressor using drive recorder and sensors in buses. In: 2019 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops), pp. 561–566. IEEE (2019)
Oliver, A., Odena, A., Raffel, C.A., Cubuk, E.D., Goodfellow, I.: Realistic evaluation of deep semi-supervised learning algorithms. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Park, C.M., Jeon, J.: Regression-based outlier detection of sensor measurements using independent variable synthesis. In: International Conference on Data Science. pp. 78–86. Springer (2015)
Pauleen, D.J., Wang, W.Y.: Does big data mean big knowledge? km perspectives on big data and analytics. J. Knowl. Manage. 21(1) (2017)
Pavlidou, M., Zioutas, G.: Kernel density outlier detector. In: Akritas, M.G., Lahiri, S.N., Politis, D.N. (eds.) Topics in Nonparametric Statistics. SPMS, vol. 74, pp. 241–250. Springer, New York (2014). https://doi.org/10.1007/978-1-4939-0569-0_22
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Rocke, D.M., Downs, G.W., Rocke, A.J.: Are robust estimators really necessary? Technometrics 24(2), 95–101 (1982)
Ruff, L., et al.: Deep semi-supervised anomaly detection. arXiv preprint arXiv:1906.02694 (2019)
Sarawagi, S., Agrawal, R., Megiddo, N.: Discovery-driven exploration of olap data cubes. In: International Conference on Extending Database Technology. pp. 168–182. Springer (1998). https://doi.org/10.1007/bfb0100984
Searle, S.R., Gruber, M.H.: Linear models. John Wiley & Sons (2016)
Shankaranarayana, S.M., Runje, D.: ALIME: autoencoder based approach for local interpretability. In: Yin, H., Camacho, D., Tino, P., Tallón-Ballesteros, A.J., Menezes, R., Allmendinger, R. (eds.) IDEAL 2019. LNCS, vol. 11871, pp. 454–463. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33607-3_49
Spjøtvoll, E., Aastveit, A.H.: Comparison of robust estimators on data from field experiments. Scandinavian J. Stat. 7, 1–13 (1980)
St, L., Wold, S., et al.: Analysis of variance (anova). Chemom. Intell. Lab. Syst. 6(4), 259–272 (1989)
Vargaftik, S., Keslassy, I., Orda, A., Ben-Itzhak, Y.: Rade: Resource-efficient supervised anomaly detection using decision tree-based ensemble methods. Mach. Learn. 110(10), 2835–2866 (2021)
Walfish, S.: A review of statistical outlier methods. Pharm. Technol. 30(11), 82 (2006)
Wang, H., Bah, M.J., Hammad, M.: Progress in outlier detection techniques: a survey. Ieee Access 7, 107964–108000 (2019)
Welsh, A.: The trimmed mean in the linear model. Ann. Stat. 15(1), 20–36 (1987)
Yang, X., Latecki, L.J., Pokrajac, D.: Outlier detection with globally optimal exemplar-based gmm. In: Proceedings of the 2009 SIAM International Conference on Data Mining, pp. 145–154. SIAM (2009)
Zhou, C., Paffenroth, R.C.: Anomaly detection with robust deep autoencoders. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 665–674 (2017)
Acknowledgements
This work was supported by the Research Center Trustworthy Data Science and Security, an institution of the University Alliance Ruhr.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kuhlmann, L., Wilmes, D., Müller, E., Pauly, M., Horn, D. (2023). RODD: Robust Outlier Detection in Data Cubes. In: Wrembel, R., Gamper, J., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2023. Lecture Notes in Computer Science, vol 14148. Springer, Cham. https://doi.org/10.1007/978-3-031-39831-5_30
Download citation
DOI: https://doi.org/10.1007/978-3-031-39831-5_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39830-8
Online ISBN: 978-3-031-39831-5
eBook Packages: Computer ScienceComputer Science (R0)