Abstract
Outlier detection methods have used approximate neighborhoods in filter-refinement approaches. Outlier detection ensembles have used artificially obfuscated neighborhoods to achieve diverse ensemble members. Here we argue that outlier detection models could be based on approximate neighborhoods in the first place, thus gaining in both efficiency and effectiveness. It depends, however, on the type of approximation, as only some seem beneficial for the task of outlier detection, while no (large) benefit can be seen for others. In particular, we argue that space-filling curves are beneficial approximations, as they have a stronger tendency to underestimate the density in sparse regions than in dense regions. In comparison, LSH and NN-Descent do not have such a tendency and do not seem to be beneficial for the construction of outlier detection ensembles.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Results from computational geometry indicate that the worst case of nearest neighbor search in more than 3 dimensions cannot be better than \(\mathcal {O}(n^{4/3})\) [16]. Empirical results with such indexes are usually much better, and tree-based indexes are often attributed a \(n \log n\) cost for searching.
- 2.
But there may be a performance improvement by nevertheless using these methods.
References
Achlioptas, D.: Database-friendly random projections: Johnson-Lindenstrauss with binary coins. JCSS 66, 671–687 (2003)
Achtert, E., Kriegel, H.P., Schubert, E., Zimek, A.: Interactive data mining with 3D-parallel-coordinate-trees. In: Proceedings SIGMOD, pp. 1009–1012 (2013)
Angiulli, F., Fassetti, F.: DOLPHIN: an efficient algorithm for mining distance-based outliers in very large datasets. ACM TKDD 3(1), 4:1–57 (2009)
Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data sets. IEEE TKDE 17(2), 203–215 (2005)
Arya, S., Mount, D.M.: Approximate nearest neighbor queries in fixed dimensions. In: Proceedings SODA, pp. 271–280 (1993)
Barnett, V., Lewis, T.: Outliers in Statistical Data, 3rd edn. Wiley, New York (1994)
Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings KDD, pp. 29–38 (2003)
Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R*-Tree: an efficient and robust access method for points and rectangles. In: Proceedings SIGMOD, pp. 322–331 (1990)
Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
Breunig, M.M., Kriegel, H.P., Ng, R., Sander, J.: LOF: Identifying density-based local outliers. In: Proceedings SIGMOD. pp. 93–104 (2000)
Campos, G.O., Zimek, A., Sander, J., Campello, R.J.G.B., Micenková, B., Schubert, E., Assent, I., Houle, M.E.: On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study. Data Min. Knowl. Disc. 30, 891–927 (2016)
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM CSUR 41(3), 1–58 (2009). Article 15
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings ACM SoCG, pp. 253–262 (2004)
de Vries, T., Chawla, S., Houle, M.E.: Density-preserving projections for large-scale local anomaly detection. KAIS 32(1), 25–52 (2012)
Dong, W., Charikar, M., Li, K.: Efficient k-nearest neighbor graph construction for generic similarity measures. In: Proceedings WWW, pp. 577–586 (2011)
Erickson, J.: On the relative complexities of some geometric problems. In: Proceedings of the 7th Canadian Conference on Computational Geometry, Quebec City, Quebec, Canada, August 1995, pp. 85–90 (1995)
Geusebroek, J.M., Burghouts, G.J., Smeulders, A.W.M.: The amsterdam library of object images. Int. J. Comput. Vis. 61(1), 103–112 (2005)
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings VLDB, pp. 518–529 (1999)
Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: Proceedings SIGMOD, pp. 47–57 (1984)
Hilbert, D.: Ueber die stetige Abbildung einer Linie auf ein Flächenstück. Math. Ann. 38(3), 459–460 (1891)
Imamura, Y., Shinohara, T., Hirata, K., Kuboyama, T.: Fast Hilbert Sort Algorithm Without Using Hilbert Indices. In: Amsaleg, L., Houle, M.E., Schubert, E. (eds.) SISAP 2016. LNCS, vol. 9939, pp. 259–267. Springer, Cham (2016). doi:10.1007/978-3-319-46759-7_20
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings STOC, pp. 604–613 (1998)
Jin, W., Tung, A.K., Han, J.: Mining top-n local outliers in large databases. In: Proceedings KDD, pp. 293–298 (2001)
Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. In: Conference in Modern Analysis and Probability, Contemporary Mathematics, vol. 26, pp. 189–206. American Mathematical Society (1984)
Kabán, A.: On the distance concentration awareness of certain data reduction techniques. Pattern Recogn. 44(2), 265–277 (2011)
Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: Proceedings VLDB, pp. 392–403 (1998)
Kollios, G., Gunopulos, D., Koudas, N., Berchthold, S.: Efficient biased sampling for approximate clustering and outlier detection in large datasets. IEEE TKDE 15(5), 1170–1187 (2003)
Kriegel, H.P., Kröger, P., Schubert, E., Zimek, A.: Interpreting and unifying outlier scores. In: Proceedings SDM, pp. 13–24 (2011)
Kriegel, H.P., Schubert, E., Zimek, A.: The (black) art of runtime evaluation: are we comparing algorithms or implementations? KAIS 52(2), 341–378 (2017). doi:10.1007/s10115-016-1004-2
Lazarevic, A., Kumar, V.: Feature bagging for outlier detection. In: Proceedings KDD, pp. 157–166 (2005)
Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semant. Web J. 6(2), 167–195 (2015)
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation-based anomaly detection. ACM TKDD 6(1), 3:1–39 (2012)
Matoušek, J.: On variants of the Johnson-Lindenstrauss lemma. Random Struct. Algorithms 33(2), 142–156 (2008)
Morton, G.M.: A computer oriented geodetic data base and a new technique in file sequencing. Technical report, International Business Machines Co (1966)
Muja, M., Lowe, D.G.: Scalable nearest neighbor algorithms for high dimensional data. IEEE TPAMI 36(11), 2227–2240 (2014)
Nguyen, H.V., Gopalkrishnan, V.: Efficient pruning schemes for distance-based outlier detection. In: Proceedings ECML PKDD, pp. 160–175 (2009)
Orair, G.H., Teixeira, C., Wang, Y., Meira, W., Parthasarathy, S.: Distance-based outlier detection: consolidation and renewed bearing. PVLDB 3(2), 1469–1480 (2010)
Papadimitriou, S., Kitagawa, H., Gibbons, P.B., Faloutsos, C.: LOCI: Fast outlier detection using the local correlation integral. In: Proceedings ICDE, pp. 315–326 (2003)
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proceedings SIGMOD, pp. 427–438 (2000)
Rousseeuw, P.J., Hubert, M.: Robust statistics for outlier detection. WIREs DMKD 1(1), 73–79 (2011)
Schubert, E., Koos, A., Emrich, T., Züfle, A., Schmid, K.A., Zimek, A.: A framework for clustering uncertain data. PVLDB 8(12), 1976–1979 (2015)
Schubert, E., Zimek, A., Kriegel, H.P.: Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Min. Knowl. Disc. 28(1), 190–237 (2014)
Schubert, E., Zimek, A., Kriegel, H.-P.: Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M.A. (eds.) DASFAA 2015. LNCS, vol. 9050, pp. 19–36. Springer, Cham (2015). doi:10.1007/978-3-319-18123-3_2
Silpa-Anan, C., Hartley, R.I.: Optimised kd-trees for fast image descriptor matching. In: Proceedings CVPR (2008)
Venkatasubramanian, S., Wang, Q.: The Johnson-Lindenstrauss transform: an empirical study. In: Proceedings ALENEX Workshop (SIAM), pp. 164–173 (2011)
Wang, Y., Parthasarathy, S., Tatikonda, S.: Locality sensitive outlier detection: a ranking driven approach. In: Proceedings ICDE, pp. 410–421 (2011)
Zhang, X., Dou, W., He, Q., Zhou, R., Leckie, C., Kotagiri, R., Salcic, Z.: LSHiForest: A generic framework for fast tree isolation based ensemble anomaly analysis. In: Proceedings ICDE (2017)
Zimek, A., Campello, R.J.G.B., Sander, J.: Ensembles for unsupervised outlier detection: challenges and research questions. SIGKDD Explor. 15(1), 11–22 (2013)
Zimek, A., Campello, R., Sander, J.: Data perturbation for outlier detection ensembles. In: Proceedings SSDBM, pp. 13:1–12 (2014)
Zimek, A., Gaudet, M., Campello, R., Sander, J.: Subsampling for efficient and effective unsupervised outlier detection ensembles. In: Proceedings KDD, pp. 428–436 (2013)
Zimek, A., Schubert, E., Kriegel, H.P.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Kirner, E., Schubert, E., Zimek, A. (2017). Good and Bad Neighborhood Approximations for Outlier Detection Ensembles. In: Beecks, C., Borutta, F., Kröger, P., Seidl, T. (eds) Similarity Search and Applications. SISAP 2017. Lecture Notes in Computer Science(), vol 10609. Springer, Cham. https://doi.org/10.1007/978-3-319-68474-1_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-68474-1_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68473-4
Online ISBN: 978-3-319-68474-1
eBook Packages: Computer ScienceComputer Science (R0)