Abstract
Popular outlier detection methods require the pairwise comparison of objects to compute the nearest neighbors. This inherently quadratic problem is not scalable to large data sets, making multidimensional outlier detection for big data still an open challenge. Existing approximate neighbor search methods are designed to preserve distances as well as possible. In this article, we present a highly scalable approach to compute the nearest neighbors of objects that instead focuses on preserving neighborhoods well using an ensemble of space-filling curves. We show that the method has near-linear complexity, can be distributed to clusters for computation, and preserves neighborhoods—but not distances—better than established methods such as locality sensitive hashing and projection indexed nearest neighbors. Furthermore, we demonstrate that, by preserving neighborhoods, the quality of outlier detection based on local density estimates is not only well retained but sometimes even improved, an effect that can be explained by relating our method to outlier detection ensembles. At the same time, the outlier detection process is accelerated by two orders of magnitude.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Achlioptas, D.: Database-friendly random projections: Johnson-Lindenstrauss with binary coins. JCSS 66, 671–687 (2003)
Achtert, E., Kriegel, H.P., Schubert, E., Zimek, A.: Interactive data mining with 3D-parallel-coordinate-trees. In: Proc. SIGMOD, pp. 1009–1012 (2013)
Aggarwal, C.C.: Outlier ensembles. SIGKDD Explor. 14(2), 49–58 (2012)
Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data sets. IEEE TKDE 17(2), 203–215 (2005)
Bache, K., Lichman, M.: UCI machine learning repository (2013). http://www.archive.ics.uci.edu/ml
Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proc. KDD, pp. 29–38 (2003)
Breunig, M.M., Kriegel, H.P., Ng, R., Sander, J.: LOF: identifying density-based local outliers. In: Proc. SIGMOD, pp. 93–104 (2000)
Butz, A.R.: Alternative algorithm for Hilbert’s space-filling curve. IEEE TC 100(4), 424–426 (1971)
Chan, T.M.: Approximate nearest neighbor queries revisited. Disc. & Comp. Geom. 20(3), 359–373 (1998)
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: A survey. ACM CSUR 41(3), Article 15, 1–58 (2009)
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proc. ACM SoCG, pp. 253–262 (2004)
de Vries, T., Chawla, S., Houle, M.E.: Density-preserving projections for large-scale local anomaly detection. KAIS 32(1), 25–52 (2012)
Geusebroek, J.M., Burghouts, G.J., Smeulders, A.W.M.: The amsterdam library of object images. Int. J. Computer Vision 61(1), 103–112 (2005)
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proc. VLDB, pp. 518–529 (1999)
Hilbert, D.: Ueber die stetige Abbildung einer Linie auf ein Flächenstück. Math. Ann. 38(3), 459–460 (1891)
Houle, M.E., Kriegel, H.-P., Kröger, P., Schubert, E., Zimek, A.: Can shared-neighbor distances defeat the curse of dimensionality? In: Gertz, M., Ludäscher, B. (eds.) SSDBM 2010. LNCS, vol. 6187, pp. 482–500. Springer, Heidelberg (2010)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proc. STOC, pp. 604–613 (1998)
Jin, W., Tung, A.K.H., Han, J., Wang, W.: Ranking outliers using symmetric neighborhood relationship. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 577–593. Springer, Heidelberg (2006)
Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. In: Conference in Modern Analysis and Probability, Contemporary Mathematics, vol. 26, pp. 189–206. American Mathematical Society (1984)
Kabán, A.: On the distance concentration awareness of certain data reduction techniques. Pattern Recognition 44(2), 265–277 (2011)
Kamel, I., Faloutsos, C.: Hilbert R-tree: an improved R-tree using fractals. In: Proc. VLDB, pp. 500–509 (1994)
Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: Proc. VLDB, pp. 392–403 (1998)
Lazarevic, A., Kumar, V.: Feature bagging for outlier detection. In: Proc. KDD, pp. 157–166 (2005)
Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web J. (2014)
Liao, S., Lopez, M.A., Leutenegger, S.T.: High dimensional similarity search with space filling curves. In: Proc. ICDE, pp. 615–622 (2001)
Matoušek, J.: On variants of the Johnson-Lindenstrauss lemma. Random Structures & Algorithms 33(2), 142–156 (2008)
Morton, G.M.: A computer oriented geodetic data base and a new technique in file sequencing. Tech. rep, International Business Machines Co. (1966)
Nguyen, G., Franco, P., Mullot, R., Ogier, J.M.: Mapping high dimensional features onto Hilbert curve: applying to fast image retrieval. In: ICPR12, pp. 425–428 (2012)
Nguyen, H.V., Gopalkrishnan, V.: Efficient pruning schemes for distance-based outlier detection. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009, Part II. LNCS, vol. 5782, pp. 160–175. Springer, Heidelberg (2009)
Orair, G.H., Teixeira, C., Wang, Y., Meira Jr., W., Parthasarathy, S.: Distance-based outlier detection: Consolidation and renewed bearing. PVLDB 3(2), 1469–1480 (2010)
Peano, G.: Sur une courbe, qui remplit toute une aire plane. Math. Ann. 36(1), 157–160 (1890)
Radovanović, M., Nanopoulos, A., Ivanović, M.: Reverse nearest neighbors in unsupervised distance-based outlier detection. IEEE TKDE (2014)
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proc. SIGMOD, pp. 427–438 (2000)
Rasmussen, A., Porter, G., Conley, M., Madhyastha, H., Mysore, R., Pucher, A., Vahdat, A.: TritonSort: a balanced large-scale sorting system. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (2011)
Schubert, E., Wojdanowski, R., Zimek, A., Kriegel, H.P.: On evaluation of outlier rankings and outlier scores. In: Proc. SDM, pp. 1047–1058 (2012)
Schubert, E., Zimek, A., Kriegel, H.P.: Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Min. Knowl. Disc. 28(1), 190–237 (2014)
Shepherd, J.A., Zhu, X., Megiddo, N.: Fast indexing method for multidimensional nearest-neighbor search. In: Proc. SPIE, pp. 350–355 (1998)
Venkatasubramanian, S., Wang, Q.: The Johnson-Lindenstrauss transform: an empirical study. In: Proc. ALENEX Workshop (SIAM), pp. 164–173 (2011)
Wang, Y., Parthasarathy, S., Tatikonda, S.: Locality sensitive outlier detection: a ranking driven approach. In: Proc. ICDE, pp. 410–421 (2011)
Zimek, A., Campello, R.J.G.B., Sander, J.: Ensembles for unsupervised outlier detection: Challenges and research questions. SIGKDD Explor. 15(1), 11–22 (2013)
Zimek, A., Campello, R.J.G.B., Sander, J.: Data perturbation for outlier detection ensembles. In: Proc. SSDBM, vol. 13, pp. 1–12 (2014)
Zimek, A., Gaudet, M., Campello, R.J.G.B., Sander, J.: Subsampling for efficient and effective unsupervised outlier detection ensembles. In: Proc. KDD, pp. 428–436 (2013)
Zimek, A., Schubert, E., Kriegel, H.P.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012)
Zolotarev, V.M.: One-dimensional stable distributions. Translations of Mathematical Monographs, vol. 65. American Mathematical Society (1986)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Schubert, E., Zimek, A., Kriegel, HP. (2015). Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science(), vol 9050. Springer, Cham. https://doi.org/10.1007/978-3-319-18123-3_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-18123-3_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18122-6
Online ISBN: 978-3-319-18123-3
eBook Packages: Computer ScienceComputer Science (R0)