Good and Bad Neighborhood Approximations for Outlier Detection Ensembles

Kirner, Evelyn; Schubert, Erich; Zimek, Arthur

doi:10.1007/978-3-319-68474-1_12

Evelyn Kirner¹⁷,
Erich Schubert¹⁸ &
Arthur Zimek¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10609))

Included in the following conference series:

International Conference on Similarity Search and Applications

1937 Accesses
11 Citations

Abstract

Outlier detection methods have used approximate neighborhoods in filter-refinement approaches. Outlier detection ensembles have used artificially obfuscated neighborhoods to achieve diverse ensemble members. Here we argue that outlier detection models could be based on approximate neighborhoods in the first place, thus gaining in both efficiency and effectiveness. It depends, however, on the type of approximation, as only some seem beneficial for the task of outlier detection, while no (large) benefit can be seen for others. In particular, we argue that space-filling curves are beneficial approximations, as they have a stronger tendency to underestimate the density in sparse regions than in dense regions. In comparison, LSH and NN-Descent do not have such a tendency and do not seem to be beneficial for the construction of outlier detection ensembles.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Results from computational geometry indicate that the worst case of nearest neighbor search in more than 3 dimensions cannot be better than \(\mathcal {O}(n^{4/3})\) [16]. Empirical results with such indexes are usually much better, and tree-based indexes are often attributed a \(n \log n\) cost for searching.
2.
But there may be a performance improvement by nevertheless using these methods.

References

Achlioptas, D.: Database-friendly random projections: Johnson-Lindenstrauss with binary coins. JCSS 66, 671–687 (2003)
MathSciNet MATH Google Scholar
Achtert, E., Kriegel, H.P., Schubert, E., Zimek, A.: Interactive data mining with 3D-parallel-coordinate-trees. In: Proceedings SIGMOD, pp. 1009–1012 (2013)
Google Scholar
Angiulli, F., Fassetti, F.: DOLPHIN: an efficient algorithm for mining distance-based outliers in very large datasets. ACM TKDD 3(1), 4:1–57 (2009)
Google Scholar
Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data sets. IEEE TKDE 17(2), 203–215 (2005)
MATH Google Scholar
Arya, S., Mount, D.M.: Approximate nearest neighbor queries in fixed dimensions. In: Proceedings SODA, pp. 271–280 (1993)
Google Scholar
Barnett, V., Lewis, T.: Outliers in Statistical Data, 3rd edn. Wiley, New York (1994)
MATH Google Scholar
Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings KDD, pp. 29–38 (2003)
Google Scholar
Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R*-Tree: an efficient and robust access method for points and rectangles. In: Proceedings SIGMOD, pp. 322–331 (1990)
Google Scholar
Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
Article MATH Google Scholar
Breunig, M.M., Kriegel, H.P., Ng, R., Sander, J.: LOF: Identifying density-based local outliers. In: Proceedings SIGMOD. pp. 93–104 (2000)
Google Scholar
Campos, G.O., Zimek, A., Sander, J., Campello, R.J.G.B., Micenková, B., Schubert, E., Assent, I., Houle, M.E.: On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study. Data Min. Knowl. Disc. 30, 891–927 (2016)
Article MathSciNet Google Scholar
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM CSUR 41(3), 1–58 (2009). Article 15
Article Google Scholar
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings ACM SoCG, pp. 253–262 (2004)
Google Scholar
de Vries, T., Chawla, S., Houle, M.E.: Density-preserving projections for large-scale local anomaly detection. KAIS 32(1), 25–52 (2012)
Google Scholar
Dong, W., Charikar, M., Li, K.: Efficient k-nearest neighbor graph construction for generic similarity measures. In: Proceedings WWW, pp. 577–586 (2011)
Google Scholar
Erickson, J.: On the relative complexities of some geometric problems. In: Proceedings of the 7th Canadian Conference on Computational Geometry, Quebec City, Quebec, Canada, August 1995, pp. 85–90 (1995)
Google Scholar
Geusebroek, J.M., Burghouts, G.J., Smeulders, A.W.M.: The amsterdam library of object images. Int. J. Comput. Vis. 61(1), 103–112 (2005)
Article Google Scholar
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings VLDB, pp. 518–529 (1999)
Google Scholar
Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: Proceedings SIGMOD, pp. 47–57 (1984)
Google Scholar
Hilbert, D.: Ueber die stetige Abbildung einer Linie auf ein Flächenstück. Math. Ann. 38(3), 459–460 (1891)
Article MathSciNet MATH Google Scholar
Imamura, Y., Shinohara, T., Hirata, K., Kuboyama, T.: Fast Hilbert Sort Algorithm Without Using Hilbert Indices. In: Amsaleg, L., Houle, M.E., Schubert, E. (eds.) SISAP 2016. LNCS, vol. 9939, pp. 259–267. Springer, Cham (2016). doi:10.1007/978-3-319-46759-7_20
Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings STOC, pp. 604–613 (1998)
Google Scholar
Jin, W., Tung, A.K., Han, J.: Mining top-n local outliers in large databases. In: Proceedings KDD, pp. 293–298 (2001)
Google Scholar
Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. In: Conference in Modern Analysis and Probability, Contemporary Mathematics, vol. 26, pp. 189–206. American Mathematical Society (1984)
Google Scholar
Kabán, A.: On the distance concentration awareness of certain data reduction techniques. Pattern Recogn. 44(2), 265–277 (2011)
Article MATH Google Scholar
Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: Proceedings VLDB, pp. 392–403 (1998)
Google Scholar
Kollios, G., Gunopulos, D., Koudas, N., Berchthold, S.: Efficient biased sampling for approximate clustering and outlier detection in large datasets. IEEE TKDE 15(5), 1170–1187 (2003)
Google Scholar
Kriegel, H.P., Kröger, P., Schubert, E., Zimek, A.: Interpreting and unifying outlier scores. In: Proceedings SDM, pp. 13–24 (2011)
Google Scholar
Kriegel, H.P., Schubert, E., Zimek, A.: The (black) art of runtime evaluation: are we comparing algorithms or implementations? KAIS 52(2), 341–378 (2017). doi:10.1007/s10115-016-1004-2
Google Scholar
Lazarevic, A., Kumar, V.: Feature bagging for outlier detection. In: Proceedings KDD, pp. 157–166 (2005)
Google Scholar
Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semant. Web J. 6(2), 167–195 (2015)
Google Scholar
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation-based anomaly detection. ACM TKDD 6(1), 3:1–39 (2012)
Google Scholar
Matoušek, J.: On variants of the Johnson-Lindenstrauss lemma. Random Struct. Algorithms 33(2), 142–156 (2008)
Article MathSciNet MATH Google Scholar
Morton, G.M.: A computer oriented geodetic data base and a new technique in file sequencing. Technical report, International Business Machines Co (1966)
Google Scholar
Muja, M., Lowe, D.G.: Scalable nearest neighbor algorithms for high dimensional data. IEEE TPAMI 36(11), 2227–2240 (2014)
Article Google Scholar
Nguyen, H.V., Gopalkrishnan, V.: Efficient pruning schemes for distance-based outlier detection. In: Proceedings ECML PKDD, pp. 160–175 (2009)
Google Scholar
Orair, G.H., Teixeira, C., Wang, Y., Meira, W., Parthasarathy, S.: Distance-based outlier detection: consolidation and renewed bearing. PVLDB 3(2), 1469–1480 (2010)
Google Scholar
Papadimitriou, S., Kitagawa, H., Gibbons, P.B., Faloutsos, C.: LOCI: Fast outlier detection using the local correlation integral. In: Proceedings ICDE, pp. 315–326 (2003)
Google Scholar
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proceedings SIGMOD, pp. 427–438 (2000)
Google Scholar
Rousseeuw, P.J., Hubert, M.: Robust statistics for outlier detection. WIREs DMKD 1(1), 73–79 (2011)
Google Scholar
Schubert, E., Koos, A., Emrich, T., Züfle, A., Schmid, K.A., Zimek, A.: A framework for clustering uncertain data. PVLDB 8(12), 1976–1979 (2015)
Google Scholar
Schubert, E., Zimek, A., Kriegel, H.P.: Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Min. Knowl. Disc. 28(1), 190–237 (2014)
Article MathSciNet MATH Google Scholar
Schubert, E., Zimek, A., Kriegel, H.-P.: Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M.A. (eds.) DASFAA 2015. LNCS, vol. 9050, pp. 19–36. Springer, Cham (2015). doi:10.1007/978-3-319-18123-3_2
Google Scholar
Silpa-Anan, C., Hartley, R.I.: Optimised kd-trees for fast image descriptor matching. In: Proceedings CVPR (2008)
Google Scholar
Venkatasubramanian, S., Wang, Q.: The Johnson-Lindenstrauss transform: an empirical study. In: Proceedings ALENEX Workshop (SIAM), pp. 164–173 (2011)
Google Scholar
Wang, Y., Parthasarathy, S., Tatikonda, S.: Locality sensitive outlier detection: a ranking driven approach. In: Proceedings ICDE, pp. 410–421 (2011)
Google Scholar
Zhang, X., Dou, W., He, Q., Zhou, R., Leckie, C., Kotagiri, R., Salcic, Z.: LSHiForest: A generic framework for fast tree isolation based ensemble anomaly analysis. In: Proceedings ICDE (2017)
Google Scholar
Zimek, A., Campello, R.J.G.B., Sander, J.: Ensembles for unsupervised outlier detection: challenges and research questions. SIGKDD Explor. 15(1), 11–22 (2013)
Article Google Scholar
Zimek, A., Campello, R., Sander, J.: Data perturbation for outlier detection ensembles. In: Proceedings SSDBM, pp. 13:1–12 (2014)
Google Scholar
Zimek, A., Gaudet, M., Campello, R., Sander, J.: Subsampling for efficient and effective unsupervised outlier detection ensembles. In: Proceedings KDD, pp. 428–436 (2013)
Google Scholar
Zimek, A., Schubert, E., Kriegel, H.P.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Ludwig-Maximilians-Universität München, Oettingenstr. 67, 80538, München, Germany
Evelyn Kirner
Heidelberg University, INF 205, 69120, Heidelberg, Germany
Erich Schubert
University of Southern Denmark, Campusvej 55, 5230, Odense M, Denmark
Arthur Zimek

Authors

Evelyn Kirner
View author publications
You can also search for this author in PubMed Google Scholar
Erich Schubert
View author publications
You can also search for this author in PubMed Google Scholar
Arthur Zimek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Erich Schubert .

Editor information

Editors and Affiliations

Fraunhofer Institute for Applied Information Technology, Sankt Augustin, Germany
Christian Beecks
Ludwig-Maximilians-Universität München, Munich, Germany
Felix Borutta
Ludwig-Maximilians-Universität München, Munich, Germany
Peer Kröger
Ludwig-Maximilians-Universität München, Munich, Germany
Thomas Seidl

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kirner, E., Schubert, E., Zimek, A. (2017). Good and Bad Neighborhood Approximations for Outlier Detection Ensembles. In: Beecks, C., Borutta, F., Kröger, P., Seidl, T. (eds) Similarity Search and Applications. SISAP 2017. Lecture Notes in Computer Science(), vol 10609. Springer, Cham. https://doi.org/10.1007/978-3-319-68474-1_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-68474-1_12
Published: 28 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68473-4
Online ISBN: 978-3-319-68474-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics