Skip to main content
Log in

Efficient distance-based outlier detection on uncertain datasets of Gaussian distribution

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Uncertain data management, querying and mining have become important because the majority of real world data is accompanied with uncertainty these days. Uncertainty in data is often caused by the deficiency in underlying data collecting equipments or sometimes manually introduced to preserve data privacy. This work discusses the problem of distance-based outlier detection on uncertain datasets of Gaussian distribution. The Naive approach of distance-based outlier on uncertain data is usually infeasible due to expensive distance function. Therefore a cell-based approach is proposed in this work to quickly identify the outliers. The infinite nature of Gaussian distribution prevents to devise effective pruning techniques. Therefore an approximate approach using bounded Gaussian distribution is also proposed. Approximating Gaussian distribution by bounded Gaussian distribution enables an approximate but more efficient cell-based outlier detection approach. An extensive empirical study on synthetic and real datasets show that our proposed approaches are effective, efficient and scalable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aggarwal, C.C., Yu, P.S.: Outlier detection with uncertain data. In: SIAM International Conference on Data Mining, pp. 483–493 (2008)

  2. Alaydie, N., Fotouhi, F., Reddy, C.K., Soltanian-Zadeh, H.: Noise and outlier filtering in heterogeneous medical data sources. In: Workshops on Database and Expert Systems Applications, DEXA, pp. 115–119 (2010)

  3. Angiulli, F., Pizzuti, C.: Fast outlier detection in high dim. spaces. In: PKDD, pp. 15–26 (2002)

  4. Angiulli, F., Fassetti, F.: Detecting distance-based outliers in streams of data. In: CIKM, pp. 811–820 (2007)

  5. Arturo, E., Alberto, O.Z., Alejandro, P., Julio, P.: Outlier analysis for plastic card fraud detection a hybridized and multi-objective approach. In: Hybrid Artificial Intelligent Systems, LNCS, pp. 1–9 (2011)

  6. Barnett, V., Lewis, T.: Outliers in Statistical Data. Wiley, New York (1994)

    MATH  Google Scholar 

  7. CISL Research Data Archive. http://rda.ucar.edu. Accessed 16 July 2012

  8. Diao, Y., Li, B., Liu, A., Peng, L., Sutton, C., Tran, T., Zink, M.: Capturing data uncertainty in high-volume stream processing. In: CIDR (2009)

  9. Garces, H., Sbarbaro, D.: Outliers detection in environmental monitoring databases. Eng. Appl. Artif. Intell. 24(2), 341–349 (2011)

    Article  Google Scholar 

  10. Hawkins, D.: Identification of Outliers. Chapman and Hall, London (1980)

    Book  MATH  Google Scholar 

  11. Helm, I., Jalukse L., Leito I.: Measurement uncertainty estimation in amperometric sensors: a tutorial review. Sensors 10(5), 4430–4455 (2010)

    Article  Google Scholar 

  12. International Surface Pressure Databank (ISPDv2) 1768–2010. http://rda.ucar.edu/datasets/ds132.0/index.html. Accessed 16 July 2012

  13. Ishida, K., Kitagawa, H.: Detecting current outliers: continuous outlier detection over time-series data streams. In: DEXA, pp. 255–268 (2008)

  14. Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: Proceedings of 24th VLDB, pp. 392–403 (1998)

  15. Knorr, E.M., Ng, R.T., Tucakov, V.: Distance-based outliers: algorithms and applications. VLDB J. 8(3–4), 237–253 (2000)

    Article  Google Scholar 

  16. Kontaki, M., Gounaris, A., Papadopoulos, A.N., Tsichlas, K., Manolopoulos, Y.: Continuous monitoring of distance-based outliers over data streams. In: ICDE, pp. 135–146 (2011)

  17. Mahoney, M., Chan, P.: Learning rules for anomaly detection of hostile network traffic. In: Proceedings of the 3rd ICDM, pp. 601–604 (2003)

  18. Maimon, O., Rockach, L.: Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers. Kluwer Academic, Norwell (2005)

    Book  Google Scholar 

  19. Ngai, W.K., Kao, B., Chui, C.K., Cheng, R., Chau, M., Yip, K.Y.: Efficient clustering of uncertain data. In: ICDM, pp. 436–445 (2006)

  20. Nievergelt, J., Hinterberger, H., Sevick, K.C.: The Grid file: an adaptable, symmetric multikey file structure. ACM Trans. Database Syst. 9(1), 38–71 (1984)

    Article  Google Scholar 

  21. Orair, G.H., Teixeira, C.H.C., Meira, W.: Distance-based outlier detection: consolidation and renewed bearing. In: Proc. of the VLDB Endowment, pp. 1469–1480 (2010)

  22. Pukelsheim, F.: The three sigma rule. Am. Stat. 48(2), 88–91 (1994)

    MathSciNet  Google Scholar 

  23. Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: ACM SIGMOD, pp. 427–438 (2000)

    Chapter  Google Scholar 

  24. Shaikh, S.A., Kitagawa, H.: Distance-based outlier detection on uncertain data of Gaussian distribution. In: APWeb, pp. 109–121 (2012)

  25. Sharma, A.B., Golubchik, L., Govindan, R.: Sensor faults: detection methods and prevalence in real-world datasets. ACM Trans. Sens. Netw. 6(3), 23:1–39 (2010)

    Article  Google Scholar 

  26. Sloan Digital Sky Survey. http://www.sdss.org. Accessed 16 July 2012

  27. Stevens Water Monitoring Systems, Inc. http://www.stevenswater.com/. Accessed 7 March 2013

  28. Tao, Y., Xiao, X., Cheng, R.: Range search on multidimensional uncertain data. ACM Trans. Database Syst. 32(3), 15:1–54 (2007)

    Article  Google Scholar 

  29. Thistleton, W., Marsh, J.A., Nelson, K., Tsallis, C.: Generalized Box-Muller method for generating q-Gaussian random deviates. IEEE Trans. Inf. Theory 53(12), 4805–4810 (2007)

    Article  MathSciNet  Google Scholar 

  30. Vaisala Corporation. http://www.vaisala.com/. Accessed 7 March 2013

  31. Wang, B., Xiao, G., Yu, H., Yang, X.: Distance-based outlier detection on uncertain data. In: IEEE 9th International Conference on Computer and Information Technology, pp. 293–298 (2009)

  32. Weisstein, E.W.: Normal Difference Distribution. From MathWorld—A Wolfram Web Resource. http://mathworld.wolfram.com/NormalDifferenceDistribution. Accessed 27 Jan 2012

  33. Xylem Corporation. http://www.globalw.com/. Accessed 7 March 2013

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Salman A. Shaikh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shaikh, S.A., Kitagawa, H. Efficient distance-based outlier detection on uncertain datasets of Gaussian distribution. World Wide Web 17, 511–538 (2014). https://doi.org/10.1007/s11280-013-0211-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-013-0211-y

Keywords

Navigation