skip to main content
10.1145/1557019.1557042acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

CoCo: coding cost for parameter-free outlier detection

Published: 28 June 2009 Publication History

Abstract

How can we automatically spot all outstanding observations in a data set? This question arises in a large variety of applications, e.g. in economy, biology and medicine. Existing approaches to outlier detection suffer from one or more of the following drawbacks: The results of many methods strongly depend on suitable parameter settings being very difficult to estimate without background knowledge on the data, e.g. the minimum cluster size or the number of desired outliers. Many methods implicitly assume Gaussian or uniformly distributed data, and/or their result is difficult to interpret. To cope with these problems, we propose CoCo, a technique for parameter-free outlier detection. The basic idea of our technique relates outlier detection to data compression: Outliers are objects which can not be effectively compressed given the data set. To avoid the assumption of a certain data distribution, CoCo relies on a very general data model combining the Exponential Power Distribution with Independent Components. We define an intuitive outlier factor based on the principle of the Minimum Description Length together with an novel algorithm for outlier detection. An extensive experimental evaluation on synthetic and real world data demonstrates the benefits of our technique. Availability: The source code of CoCo and the data sets used in the experiments are available at: http://www.dbs.ifi.lmu.de/Forschung/KDD/Boehm/CoCo.

Supplementary Material

JPG File (p149-haegler.jpg)
MP4 File (p149-haegler.mp4)

References

[1]
C. Böhm, C. Faloutsos, J.-Y. Pan, and C. Plant. Robust information-theoretic clustering. In KDD Conference, pages 65--75, 2006.
[2]
C. Böhm, C. Faloutsos, and C. Plant. Outlier-robust clustering using independent components. In SIGMOD Conference, pages 185--198, 2008.
[3]
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: Identifying density-based local outliers. In SIGMOD Conference, pages 93--104, 2000.
[4]
V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys, 2009.
[5]
D. Hawkins. Identification of Outliers. Chapman and Hall, London, 1980.
[6]
A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. 2001.
[7]
E. Keogh, S. Lonardi, and C. A. Ratanamahatana. Towards parameter-free data mining. In KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 206--215, New York, NY, USA, 2004. ACM.
[8]
S. Kim and I.-S. Kweon. Simultaneous classification and visualword selection using entropy-based minimum description length. In ICPR (1), pages 650--653, 2006.
[9]
E. M. Knorr. On digital money and card technologies. Technical Report Technical Report 97-02, University of British Columbia, 1997.
[10]
E. M. Knorr and R. T. Ng. A unified notion of outliers: Properties and computation. In KDD, pages 219--222, 1997.
[11]
E. M. Knorr and R. T. Ng. Algorithms for mining distance-based outliers in large datasets. In VLDB, pages 392--403, 1998.
[12]
E. M. Knorr and R. T. Ng. Finding intensional knowledge of distance-based outliers. In VLDB, pages 211--222, 1999.
[13]
A. Mineo and M. Ruggieri. A software tool for the exponential power distribution: The normalp package. Journal of Statistical Software, 12(4), 1 2005.
[14]
S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using the local correlation integral. In ICDE, pages 315--, 2003.
[15]
D. Pelleg and A. Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In ICML Conference, pages 727--734, 2000.
[16]
J. Rissanen. Mdl denoising. IEEE Transactions on Information Theory, 46(7):2537--2543, 2000.
[17]
M. Robnik-Sikonja and I. Kononenko. Pruning regression trees with mdl. In ECAI, pages 455--459, 1998.
[18]
J. Xie, D. Zhang, and W. Xu. Spatially adaptive wavelet denoising using the minimum description length principle. IEEE Transactions on Image Processing, 13(2):179--187, 2004.
[19]
T. Yoshida, H. Motoda, and T. Washio. Adaptive ripple down rules method based on minimum description length principle. In ICDM, pages 530--537, 2002.

Cited By

View all
  • (2024)Anomaly Detection Based on Compressed Data: An Information Theoretic CharacterizationIEEE Transactions on Systems, Man, and Cybernetics: Systems10.1109/TSMC.2023.329916954:1(23-38)Online publication date: Jan-2024
  • (2024)A survey of anomaly detection techniquesJournal of Optics10.1007/s12596-023-01147-453:2(756-774)Online publication date: 16-Feb-2024
  • (2023)Generating Relevant and Informative Questions for Open-Domain ConversationsACM Transactions on Information Systems10.1145/351061241:1(1-30)Online publication date: 9-Jan-2023
  • Show More Cited By

Index Terms

  1. CoCo: coding cost for parameter-free outlier detection

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
    June 2009
    1426 pages
    ISBN:9781605584959
    DOI:10.1145/1557019
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 June 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. coding costs
    2. data compression
    3. minimum description length
    4. outlier detection

    Qualifiers

    • Research-article

    Conference

    KDD09

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Anomaly Detection Based on Compressed Data: An Information Theoretic CharacterizationIEEE Transactions on Systems, Man, and Cybernetics: Systems10.1109/TSMC.2023.329916954:1(23-38)Online publication date: Jan-2024
    • (2024)A survey of anomaly detection techniquesJournal of Optics10.1007/s12596-023-01147-453:2(756-774)Online publication date: 16-Feb-2024
    • (2023)Generating Relevant and Informative Questions for Open-Domain ConversationsACM Transactions on Information Systems10.1145/351061241:1(1-30)Online publication date: 9-Jan-2023
    • (2023)Performance Analysis of Document Similarity-Based DBSCAN and K-Means Clustering on Text DatasetsProceedings of International Conference on Recent Innovations in Computing10.1007/978-981-19-9876-8_5(57-69)Online publication date: 3-May-2023
    • (2022)Editorial: Special Issue on Deep Learning for Data QualityJournal of Data and Information Quality10.1145/351313514:3(1-3)Online publication date: 1-Aug-2022
    • (2022)Machine Learning and Data Cleaning: Which Serves the Other?Journal of Data and Information Quality10.1145/350671214:3(1-11)Online publication date: 21-Jul-2022
    • (2022)Making Community Beliefs and Capacities Visible Through Care-mongering During COVID-19Proceedings of the ACM on Human-Computer Interaction10.1145/34928476:GROUP(1-19)Online publication date: 14-Jan-2022
    • (2022)Black Lives, Green Books, and Blue ChecksProceedings of the ACM on Human-Computer Interaction10.1145/34928466:GROUP(1-22)Online publication date: 14-Jan-2022
    • (2022)Trend analysis and forecasting of publication activities by Indian computer science researchers during the period of 2010–23Expert Systems10.1111/exsy.1307039:10Online publication date: 19-Jun-2022
    • (2022)Anomaly Detection in Audio With Concept Drift Using Dynamic Huffman CodingIEEE Sensors Journal10.1109/JSEN.2022.319396922:17(17126-17138)Online publication date: 1-Sep-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media