skip to main content
10.1145/3269206.3271721acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Exploring a High-quality Outlying Feature Value Set for Noise-Resilient Outlier Detection in Categorical Data

Authors Info & Claims
Published:17 October 2018Publication History

ABSTRACT

Unavoidable noise in real-world categorical data presents significant challenges to existing outlier detection methods because they normally fail to separate noisy values from outlying values. Feature subspace-based methods inevitably mix noisy values when retaining an entire feature because a feature may contain both outlying values and noisy values. Pattern-based methods are normally based on frequency and are easily misled by noisy values, resulting in many faulty patterns. This paper introduces a novel unsupervised framework termed OUVAS, and its parameter-free instantiation RHAC to explore a high-quality outlying value set for detecting outliers in noisy categorical data. Based on the observation that the relations between values reflect their essence, OUVAS investigates value similarities to cluster values into different groups and combines cluster-level analysis and value-level refinement to identify an outlying value set. RHAC instantiates OUVAS by three successive modules (i.e., the combination of Ochiai coefficient and LOUVAIN algorithm to cluster values, hierarchical value coupling learning to perform cluster-level analysis, and a threshold to divide fake and real outlying values in value-level refinement). We show that (i) RHAC-based outlier detector significantly outperforms five state-of-the-art outlier detection methods; (ii) Extended RHAC-based feature selection method successfully improves the performance of existing outlier detectors and performs better than two latest outlying feature selection methods.

References

  1. Charu Aggarwal and S. Yu. 2005. An effective and efficient algorithm for highdimensional outlier detection. The VLDB Journal 14, 2 (2005), 211--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Charu C. Aggarwal. 2017. Outlier Analysis. Springer.Google ScholarGoogle Scholar
  3. Charu C. Aggarwal and Saket Sathe. 2015. Theoretical foundations and algorithms for outlier ensembles. ACM SIGKDD Explorations Newsletter 17, 1 (2015), 24--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Leman Akoglu, Hanghang Tong, Jilles Vreeken, and Christos Faloutsos. 2012. Fast and reliable anomaly detection in categorical data. In CIKM. ACM, 415--424. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment 2008, 10 (2008), P10008.Google ScholarGoogle ScholarCross RefCross Ref
  6. Markus M. Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. 2000. LOF: Identifying density-based local outliers. ACM SIGMOD Record 29, 2 (2000), 93--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Guilherme O. Campos, Arthur Zimek, Jörg Sander, Ricardo J. G. B. Campello, Barbora Micenková, Erich Schubert, Ira Assent, and Michael E. Houle. 2016. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Mining and Knowledge Discovery 30, 4 (2016), 891--927. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Longbing Cao, Yuming Ou, and Philip S. Yu. 2012. Coupled behavior analysis with applications. IEEE Transactions on Knowledge and Data Engineering 24, 8 (2012), 1378--1392. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Kaustav Das and Jeff Schneider. 2007. Detecting anomalous records in categorical datasets. In SIGKDD. ACM, 220--229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Zengyou He, Xiaofei Xu, Zhexue Joshua Huang, and Shengchun Deng. 2005. FP-outlier: Frequent pattern based outlier detection. Computer Science and Information Systems 2, 1 (2005), 103--118.Google ScholarGoogle ScholarCross RefCross Ref
  11. Songlei Jian, Guansong Pang, Longbing Cao, Kai Lu, and Hang Gao. 2018. CURE: Flexible Categorical Data Representation by Hierarchical Coupling Learning. IEEE Transactions on Knowledge and Data Engineering (2018).Google ScholarGoogle Scholar
  12. Fabian Keller, Emmanuel Müller, and Klemens Bohm. 2012. HiCS: High contrast subspaces for density-based outlier ranking. In ICDE. IEEE, 1037--1048. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Aleksandar Lazarevic and Vipin Kumar. 2005. Feature bagging for outlier detection. In SIGKDD. ACM, 157--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2012. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data 6, 1, Article 3 (2012), 39 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Guansong Pang, Longbing Cao, and Ling Chen. 2016. Outlier detection in complex categorical data by modelling the feature value couplings. In IJCAI. AAAI Press, 1902--1908. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Guansong Pang, Longbing Cao, Ling Chen, and Huan Liu. 2016. Unsupervised feature selection for outlier detection by modelling hierarchical value-feature couplings. In ICDM. IEEE, 410--419.Google ScholarGoogle Scholar
  17. Guansong Pang, Longbing Cao, Ling Chen, and Huan Liu. 2017. Learning homophily couplings from non-iid data for joint feature selection and noise-resilient outlier detection. In IJCAI. AAAI Press, 2585--2591. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Guansong Pang, Kai Ming Ting, David Albrecht, and Huidong Jin. 2016. ZERO++: Harnessing the power of zero appearances to detect anomalies in large-scale data sets. Journal of Artificial Intelligence Research 57 (2016), 593--620.Google ScholarGoogle ScholarCross RefCross Ref
  19. Guansong Pang, Hongzuo Xu, Longbing Cao, and Wentao Zhao. 2017. Selective Value Coupling Learning for Detecting Outliers in High-Dimensional Categorical Data. In CIKM. ACM, 807--816. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, and Christos Faloutsos. 2003. LOCI: Fast outlier detection using the local correlation integral. In International Conference on Data Engineering. IEEE, 315--326.Google ScholarGoogle ScholarCross RefCross Ref
  21. Pascal Pons and Matthieu Latapy. 2005. Computing communities in large networks using random walks. In International symposium on computer and information sciences. Springer, 284--293. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Saket Sathe and Charu C. Aggarwal. 2016. Subspace outlier detection in linear time with randomized hashing. In ICDM. IEEE, 459--468.Google ScholarGoogle Scholar
  23. Huaimin Wang, Peichang Shi, and Yiming Zhang. 2017. Jointcloud: A cross-cloud cooperation architecture for integrated internet service customization. In ICDCS. IEEE, 1846--1855.Google ScholarGoogle Scholar
  24. Wentao Zhao, Qian Li, Chengzhang Zhu, Jianglong Song, Xinwang Liu, and Jianping Yin. 2018. Model-aware categorical data embedding: a data-driven approach. Soft Computing 22, 11 (2018), 3603--3619. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Chengzhang Zhu, Longbing Cao, Qiang Liu, Jianping Yin, and Vipin Kumar. 2018. Heterogeneous metric learning of categorical data with hierarchical couplings. IEEE Transactions on Knowledge and Data Engineering 30, 7 (2018), 1254--1267.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Exploring a High-quality Outlying Feature Value Set for Noise-Resilient Outlier Detection in Categorical Data

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management
          October 2018
          2362 pages
          ISBN:9781450360142
          DOI:10.1145/3269206

          Copyright © 2018 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 17 October 2018

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          CIKM '18 Paper Acceptance Rate147of826submissions,18%Overall Acceptance Rate1,861of8,427submissions,22%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader