ABSTRACT
Unavoidable noise in real-world categorical data presents significant challenges to existing outlier detection methods because they normally fail to separate noisy values from outlying values. Feature subspace-based methods inevitably mix noisy values when retaining an entire feature because a feature may contain both outlying values and noisy values. Pattern-based methods are normally based on frequency and are easily misled by noisy values, resulting in many faulty patterns. This paper introduces a novel unsupervised framework termed OUVAS, and its parameter-free instantiation RHAC to explore a high-quality outlying value set for detecting outliers in noisy categorical data. Based on the observation that the relations between values reflect their essence, OUVAS investigates value similarities to cluster values into different groups and combines cluster-level analysis and value-level refinement to identify an outlying value set. RHAC instantiates OUVAS by three successive modules (i.e., the combination of Ochiai coefficient and LOUVAIN algorithm to cluster values, hierarchical value coupling learning to perform cluster-level analysis, and a threshold to divide fake and real outlying values in value-level refinement). We show that (i) RHAC-based outlier detector significantly outperforms five state-of-the-art outlier detection methods; (ii) Extended RHAC-based feature selection method successfully improves the performance of existing outlier detectors and performs better than two latest outlying feature selection methods.
- Charu Aggarwal and S. Yu. 2005. An effective and efficient algorithm for highdimensional outlier detection. The VLDB Journal 14, 2 (2005), 211--221. Google ScholarDigital Library
- Charu C. Aggarwal. 2017. Outlier Analysis. Springer.Google Scholar
- Charu C. Aggarwal and Saket Sathe. 2015. Theoretical foundations and algorithms for outlier ensembles. ACM SIGKDD Explorations Newsletter 17, 1 (2015), 24--47. Google ScholarDigital Library
- Leman Akoglu, Hanghang Tong, Jilles Vreeken, and Christos Faloutsos. 2012. Fast and reliable anomaly detection in categorical data. In CIKM. ACM, 415--424. Google ScholarDigital Library
- Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment 2008, 10 (2008), P10008.Google ScholarCross Ref
- Markus M. Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. 2000. LOF: Identifying density-based local outliers. ACM SIGMOD Record 29, 2 (2000), 93--104. Google ScholarDigital Library
- Guilherme O. Campos, Arthur Zimek, Jörg Sander, Ricardo J. G. B. Campello, Barbora Micenková, Erich Schubert, Ira Assent, and Michael E. Houle. 2016. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Mining and Knowledge Discovery 30, 4 (2016), 891--927. Google ScholarDigital Library
- Longbing Cao, Yuming Ou, and Philip S. Yu. 2012. Coupled behavior analysis with applications. IEEE Transactions on Knowledge and Data Engineering 24, 8 (2012), 1378--1392. Google ScholarDigital Library
- Kaustav Das and Jeff Schneider. 2007. Detecting anomalous records in categorical datasets. In SIGKDD. ACM, 220--229. Google ScholarDigital Library
- Zengyou He, Xiaofei Xu, Zhexue Joshua Huang, and Shengchun Deng. 2005. FP-outlier: Frequent pattern based outlier detection. Computer Science and Information Systems 2, 1 (2005), 103--118.Google ScholarCross Ref
- Songlei Jian, Guansong Pang, Longbing Cao, Kai Lu, and Hang Gao. 2018. CURE: Flexible Categorical Data Representation by Hierarchical Coupling Learning. IEEE Transactions on Knowledge and Data Engineering (2018).Google Scholar
- Fabian Keller, Emmanuel Müller, and Klemens Bohm. 2012. HiCS: High contrast subspaces for density-based outlier ranking. In ICDE. IEEE, 1037--1048. Google ScholarDigital Library
- Aleksandar Lazarevic and Vipin Kumar. 2005. Feature bagging for outlier detection. In SIGKDD. ACM, 157--166. Google ScholarDigital Library
- Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2012. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data 6, 1, Article 3 (2012), 39 pages. Google ScholarDigital Library
- Guansong Pang, Longbing Cao, and Ling Chen. 2016. Outlier detection in complex categorical data by modelling the feature value couplings. In IJCAI. AAAI Press, 1902--1908. Google ScholarDigital Library
- Guansong Pang, Longbing Cao, Ling Chen, and Huan Liu. 2016. Unsupervised feature selection for outlier detection by modelling hierarchical value-feature couplings. In ICDM. IEEE, 410--419.Google Scholar
- Guansong Pang, Longbing Cao, Ling Chen, and Huan Liu. 2017. Learning homophily couplings from non-iid data for joint feature selection and noise-resilient outlier detection. In IJCAI. AAAI Press, 2585--2591. Google ScholarDigital Library
- Guansong Pang, Kai Ming Ting, David Albrecht, and Huidong Jin. 2016. ZERO++: Harnessing the power of zero appearances to detect anomalies in large-scale data sets. Journal of Artificial Intelligence Research 57 (2016), 593--620.Google ScholarCross Ref
- Guansong Pang, Hongzuo Xu, Longbing Cao, and Wentao Zhao. 2017. Selective Value Coupling Learning for Detecting Outliers in High-Dimensional Categorical Data. In CIKM. ACM, 807--816. Google ScholarDigital Library
- Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, and Christos Faloutsos. 2003. LOCI: Fast outlier detection using the local correlation integral. In International Conference on Data Engineering. IEEE, 315--326.Google ScholarCross Ref
- Pascal Pons and Matthieu Latapy. 2005. Computing communities in large networks using random walks. In International symposium on computer and information sciences. Springer, 284--293. Google ScholarDigital Library
- Saket Sathe and Charu C. Aggarwal. 2016. Subspace outlier detection in linear time with randomized hashing. In ICDM. IEEE, 459--468.Google Scholar
- Huaimin Wang, Peichang Shi, and Yiming Zhang. 2017. Jointcloud: A cross-cloud cooperation architecture for integrated internet service customization. In ICDCS. IEEE, 1846--1855.Google Scholar
- Wentao Zhao, Qian Li, Chengzhang Zhu, Jianglong Song, Xinwang Liu, and Jianping Yin. 2018. Model-aware categorical data embedding: a data-driven approach. Soft Computing 22, 11 (2018), 3603--3619. Google ScholarDigital Library
- Chengzhang Zhu, Longbing Cao, Qiang Liu, Jianping Yin, and Vipin Kumar. 2018. Heterogeneous metric learning of categorical data with hierarchical couplings. IEEE Transactions on Knowledge and Data Engineering 30, 7 (2018), 1254--1267.Google ScholarCross Ref
Index Terms
- Exploring a High-quality Outlying Feature Value Set for Noise-Resilient Outlier Detection in Categorical Data
Recommendations
Selective Value Coupling Learning for Detecting Outliers in High-Dimensional Categorical Data
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge ManagementThis paper introduces a novel framework, namely SelectVC and its instance POP, for learning selective value couplings (i.e., interactions between the full value set and a set of outlying values) to identify outliers in high-dimensional categorical data. ...
Homophily outlier detection in non-IID categorical data
AbstractMost of existing outlier detection methods assume that the outlier factors (i.e., outlierness scoring measures) of data entities (e.g., feature values and data objects) are Independent and Identically Distributed (IID). This assumption does not ...
Feature selection considering interaction, redundancy and complementarity for outlier detection in categorical data
AbstractFeature selection is usually used as a preprocessing step for outlier detection to obtain significant performance. There is little work on feature selection for outlier detection in categorical data, and many studies do not consider the ...
Comments