research-article

On Saving Outliers for Better Clustering over Noisy Data

Authors:

Shaoxu Song,

Fei Gao,

Ruihong Huang,

Yihan WangAuthors Info & Claims

SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

Pages 1692 - 1704

https://doi.org/10.1145/3448016.3457271

Published: 18 June 2021 Publication History

Get Access

Abstract

Clustering is often distracted by errors, frequently observed in almost all areas, ranging from online questionnaire to sensor reading in IoT. The dirty data values not only make themselves (the corresponding tuples) outlying, but also mislead the clustering of remaining tuples, e.g., mistakenly splitting a cluster into two or distorting the cluster center. The reason is that the traditional clustering methods either simply ignore the outliers such as DBSCAN or assign them to the closest clusters anyway, e.g., in K-Means. In this paper, we propose to save the outliers for better clustering. The idea is to adjust the erroneous values (often minimally) of the outlier in order to make it appear normally. That is, the tuples after adjusting values are no longer outlying, and thus will be clustered without distracting others. The outlier saving by value adjustment is designed to work with any clustering methods (e.g., DBSCAN or K-Means). Our technical contributions include: (1) showing NPhardness of the outlier saving problem for clustering, (2) deriving lower and upper bounds of the optimal solutions, and (3) devising approximation algorithm with performance guarantees referring to the aforesaid bounds. Experiments on datasets with real-world outliers demonstrate the higher accuracy of our proposal, compared to the state-of-the-art approaches. Remarkably, we show that the adjusted data with outlier saving indeed improve significantly clustering, as well as other applications such as classification and record matching.

Supplementary Material

MP4 File (3448016.3457271.mp4)

Clustering is often distracted by noises, frequently observed in al- most all areas, ranging from online questionnaire to sensor read- ing in IoT. The noisy data values not only make themselves (the corresponding tuples) outlying, but also mislead the clustering of remaining tuples, e.g., erroneously splitting a cluster into two or distorting the cluster center. The reason is that the traditional clus- tering methods either simply ignore the outliers such as DBSCAN or assign them to the closest clusters anyway, e.g., in K-Means. In this paper, we propose to save the outliers for better clustering. Intuitively, the noisy values in an outlier are introduced owing to data errors or abnormal behaviors. The idea of outlier saving is thus to adjust its values (often minimally) in order to make it appear normally. That is, the tuples after adjusting values are no longer outlying, and thus will be clustered without distracting oth- ers. The outlier saving by value adjustment is designed to work with any clustering methods (e.g., DBSCAN or K-Means). Our tech- nical contributions include: (1) showing NP-hardness of the out- lier saving problem for clustering, (2) deriving lower and upper bounds of optimal solutions, and (3) devising approximation al- gorithm with performance guarantees referring to the aforesaid bounds. Experiments on datasets with real-world outliers demon- strate the higher accuracy of our proposal, compared to the state- of-the-art approaches. Remarkably, we show that the adjusted data with outlier saving indeed improve significantly clustering, as well as other applications such as classification and record matching.

Download
23.87 MB

References

[1]

Full version technical report. https://tsdsz.github.io/disc.pdf.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

The Quality of Clustering Data Containing Outliers

The FRCK clustering algorithm for determining cluster number and removing outliers automatically

Cluster validity measure and merging system for hierarchical clustering considering outliers

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations