skip to main content
10.1145/3448016.3457271acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

On Saving Outliers for Better Clustering over Noisy Data

Published: 18 June 2021 Publication History

Abstract

Clustering is often distracted by errors, frequently observed in almost all areas, ranging from online questionnaire to sensor reading in IoT. The dirty data values not only make themselves (the corresponding tuples) outlying, but also mislead the clustering of remaining tuples, e.g., mistakenly splitting a cluster into two or distorting the cluster center. The reason is that the traditional clustering methods either simply ignore the outliers such as DBSCAN or assign them to the closest clusters anyway, e.g., in K-Means. In this paper, we propose to save the outliers for better clustering. The idea is to adjust the erroneous values (often minimally) of the outlier in order to make it appear normally. That is, the tuples after adjusting values are no longer outlying, and thus will be clustered without distracting others. The outlier saving by value adjustment is designed to work with any clustering methods (e.g., DBSCAN or K-Means). Our technical contributions include: (1) showing NPhardness of the outlier saving problem for clustering, (2) deriving lower and upper bounds of the optimal solutions, and (3) devising approximation algorithm with performance guarantees referring to the aforesaid bounds. Experiments on datasets with real-world outliers demonstrate the higher accuracy of our proposal, compared to the state-of-the-art approaches. Remarkably, we show that the adjusted data with outlier saving indeed improve significantly clustering, as well as other applications such as classification and record matching.

Supplementary Material

MP4 File (3448016.3457271.mp4)
Clustering is often distracted by noises, frequently observed in al- most all areas, ranging from online questionnaire to sensor read- ing in IoT. The noisy data values not only make themselves (the corresponding tuples) outlying, but also mislead the clustering of remaining tuples, e.g., erroneously splitting a cluster into two or distorting the cluster center. The reason is that the traditional clus- tering methods either simply ignore the outliers such as DBSCAN or assign them to the closest clusters anyway, e.g., in K-Means. In this paper, we propose to save the outliers for better clustering. Intuitively, the noisy values in an outlier are introduced owing to data errors or abnormal behaviors. The idea of outlier saving is thus to adjust its values (often minimally) in order to make it appear normally. That is, the tuples after adjusting values are no longer outlying, and thus will be clustered without distracting oth- ers. The outlier saving by value adjustment is designed to work with any clustering methods (e.g., DBSCAN or K-Means). Our tech- nical contributions include: (1) showing NP-hardness of the out- lier saving problem for clustering, (2) deriving lower and upper bounds of optimal solutions, and (3) devising approximation al- gorithm with performance guarantees referring to the aforesaid bounds. Experiments on datasets with real-world outliers demon- strate the higher accuracy of our proposal, compared to the state- of-the-art approaches. Remarkably, we show that the adjusted data with outlier saving indeed improve significantly clustering, as well as other applications such as classification and record matching.

References

[1]
Full version technical report. https://tsdsz.github.io/disc.pdf.
[2]
http://archive.ics.uci.edu/ml/datasets/iris.
[3]
http://archive.ics.uci.edu/ml/datasets/letter+recognition.
[4]
http://archive.ics.uci.edu/ml/datasets/seeds.
[5]
http://archive.ics.uci.edu/ml/datasets/wireless+indoor+localization.
[6]
http://archive.ics.uci.edu/ml/datasets/yeast.
[7]
https://archive.ics.uci.edu/ml/datasets/spambase.
[8]
https://figshare.com/articles/dataset/flights+csv/9820139.
[9]
http://www.cs.utexas.edu/users/ml/riddle/data.html.
[10]
scikit-learn. http://scikit-learn.org/, 2018.
[11]
M. Ankerst, M. M. Breunig, H. Kriegel, and J. Sander. OPTICS: ordering points to identify the clustering structure. In SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1--3, 1999, Philadelphia, Pennsylvania, USA., pages 49--60, 1999.
[12]
G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin. On the relative trust between inconsistent data and inaccurate constraints. In 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8--12, 2013, pages 541--552, 2013.
[13]
P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15--20, 2007, pages 746--755, 2007.
[14]
S. Chawla and A. Gionis. k-means-: A unified approach to clustering and outlier detection. In Proceedings of the 13th SIAM International Conference on Data Mining, May 2--4, 2013. Austin, Texas, USA, pages 189--197, 2013.
[15]
K. Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM J. Comput., 39(3):923--947, 2009.
[16]
J. Chomicki and J. Marcinkowski. On the computational complexity of minimal-change integrity maintenance in relational databases. In Inconsistency Tolerance [result from a Dagstuhl seminar], pages 119--150, 2005.
[17]
X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 6(13):1498--1509, 2013.
[18]
X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8--12, 2013, pages 458--469, 2013.
[19]
F. Daniel, F. Casati, T. Palpanas, O. Chayka, and C. Cappiello. Enabling better decisions through quality-aware reports in business intelligence applications. In Proceedings of the 13th International Conference on Information Quality, MIT, Cambridge, MA, USA, 2008, pages 310--324, 2008.
[20]
I. Davidson and S. S. Ravi. A framework for determining the fairness of outlier detection. In ECAI 2020 - 24th European Conference on Artificial Intelligence, 29 August-8 September 2020, Santiago de Compostela, Spain, August 29 - September 8, 2020 - Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020), pages 2465--2472, 2020.
[21]
A. Dobra. Decision tree classification. In Encyclopedia of Database Systems, pages 765--769. 2009.
[22]
M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon, USA, pages 226--231, 1996.
[23]
W. Fan. Dependencies revisited for improving data quality. In M. Lenzerini and D. Lembo, editors, Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2008, June 9--11, 2008, Vancouver, BC, Canada, pages 159--170. ACM, 2008.
[24]
W. Fan. Constraint-driven database repair. In Encyclopedia of Database Systems, pages 458--463. 2009.
[25]
M. A. Herná ndez and S. J. Stolfo. The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, May 22--25, 1995., pages 127--138, 1995.
[26]
V. J. Hodge and J. Austin. A survey of outlier detection methodologies. Artif. Intell. Rev., 22(2):85--126, 2004.
[27]
X. Jin and J. Han. K-means clustering. In Encyclopedia of Machine Learning and Data Mining, pages 695--697. 2017.
[28]
E. M. Knorr and R. T. Ng. Algorithms for mining distance-based outliers in large datasets. In VLDB'98, Proceedings of 24rd International Conference on Very Large Data Bases, August 24--27, 1998, New York City, New York, USA, pages 392--403, 1998.
[29]
E. M. Knorr and R. T. Ng. Finding intensional knowledge of distance-based outliers. In VLDB'99, Proceedings of 25th International Conference on Very Large Data Bases, September 7--10, 1999, Edinburgh, Scotland, UK, pages 211--222, 1999.
[30]
E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-based outliers: Algorithms and applications. VLDB J., 8(3--4):237--253, 2000.
[31]
S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In R. Fagin, editor, Database Theory - ICDT 2009, 12th International Conference, St. Petersburg, Russia, March 23--25, 2009, Proceedings, volume 361 of ACM International Conference Proceeding Series, pages 53--62. ACM, 2009.
[32]
P. Li, X. Rao, J. Blase, Y. Zhang, X. Chu, and C. Zhang. Cleanml: A benchmark for joint data cleaning and machine learning [experiments and analysis]. CoRR, abs/1904.09483, 2019.
[33]
X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2):97--108, 2012.
[34]
C. D. Manning, P. Raghavan, and H. Schü tze. Introduction to information retrieval. Cambridge University Press, 2008.
[35]
C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6--10, 2010, pages 75--86, 2010.
[36]
B. Micenková, R. T. Ng, X. Dang, and I. Assent. Explaining outliers by subspace separability. In 2013 IEEE 13th International Conference on Data Mining, Dallas, TX, USA, December 7--10, 2013, pages 518--527, 2013.
[37]
G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31--88, 2001.
[38]
S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443 -- 453, 1970.
[39]
X. V. Nguyen, J. Epps, and J. Bailey. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res., 11:2837--2854, 2010.
[40]
Y. Noh, F. C. Park, and D. D. Lee. Diffusion decision making for adaptive k-nearest neighbor classification. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3--6, 2012, Lake Tahoe, Nevada, United States., pages 1934--1942, 2012.
[41]
C. K. Reddy, H. Chiang, and B. Rajaratnam. Stability region based expectation maximization for model-based clustering. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM 2006), 18--22 December 2006, Hong Kong, China, pages 522--531, 2006.
[42]
T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré . Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190--1201, 2017.
[43]
N. Rujeerapaiboon, K. Schindler, D. Kuhn, and W. Wiesemann. Size matters: Cardinality-constrained clustering and outlier detection via conic optimization. SIAM J. Optim., 29(2):1211--1239, 2019.
[44]
S. Sarawagi. Explaining differences in multidimensional aggregates. In VLDB'99, Proceedings of 25th International Conference on Very Large Data Bases, September 7--10, 1999, Edinburgh, Scotland, UK, pages 42--53, 1999.
[45]
S. Song, C. Li, and X. Zhang. Turn waste into wealth: On simultaneous clustering and cleaning over dirty data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10--13, 2015, pages 1115--1124, 2015.
[46]
M. Yakout, L. Berti-É quille, and A. K. Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22--27, 2013, pages 553--564, 2013.

Cited By

View all
  • (2024)Win-Win: On Simultaneous Clustering and Imputing over Incomplete DataProceedings of the VLDB Endowment10.14778/3681954.368198217:11(3045-3057)Online publication date: 30-Aug-2024
  • (2024)BIC-Based Mixture Model Defense Against Data Poisoning Attacks on Classifiers: A Comprehensive StudyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.336554836:8(3697-3711)Online publication date: Aug-2024
  • (2024)Relational Data Cleaning Meets Artificial Intelligence: A SurveyData Science and Engineering10.1007/s41019-024-00266-7Online publication date: 20-Dec-2024
  • Show More Cited By

Index Terms

  1. On Saving Outliers for Better Clustering over Noisy Data

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
    June 2021
    2969 pages
    ISBN:9781450383431
    DOI:10.1145/3448016
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 June 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. clustering
    2. outlier saving

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • National Key Research and Development Plan
    • MIIT High Quality Development Program 2020

    Conference

    SIGMOD/PODS '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)24
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 15 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Win-Win: On Simultaneous Clustering and Imputing over Incomplete DataProceedings of the VLDB Endowment10.14778/3681954.368198217:11(3045-3057)Online publication date: 30-Aug-2024
    • (2024)BIC-Based Mixture Model Defense Against Data Poisoning Attacks on Classifiers: A Comprehensive StudyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.336554836:8(3697-3711)Online publication date: Aug-2024
    • (2024)Relational Data Cleaning Meets Artificial Intelligence: A SurveyData Science and Engineering10.1007/s41019-024-00266-7Online publication date: 20-Dec-2024
    • (2023)ShadowAQP: Efficient Approximate Group-by and Join Query via Attribute-Oriented Sample Size Allocation and Data GenerationProceedings of the VLDB Endowment10.14778/3625054.362505916:13(4216-4229)Online publication date: 1-Sep-2023
    • (2022)Scalable and Accurate Density-Peaks Clustering on Fully Dynamic Data2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020690(445-454)Online publication date: 17-Dec-2022

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media