research-article

Robust Regression via Heuristic Corruption Thresholding and Its Adaptive Estimation Variation

Authors:
Xuchao Zhang

Virginia Tech, Falls Church, VA

Virginia Tech, Falls Church, VA

0000-0001-5344-456X
View Profile

,
Shuo Lei

Virginia Tech, Falls Church, VA

Virginia Tech, Falls Church, VA
View Profile

,
Liang Zhao

George Mason University, Fairfax, VA

George Mason University, Fairfax, VA
View Profile

,
Arnold P. Boedihardjo

U. S. Army Corps of Engineers, Alexandria, VA

U. S. Army Corps of Engineers, Alexandria, VA
View Profile

,
Chang-Tien Lu

Virginia Tech, Falls Church, VA

Virginia Tech, Falls Church, VA
View Profile

ACM Transactions on Knowledge Discovery from Data Volume 13 Issue 3Article No.: 28pp 1–22https://doi.org/10.1145/3314105

Published:07 June 2019Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

The presence of data noise and corruptions has recently invoked increasing attention on robust least-squares regression (RLSR), which addresses this fundamental problem that learns reliable regression coefficients when response variables can be arbitrarily corrupted. Until now, the following important challenges could not be handled concurrently: (1) rigorous recovery guarantee of regression coefficients, (2) difficulty in estimating the corruption ratio parameter, and (3) scaling to massive datasets. This article proposes a novel Robust regression algorithm via Heuristic Corruption Thresholding (RHCT) that concurrently addresses all the above challenges. Specifically, the algorithm alternately optimizes the regression coefficients and estimates the optimal uncorrupted set via heuristic thresholding without a pre-defined corruption ratio parameter until its convergence. Moreover, to improve the efficiency of corruption estimation in large-scale data, a Robust regression algorithm via Adaptive Corruption Thresholding (RACT) is proposed to determine the size of the uncorrupted set in a novel adaptive search method without iterating data samples exhaustively. In addition, we prove that our algorithms benefit from strong guarantees analogous to those of state-of-the-art methods in terms of convergence rates and recovery guarantees. Extensive experiments demonstrate that the effectiveness of our new methods is superior to that of existing methods in the recovery of both regression coefficients and uncorrupted sets, with very competitive efficiency.

References

Kush Bhatia, Prateek Jain, and Purushottam Kar. 2015. Robust regression via hard thresholding. In Proceedings of the 28th International Conference on Neural Information Processing Systems. 721--729.Google Scholar
Joel W. Branch, Chris Giannella, Boleslaw Szymanski, Ran Wolff, and Hillol Kargupta. 2013. In-network outlier detection in wireless sensor networks. Knowledge and Information Systems 34, 1 (2013), 23--54. Google ScholarDigital Library
Markus Breunig, Hans-Peter Kriegel, Raymond Ng, and Jörg Sander. 1999. Optics-of: Identifying local outliers. Principles of Data Mining and Knowledge Discovery (1999), 262--270. Google ScholarDigital Library
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. LOF: Identifying density-based local outliers. In Proceedings of the ACM Sigmod Record, Vol. 29. ACM, 93--104. Google ScholarDigital Library
Kai Chen, Qi Lv, Yao Lu, and Yong Dou. 2017. Robust regularized extreme learning machine for regression using iteratively reweighted least squares. Neurocomputing 230 (2017), 345--358. Google ScholarDigital Library
Yudong Chen, Constantine Caramanis, and Shie Mannor. 2013. Robust sparse regression under adversarial corruption. In Proceedings of the 30th International Conference on Machine Learning. 28, 3 (2013), 774--782. Google ScholarDigital Library
Gaudenz Danuser and Markus Stricker. 1998. Parametric model fitting: From inlier characterization to outlier detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 3 (1998), 263--280.Google ScholarDigital Library
Manish Gupta, Jing Gao, Charu C. Aggarwal, and Jiawei Han. 2014. Outlier detection for temporal data: A survey. IEEE Transactions on Knowledge and Data Engineering 26, 9 (2014), 2250--2267.Google ScholarCross Ref
Victoria Hodge and Jim Austin. 2004. A survey of outlier detection methodologies. Artificial Intelligence Review 22, 2 (2004), 85--126. Google ScholarDigital Library
Chao Huang and Dong Wang. 2016. Topic-aware social sensing with arbitrary source dependency graphs. In Proceedings of the 15th International Conference on Information Processing in Sensor Networks. IEEE Press, 7. Google ScholarDigital Library
Chao Huang, Dong Wang, and Nitesh Chawla. 2017. Scalable uncertainty-aware truth discovery in big data social sensing applications for cyber-physical systems. IEEE Transactions on Big Data. 1--1.Google ScholarCross Ref
Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. 2006. Extreme learning machine: Theory and applications. Neurocomputing 70, 1--3 (2006), 489--501.Google ScholarCross Ref
Peter J. Huber. 1973. Robust regression: Asymptotics, conjectures and Monte Carlo. Annals of Statistics 1, 5 (1973), 799--821. https://projecteuclid.org/euclid.aos/1176342503.Google ScholarCross Ref
Peter J. Huber and Elvezio M. Ronchetti. 2009. The Basic Types of Estimates. John Wiley 8 Sons, Inc., 45--70.Google Scholar
Wen Jin, Anthony K. H. Tung, and Jiawei Han. 2001. Mining top-n local outliers in large databases. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’01). ACM, New York, NY, 293--298.Google ScholarDigital Library
Yoonsuh Jung, Seung Pil Lee, and Jianhua Hu. 2016. Robust regression for highly corrupted response by shifting outliers. Statistical Modelling 16, 1 (2016), 1--23.Google ScholarCross Ref
Longin Jan Latecki, Aleksandar Lazarevic, and Dragoljub Pokrajac. 2007. Outlier detection with kernel density functions. In Proceedings of the International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer, 61--75. Google ScholarDigital Library
Ruirui Li, Xinxin Huang, Shuo Song, Jia Wang, and Wei Wang. 2016. Towards customer trouble tickets resolution automation in large cellular services. In Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking. ACM, 479--480. Google ScholarDigital Library
Po-Ling Loh and Martin J. Wainwright. 2011. High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. In Proceedings of the Advances in Neural Information Processing Systems. 2726--2734. Google ScholarDigital Library
V. M. Lourenco, Ana M. Pires, and M. Kirst. 2011. Robust linear regression methods in association studies. Bioinformatics 27, 6 (2011), 815--821. Google ScholarDigital Library
RARD Maronna, R Douglas Martin, and Victor Yohai. 2006. Robust Statistics. John Wiley 8 Sons, Chichester.Google Scholar
Brian Mcwilliams, Gabriel Krummenacher, Mario Lucic, and Joachim M. Buhmann. 2014. Fast and robust least squares estimation in corrupted linear models. In Proceedings of the 27th International Conference on Neural Information Processing Systems. Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), Curran Associates, Inc., 415--423. Retrieved from http://papers.nips.cc/paper/5428-fast-and-robust-least-squares-estimation-in-corrupted-linear-models.pdf. Google ScholarDigital Library
Imran Naseem, Roberto Togneri, and Mohammed Bennamoun. 2012. Robust regression for face recognition. Pattern Recognition 45, 1 (2012), 104--118. Google ScholarDigital Library
Hong-Wei Ng and Stefan Winkler. 2014. A data-driven approach to cleaning large face datasets. In Proceedings of the IEEE International Conference on Image Processing (ICIP’14). IEEE, 343--347.Google ScholarCross Ref
Nam H. Nguyen and Trac D. Tran. 2013. Exact recoverability from dense corrupted observations via L1-minimization. IEEE Transactions on Information Theory 59, 4 (2013), 2017--2035. Google ScholarDigital Library
Volker Roth. 2006. Kernel fisher discriminants for outlier detection. Neural Computation 18, 4 (2006), 942--960. Google ScholarDigital Library
Peter J. Rousseeuw and Annick M. Leroy. 2005. Robust Regression and Outlier Detection, Vol. 589. John Wiley 8 Sons.Google Scholar
Peter J. Rousseeuw and Katrien van Driessen. 2006. Computing LTS regression for large data sets. Data Mining and Knowledge Discovery 12, 1 (Jan. 2006), 29--45. Google ScholarDigital Library
Yiyuan She and Art B. Owen. 2011. Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association 106, 494 (2011), 626--639. http://www.jstor.org/stable/41416397.Google ScholarCross Ref
Helge Erik Solberg and Ari Lahti. 2005. Detection of outliers in reference distributions: Performance of Horn’s algorithm. Clinical Chemistry 51, 12 (2005), 2326--2332.Google ScholarCross Ref
Christoph Studer, Patrick Kuppinger, Graeme Pope, and Helmut Bolcskei. 2012. Recovery of sparsely corrupted signals. IEEE Transactions on Information Theory 58, 5 (2012), 3115--3130. Google ScholarDigital Library
Sharmila Subramaniam, Themis Palpanas, Dimitris Papadopoulos, Vana Kalogeraki, and Dimitrios Gunopulos. 2006. Online outlier detection in sensor data using non-parametric models. In Proceedings of the 32nd International Conference on Very Large Data Bases. VLDB Endowment, 187--198. Google ScholarDigital Library
John Wright and Yi Ma. 2010. Dense error correction via L1-minimization. IEEE Transactions on Information Theory 56, 7 (Jul. 2010), 3540--3560. Google ScholarDigital Library
John Wright, Allen Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma. 2009. Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 2 (2009), 210--227. Google ScholarDigital Library
Xian Wu, Yuxiao Dong, Jun Tao, Chao Huang, and Nitesh V. Chawla. 2017. Reliable fake review detection via modeling temporal and behavioral patterns. In Proceedings of the 2017 IEEE International Conference on Big Data (Big Data’17). IEEE, 494--499.Google Scholar
Allen Yang, Arvind Ganesh, Shankar Sastry, and Yi Ma. 2010. Fast L1-Minimization Algorithms and an Application in Robust Face Recognition: A Review. Technical Report UCB/EECS-2010-13. EECS Department, University of California, Berkeley. Retrieved from http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-13.html.Google Scholar
Andrea Zanella, Nicola Bui, Angelo Castellani, Lorenzo Vangelista, and Michele Zorzi. 2014. Internet of things for smart cities. IEEE Internet of Things Journal 1, 1 (2014), 22--32.Google ScholarCross Ref
Xuchao Zhang, Shuo Lei, Liang Zhao, Arnold Boedihardjo, and Chang-Tien Lu. 2018. Robust regression via online feature selection under adversarial data corruption. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM’18). IEEE, 1440--1445.Google ScholarCross Ref
Xuchao Zhang, Liang Zhao, Arnold P Boedihardjo, and Chang-Tien Lu. 2017. Online and distributed robust regressions under adversarial data corruption. In Proceedings of the IEEE International Conference on Data Mining (ICDM’17). IEEE, 625--634.Google ScholarCross Ref
Xuchao Zhang, Liang Zhao, Arnold P. Boedihardjo, and Chang-Tien Lu. 2017. Robust regression via heuristic hard thresholding. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17). 3434--3440.Google ScholarCross Ref
Hao Zhu, Henry Leung, and Zhongshi He. 2013. A variational Bayesian approach to robust sensor fusion based on Student-t distribution. Information Sciences 221, Supplement C (2013), 201--214.Google ScholarDigital Library
Abdelhak M Zoubir, Visa Koivunen, Yacine Chakhchoukh, and Michael Muma. 2012. Robust estimation in signal processing: A tutorial-style treatment of fundamental concepts. IEEE Signal Processing Magazine 29, 4 (2012), 61--80.Google ScholarCross Ref

Index Terms

Robust Regression via Heuristic Corruption Thresholding and Its Adaptive Estimation Variation
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms

Recommendations

Robust weighted LAD regression

The least squares linear regression estimator is well-known to be highly sensitive to unusual observations in the data, and as a result many more robust estimators have been proposed as alternatives. One of the earliest proposals was least-sum of ...
Read More
An evolutionary algorithm for robust regression

A drawback of robust statistical techniques is the increased computational effort often needed as compared to non-robust methods. Particularly, robust estimators possessing the exact fit property are NP-hard to compute. This means that-under the widely ...
Read More
Robust regression via error tolerance
Abstract
Real-world datasets are often characterised by outliers; data items that do not follow the same structure as the rest of the data. These outliers might negatively influence modelling of the data. In data analysis it is, therefore, important to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Knowledge Discovery from Data Volume 13, Issue 3
June 2019
261 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3331063
Editors:
Charu Aggarwal
IBM T. J. Watson Research, USA
,
Xindong Wu
University of Louisiana at Lafayette, USA
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 June 2019
- Accepted: 1 February 2019
- Revised: 1 October 2018
- Received: 1 January 2018
Published in tkdd Volume 13, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Robust regression
adaptive search
adversarial data corruption
discrete optimization
hard thresholding
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 139
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Robust Regression via Heuristic Corruption Thresholding and Its Adaptive Estimation Variation

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Robust weighted LAD regression

An evolutionary algorithm for robust regression

Robust regression via error tolerance