Abstract
The presence of data noise and corruptions has recently invoked increasing attention on robust least-squares regression (RLSR), which addresses this fundamental problem that learns reliable regression coefficients when response variables can be arbitrarily corrupted. Until now, the following important challenges could not be handled concurrently: (1) rigorous recovery guarantee of regression coefficients, (2) difficulty in estimating the corruption ratio parameter, and (3) scaling to massive datasets. This article proposes a novel Robust regression algorithm via Heuristic Corruption Thresholding (RHCT) that concurrently addresses all the above challenges. Specifically, the algorithm alternately optimizes the regression coefficients and estimates the optimal uncorrupted set via heuristic thresholding without a pre-defined corruption ratio parameter until its convergence. Moreover, to improve the efficiency of corruption estimation in large-scale data, a Robust regression algorithm via Adaptive Corruption Thresholding (RACT) is proposed to determine the size of the uncorrupted set in a novel adaptive search method without iterating data samples exhaustively. In addition, we prove that our algorithms benefit from strong guarantees analogous to those of state-of-the-art methods in terms of convergence rates and recovery guarantees. Extensive experiments demonstrate that the effectiveness of our new methods is superior to that of existing methods in the recovery of both regression coefficients and uncorrupted sets, with very competitive efficiency.
- Kush Bhatia, Prateek Jain, and Purushottam Kar. 2015. Robust regression via hard thresholding. In Proceedings of the 28th International Conference on Neural Information Processing Systems. 721--729.Google Scholar
- Joel W. Branch, Chris Giannella, Boleslaw Szymanski, Ran Wolff, and Hillol Kargupta. 2013. In-network outlier detection in wireless sensor networks. Knowledge and Information Systems 34, 1 (2013), 23--54. Google ScholarDigital Library
- Markus Breunig, Hans-Peter Kriegel, Raymond Ng, and Jörg Sander. 1999. Optics-of: Identifying local outliers. Principles of Data Mining and Knowledge Discovery (1999), 262--270. Google ScholarDigital Library
- Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. LOF: Identifying density-based local outliers. In Proceedings of the ACM Sigmod Record, Vol. 29. ACM, 93--104. Google ScholarDigital Library
- Kai Chen, Qi Lv, Yao Lu, and Yong Dou. 2017. Robust regularized extreme learning machine for regression using iteratively reweighted least squares. Neurocomputing 230 (2017), 345--358. Google ScholarDigital Library
- Yudong Chen, Constantine Caramanis, and Shie Mannor. 2013. Robust sparse regression under adversarial corruption. In Proceedings of the 30th International Conference on Machine Learning. 28, 3 (2013), 774--782. Google ScholarDigital Library
- Gaudenz Danuser and Markus Stricker. 1998. Parametric model fitting: From inlier characterization to outlier detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 3 (1998), 263--280.Google ScholarDigital Library
- Manish Gupta, Jing Gao, Charu C. Aggarwal, and Jiawei Han. 2014. Outlier detection for temporal data: A survey. IEEE Transactions on Knowledge and Data Engineering 26, 9 (2014), 2250--2267.Google ScholarCross Ref
- Victoria Hodge and Jim Austin. 2004. A survey of outlier detection methodologies. Artificial Intelligence Review 22, 2 (2004), 85--126. Google ScholarDigital Library
- Chao Huang and Dong Wang. 2016. Topic-aware social sensing with arbitrary source dependency graphs. In Proceedings of the 15th International Conference on Information Processing in Sensor Networks. IEEE Press, 7. Google ScholarDigital Library
- Chao Huang, Dong Wang, and Nitesh Chawla. 2017. Scalable uncertainty-aware truth discovery in big data social sensing applications for cyber-physical systems. IEEE Transactions on Big Data. 1--1.Google ScholarCross Ref
- Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. 2006. Extreme learning machine: Theory and applications. Neurocomputing 70, 1--3 (2006), 489--501.Google ScholarCross Ref
- Peter J. Huber. 1973. Robust regression: Asymptotics, conjectures and Monte Carlo. Annals of Statistics 1, 5 (1973), 799--821. https://projecteuclid.org/euclid.aos/1176342503.Google ScholarCross Ref
- Peter J. Huber and Elvezio M. Ronchetti. 2009. The Basic Types of Estimates. John Wiley 8 Sons, Inc., 45--70.Google Scholar
- Wen Jin, Anthony K. H. Tung, and Jiawei Han. 2001. Mining top-n local outliers in large databases. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’01). ACM, New York, NY, 293--298.Google ScholarDigital Library
- Yoonsuh Jung, Seung Pil Lee, and Jianhua Hu. 2016. Robust regression for highly corrupted response by shifting outliers. Statistical Modelling 16, 1 (2016), 1--23.Google ScholarCross Ref
- Longin Jan Latecki, Aleksandar Lazarevic, and Dragoljub Pokrajac. 2007. Outlier detection with kernel density functions. In Proceedings of the International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer, 61--75. Google ScholarDigital Library
- Ruirui Li, Xinxin Huang, Shuo Song, Jia Wang, and Wei Wang. 2016. Towards customer trouble tickets resolution automation in large cellular services. In Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking. ACM, 479--480. Google ScholarDigital Library
- Po-Ling Loh and Martin J. Wainwright. 2011. High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. In Proceedings of the Advances in Neural Information Processing Systems. 2726--2734. Google ScholarDigital Library
- V. M. Lourenco, Ana M. Pires, and M. Kirst. 2011. Robust linear regression methods in association studies. Bioinformatics 27, 6 (2011), 815--821. Google ScholarDigital Library
- RARD Maronna, R Douglas Martin, and Victor Yohai. 2006. Robust Statistics. John Wiley 8 Sons, Chichester.Google Scholar
- Brian Mcwilliams, Gabriel Krummenacher, Mario Lucic, and Joachim M. Buhmann. 2014. Fast and robust least squares estimation in corrupted linear models. In Proceedings of the 27th International Conference on Neural Information Processing Systems. Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), Curran Associates, Inc., 415--423. Retrieved from http://papers.nips.cc/paper/5428-fast-and-robust-least-squares-estimation-in-corrupted-linear-models.pdf. Google ScholarDigital Library
- Imran Naseem, Roberto Togneri, and Mohammed Bennamoun. 2012. Robust regression for face recognition. Pattern Recognition 45, 1 (2012), 104--118. Google ScholarDigital Library
- Hong-Wei Ng and Stefan Winkler. 2014. A data-driven approach to cleaning large face datasets. In Proceedings of the IEEE International Conference on Image Processing (ICIP’14). IEEE, 343--347.Google ScholarCross Ref
- Nam H. Nguyen and Trac D. Tran. 2013. Exact recoverability from dense corrupted observations via L1-minimization. IEEE Transactions on Information Theory 59, 4 (2013), 2017--2035. Google ScholarDigital Library
- Volker Roth. 2006. Kernel fisher discriminants for outlier detection. Neural Computation 18, 4 (2006), 942--960. Google ScholarDigital Library
- Peter J. Rousseeuw and Annick M. Leroy. 2005. Robust Regression and Outlier Detection, Vol. 589. John Wiley 8 Sons.Google Scholar
- Peter J. Rousseeuw and Katrien van Driessen. 2006. Computing LTS regression for large data sets. Data Mining and Knowledge Discovery 12, 1 (Jan. 2006), 29--45. Google ScholarDigital Library
- Yiyuan She and Art B. Owen. 2011. Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association 106, 494 (2011), 626--639. http://www.jstor.org/stable/41416397.Google ScholarCross Ref
- Helge Erik Solberg and Ari Lahti. 2005. Detection of outliers in reference distributions: Performance of Horn’s algorithm. Clinical Chemistry 51, 12 (2005), 2326--2332.Google ScholarCross Ref
- Christoph Studer, Patrick Kuppinger, Graeme Pope, and Helmut Bolcskei. 2012. Recovery of sparsely corrupted signals. IEEE Transactions on Information Theory 58, 5 (2012), 3115--3130. Google ScholarDigital Library
- Sharmila Subramaniam, Themis Palpanas, Dimitris Papadopoulos, Vana Kalogeraki, and Dimitrios Gunopulos. 2006. Online outlier detection in sensor data using non-parametric models. In Proceedings of the 32nd International Conference on Very Large Data Bases. VLDB Endowment, 187--198. Google ScholarDigital Library
- John Wright and Yi Ma. 2010. Dense error correction via L1-minimization. IEEE Transactions on Information Theory 56, 7 (Jul. 2010), 3540--3560. Google ScholarDigital Library
- John Wright, Allen Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma. 2009. Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 2 (2009), 210--227. Google ScholarDigital Library
- Xian Wu, Yuxiao Dong, Jun Tao, Chao Huang, and Nitesh V. Chawla. 2017. Reliable fake review detection via modeling temporal and behavioral patterns. In Proceedings of the 2017 IEEE International Conference on Big Data (Big Data’17). IEEE, 494--499.Google Scholar
- Allen Yang, Arvind Ganesh, Shankar Sastry, and Yi Ma. 2010. Fast L1-Minimization Algorithms and an Application in Robust Face Recognition: A Review. Technical Report UCB/EECS-2010-13. EECS Department, University of California, Berkeley. Retrieved from http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-13.html.Google Scholar
- Andrea Zanella, Nicola Bui, Angelo Castellani, Lorenzo Vangelista, and Michele Zorzi. 2014. Internet of things for smart cities. IEEE Internet of Things Journal 1, 1 (2014), 22--32.Google ScholarCross Ref
- Xuchao Zhang, Shuo Lei, Liang Zhao, Arnold Boedihardjo, and Chang-Tien Lu. 2018. Robust regression via online feature selection under adversarial data corruption. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM’18). IEEE, 1440--1445.Google ScholarCross Ref
- Xuchao Zhang, Liang Zhao, Arnold P Boedihardjo, and Chang-Tien Lu. 2017. Online and distributed robust regressions under adversarial data corruption. In Proceedings of the IEEE International Conference on Data Mining (ICDM’17). IEEE, 625--634.Google ScholarCross Ref
- Xuchao Zhang, Liang Zhao, Arnold P. Boedihardjo, and Chang-Tien Lu. 2017. Robust regression via heuristic hard thresholding. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17). 3434--3440.Google ScholarCross Ref
- Hao Zhu, Henry Leung, and Zhongshi He. 2013. A variational Bayesian approach to robust sensor fusion based on Student-t distribution. Information Sciences 221, Supplement C (2013), 201--214.Google ScholarDigital Library
- Abdelhak M Zoubir, Visa Koivunen, Yacine Chakhchoukh, and Michael Muma. 2012. Robust estimation in signal processing: A tutorial-style treatment of fundamental concepts. IEEE Signal Processing Magazine 29, 4 (2012), 61--80.Google ScholarCross Ref
Index Terms
- Robust Regression via Heuristic Corruption Thresholding and Its Adaptive Estimation Variation
Recommendations
Robust weighted LAD regression
The least squares linear regression estimator is well-known to be highly sensitive to unusual observations in the data, and as a result many more robust estimators have been proposed as alternatives. One of the earliest proposals was least-sum of ...
An evolutionary algorithm for robust regression
A drawback of robust statistical techniques is the increased computational effort often needed as compared to non-robust methods. Particularly, robust estimators possessing the exact fit property are NP-hard to compute. This means that-under the widely ...
Robust regression via error tolerance
AbstractReal-world datasets are often characterised by outliers; data items that do not follow the same structure as the rest of the data. These outliers might negatively influence modelling of the data. In data analysis it is, therefore, important to ...
Comments