skip to main content
research-article

Analyzing data with systematic bias

Published: 28 November 2022 Publication History

Abstract

In many data analysis problems, we only have access to biased data due to some systematic bias of the data collection procedure. In this letter, we present a general formulation of systematic bias in data as well as our recent results on how to handle two very fundamental types of systematic bias that arise frequently in econometric studies: truncation bias and self-selection bias.

References

[1]
Athey, S. and Haile, P. A. 2002. Identification of standard auction models. Econometrica 70, 6, 2107--2140.
[2]
Athey, S. and Haile, P. A. 2007. Nonparametric approaches to auctions. Handbook of econometrics 6, 3847--3965.
[3]
Cherapanamjeri, Y., Daskalakis, C., Ilyas, A., and Zampetakis, M. 2022a. Estimation of standard auction models. arXiv preprint arXiv:2205.02060.
[4]
Cherapanamjeri, Y., Daskalakis, C., Ilyas, A., and Zampetakis, M. 2022b. What makes a good fisherman? linear regression under self-selection bias. arXiv preprint arXiv:2205.03246.
[5]
Dagan, N., Barda, N., Kepten, E., Miron, O., Perchik, S., Katz, M. A., Hernán, M. A., Lipsitch, M., Reis, B., and Balicer, R. D. 2021. Bnt162b2 mrna covid-19 vaccine in a nationwide mass vaccination setting. New England Journal of Medicine.
[6]
Daskalakis, C., Gouleakis, T., Tzamos, C., and Zampetakis, M. 2018. Efficient statistics, in high dimensions, from truncated samples. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 639--649.
[7]
Daskalakis, C., Gouleakis, T., Tzamos, C., and Zampetakis, M. 2019. Computationally and statistically efficient truncated regression. In Conference on Learning Theory. PMLR, 955--960.
[8]
Daskalakis, C., Kontonis, V., Tzamos, C., and Zampetakis, E. 2021. A statistical taylor theorem and extrapolation of truncated densities. In Conference on Learning Theory. PMLR, 1395--1398.
[9]
Daskalakis, C., Rohatgi, D., and Zampetakis, E. 2020. Truncated linear regression in high dimensions. Advances in Neural Information Processing Systems 33, 10338--10347.
[10]
Daskalakis, C., Stefanou, P., Yao, R., and Zampetakis, E. 2021. Efficient truncated linear regression with unknown noise variance. Advances in Neural Information Processing Systems 34, 1952--1963.
[11]
Diakonikolas, I. and Kane, D. M. 2019. Recent advances in algorithmic high-dimensional robust statistics. arXiv preprint arXiv:1911.05911.
[12]
Dvoretzky, A., Kiefer, J., and Wolfowitz, J. 1956. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, 642--669.
[13]
Fair, R. C. and Jaffee, D. M. 1972. Methods of estimation for markets in disequilibrium. Econometrica: Journal of the Econometric Society, 497--514.
[14]
Fisher, R. 1931. Properties and Applications of Hh Functions. Mathematical tables 1, 815--852.
[15]
Hansen, W. L., Weisbrod, B. A., and Scanlon, W. J. 1970. Schooling and earnings of low achievers. The American Economic Review, 409--418.
[16]
Hardt, M., Megiddo, N., Papadimitriou, C. H., and Wootters, M. 2016. Strategic classification. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, Cambridge, MA, USA, January 14--16, 2016, M. Sudan, Ed. ACM, 111--122.
[17]
Hausman, J. A. and Wise, D. A. 1977. Social experimentation, truncated distributions, and efficient estimation. Econometrica: Journal of the Econometric Society, 919--938.
[18]
Huber, P. J. 2011. Robust statistics. In International encyclopedia of statistical science. Springer, 1248--1251.
[19]
Ilyas, A., Zampetakis, E., and Daskalakis, C. 2020. A theoretical and practical framework for regression and classification from truncated samples. In International Conference on Artificial Intelligence and Statistics. PMLR, 4463--4473.
[20]
Klein, J. P. and Moeschberger, M. L. 2003. Survival analysis: techniques for censored and truncated data. Vol. 1230. Springer.
[21]
Krishnaswamy, A. K., Li, H., Rein, D., Zhang, H., and Conitzer, V. 2020. Classification with strategically withheld data. CoRR abs/2012.10203.
[22]
Lee, A. 1914. Table of the gaussian "tail" functions; when the "tail" is larger than the body. Biometrika.
[23]
Liu, Z. and Garg, N. 2021. Test-optional policies: Overcoming strategic behavior and informational gaps. In EAAMO 2021: ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, Virtual Event, USA, October 5 - 9, 2021. ACM, 11:1--11:13.
[24]
Maddala, G. S. 1986. Limited-dependent and qualitative variables in econometrics. Number 3. Cambridge university press.
[25]
Massart, P. 1990. The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The annals of Probability, 1269--1283.
[26]
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., and Galstyan, A. 2021. A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR) 54, 6, 1--35.
[27]
Pearson, K. 1902. On the Systematic Fitting of Frequency Curves. Biometrika 2, 2--7.
[28]
Plevrakis, O. 2021. Learning from censored and dependent data: The case of linear dynamics. In Conference on Learning Theory. PMLR, 3771--3787.
[29]
ROY, A. D. 1951. Some thoughts on the distribution of earnings. Oxford economic papers 3, 2, 135--146.
[30]
Woodroofe, M. 1985. Estimating a distribution function with truncated data. The Annals of Statistics 13, 1, 163--177.

Cited By

View all
  • (2024)Efficient Statistics With Unknown Truncation, Polynomial Time Algorithms, Beyond Gaussians2024 IEEE 65th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS61266.2024.00066(988-1006)Online publication date: 27-Oct-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGecom Exchanges
ACM SIGecom Exchanges  Volume 20, Issue 1
July 2022
71 pages
EISSN:1551-9031
DOI:10.1145/3572885
Issue’s Table of Contents
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 November 2022
Published in SIGECOM Volume 20, Issue 1

Check for updates

Author Tags

  1. bias
  2. censoring
  3. self-selection
  4. truncation

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)1
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Efficient Statistics With Unknown Truncation, Polynomial Time Algorithms, Beyond Gaussians2024 IEEE 65th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS61266.2024.00066(988-1006)Online publication date: 27-Oct-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media