ABSTRACT
Researchers in the social and behavioral sciences routinely rely on quasi-experimental designs to discover knowledge from large data-bases. Quasi-experimental designs (QEDs) exploit fortuitous circumstances in non-experimental data to identify situations (sometimes called "natural experiments") that provide the equivalent of experimental control and randomization. QEDs allow researchers in domains as diverse as sociology, medicine, and marketing to draw reliable inferences about causal dependencies from non-experimental data. Unfortunately, identifying and exploiting QEDs has remained a painstaking manual activity, requiring researchers to scour available databases and apply substantial knowledge of statistics. However, recent advances in the expressiveness of databases, and increases in their size and complexity, provide the necessary conditions to automatically identify QEDs. In this paper, we describe the first system to discover knowledge by applying quasi-experimental designs that were identified automatically. We demonstrate that QEDs can be identified in a traditional database schema and that such identification requires only a small number of extensions to that schema, knowledge about quasi-experimental design encoded in first-order logic, and a theorem-proving engine. We describe several key innovations necessary to enable this system, including methods for automatically constructing appropriate experimental units and for creating aggregate variables on those units. We show that applying the resulting designs can identify important causal dependencies in real domains, and we provide examples from academic publishing, movie making and marketing, and peer-production systems. Finally, we discuss the integration of QEDs with other approaches to causal discovery, including joint modeling and directed experimentation.
Supplemental Material
- Armour, S. and Haynie, D. 2007. Adolescent sexual debut and later delinquency. Journal of Youth and Adolescence. 36, 2, 141--152.Google ScholarCross Ref
- Barker, R. 1990. CASE*Method: Entity Relationship Modelling. Addison-Wesley, Boston, MA. Google ScholarDigital Library
- Bradshaw, G., Langley, P., and Simon, H. 1983. Studying scientific discovery by computer simulation. Science, 222, 4627, 971--975.Google Scholar
- Campbell, D. and Stanley, J. 1963. Experimental and Quasi-Experimental Designs for Research. Rand McNally.Google Scholar
- Cook, T. and Campbell, T. 1979. Quasi-Experimentation: Design & Analysis Issues for Field Settings. Rand McNally.Google Scholar
- Chen, P. 1976. The entity-relationship model - Toward a unified view of data. ACM Transactions on Database Systems 1, 1, 9--36. Google ScholarDigital Library
- Cochran, W. and Cox, G. 1954. Experimental Designs. Wiley, New York.Google Scholar
- Harden, K., Mendle, J., Hill, J., Turkheimer, E., and Emery, R. 2008. Rethinking timing of first sex and delinquency. Journal of Youth and Adolescence 37, 4, 373--385.Google ScholarCross Ref
- Holland, P. 1986. Statistics and causal inference. Journal of the American Statistical Association. 81, 396, 945--960.Google Scholar
- Holland, P. and Rubin, D. 1988. Causal inference in retrospective studies. Evaluation Review 12, 203--231.Google ScholarCross Ref
- Jensen, D. 2008. Beyond prediction: Directions for probabilistic and relational learning. Lecture Notes in Computer Science 4894, 4--21. Springer, Berlin. Google ScholarDigital Library
- Karimi, K. and Hamilton, H. 2003. Distinguishing causal and acausal temporal relations. The Seventh Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'2003). Seoul, South Korea, 234--240. Google ScholarDigital Library
- King, R., Whelan, K., Jones, F., Reiser, P., Bryant, C., Muggleton, S., Kell, D., and Oliver, S. 2004. Functional genomic hypothesis generation and experimentation by a robot scientist. Nature 427, 6971, 247--252.Google Scholar
- Kulkarni, D. and Simon, H. 1988. The processes of scientific discovery: The strategy of experimentation. Cognitive Science 12, 139--176.Google ScholarCross Ref
- Langley, P. 1981. Data-driven discovery of physical laws. Cognitive Science 5, 1, 31--54Google ScholarCross Ref
- Pearl, J. 2000. Causality: Models, Reasoning, and Inference. Cambridge. Google ScholarDigital Library
- Richardson, M. and Domingos, P. 2003. Building large knowledge bases by mass collaboration. Proceedings of the 2nd international conference on Knowledge capture. 129--137. Google ScholarDigital Library
- Rubin, D. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology. 66, 5, 689.Google ScholarCross Ref
- Shadish, W., Cook, T., and Campbell, D. 2002. Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Houghton Mifflin, Boston, MA.Google Scholar
- Spirtes, P., Glymour, C., and Scheines, R. 2000. Causation, Prediction, and Search. MIT Press, Cambridge.Google Scholar
- UNC Carolina Population Center. 2008. Add Health Home Page. http://www.cpc.unc.edu/addhealth. Accessed on February 27, 2008.Google Scholar
- Weiss, R. 2007. Study debunks theory on teen sex, delinquency. Washington Post. November 11, 2007, A03.Google Scholar
Index Terms
- Automatic identification of quasi-experimental designs for discovering causal knowledge
Recommendations
Causal discovery in social media using quasi-experimental designs
SOMA '10: Proceedings of the First Workshop on Social Media AnalyticsSocial media systems have become increasingly attractive to both users and companies providing those systems. Efficient management of these systems is essential and requires knowledge of cause-and-effect relationships within the system. Online ...
Disentangling causality: assumptions in causal discovery and inference
AbstractCausality has been a burgeoning field of research leading to the point where the literature abounds with different components addressing distinct parts of causality. For researchers, it has been increasingly difficult to discern the assumptions ...
Automatic linear causal relationship identification for financial factor modeling
Given a comprehensive set of financial factors, we use linear non-Gaussian SEM to automatically identify the causal relationships buried in the factor set. The causal structure is allowed to have cyclic edges, explicitly accommodating 'mutual causality' ...
Comments