Waseda Workshop on Causal Inference #5

Title:

Design and Analysis with Machine Learning-Generated Variables: A Unified Framework for Prediction Bias and the Illusory Sample Size

Abstract:

Machine learning (ML) has revolutionized the social sciences by enabling researchers to extract variables from massive unstructured datasets, such as text and images. However, using these ML-predicted variables in downstream statistical analyses leads to substantial bias and invalid inference. To overcome this, we propose a unified statistical framework for valid inference with ML-predicted variables that utilizes a small set of hand-labeled observations. Unlike existing model-specific corrections, our generic framework accommodates predicted outcomes, treatments, or covariates across a wide range of estimators, including linear regression, fixed effects, survival analysis, and instrumental variable estimation. Crucially, we uncover an `illusory sample size’ problem: contrary to common intuition, massive unlabeled datasets do not reduce estimation variance without sufficient hand-labeled data. Accordingly, our framework uses a small pilot dataset to optimize data collection, balancing labeling costs against estimation precision. We demonstrate the framework’s utility by revisiting a study on election fraud using ballot image data.