Waseda Workshop on Causal Inference　＃５

Title：

Design and Analysis with Machine Learning-Generated Variables: A Unified Framework for Prediction Bias and the Illusory Sample Size

Abstract：

Machine learning (ML) has revolutionized the social sciences by enabling researchers to extract variables from massive unstructured datasets, such as text and images. However, using these ML-predicted variables in downstream statistical analyses leads to substantial bias and invalid inference. To overcome this, we propose a unified statistical framework for valid inference with ML-predicted variables that utilizes a small set of hand-labeled observations. Unlike existing model-specific corrections, our generic framework accommodates predicted outcomes, treatments, or covariates across a wide range of estimators, including linear regression, fixed effects, survival analysis, and instrumental variable estimation. Crucially, we uncover an `illusory sample size’ problem: contrary to common intuition, massive unlabeled datasets do not reduce estimation variance without sufficient hand-labeled data. Accordingly, our framework uses a small pilot dataset to optimize data collection, balancing labeling costs against estimation precision. We demonstrate the framework’s utility by revisiting a study on election fraud using ballot image data.

Waseda Workshop on Causal Inference ＃５