Random Forests and Selected Samples

16 Pages Posted: 13 Nov 2017 Last revised: 3 May 2019

See all articles by Jonathan A Cook

Jonathan A Cook

U.S. Securities and Exchange Commission; affiliation not provided to SSRN

Saad Siddiqui

Villanova University

Date Written: April 1, 2019


This paper presents a procedure for recovering causal coefficients from selected samples that uses random forests, a popular machine-learning algorithm. This proposed method makes few assumptions regarding the selection equation and the distribution of the error terms. Our Monte Carlo results indicate that our method performs well, even when the selection and outcome equations contain the same variables, as long as the selection equation is nonlinear. The method can also be used when there are many variables in the selection equation. We also compare the results of our procedure with other parametric and semiparametric methods using real data.

Keywords: Sample-selection model, random forest, Heckman model, semiparametric estimation

Suggested Citation

Cook, Jonathan A and Siddiqui, Saad, Random Forests and Selected Samples (April 1, 2019). Available at SSRN: https://ssrn.com/abstract=3068128 or http://dx.doi.org/10.2139/ssrn.3068128

Jonathan A Cook (Contact Author)

U.S. Securities and Exchange Commission ( email )

affiliation not provided to SSRN

Saad Siddiqui

Villanova University ( email )

Villanova, PA 19085
United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Abstract Views
PlumX Metrics