Simulation-Based Methods
Zijing Hu
October 15, 2022
Contents
1 Bootstrap 1
2 Sampling Methods 2
2.1 Ways to Draw from a Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Simulation-Based Estimation 5
3.1 Maximum Simulated Likelihood Estimation (MSL) . . . . . . . . . . . . . . . . . . . . 5
3.2 Simulated Method of Moments Estimation (SMM) . . . . . . . . . . . . . . . . . . . . 6
*This note is based on
ECMT 677: Applied Microeconometrics by Dr. Yonghong An and Dr. Jackson Bunting, TAMU
Discrete Choice Methods with Simulation by Kenneth Train
1 Bootstrap
Asymptotics vs. Bootstrap
1. Heteroskedasticity: robust-std performs badly for samll sample size.
2. Cluster: Athey et al. (2017, 2022)
Bootstrapping (Efron 1972) assigns measures of accuracy (bias, variance, confidence intervals,
prediction error, etc.) to sample estimates.
std(
ˆ
β
β
β) =
v
u
u
t
1
B 1
B
X
j=1
˜
β
β
β
j
¯
˜
β
β
β
2
How to choose a sufficient number of B? B determines std(
ˆ
β
β
β) and N determines
ˆ
β
β
β. B = 500 is
enough and B = 1000 is sufficient.
Bootstrap Methods in Linear Regression
Pair bootstrap. For data (X
i
, y
i
) F (X, y), bootstrap (X
j
, y
j
) where j = (1, 2, . . . , B).
Compute each
ˆ
β
β
β
j
. Drawbacks: no restrictions for E[ε
ε
ε
j
|X
j
] = E[y
j
X
j
β
β
β|X
j
] = 0.
Residual bootstrap. Compute ˆu
i
= y
i
X
i
ˆ
β
β
β and sample (X
i
, y
i
) where y
i
= X
i
ˆ
β
β
β + ˆu
i
and
ˆu
i
{ˆu
1
, ˆu
2
, . . . , ˆu
n
}. Drawbacks: Homoskedasticity is imposed, i.e., V ar(ˆu
i
|X
i
) = σ
2
.
Wild boostrap. First, compute
ˆ
β
β
β and ˆu
i
where i = (1, 2, . . . , n). Then,
y
i
= X
i
ˆ
β
β
β + f(ˆu
i
)v
i
1
where f(·) is a function of ˆu
i
and v
i
has zero mean and ˆu
i
v
i
.
Common choices of f(·): f(ˆu
i
) =
n
nk
ˆu
i
, f(ˆu
i
) = ˆu
i
/
1 h
i
, or f(ˆu
i
) = ˆu
i
/(1 h
i
). n is
sample size, k is the number of regressors, and h
i
is the i-th diagonal element of the matrix
P X(X
X)
1
X
. Normally, researchers use f(·) to capture heteroskedasticity and v
i
to cap-
ture intro-cluster correlations. (The last two are better as they incorporate heteroskedasticity.)
Two requirements of v
i
: E[v
i
] = 0 and moments of f(ˆu
i
)v
i
= moments of f (ˆu
i
).
Two choices of v
i
(b is better than a while both would perform well as n goes to +):
Mammen (1993)
v
i
=
(
51
2
with probability
5+1
2
5
5+1
2
with probability
51
2
5
where E[v
i
] = 0, E[v
i
2
] = 1, E[v
i
3
] = 1, E[v
i
4
] = 2.
Davidson and Flachair (2008)
v
i
=
(
1 with probability
1
2
1 with probability
1
2
where E[v
i
] = 0, E[v
i
2
] = 1, E[v
i
3
] = 0, E[v
i
4
] = 1.
Example 1.1. Wild Bootstrap. Suppose that y
ji
= X
ji
β
β
β + ε
ji
, where j = 1, 2, . . . , m represent
cluster indexes and i = 1, 2, . . . , n
m
represent individual indexes. Now we show an example for boot-
strapping the cluster-robust std:
1. Run OLS to get
ˆ
β
β
β and ˆu
ji
2. Wild bootstrap (y
ji
, X
ji
)
y
ji
= X
ji
ˆ
β
β
β + f(ˆu
ji
)v
j
where v
j
is shared by all the n
j
individuals in cluster j.
3. For each bootstrapped sample, estimate
˜
β
i
and compute std(
ˆ
β).
Block Bootstrap
1. Panel data
2. Time series: weakly dependence is necessary for bootstrap.
Non-Overlapping Block Bootstrapping (NBB)
Moving Block Bootstrapping (MBB)
2 Sampling Methods
Suppose that we can observe a person’s action. The utility function of taking the action is written as
follows:
U = h(x, ε) = β
x + ε
where x is observable for the individual and researchers, while ε is unobservable only for researchers.
The probability of taking the action is
P(y|x) =
Z
I [h(x, ε) = y] f(ε) dε
2
1. Complete closed form
Suppose that the person takes the action if she can gain positive utility.
P(y|x) =
Z
I [ε > β
x] f(ε) dε
= 1 F (β
x)
=
e
β
x
1 + e
β
x
suppose that ε logistic
Researchers end up thinking in terms of their models in math terms instead of in economic terms
that attempt to represent the reality.
2. Complete simulation
Key: any integral over a density is a kind of averaging. So we can approximate the true average
with a simulated average. Finding a proper simulator is important as sometimes some simulators
would be problematic.
3. Partial closed form/partial simulation
Try to do as much as we can analytically. Suppose that the error term ε = (ε
1
, ε
2
). We can
decompose the error terms into two parts, one part we can analytically integrate over and the other
part we can’t. We have f (ε
1
, ε
2
) = f(ε
2
|ε
1
)f(ε
1
). Then the probability of taking the action is:
P(y|x) =
Z
ε
1
Z
ε
2
I [h(x, ε
1
, ε
2
) = y] f (ε
2
|ε
1
) dε
2
f(ε
1
) dε
1
Suppose that
g(ε
1
) =
Z
ε
2
I [h(x, ε
1
, ε
2
) = y] f (ε
2
|ε
1
) dε
2
has a closed form (but is dependant on ε
1
). Then we can simply take the average the g(ε
2
) over the
distribution of the errors that we cannot integrate analytically:
P(y|x) =
Z
ε
1
g(ε
1
)f(ε
1
) dε
1
2.1 Ways to Draw from a Density
1. Standard normal / uniform. Use established functions.
2. Transformation of normal
ε N(b, s
2
) = ε = b + sµ, µ N(0, 1)
ε LN(b, s
2
) = ε = exp(b + ), µ N(0, 1)
3. Inverse cumulative. (This only works for univariate!) For any probability density function
f(ε), we can find its cumulative density function (CDF) F (ε), which is invertible. So we can
uniformally sample µ [0, 1] and get ε = F
1
(µ). This works well for those CDFs we can easily
find their inverse function, e.g., extreme value.
f(ε) = e
ε
e
e
ε
, F (ε) = e
e
ε
Draw µ and we have e
e
ε
= µ = ε = log(log(µ))
3
4. Truncated univariate. Suppose we only want draw from a density over [a, b]. Then we
1. Draw µ
2. Compute ¯µ = (1 µ)F (a) + µF (b)
3. Compute ε = F
1
(¯µ)
5. Cholesky for multi normal. Cholesky factor is a lower triangular k × k matrix L (generalized
standard deviation) such that LL
= . Suppose that we want ε
ε
ε N(b,
Ω) is a k × 1 vector.
We can
1. Draw k values from standard normal distribution η
η
η = {η
1
, η
2
, . . . , η
k
}
2. Calculate ε
ε
ε = b + Lη
η
η. The variance is Var(ε) = Var(Lη
η
η) = LE(η
η
ηη
η
η
)L
= LL
=
6. Accept-reject for multi truncated. Suppose that we want to draw a k-dimensional vector from
f(ε
ε
ε) such that a ε
ε
ε b. We can draw from the untruncated distribution and only save those
fall into [a, b]. Issues: small range / high dimensions.
7. Importance sampling. Suppose we want to draw from f(ε) but don’t know how. If we know how
to draw from another probability density function g(·). We can
1. draw from g
2. weight draw by
f(ε)
g(ε)
The set of weighted draws are equivalent to draws from f(ε).
¯
t =
Z
t(ε)f(ε) dε =
Z
t(ε)
f(ε)
g(ε)
g(ε) dε
8. Gibbs sampling. Use conditional density to
1. Start with any ε
0
1
2. Draw ε
0
2
f(ε
2
|ε
0
1
)
3. Draw ε
1
1
f(ε
1
|ε
1
2
)
4. ...
No good rule for identifying convergence. Normally, researchers start sampling after thousands
of iterations (burn-in time).
9. Metropolis–Hastings.
(a) Initialize X
1
= X
1
.
(b) For t = 1, 2, . . .
i. Sample y from q(y|X
t
). Think of y as a “proposed” value for X
t+1
.
ii. Compute
A = min
1,
π(y)q(X
t
|y)
π(X
t
)q(y|X
t
)
.
A is often called the “acceptance probability”.
iii. With probability A “accept” the proposed value, and set X
t+1
= y. Other set
X
t+1
= X
t
.
An example: random walk Metropolis-Hastings
1. Start at any ε
0
2. Create “trail” for ε
1
: ˜ε
1
= ε
0
+ η where η Unif(δ, δ)
3. Compare f(˜ε
1
) and f(ε
0
):
(1) If f(˜ε
1
) > f(ε
0
), accept ε
1
= ˜ε
1
(2) Otherwise, accept ˜ε
1
with probability
f(˜ε
1
)
f(ε
0
)
(3) If reject, set ε
1
= ε
0
4
3 Simulation-Based Estimation
3.1 Maximum Simulated Likelihood Estimation (MSL)
If the likelihood function is intractable, MSL could help to optimize the parameters. Denote y
i
as
income, X
i
as characteristics that are observable for researchers, and u
i
as unobserved heterogene-
ity (but its distribution is known). First we need to exclude u
i
from the original density function
h(y
i
|X
i
, u
i
;Θ
Θ
Θ), as u
i
is unobservable. Thus, we have
f(y
i
|X
i
;Θ
Θ
Θ) =
Z
h(y
i
|X
i
, u
i
;Θ
Θ
Θ)g(u
i
) du
i
,
ˆ
Θ
Θ
Θ
MSL
argmax
Θ
Θ
Θ
n
Y
i=1
ˆ
f(y
i
|X
i
;Θ
Θ
Θ)
where f (·) is a density function and g(·) is the pdf of u
i
. Usually, it is very difficult to get a closed
form of the above formula. So we simulate f(·):
ˆ
f(y
i
|X
i
;Θ
Θ
Θ) =
1
S
S
X
j=1
h(y
i
|X
i
;Θ
Θ
Θ, u
ij
)
u
ij
g(u
i
)
If g(u
i
) is complicated, we can use importance sampling with a simpler distribution p(u
i
):
ˆ
f(y
i
|X
i
;Θ
Θ
Θ) =
1
S
S
X
j=1
h(y
i
|X
i
;Θ
Θ
Θ, u
ij
)g(u
ij
)
p(u
ij
)
u
ij
p(u
i
)
Properties of MSL
1. E(
ˆ
f) = f,
ˆ
f f
2. If (1) a MLE is consistent and asymptotically normal and (2) E(
ˆ
f) = f, then MSL is asymptot-
ically equivalent to MLE if S , N , and
N
S
0.
Example 3.1. Logit model with random coefficients. Suppose that y
i
is a binary variable and
X
i
are observable covariates. Then we have
P (y
i
= 1|X
i
; β
i
) =
e
X
i
β
i
1 + e
X
i
β
i
Suppose that β
i
= β + ω
i
where ω
i
N(0, σ
2
ω
). We can simulate the “observed” density f as follows:
1. Draw ω
i1
, ω
i2
, . . . , ω
iS
N(0, σ
2
ω
) i = 1, 2, . . . , n
2. Simulate f i = 1, 2, . . . , n:
ˆ
f(y
i
|X
i
; β, σ
2
ω
) =
1
S
S
X
j=1
h(y
i
|X
i
; β, ω
ij
)
=
1
S
S
X
j=1
e
X
i
(β+ω
i
)
1 + e
X
i
(β+ω
i
)
y
i
1
1 + e
X
i
(β+ω
i
)
1y
i
3. The optimal parameters are:
(
ˆ
β, ˆσ
2
ω
) = argmax
β
2
ω
n
Y
i=1
ˆ
f(y
i
|X
i
; β, σ
2
ω
)
5
3.2 Simulated Method of Moments Estimation (SMM)
We have GMM estimation:
ˆ
θ
θ
θ
GMM
= argmax
θ
θ
θΘ
Θ
Θ
1
n
n
X
i=1
m(y
i
, X
i
;θ
θ
θ)
!
W
1
n
n
X
i=1
m(y
i
, X
i
;θ
θ
θ)
!
We need integrate out unobservable variables:
m(y
i
, X
i
;θ
θ
θ) =
Z
h(y
i
, X
i
|v
i
;θ
θ
θ)f
u
(v) dv
We can use simulation to approximate the above formula when the integral is intrackable:
1. Draw {u
1
, u
2
, . . . , u
s
} from f
u
2. Replace m(·) by ˆm(·) =
1
S
P
S
s=1
h(y
i
, X
i
|u
s
;θ
θ
θ)
3. Compute the optimal parameters
ˆ
θ
θ
θ
SMM
= argmax
θ
θ
θΘ
Θ
Θ
1
n
n
X
i=1
ˆm(y
i
, X
i
;θ
θ
θ)
!
W
1
n
n
X
i=1
ˆm(y
i
, X
i
;θ
θ
θ)
!
Properties of MSL
n(
ˆ
θ
θ
θ
GMM
θ
θ
θ
0
)
d
N(0, A)
n(
ˆ
θ
θ
θ
SMM
θ
θ
θ
0
)
d
N(0, B)
where
A = (G
I
1
0
G)
, B = (G
˜
I
1
0
G)
G = E
m(y
i
, X
i
;θ
θ
θ)
∂θ
θ
θ
θ
θ
θ=θ
θ
θ
0
, I
0
= Var [m(y
i
, X
i
;θ
θ
θ)] ,
˜
I
0
= Var [ ˆm(y
i
, X
i
;θ
θ
θ)]
If S , we have
˜
I
0
I
0
.
6