Simulation-Based Methods - Zijing "Jimmy" Hu

Simulation-Based Methods

Zijing Hu

October 15, 2022

Contents

1 Bootstrap 1

2 Sampling Methods 2

2.1 Ways to Draw from a Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Simulation-Based Estimation 5

3.1 Maximum Simulated Likelihood Estimation (MSL) . . . . . . . . . . . . . . . . . . . . 5

3.2 Simulated Method of Moments Estimation (SMM) . . . . . . . . . . . . . . . . . . . . 6

*This note is based on

• ECMT 677: Applied Microeconometrics by Dr. Yonghong An and Dr. Jackson Bunting, TAMU

• Discrete Choice Methods with Simulation by Kenneth Train

1 Bootstrap

Asymptotics vs. Bootstrap

1. Heteroskedasticity: robust-std performs badly for samll sample size.

2. Cluster: Athey et al. (2017, 2022)

Bootstrapping (Efron 1972) assigns measures of accuracy (bias, variance, conﬁdence intervals,

prediction error, etc.) to sample estimates.

std(

β) =

B − 1

j=1



−



How to choose a suﬃcient number of B? B determines std(

β) and N determines

β. B = 500 is

enough and B = 1000 is suﬃcient.

Bootstrap Methods in Linear Regression

• Pair bootstrap. For data (X

, y

) ∼ F (X, y), bootstrap (X

⋆

, y

⋆

) where j = (1, 2, . . . , B).

Compute each

. Drawbacks: no restrictions for E[ε

⋆

] = E[y

⋆

− X

⋆

β|X

⋆

] = 0.

• Residual bootstrap. Compute ˆu

= y

− X

′

β and sample (X

, y

⋆

) where y

⋆

= X

′

β + ˆu

⋆

and

ˆu

⋆

∼ {ˆu

, ˆu

, . . . , ˆu

}. Drawbacks: Homoskedasticity is imposed, i.e., V ar(ˆu

⋆

) = σ

• Wild boostrap. First, compute

β and ˆu

where i = (1, 2, . . . , n). Then,

⋆

= X

′

β + f(ˆu

⋆

where f(·) is a function of ˆu

and v

⋆

has zero mean and ˆu

⊥ v

⋆

Common choices of f(·): f(ˆu

) =

n−k

ˆu

, f(ˆu

) = ˆu

√

1 − h

, or f(ˆu

) = ˆu

/(1 − h

). n is

sample size, k is the number of regressors, and h

is the i-th diagonal element of the matrix

P ≡ X(X

′

−1

′

. Normally, researchers use f(·) to capture heteroskedasticity and v

⋆

to cap-

ture intro-cluster correlations. (The last two are better as they incorporate heteroskedasticity.)

Two requirements of v

⋆

: E[v

⋆

] = 0 and moments of f(ˆu

⋆

= moments of f (ˆu

Two choices of v

⋆

(b is better than a while both would perform well as n goes to +∞):

• Mammen (1993)

⋆

(

−

√

5−1

with probability

√

5+1

√

5+1

with probability

√

5−1

√

where E[v

⋆

] = 0, E[v

⋆

] = 1, E[v

⋆

] = 1, E[v

⋆

] = 2.

• Davidson and Flachair (2008)

⋆

(

1 with probability

−1 with probability

where E[v

⋆

] = 0, E[v

⋆

] = 1, E[v

⋆

] = 0, E[v

⋆

] = 1.

Example 1.1. Wild Bootstrap. Suppose that y

= X

′

β + ε

, where j = 1, 2, . . . , m represent

cluster indexes and i = 1, 2, . . . , n

represent individual indexes. Now we show an example for boot-

strapping the cluster-robust std:

1. Run OLS to get

β and ˆu

2. Wild bootstrap (y

⋆

, X

)

⋆

= X

′

β + f(ˆu

⋆

where v

⋆

is shared by all the n

individuals in cluster j.

3. For each bootstrapped sample, estimate

and compute std(

β).

Block Bootstrap

1. Panel data

2. Time series: weakly dependence is necessary for bootstrap.

• Non-Overlapping Block Bootstrapping (NBB)

• Moving Block Bootstrapping (MBB)

2 Sampling Methods

Suppose that we can observe a person’s action. The utility function of taking the action is written as

follows:

U = h(x, ε) = β

′

x + ε

where x is observable for the individual and researchers, while ε is unobservable only for researchers.

The probability of taking the action is

P(y|x) =

I [h(x, ε) = y] f(ε) dε

1. Complete closed form

Suppose that the person takes the action if she can gain positive utility.

P(y|x) =

I [ε > −β

′

x] f(ε) dε

= 1 − F (−β

′

1 + e

′

suppose that ε ∼ logistic

Researchers end up thinking in terms of their models in math terms instead of in economic terms

that attempt to represent the reality.

2. Complete simulation

Key: any integral over a density is a kind of averaging. So we can approximate the true average

with a simulated average. Finding a proper simulator is important as sometimes some simulators

would be problematic.

3. Partial closed form/partial simulation

Try to do as much as we can analytically. Suppose that the error term ε = (ε

, ε

). We can

decompose the error terms into two parts, one part we can analytically integrate over and the other

part we can’t. We have f (ε

, ε

) = f(ε

|ε

)f(ε

). Then the probability of taking the action is:

P(y|x) =



I [h(x, ε

, ε

) = y] f (ε

|ε

) dε



f(ε

) dε

Suppose that

g(ε

) =

I [h(x, ε

, ε

) = y] f (ε

|ε

) dε

has a closed form (but is dependant on ε

). Then we can simply take the average the g(ε

) over the

distribution of the errors that we cannot integrate analytically:

P(y|x) =

g(ε

)f(ε

) dε

2.1 Ways to Draw from a Density

1. Standard normal / uniform. Use established functions.

2. Transformation of normal

ε ∼ N(b, s

) ⇐= ε = b + sµ, µ ∼ N(0, 1)

ε ∼ LN(b, s

) ⇐= ε = exp(b + sµ), µ ∼ N(0, 1)

3. Inverse cumulative. (This only works for univariate!) For any probability density function

f(ε), we can ﬁnd its cumulative density function (CDF) F (ε), which is invertible. So we can

uniformally sample µ ∈ [0, 1] and get ε = F

−1

(µ). This works well for those CDFs we can easily

ﬁnd their inverse function, e.g., extreme value.

f(ε) = e

−ε

−e

−ε

, F (ε) = e

−e

−ε

Draw µ and we have e

−e

−ε

= µ =⇒ ε = −log(−log(µ))

4. Truncated univariate. Suppose we only want draw from a density over [a, b]. Then we

1. Draw µ

2. Compute ¯µ = (1 − µ)F (a) + µF (b)

3. Compute ε = F

−1

(¯µ)

5. Cholesky for multi normal. Cholesky factor is a lower triangular k × k matrix L (generalized

standard deviation) such that LL

′

= Ω. Suppose that we want ε

ε ∼ N(b, Ω

Ω

Ω) is a k × 1 vector.

We can

1. Draw k values from standard normal distribution η

η = {η

, η

, . . . , η

}

2. Calculate ε

ε = b + Lη

η. The variance is Var(ε) = Var(Lη

η) = LE(η

ηη

′

= LL

′

= Ω

Ω

6. Accept-reject for multi truncated. Suppose that we want to draw a k-dimensional vector from

f(ε

ε) such that a ≤ ε

ε ≤ b. We can draw from the untruncated distribution and only save those

fall into [a, b]. Issues: small range / high dimensions.

7. Importance sampling. Suppose we want to draw from f(ε) but don’t know how. If we know how

to draw from another probability density function g(·). We can

1. draw from g

2. weight draw by

f(ε)

g(ε)

The set of weighted draws are equivalent to draws from f(ε).

t =

t(ε)f(ε) dε =



t(ε)

f(ε)

g(ε)



g(ε) dε

8. Gibbs sampling. Use conditional density to

1. Start with any ε

2. Draw ε

∼ f(ε

|ε

)

3. Draw ε

∼ f(ε

|ε

)

4. ...

No good rule for identifying convergence. Normally, researchers start sampling after thousands

of iterations (burn-in time).

9. Metropolis–Hastings.

(a) Initialize X

= X

(b) For t = 1, 2, . . .

i. Sample y from q(y|X

). Think of y as a “proposed” value for X

t+1

ii. Compute

A = min



π(y)q(X

|y)

π(X

)q(y|X

)



A is often called the “acceptance probability”.

iii. With probability A “accept” the proposed value, and set X

t+1

= y. Other set

t+1

= X

An example: random walk Metropolis-Hastings

1. Start at any ε

2. Create “trail” for ε

: ˜ε

= ε

+ η where η ∼ Unif(−δ, δ)

3. Compare f(˜ε

) and f(ε

(1) If f(˜ε

) > f(ε

), accept ε

= ˜ε

(2) Otherwise, accept ˜ε

with probability

f(˜ε

)

f(ε

)

(3) If reject, set ε

= ε

3 Simulation-Based Estimation

3.1 Maximum Simulated Likelihood Estimation (MSL)

If the likelihood function is intractable, MSL could help to optimize the parameters. Denote y

income, X

as characteristics that are observable for researchers, and u

as unobserved heterogene-

ity (but its distribution is known). First we need to exclude u

from the original density function

h(y

, u

;Θ

Θ), as u

is unobservable. Thus, we have

f(y

;Θ

Θ) =

h(y

, u

;Θ

Θ)g(u

) du

MSL

≡ argmax

i=1

f(y

;Θ

Θ)

where f (·) is a density function and g(·) is the pdf of u

. Usually, it is very diﬃcult to get a closed

form of the above formula. So we simulate f(·):

f(y

;Θ

Θ) =

j=1

h(y

;Θ

Θ, u

)

∼ g(u

)

If g(u

) is complicated, we can use importance sampling with a simpler distribution p(u

f(y

;Θ

Θ) =

j=1

h(y

;Θ

Θ, u

)g(u

)

p(u

)

∼ p(u

)

Properties of MSL

1. E(

f) = f,

f → f

2. If (1) a MLE is consistent and asymptotically normal and (2) E(

f) = f, then MSL is asymptot-

ically equivalent to MLE if S → ∞, N → ∞, and

√

→ 0.

Example 3.1. Logit model with random coeﬃcients. Suppose that y

is a binary variable and

are observable covariates. Then we have

P (y

= 1|X

; β

) =

1 + e

Suppose that β

= β + ω

where ω

∼ N(0, σ

). We can simulate the “observed” density f as follows:

1. Draw ω

, ω

, . . . , ω

∼ N(0, σ

) ∀i = 1, 2, . . . , n

2. Simulate f ∀i = 1, 2, . . . , n:

f(y

; β, σ

) =

j=1

h(y

; β, ω

)

j=1



(β+ω

)

1 + e

(β+ω

)





1 + e

(β+ω

)



1−y

3. The optimal parameters are:

(

β, ˆσ

) = argmax

β,σ

i=1

f(y

; β, σ

)

3.2 Simulated Method of Moments Estimation (SMM)

We have GMM estimation:

GMM

= argmax

θ∈Θ

i=1

m(y

, X

;θ

θ)

⊤

i=1

m(y

, X

;θ

θ)

We need integrate out unobservable variables:

m(y

, X

;θ

θ) =

h(y

, X

;θ

θ)f

(v) dv

We can use simulation to approximate the above formula when the integral is intrackable:

1. Draw {u

, u

, . . . , u

} from f

2. Replace m(·) by ˆm(·) =

s=1

h(y

, X

;θ

θ)

3. Compute the optimal parameters

SMM

= argmax

θ∈Θ

i=1

ˆm(y

, X

;θ

θ)

⊤

i=1

ˆm(y

, X

;θ

θ)

Properties of MSL

√

GMM

−θ

)

−→ N(0, A)

√

SMM

−θ

)

−→ N(0, B)

where

A = (G

⊤

−1

⊤

, B = (G

⊤

−1

⊤

G = E



∂m(y

, X

;θ

θ)

∂θ





θ=θ

, I

= Var [m(y

, X

;θ

θ)] ,

= Var [ ˆm(y

, X

;θ

θ)]

If S → ∞, we have

→ I