Identiﬁcation - Zijing "Jimmy" Hu

Identiﬁcation

Zijing Hu

November 29, 2023

Contents

1 Basic Concepts of Identiﬁcation 1

2 Identiﬁcation in Causal Model 3

2.1 Basic Assumptions: Unconfoundedness and Overlap . . . . . . . . . . . . . . . . . . . 3

2.2 Accessing Overlap Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Accessing Unconfoundedness Assumption . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3.1 Accessing unconfoundedness without further assumptions . . . . . . . . . . . . 5

2.3.2 Accessing unconfoundedness using external approaches . . . . . . . . . . . . . . 6

3 Identiﬁcation in Classical Linear Model 8

4 Identiﬁcation in Choice Model 8

*This note is based on ECMT 677: Applied Microeconometrics by Dr. Jackson Bunting, TAMU

1 Basic Concepts of Identiﬁcation

The workﬂow of empirical analysis

• Population: a collection of indices i ∈ I (could be either ﬁnite or not). Each unit is associated

with some information y

: i ∈ I. y

likely varies by i. We ﬁnalize this as y : I → R. We call the

CDF of F

the population distribution. Other example: (Y

, X

, U

)

i∈I

and F

Y XU

• Population parameters of interest: Θ(F

Y XU

). (e.g., E[Y ], E[Y

− Y

]). If we know F

Y XU

we can compute Θ(F

Y XU

• Problem 1: Y and X are observed but U is not.

Population data: the distribution of observed random variables. We conceptualize it as

Y X

= MakeData(F

Y XU

where MakeData (MD) is an objective data generation process that maps the population

to the observable population data. Usually we don’t impose any assumption on MD.

• Problem 2: we only observe a ﬁnite sample data (Y

, X

)

i=1

• The object becomes learning about Θ(F

Y XU

) from (Y

, X

)

i=1

• Identiﬁcation: learn about Θ(F

Y XU

) from F

Y X

= MD(F

Y XU

• Estimation: take into account the diﬀerence between ﬁnite sample and population data.

Definition 1.1. Observational Equivalence (o.e.).

We say F

Y XU

and

Y XU

are o.e. if MD(F

Y XU

) = MD(

Y XU

)

We impose assumptions over the set F where the “true” population distribution F

⋆

Y XU

∈ F.

• Falsiﬁability: if you can see the population data, can you say your assumption(s) is true or not?

(e.g., it’s easy to test whether E[Y |X = x] = 0 but it might be hard to test unconfoundedness).

• Structural models usually impose assumption on F (e.g., the random component of utility is

i.i.d. and takes a Gumbel distribution).

Definition 1.2. Identiﬁciation Set.



θ ∈ Θ : θ = Θ(F

Y XU

) for some F

Y XU

∈ F and F

obs

Y X

= MD(F

Y XU

)



[

Y X U

∈F:MD(F

Y X U

)=F

Y X

}

{Θ(F

Y XU

)}

Types of identiﬁcation

• If Θ

= ∅, we conclude that F

⋆

Y XU

∈ F, suggesting the assumptions on F might be problematic.

• Point identiﬁcation: |Θ

| = 1. This only indicates that

Y XU

∈ F, which is an o.e. of F

⋆

Y XU

• Partial identiﬁcation: |Θ

| > 1 and Θ

⊊ Θ.

• Completely non-identiﬁcation: |Θ

| > 1 and Θ

= Θ.

Sharp identiﬁcation: Θ

is constructed by checking all F ∈ F.

The characterizations of point identiﬁcation

• ∀F

Y XU

∈ F and M D(F

Y XU

) = MD(

Y XU

) = F

Y X

we have Θ(F

Y XU

) = Θ(

Y XU

Proof. Suppose that |Θ

| > 1. Then ∃F,

F ∈ F s.t. MD(F ) = MD(

F ) = F

Y X

and Θ(F ) =

Θ(

F ). This leads to contradiction. |Θ

| = 1 won’t lead to any contradiction.

• |Θ

| = 1 if Θ(F

Y XU

) = h(MD(F

Y XU

)) for an known h.

Proof. Suppose that |Θ

| > 1. Then ∃F,

F ∈ F s.t. MD(F ) = MD(

F ) = F

Y X

. We have

h(MD(F )) = h(MD(

F )) ⇔ Θ(F ) = Θ(

F ).

Missing data and partial identiﬁcation

Example 1. Suppose that M D(F

Y Z

) = {P (Y = 1 | z = 0), P (Z = 1), P (Y = 1 | Z = 1} and we want

to identify Θ(F

Y Z

) = P (Y = 1). With no assumptions, we have

= [P (Y = 1 | Z = 1)P (Z = 1), P (Y = 1 | Z = 1)P (Z = 1) + P (Y = 1 | Z = 0)P (Z = 0)]

However, we can impose other assumptions.

• Missing at random (strong): P (Y = 1 | Z = 0) = P (Y = 1 | Z = 1). Then we have

= {P (Y = 1 | Z = 1)}

• Monotonicity. E.g., 1 ≥ P (Y = 1 | Z = 0) ≥ P (Y = 1 | Z = 1). Then we have

= [P (Y = 1 | Z = 1), P (Y = 1 | Z = 1)P (Z = 1) + P (Y = 1 | Z = 0)P (Z = 0)]

Example 2 (Blundell et al. 2007 ECMA). Suppose that y represents wages and z = 1 if working. The

inequality is measured by the interquartile range (IQR): Q

y |x

(0.75) − Q

y |x

(0.25). However, we only

observe (y

, x

, z

= 1) and (x

, z

= 0). We can start from the extreme cases where everyone missing

is assumed to have the highest (+∞) or lowest (0) wage. More assumptions can be imposed later

(e.g., monotonicity: F

y |x,z=1

≤ F

y|x,z=0

,“people tend to work if more productive”) and lead to tighter

bounds.

Falsiﬁability

Suppose that we include an instrument w for x such that F

y |wx

= F

y |x

. Then we can compare the

identiﬁcation sets for diﬀerent ws and check if they are overlapping or not. If they are not, then the

assumption of the instrument might be wrong.

2 Identiﬁcation in Causal Model

Unit level causal model

• Potential outcomes: Y

(x) = g(x, u

), x ∈ {x

, x

}

• Counterfactual outcomes: Y

(x) for x = x

) the realised outcome)

• Unit level causal eﬀect: Y

) − Y

)

Population level causal model

• F

Y (x

)−Y (x

)

(t) = P (Y (x

) − Y (x

) ≤ t)

• ATE(x

→ x

), E[Y (x

) − Y (x

)]

• P (Y (x

) > Y (x

)): popUlation who beneﬁt

• Average marginal eﬀect E[∇

g(x, U)]

• Average structure function: ASF(x) = E[Y (x)]

• Quantile treatment eﬀect: Q

Y (x

)

(t) − Q

Y (x

)

(t).

• This is not the same as Q

Y (x

)−Y (x

)

(t) or E[Y (x

) − Y (x

) | Y (x

) = Q

Y (x

)

(t)]

The fundamental problem is that unit level causal eﬀects are fully non-identiﬁed. But we might be

able to identify population level causal eﬀects. Suppose the F

Y XU

satisﬁes Y = g(x, u). We have

P (Y ≤ y, X ≤ x, U ≤ u) = P (g(x, u) ≤ y, X ≤ x, U ≤ u)

Identiﬁcation analysis focuses on g(·) and F

instead of F

Y XU

. Usually we denote x as observable

heterogeneity and U as unobservable heterogeneity. For example, suppose that

x ∈ {0, 1} , U

= (Y

(0), Y

(1) − Y

(0))

Then we have

(x) = U

i,1

+ xU

i,2

2.1 Basic Assumptions: Unconfoundedness and Overlap

Theorem 2.1. (Random assigned treatment) Suppose that {Y (x) : x ∈ X } ⊥⊥ X (Unconfoundedness,

U). Then F

Y (x)

is point identiﬁed for all x ∈ S(X), where S(W ) =



w ∈ R

dim(w)

: dF

(w) > 0



Proof. F

Y (x)

(t) = P (Y (x) ≤ t) = P (Y (x) ≤ t | X = x) = P (Y ≤ t | X = x). Therefore, F

Y (x)

point identiﬁed for all x ∈ S(X).

Implication: parameters that depend on



Y (x)

: x ∈ S(X)



⊥⊥ X is also point identiﬁed

• ASF(x) =

ydF

Y (x)

(y)

• ATE(x → x

′

) =

y(dF

Y (x

′

)

(y) − dF

Y (x)

(y))

Theorem 2.2. (Stratiﬁed random experiments) Suppose that {Y (x) : x ∈ X } ⊥⊥ X | W . Then F

Y (x)|W

is point identiﬁed for all (x, w) ∈ S(X, W ).

Proof. F

Y (x)|W

(t) = P (Y (x) ≤ t) = P (Y (x) ≤ t | X = x, W = w) = P (Y ≤ t | X = x, W = w).

Therefore, F

Y (x)|W

is point identiﬁed for all (x, w) ∈ S(X, W ).

Implication: parameters that depend on



Y (x)|W

: (x, w) ∈ S(X, W )



⊥⊥ X is also point identiﬁed

• CASF(x) =

ydF

Y (x)|W

(y, w)

• CATE(x → x

′

, w) =

y(dF

Y (x

′

)|W

(y, w) − dF

Y (x)|W

(y, w))

Given that

ATE(x → x

′

) =

CATE(x → x

′

, w)dF

(w)

to point identify ATE under unconfoundedness assumption, we need (Overlap condition, O)

P (X = 1 | W = w) ∈ (0, 1) ∀w ∈ S(W ) ⇔ S(X|W = w) = {0, 1} ∀w ∈ S(W )

or rectangular assumption (too strong if continuous X): S(X) = {0, 1} and S(X, W ) = S(X)×S(W ).

Theorem 2.3. Suppose U and O. Then we have (1) F

Y (x)|W

is point identiﬁed ∀x ∈ X , w ∈ S(W )

and (2) F

Y (x)

is point identiﬁed.

Proof. F

Y (x)|W

(t) = P (Y (x) ≤ t) = P (Y (x) ≤ t | X = x, W = w) = P (Y ≤ t | X = x, W = w) which

observed for x, w ∈ S(X, W ). F

Y (x)

(t) =

Y (x)|W (t,w)

(w).

Proposition 2.4. Suppose U and overlap. Then ATE and CATE(w) are point identiﬁed for all

w ∈ S(W ).

Proof.

CATE(w) = E[Y (1) − Y (0) | W = w]

= E[Y (1) | X = 1, W = w] − E[Y (0) | X = 0, W = w]

= E[Y | X = 1, W = w] − E[Y | X = 0, W = w]

ATE =

CATE(w)dF

(w)

Why do we prefer U than conditional mean independent (weaker assumption)?

• How to justify CMI is true but not U?

• We can learn more if assume U

• CMI of what? log Y (x) vs. Y (x)

Theorem 2.5. Suppose U. Then {Y (x) : x ∈ X } ⊥⊥ X | p(W ) where p(w) = P (X = 1 | W = w).

(Propensity Score Matching)

Proof.

P (X = 1 | Y (0), Y (1), p(W ) = p) = E[X | Y (0), Y (1), p(W ) = p]

= E [E[X | Y (0), Y (1), p(W ) = p, W ] | Y (0), Y (1), p(W ) = p]

= E [E[X | p(W ) = p, W ] | Y (0), Y (1), p(W ) = p]

= E[X | p(W ) = p]

Implication: under U and O, we can point identify ASF and ATE

ASF(x) = E[Y (x)] = E[E[Y (x) | p(W )]]

Given that

E[Y (x) | P (W ) = p] = E[y | X = x, p(W ) = p]

we need S(X, p(W )) = S(X) × S(p(W )) ⇔ p(W ) ∈ (0, 1)∀w ∈ S(W ).

2.2 Accessing Overlap Assumption

Falsiﬁcation of O: compute P (X = 1 | W = w) from F

Y XW

• If ﬁnite S(W ), check if there’s at least one unit treated in each w ∈ S(W )

• If inﬁnite S(W ), f

X|W

(x; w) is still point identiﬁed but it would hard to distinguish zero prob-

ability and very small probability

What to do if O fails?

Option 1: stick with point identiﬁcation

Deﬁne W

0,1

= {w ∈ S(W ) : P (X = 1 | W = w) ∈ (0, 1)} and we point identify only the ATE for the

overlapping part ATE

0,1

= E[Y (1) − Y (0) | w ∈ W

0,1

]

Option 2: using partial identiﬁcation

ATE = ATE

0,1

P (W

0,1

) + ATE

∁

0,1

(1 − P (W

0,1

)). Need to construct the bounds of ATE

∁

0,1

based on external knowledge. Not common.

Option 3: (Imbens and Wooldridge 2009): assume a relationship between ATE

0,1

and ATE

∁

0,1

Example: assume that E[Y (x) | W = w] = q(w)

⊤

γ(x) where q is known and γ(x) ∈ E

dim(q)

. Then,

we can identify γ(x) by regressing Y on q(w) among X = x (if the ﬁrst expectation is non-singular):

γ(x) = E[q(w)q(w)

⊤

|X = x]E[q(w)y|X = x]

Then we also have

• CASF(x, w) = q(w)

⊤

γ(x)

• CATE(w) = q(w)

⊤

[γ(1) − γ(0)]

• ATE = E[q(w)

⊤

][γ(1) − γ(0)]

Theorem 2.6. Suppose U and q(w) = (1, w), i.e., E[Y (x) | w] = (1, w)

⊤

γ(x). Then ATE is point

identiﬁed if V ar(W |X = x) > 0 ∀x ∈ {0, 1}.

2.3 Accessing Unconfoundedness Assumption

2.3.1 Accessing unconfoundedness without further assumptions

Option 1: idea of Cornﬁeld et al. (1959)

Theorem 2.7. Let Y, X, and U be binary random variables. Let Y (x, u) denote the potential outcomes,

x, u ∈ {0, 1}. Suppose that

= P (Y = 1 | U = u) p

= P (U = 1 | X = x)

r =

P (Y = 1 | X = 1)

P (Y = 1 | X = 0)

We have

> r if three assumptions are satisﬁed:

• Latent unconfoundedness: {Y (x, u) : x, u = 0, 1} ⊥⊥ X | U

• No causal eﬀect of X: for all i ∈ I and u ∈ {0, 1},

(1, u) = Y

(0, u)

• U positively (or negatively) related with both Y and X: r

> r

and p

> p

Proof.

r =

P (Y = 1 | X = 1)

P (Y = 1 | X = 0)

+ r

(1 − p

)

+ r

(1 − p

)

⇒

= r +

((1 − p

)r − (1 − p

))

Given that r > 1 and p

> p0, we have

((1 − p

)r − (1 − p

)) > 0

Application: (1) if

> r is not plausible, then the no causal eﬀect assumption might be false; and

(2) if it is plausible, we might want to doubt U assumption ({Y (x) : x ∈ X } ⊥⊥ X | W ).

Similar methods: Oster (2016 JBES)

Option 2: instead of focusing on observable confounders, we can focus on the relationship between

Y (x) and X (Robins, Rotznitzky, and Scharfstein 2000).

Theorem 2.8. Suppose X and Y (x) are discretely distributed. Suppose the joint distribution of (X, Y )

is observed. Suppose

(y) = P(X = x | Y (x) = y)

is both known and nonzero for all x ∈ S(X) and y ∈ S(Y ). Also assume p

(y) is nonzero for all

y ∈ S(Y ). Then P[Y (x) = y] is point identiﬁed for all y ∈ S(Y ) and all x ∈ S(X).

Proof.

P[Y (x) = y] =

P(Y (x) = y, X = x)

P(X = x | Y (x) = y)

= P(Y = y | X = x)

P(X = x)

(y)

To operationalize, we can parameterize p

(y) = p

(y; γ) and compute F

Y (x)

(or ATE) for diﬀerent γ.

Option 3: we can also nonparametrically relax unconfoundedness (Masten and Poirier 2018).

Definition 2.9. Let x ∈ {0, 1}. Let c be a scalar in [0, 1]. Say X is c-dependent with Y (x) if

sup

y ∈S[Y (x)]

|P (X = 1 | Y (x) = y) − P (X = 1)| ≤ c

If c = 0, we have random assignment. If c > 0, we can still partially identify the interested parameters.

2.3.2 Accessing unconfoundedness using external approaches

Option 1: Placebo tests

1. Placebo outcome

Definition 2.10. Placebo exclusion: Y

(0) = Y

(1)

Theorem 2.11. Suppose placebo exclusion. Then {Y

(0), Y

(1)} ⊥⊥ X if and only if Y

(x) ⊥⊥ X.

U X Y

The chain of logic requires the placebo outcomes to be aﬀected by the confounder we worried about.

Then we have

(x) ⊥⊥ X ⇔ Y

(x, U) ⊥⊥ X(U)

⇔ X(u) = X(˜u) ∀u = ˜u

⇔ Y (x, U ) ⊥⊥ X(U ) ≡ X

Workﬂow

• Find a Y

and justify placebo exclusion

• Check if Y

(x) ⊥⊥ X and conclude {Y

(0), Y

(1)} ⊥⊥ X or not

• Justify why possible confounder of Y are also the cause of Y

• Conclude Y (x) ⊥⊥ X if and only if Y

(x) ⊥⊥ X

Example: parallel trends

Suppose that{X

, Y

}

t=−1,0,1

, X

−1

= X

= 0, and X = X

. We want to assume unconfoundedness,

i.e., (Y

− Y

)(x) ⊥⊥ X. We can use as the pre-trends, i.e., Y

− Y

−1

, as the placebo. If we have placebo

exclusion, i.e., (Y

−Y

−1

)(0) = (Y

−Y

−1

)(1), then placebo unconfoundedness, i.e., (Y

−Y

−1

)(x) ⊥⊥ X,

can be used to validate our assumption.

2. Placebo treatment

Theorem 2.12. Suppose that Y (x, x

) = Y (x, ˜x

) ∀x

, ˜x

and Y (x) ⊥⊥ X

| X. Then Y ⊥⊥ X

| X

Proof.

P (Y ≤ y | X

= x

, X = x) = P (Y (x, x

) ≤ y | X

= x

, X = x)

= P (Y (x) ≤ y | X

= x

, X = x)

= P (Y (x) ≤ y | X = x)

= P (Y ≤ y | X = x)

U X Y

We require that the confounder we worried about aﬀect both X and X

. Then we have

Y (x) ⊥⊥ X

| X ⇔ Y (x, U ) ⊥⊥ X

(U) | X(U )

⇔ Y (x, u) = Y (x, ˜u) ∀u, ˜u

⇔ Y (x, U ) ⊥⊥ X(U )

⇔ Y (x) ⊥⊥ X

Workﬂow

• Check Y ⊥⊥ X

| X tells Y (x) ⊥⊥ X

| X

• Statement about Y (x) ⊥⊥ X

Option 2: Using other assumptions

Example 1: Monotonicity restriction Y

min

≤ Y (0) ≤ Y (1)

Theorem 2.13. Suppose monotone treatment responses (MTR). Then E(Y (x)) ∈ [LB(x), UB(x)].

LB(0) = y

min

P (X = 1) + P (Y | X = 0)P (X = 0)

UB(0) = (Y | X = 1)P (X = 1) + P (Y | X = 0)P (X = 0) = E(Y )

The idea is to derive informative bounds on ASF or ATE under MTR and check if the identiﬁcation

set under U is in that bound.

Example 2: Modeling treatment assignment

Given production function Y

(x) = g(x, U) and proﬁt E

(x) − xv | V

= v] = m

(x, v).

Theorem 2.14. We have X ⊥⊥ U if the following conditions hold:

• Rational expectation E

= E

• Proﬁt maximization x

= 1(m(1, v

) ≥ m(0, v

))

• The unobservable part of the production function is independent with the cost V ⊥⊥ U

Proof. X

= g(V

) ⊥⊥ U

. Note that Π(x) = E[m(x, v)] is not identiﬁed since we cannot ensure X ⊥⊥ V

Option 3: Alternative identiﬁcation strategy

Assuming U & IV, we can identify ATE in two ways and check if the identiﬁcations are the same.

3 Identiﬁcation in Classical Linear Model

Theorem 3.1. (β, F

) is point identiﬁed given the following assumptions of classical linear model:

• A1: Linearity Y = X

′

β + U

• A2: Finite moments E[XY ], E[XX

′

], E[XU ], and E[U] are ﬁnite

• A3: Suﬃcient variation E[XX

′

] is nonsigular

• A4: Exogenous E[U X] = E[U ]E[X]

• A5: Normalization X include 1 and E[U] = 0

Accessing linearity by strengthen A4

• A4

′

: E[U |X] = E[U] = 0 together with A2 and A5 implies that [Y |X = x] = x

′

β (falsiﬁable)

• A4

′′

: U ⊥⊥ X implies that V ar[Y |X = x] = V ar[U] (falsiﬁable)

Can we falsify A1

′

: Y

= m(X) + U

Theorem 3.2. (1) If A1

′

, A2, and A4

′

, then m(X) + E[U] is point identiﬁed; (2) if A1

′

, A2, and

′′

, then m(x) + E[U ] is point identiﬁed.

Accessing exogeneity

Suppose that Y

= β

+ Xβ

+ W β

+ U

• Then

Cov(Y,X)

V ar[X]

= β

Cov(W,X)

V ar[X]

. Suppose sign restrict bias. Then the causal eﬀect ≤

Cov(Y,X)

V ar[X]

• Release A4 and suppose that |Cov(X, U )| ≤ ε. Then β

∈

Cov(Y,X)

V ar[X]

−

V ar[X]

Cov(Y,X)

V ar[X]

Heterogeneous treatment eﬀect

Can we assess A1

′′

: Y

= X

′

+ U

? We cannot identify β

if using only A1 − 5.

1. Keep

β and interpret plim

β under A1

′′

• (Very strong assumption) Suppose A1

′′

, A4

′

, A5, and [β

| X = x

] = [β

]. Then plim

β = E(β

)

• Suppose X ∈ {0, 1}. Then regressing Y on X − E[X − E[X | W ]] yields

Cov(Y, X − E[X | W ])

V ar(X − E[X | W ])

= E



CATE(W )V ar(X | W )

E[V ar(X | W )]



β is (1) the plim of regressing Y on (1, X, g(W )), and (2) a convex combination of CATE.

2. Deﬁne Θ and then compute Θ

under A1

′′

− A5.

4 Identiﬁcation in Choice Model

Example: suppose Y

(x) is the labor force participation under policy x.

• “Threshold crossing”: Y

(x) = 1(m(x, U

) ≤ 0)

• Employment rate under x: P r(m(x, W, U

) ≤ 0 | W = w)

• Common methods

• Additive separable: m(x, U ) = g(x) + U

• Linear coeﬃcient: g(x) = x

′

β → Y (x) = 1(U ≤ −x

′

β)

Theorem 4.1. Suppose threshold crossing and X ⊥⊥ U . Then ASF(x) = P r(m(x, U) ≤ 0) is point

identiﬁed ∀x ∈ Supp(X)

Proof. ASF(x) = P r(m(x, U) ≤ 0) = P r(m(x, U ) ≤ 0 | X = x) = E[Y | X = x], which is observed.

How to extrapolate to x not in support by “realistic”? Identify the structural function m(·).

Theorem 4.2. (1) Suppose Y (x) = 1(g(x) + U ≤ 0) and U | X ∼ N (µ, σ

). Then (g, µ, σ) is not

point identiﬁed. (2) Suppose Y (x) = 1(g(x) + U ≤ 0) and U | X ∼ N(0, 1). Then g is point identiﬁed.

Random utility model

Suppose that Y

(x) = 1(u

(x) ≤ u

(x)) and u

(x) = v

(x) + ε

= x

′

+ ε

• If ε

− ε

⊥⊥ X, then p(x) = P (Y = 1 | X = x) = F

−ε

′

− x

′

)

• If F

−ε

is known and strictly increasing, then x

′

− x

′

is point identiﬁed as F

−1

−ε

(p(x))

• If E[XX

′

] is nonsingular, then β

− β

is point identiﬁed.

• Common choice

• (ε

, ε

) are joint normal with known µ and σ

• (ε

, ε

) are i.i.d. EVTI. This implies that ε

− ε

is logistically distributed.

• IIA: if ε

are i.i.d. EVTI and let C

⊂ C

⊂ ... ⊂ C

= {1, ..., J }, then ∀j, k ∈ C

, we have

P (y(C

) = j)

P (y(C

) = k)

P (y(C

) = j)

P (y(C

) = k)

= ... =

P (y(C

) = j)

P (y(C

) = k)

But this does not apply to nested logit or DDC.

Dynamic discrete choice model

The agent consider not only the current choice problem but also a series of future actions to maximizing

the long-term utility. Speciﬁcally, the agent choose (d

, d

, ...) to maximize the following target

u(x

, d

) + ε

) + E

∞

t=1

[u(x

, d

) + ε

)]

We can derive the value function

, ε

) ≡ max

t+1

,...







u(x

, d

) + ε

) + E





∞

′

≥t

′

−t

[u(x

′

, d

′

) + ε

′

)]











It would be more convenient to focus on the integrated value function (value of being in state x at

time t, prior to realization of ε

)

) ≡

, ε

)g(ε

)dε

The conditional value function (conditional on choosing d

) is given by

, d

) ≡ u(x

, d

) + β

t+1

)f(x

t+1

| x

, d

)dx

t+1

Then the choice probability is

P (D

= d | X

= x) =

1(arg max

{

d) + ϵ

(

d)} = d)g(ε

)dε

The identiﬁcation problem is to learn

(x, d), F

, f

′

| x, d), β)

from p

| x). F

and β are usually assumed to be known. f

′

| x, d) is observed from data.

Theorem 4.3. If ε

(d) is i.i.d. and follows known distribution G, then

, d

) −

, 0) is point

identiﬁed.

For ﬁnite cases, use back induction. For inﬁnite cases, use value iteration.