Causal Machine Learning - Zijing "Jimmy" Hu

Causal Machine Learning

Zijing Hu

November 29, 2023

Contents

1 The Framework of Causal Inference 1

1.1 Application: proﬁt maximisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Inﬂuence Functions 3

3 Two-Step Estimation 5

3.1 Standard Two-Step Estimation & Inference . . . . . . . . . . . . . . . . . . . . . . . . 5

3.2 Nonparametric/ML Two-Step Estimation & Inference . . . . . . . . . . . . . . . . . . 7

4 Deep Learning for Individual Heterogeneity 10

*This note is based on Causal Machine Learning (41917) by Dr. Max Farrell and Dr. Sanjog Misra

1 The Framework of Causal Inference

Estimands

• ITE = Y

(1) − Y

(0). Not observed and not estimable without very strong assumptions.

• ATE = E [Y

(1) − Y

(0)] := τ. Overall eﬀects.

• CATE = E [Y

(1) − Y

(0) | X = x] := τ (x). Eﬀects conditional on individual characteristics.

• ATT = E [Y

(1) − Y

(0) | T = 1]. Eﬀects conditional on treatment.

• ITT. Intent to be treated. Very common.

Identiﬁcation of Causal Eﬀect

Step 1: diﬀerence in means

• Overlap/positivity assumption

• Regularity condition E [|Y |

c+ε

| T = 1] < ∞

−

(1 − t

)

−

(1 − t

)

overlap & regularity → P (T = 1)

−1

E [Y T ] − P (T = 0)

−1

E [Y (1 − T )]

= E [Y | T = 1] − E [Y | T = 0]

Step 2: causal eﬀect

• Stable unit treatment values assumption (SUTVA): E [Y | T = 1] = E [Y (1) | T = 1] , E [Y | T = 0] =

E [Y (0) | T = 0] (Consistency and non-interference)

• Randomization assumption

Relevant paper: Blake, Nosko, and Tadelis (2015)

E [Y | T = 1] − E [Y | T = 0]

SUTVA & consistency = E [Y (1) | T = 1] − E [Y (0) | T = 0]

= ATT + selection bias

randomization = ATT

Regression for Causal Eﬀect

Assuming that Y (t) = µ(t) + ε

(SUTVA) and Y = T Y (1) + (1 − T )Y (0) (consistency), we have

Y = Y (0) + T (Y (1) − Y (0))

= µ

+ T (µ

− µ

) + ε

+ T (ε

− ε

)

= α + βT + ε

To ensure identiﬁcation and unbiased estimation, we additionally need the rank assumption (overlap

assumption) and independent assumption (randomization assumption).

Covariates

• Bad controls: treatment might aﬀect covariates. Include only pre-treatment covariates. Assum-

ing that Y = T Y (1) + (1 − T )Y (0) and X = T X(1) + (1 − T )X(0), we have

E [Y | X = 1, T = 1] − E [Y | X = 1, T = 0]

= E [Y (1) | X(1) = 1, T = 1] − E [Y (0) | X(0) = 1, T = 0]

= E [Y (1) | X(1) = 1] − E [Y (0) | X(1) = 1] + E [Y (0) | X(1) = 1] − E [Y (0) | X(0) = 1]

= E [Y (1) | X(1) = 1] − E [Y (0) | X(1) = 1] + selection bias

• Heterogeneity: assuming that Y (t) = µ(t, X) + ε

, then

Y = µ(0, X) + T (µ(1, X) − µ(0, X)) + ε

= α(X) + β(X) · T + ε

= α(X) + CATE · T + ε

We need to specify structurally the form of heterogeneity. For example:

µ(t, X) = α + βX + t(τ + γ(X −

X))

Then, we have

Y = α + βX + T τ + T γ(X −

= Y = b

+ b

T + b

X + b

(X −

• Causal ML

• Use ML (especially deep learning) for β(X)

• We relax function form assumption of β

• Y = α(X) + β(X) · T + ε ≈ α

+ ITE

· T

1.1 Application: proﬁt maximisation

Proﬁt contribution depending on targeting status

(T ) = π (T, x

) =

(

mY (0, x

) if T = 0

mY (1, x

) − c if T = 1

• m is the margin percentage

• c is the targeting cost

• Easily generalizable for heterogeneous margins and costs

Targeting policy d : X → {0, 1}

Goal: Evaluate the expected proﬁt from any targeting policy, d

E [Π (d, (X))] =

i=1

E [1 {d (X) = 0} · π(0, X) + 1 {d (X) = 1} · π(1, X) | X = x

]

i=1

E [(1 − d (X)) · π(0, X) + d (X) · π(1, X) | X = x

]

Optimal policy

• d

∗

is an optimal policy if it maximizes E [Π (d, (X))]

• Assume: T

does not aﬀect the behavior of any other customer i

′

= i (SUTVA)

• Then d

∗

is optimal if and only if it maximizes the expected proﬁt from each individual customer

with features x

E [(1 − d (X)) · π(0, X) + d (X) · π(1, X) | X = x

]

We use the inverse probability-weighted targeting proﬁt estimator

Π (d, (X))

i=1



1 − T

1 − e (X)

(1 − d (X)) · π(0, X) +

e (X)

d (X) · π(1, X) | X = x

, T = t



i=1



1 − e (x

)

1 − e (x

)

(1 − d (x

)) · E [π(0, X) | X = x

] +

e (x

)

e (x

)

d (x

) · E [π(1, X) | X = x

]



i=1

((1 − d (x

)) · E [π(0, X) | X = x

] + d (x

) · E [π(1, X) | X = x

])

i=1

E [(1 − d (X)) · π(0, X) + d (X) · π(1, X) | X = x

]

= E [Π (d, (X))]

• Optimal policy, d

∗

: Target a customer if and only if

E [π(1, X) − π(0, X) | X = x

] > 0 ⇔ E [(mY (1, X) − c) − (mY (0, X)) | X = x

] > 0

⇔ mE [Y (1, X) − Y (0, X) | X = x

] − c > 0

Lessons: an optimal targeting policy is based on the incremental eﬀect of targeting, and the optimal

policy is based on estimate of the CATE

2 Inﬂuence Functions

Motivation

We want to know how the statistic changes when the data changes.

CLT ensures good properties for many statistics. CLT applies to averages, and the inﬂuence func-

tion is exactly what you are averaging:

√

n(ˆα − α) =

√

i=1

− E[Y ]) →

N(0, ρ

)

The asymptotic variance is just the variance of the inﬂuence function.

Notation

β is a function of the data:

β :=

β(F

), with F

the distribution of the data.

•

β → β, which is also a function of the population “data”: β(F )

• If β(F

) are draws from F , then β(F ) is deﬁned as what β(F

) estimates

How to think about the data changing?

Example 1: Inﬂuence of one data point on the statistic α(F )

Suppose that ˆα = α(F

) =

i=1

and ˆα = α(F

n−1

) =

n−1

i=1

if we delete one data point.

Then, the diﬀerence is

α(F

) − α(F

n−1

)

= y

−

i=1

= y

− α(F

n−1

)

is the size of change. This is similar to (LOO) Jackknife resampling.

Example 2: Perturbation of the data

Suppose that we corrupt the distribution F and get F

that is generated from (1 − ε)F + εG, where

G is a corruption or contamination distribution (usually is a point mass distribution). Assume that

α(F ) =

ydF . Then, the inﬂuence function is

α(F

) − α(F )

ydF

−

ydF

ydG −

ydF =

ydG − α(F )

Example 3: Explicit derivative

∂

∂ε

α(F

) =

∂

∂ε

y [(1 − ε)f + εg] dy =

ydG −

ydF =

ydG − α(F )

However, for more complex functions, ε might not be cancelled in the derivative.

Deﬁnition of inﬂuence function

The inﬂuence function of

θ at F, ψ

θ,F

: X → Γ is deﬁned as:

θ,F

= lim

ϵ→0

θ (F

) −

θ(F )

where F

= (1 − ε)F + εG and G is an arbitrary distribution.

Application to OLS

Another routine to derive OLS estimators.

√



β − β





′



−1

′

Y − β

√



′



−1



′

(Y − Xβ)



√

i=1

′

−1

i=1

= E [XX

′

]

−1

√

i=1





i=1

′

−1

− E [XX

′

]

−1





√

i=1

√

i=1

E [XX

′

]

−1

+ O

(1) · O

(1)

= V

E [XX

′

]

−1

i=1

= E [XX

′

]

−1

i=1

E [XX

′

]

−1

≈



′



−1



′

ΣX





′



−1

Intuition

For any complicated statistic, which is a function of the data,

Θ = Θ(F

), if we know that

√



Θ − Θ



√

i=1

ψ(z

) + (s.o.)

where ψ is the inﬂuence function, and we can ﬁnd some ways to determine the inﬂuence function,

then we can quickly derive the asymptotic properties of that statistic.

Example: Standard MLE

Data z

, parameter θ, log likelihood l(z, θ), and θ

= arg min

E[l(z, θ)], then we have

F OC

i=1

∂l(z

θ)

∂θ

i=1

∂l (z

, θ

)

∂θ

i=1

∂

l (z

, θ

)

∂θ∂θ

′



θ − θ



+ (s.o.)

√



θ − θ



= −

√

i=1

∂

l (z

, θ

)

∂θ∂θ

′

−1

i=1

∂l (z

, θ

)

∂θ

+(s.o.) =

√

i=1

−H(θ

)

−1

∂l (z

, θ

)

∂θ

+(s.o.)

3 Two-Step Estimation

3.1 Standard Two-Step Estimation & Inference

Key idea with observational data: X captures why people select

E [Y (1) − Y (0)] = E [E [Y (1) − Y (0)|X = x]] = E [CATE(x)]

• Imputation

E [Y |T = 1, X = x] = E [Y (1)|T = 1, X = x] = E [Y (1)|X = x]

• Inverse propensity weighting (IPW)



T Y

p(X)

|X = x



= E



T Y (1)

p(X)

|X = x



E [Y (1)|X = x] E [T |X = x]

p(x)

= E [Y (1)|X = x]

where p(x) = P [T = 1|X = x] and 0 < c ≤ p(x) ≤ d < 1, for ﬁxed c, d.

Two-step estimation

Y = α(X) + β(X)T + ε

τ = E [CATE(x)] = E [β(X)]

1. Estimate α(X) and β(X)

2. Use these to estimate τ = E [β(X)]

Example: Linear Models

(x) = E [Y (t)|X = x] = x

′

CATE(x) = β(x) = τ(x) = x

′

− x

′

• Imputation

1. Estimate β

: run regression of Y on X in T = 1 and get

2. Plug in

and compute τ =

(

−

)

• IPW is also a two step estimation

1. Run logit regression of T on X and get ˆp(x

)

2. Plug in

ˆp(x

)

Sources of uncertainty (for inference)

Motivation: Why not bootstrapping? Too complex. Might need at least an outer loop to bootstrap τ

and an inner loop to bootstrap β

. Almost impossible for complex models.

Suppose that we have E

X∼F

′

β(F

)

. As the data changes, X

′

changes twice

1. Data uncertainty: suppose that β

is ﬁxed, when data changes, x

and

E [Y (1)] would change

2. Model uncertainty: when data changes the function

) changes

√



E [Y (1)] − E [Y (1)]



√

i=1



′

− E [Y (1)]



√

i=1



′

− E [X

′

]



√

i=1

′

− E [X

′

]) +

i=1

′

√



− β



√

i=1

(

′

− E [X

′

]) + E [X]

′

i=1

−1

)

The ﬁrst term is centered and scaled by

√

n, similar to sample average (can use CLT). The second

captures

= β

, showing that the ﬁrst step noisy cannot be ignored when you go to the second step.

This matters for inference. We plug in the inﬂuence function here:

√



− β



√

i=1

E [T XX

′

]

−1

√

i=1

−1

The inﬂuence process becomes:

1. Run a regression of Y on X in T = 1

2. Compute

3. Inference: ﬁgure out the inﬂuence function

However, the second term is too complex, almost infeasible for models like DNN. So instead, we use

the following process for inference:

1. Run a regression of Y on X in T = 1 and estimate

β, ˆε, and

2. Plug these estimation in the average

The cost is that we need to do more in the ﬁrst two steps and have to know the form of the inﬂuence

function. In practice, we will use automatic diﬀerentiation in the second step.

Connection to inﬂuence function

E [Y (1)] = E

X∼F

′

β(F

)

= µ

β(F

)) = µ

, z)



β(F

)

∂

∂ε

)



ε=0

∂

∂ε

β(F

)) +

∂µ

∂

β(F

)

∂

β(F

)

∂ε

i=1

(

′

− E [X

′

]) + E [X]

′

i=1

−1

)

Doubly Robust Estimation

E [Y (1)]

i=1



ˆµ

) +

− ˆµ

))

ˆp(x

)



We require only ˆµ

or ˆp is correct (or both are very close to the truth).

Similar intuition: the second step is less sensitive to the ﬁrst step.

3.2 Nonparametric/ML Two-Step Estimation & Inference

Variance-bias decomposition

Suppose that f (x) is the real function, f

(x) is the approximation sequence, and

f(x) is the estimation.

We have

|f(x) −

f(x)| ≤ |f(x) − f

(x)|

| {z }

approximation bias

+ |f

(x) −

f(x)|

| {z }

variance

Example: polynomial 1 in J (quantile) bins

Let f

(x) = β

+ β

x where x ∈ [x, ¯x], β

= f(x) + f

′

(x)(−x) and β

= f

′

(x) if there are enough large

number of bins. We can compute the bias and variance as

• Bias: |f (x) − f

(x)| = |f(x) − α − βx| = |

′′

(x)

(x − x)

| = O(J

−2

)

• Variance: using inﬂuence function we have |f

(x) −

f(x)| = O(

/n) given x ∈ R

Approximating a smooth function

• Polynomial K in J bins:

+ J

−K−1

• Kernelsof order P :

√

+ h

• Series:

+ K

−α

• In general:

Var =

# of params

Bias = (# of params)

−smoothness

Farrell, Liang, and Misra (2021 Econometrica) show that

DNN

(x) − f(x)| = O

W × L log(W ) log(n)

+ ϵ

where W is number of parameters, L is the number of layers, and ϵ

is the bias (depends on the

architecture). Theoretically it is not as good as traditional nonparametric model but in practice it is

very powerful. Because given the adaptive ability of DNN, the true error would be smaller than the

one in theory. Further, we have

W = (d + 1)H

l=2

l−1

+ 1)H

+ (H

+ 1)

∼ H

≤ (W L log(W ))

−smoothness/2×dim

∼ (H

log(H

L))

−smoothness/2×dim

This indicates that even the number of parameters is ﬁxed, deeper neural network gives smaller bias.

Inference challenges for nonparametric/ML methods

Two-step estimation fails for nonparametric/ML methods. Take nonparametric model as an example:

) =

j=1

I{x

∈ b

To conduct inference, we compute

E[Y (1)] − E[Y (1)] =

ˆµ(x

) − E[Y (1)]

ˆµ(x

) −

)

) −

µ(x

)

µ(x

) − E[Y (1)]

Then,

√

ˆµ(x

) −

)





j=1

√

β − β

)I{x

∈ b

}





∼

√

) −

µ(x

)

= (nJ

−4

)

= o

(1) if J grows very fast

Plug in to the original formula

√

E[Y (1)] − E[Y (1)]

= (nJ

−4

)

| {z }

ignorable

√

|{z}

goes to +∞

√

µ(x

) − E[Y (1)]

| {z }

CLT

This shows that the variance would be very large if we use ﬂexible model in the ﬁrst stage and let J

grow fast enough to reduce bias. Thus, the inﬂuence would be problematic.

We use inﬂuence function to deal with this issue:

Given that θ

(x) = E

[Y | T = 1, X = x] =

ydF

(y; x) =

yd[(1 − ε)F

Y |X

(y; x) + εG(y; x)] and

µ(F ) =

θ(x)dF

, we can compute the inﬂuence function:

ψ(x

) = lim

ε→0

∂

∂ε

µ(F

) = lim

ε→0

∂

∂ε

(x)dF

(x)

= lim

ε→0

∂

∂ε

(x)d[(1 − ε)F

(x) + εG(x)]

(x)d[G(x) − F

(x)] +

∂

∂ε

(x)dF

(x)

= θ(x

) − E[Y (1)] + E



∂

∂ε

(x)



How can we get the blue term? Based on the moment condition,

[t(Y − θ

(X)) | X = x

] = 0,

we have

0 =

∂

∂ε

t(y − θ

(x))dF

(y, t; x

)

t(y − θ

(x))

∂

∂ε

(y, t; x

) −

∂

∂ε

(x)dF

(y, t; x

)

t(y − θ

(x))d[G(y, t; x

) − F

Y T |X

(y, t; x

)] −

∂

∂ε

(x)

tdF

(y, t; x

)

= t

− θ(x

)) − E[t(y − θ(x))]

| {z }

F.O.C.

−

∂

∂ε

(x)E[T | X = x

]

= t

− θ(x

)) −

∂

∂ε

(x)p(x

)

Therefore,

∂

∂ε

(x) =

− θ(x

))

p(x

)

Now we can plug in this result into the inﬂuence function

ψ(x

) = θ(x

) − E[Y (1)] +

− θ(x

))

p(x

)

and instead of using

E[Y (1)] =

i=1

θ(x

), whose variance blows up as shown previously, we use

E[Y (1)] =

i=1

θ(x

) +

−

θ(x

))

ˆp(x

)

In the ﬁrst step we estimate

θ(x

) and ˆp(x

) nonparametrically and in the second step just take the

average. This procedure ﬁts in the doubly robust estimation.

√

n ·

i=1

θ(x

) +

−

θ(x

))

ˆp(x

)

− E[Y (1)]

√

n ·

i=1

θ(x

) +

− θ(x

))

ˆp(x

)

(θ(x

) −

θ(x

))

ˆp(x

)

− E[Y (1)]

√

n ·

i=1



θ(x

) +

− θ(x

))

p(x

)

− E[Y (1)]



+ R

where

R1 =

√

n ·

i=1

− θ(x

))



ˆp(x

)

−

p(x

)



R2 =

√

n ·

i=1

(

θ(x

) − θ(x

))(ˆp(x

) − p(x

))

ˆp(x

)p(x

)

R3 =

√

n ·

i=1

(

θ(x

) − θ(x

)) ·

p(x

) − t

p(x

)

The variance of R

is given by

V ar(R1) =

i=1



ˆp(x

)

−

p(x

)



V [y

| x

] → 0

For R

, getting the asymptotic property of the variance is tricky. We have E[t

| x

] = p(x

) but

E[t

− p(x

) | t

, x

] = 0 might not hold. To break the connection with t

, we need use the sample

split trick. Speciﬁcally, we ﬁrst estimate

θ(x

) on part of the data, and get 1/n

i∈C

θ(x

) on the

rest of the data. Then, we have

E[(

θ(x

) − θ(x

)) · (p(x

) − t

)|x

] = (

θ(x

) − θ(x

)) · E[(p(x

) − t

)|x

] = 0

For R

, we have

√

n ·

i=1

(

θ(x

) − θ(x

))(ˆp(x

) − p(x

))

√

n ·

i=1

(

θ(x

) − θ(x

))

1/2

i=1

(ˆp(x

) − p(x

))

1/2

≤

√

+ J

−2

+ J

−2

So we just need J

√

n → 0.

The inﬂuence function (with sample split) is less sensitive to the ﬁrst step estimation (probably due

to doubly robust estimation?), so the variance won’t blow up.

More general inﬂuence function (if we want to do inference for something other than the ATE).

µ(F ) = µ = E[H(X, θ(X))], θ

= argmax

i=1

l(y

, t

, θ(x

))

has inﬂuence function

H(x

, θ(x

)) − µ + (∇

H)E[l

θθ

| x

]

−1

, t

, θ(x

))

We can use automatic diﬀerentiation engine to compute this.

4 Deep Learning for Individual Heterogeneity

Consider a utility function U = α+βT +ε where T is some treatment (for our example, say a targeted

price). Then a ”structural” choice with heterogeneity and the usual EVTI error gives

P (y

= 1 | x

, t

) =

exp (α

+ β

)

1 + exp (α

+ β

)

Change this to

P (y

= 1 | x

, t

) =

exp (α

DNN

) + β

DNN

) t

)

1 + exp (α

DNN

) + β

DNN

) t

)

and estimate

θ = argmax

θ∈F

DNN

l(y

, t

, θ(x

))

This retains the structural interpretation completely: β(x) is still the price eﬀect, and we can still

use usual tricks for WTP, elasticity, surplus, ... In the ﬁrst stage we use DNN to estimate β ; and in

the second stage we can do inference for these economic outcomes through computing the standard

asymptotic variance of the inﬂuence function.

Sample split trick: we need to split the data into three part. Usually A and B should be larger than

C because DNN is harder than taking average.

• Part A for estimating θ(x) using DNN

• Part B for estimating Λ(x) = E[l

θθ

| x] (probably using DNN)

• Part C for averaging the inﬂuence function

Advantages of this methods:

• If there’s heterogeneity, we cannot get an accurate estimation of β if simply regress the outcome

on the treatment (unless fully randomization).

• While we cannot derive the conﬁdence interval of β(x), we can use β(x) to do targeting and

compute the conﬁdence interval of E[Π(target ads)]