Causal Machine Learning
Zijing Hu
November 29, 2023
Contents
1 The Framework of Causal Inference 1
1.1 Application: profit maximisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Influence Functions 3
3 Two-Step Estimation 5
3.1 Standard Two-Step Estimation & Inference . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Nonparametric/ML Two-Step Estimation & Inference . . . . . . . . . . . . . . . . . . 7
4 Deep Learning for Individual Heterogeneity 10
*This note is based on Causal Machine Learning (41917) by Dr. Max Farrell and Dr. Sanjog Misra
1 The Framework of Causal Inference
Estimands
ITE = Y
i
(1) Y
i
(0). Not observed and not estimable without very strong assumptions.
ATE = E [Y
i
(1) Y
i
(0)] := τ. Overall effects.
CATE = E [Y
i
(1) Y
i
(0) | X = x] := τ (x). Effects conditional on individual characteristics.
ATT = E [Y
i
(1) Y
i
(0) | T = 1]. Effects conditional on treatment.
ITT. Intent to be treated. Very common.
Identification of Causal Effect
Step 1: difference in means
Overlap/positivity assumption
Regularity condition E [|Y |
c+ε
| T = 1] <
¯
Y
1
¯
Y
0
=
1
n
1
X
i
y
1i
t
i
1
n
0
X
i
y
0i
(1 t
i
)
=
n
n
1
1
n
X
i
y
1i
t
i
n
n
0
1
n
X
i
y
0i
(1 t
i
)
overlap & regularity P (T = 1)
1
E [Y T ] P (T = 0)
1
E [Y (1 T )]
= E [Y | T = 1] E [Y | T = 0]
Step 2: causal effect
Stable unit treatment values assumption (SUTVA): E [Y | T = 1] = E [Y (1) | T = 1] , E [Y | T = 0] =
E [Y (0) | T = 0] (Consistency and non-interference)
1
Randomization assumption
Relevant paper: Blake, Nosko, and Tadelis (2015)
E [Y | T = 1] E [Y | T = 0]
SUTVA & consistency = E [Y (1) | T = 1] E [Y (0) | T = 0]
= ATT + selection bias
randomization = ATT
Regression for Causal Effect
Assuming that Y (t) = µ(t) + ε
t
(SUTVA) and Y = T Y (1) + (1 T )Y (0) (consistency), we have
Y = Y (0) + T (Y (1) Y (0))
= µ
0
+ T (µ
1
µ
0
) + ε
0
+ T (ε
1
ε
0
)
= α + βT + ε
To ensure identification and unbiased estimation, we additionally need the rank assumption (overlap
assumption) and independent assumption (randomization assumption).
Covariates
Bad controls: treatment might affect covariates. Include only pre-treatment covariates. Assum-
ing that Y = T Y (1) + (1 T )Y (0) and X = T X(1) + (1 T )X(0), we have
E [Y | X = 1, T = 1] E [Y | X = 1, T = 0]
= E [Y (1) | X(1) = 1, T = 1] E [Y (0) | X(0) = 1, T = 0]
= E [Y (1) | X(1) = 1] E [Y (0) | X(1) = 1] + E [Y (0) | X(1) = 1] E [Y (0) | X(0) = 1]
= E [Y (1) | X(1) = 1] E [Y (0) | X(1) = 1] + selection bias
Heterogeneity: assuming that Y (t) = µ(t, X) + ε
t
, then
Y = µ(0, X) + T (µ(1, X) µ(0, X)) + ε
= α(X) + β(X) · T + ε
= α(X) + CATE · T + ε
We need to specify structurally the form of heterogeneity. For example:
µ(t, X) = α + βX + t(τ + γ(X
¯
X))
Then, we have
Y = α + βX + T τ + T γ(X
¯
X)
= Y = b
0
+ b
1
T + b
2
X + b
3
(X
¯
X)
Causal ML
Use ML (especially deep learning) for β(X)
We relax function form assumption of β
Y = α(X) + β(X) · T + ε α
i
+ ITE
i
· T
i
1.1 Application: profit maximisation
Profit contribution depending on targeting status
π
i
(T ) = π (T, x
i
) =
(
mY (0, x
i
) if T = 0
mY (1, x
i
) c if T = 1
2
m is the margin percentage
c is the targeting cost
Easily generalizable for heterogeneous margins and costs
Targeting policy d : X {0, 1}
Goal: Evaluate the expected profit from any targeting policy, d
E (d, (X))] =
N
X
i=1
E [1 {d (X) = 0} · π(0, X) + 1 {d (X) = 1} · π(1, X) | X = x
i
]
=
N
X
i=1
E [(1 d (X)) · π(0, X) + d (X) · π(1, X) | X = x
i
]
Optimal policy
d
is an optimal policy if it maximizes E (d, (X))]
Assume: T
i
does not affect the behavior of any other customer i
= i (SUTVA)
Then d
is optimal if and only if it maximizes the expected profit from each individual customer
with features x
i
,
E [(1 d (X)) · π(0, X) + d (X) · π(1, X) | X = x
i
]
We use the inverse probability-weighted targeting profit estimator
E
h
ˆ
Π (d, (X))
i
=
N
X
i=1
E
1 T
1 e (X)
(1 d (X)) · π(0, X) +
T
e (X)
d (X) · π(1, X) | X = x
i
, T = t
i
=
N
X
i=1
1 e (x
i
)
1 e (x
i
)
(1 d (x
i
)) · E [π(0, X) | X = x
i
] +
e (x
i
)
e (x
i
)
d (x
i
) · E [π(1, X) | X = x
i
]
=
N
X
i=1
((1 d (x
i
)) · E [π(0, X) | X = x
i
] + d (x
i
) · E [π(1, X) | X = x
i
])
=
N
X
i=1
E [(1 d (X)) · π(0, X) + d (X) · π(1, X) | X = x
i
]
= E (d, (X))]
Optimal policy, d
: Target a customer if and only if
E [π(1, X) π(0, X) | X = x
i
] > 0 E [(mY (1, X) c) (mY (0, X)) | X = x
i
] > 0
mE [Y (1, X) Y (0, X) | X = x
i
] c > 0
Lessons: an optimal targeting policy is based on the incremental effect of targeting, and the optimal
policy is based on estimate of the CATE
2 Influence Functions
Motivation
We want to know how the statistic changes when the data changes.
CLT ensures good properties for many statistics. CLT applies to averages, and the influence func-
tion is exactly what you are averaging:
n(ˆα α) =
1
n
n
X
i=1
(y
i
E[Y ])
d
N(0, ρ
2
)
3
The asymptotic variance is just the variance of the influence function.
Notation
ˆ
β is a function of the data:
ˆ
β :=
ˆ
β(F
n
), with F
n
the distribution of the data.
ˆ
β β, which is also a function of the population “data”: β(F )
If β(F
n
) are draws from F , then β(F ) is defined as what β(F
n
) estimates
How to think about the data changing?
Example 1: Influence of one data point on the statistic α(F )
Suppose that ˆα = α(F
n
) =
1
n
P
n
i=1
y
i
and ˆα = α(F
n1
) =
1
n1
P
n1
i=1
y
i
if we delete one data point.
Then, the difference is
α(F
n
) α(F
n1
)
1
n
= y
n
1
n
n
X
i=1
y
i
= y
n
α(F
n1
)
1
n
is the size of change. This is similar to (LOO) Jackknife resampling.
Example 2: Perturbation of the data
Suppose that we corrupt the distribution F and get F
ε
that is generated from (1 ε)F + εG, where
G is a corruption or contamination distribution (usually is a point mass distribution). Assume that
α(F ) =
R
ydF . Then, the influence function is
α(F
ε
) α(F )
ε
=
R
ydF
ε
R
ydF
ε
=
Z
ydG
Z
ydF =
Z
ydG α(F )
Example 3: Explicit derivative
ε
α(F
ε
) =
ε
Z
y [(1 ε)f + εg] dy =
Z
ydG
Z
ydF =
Z
ydG α(F )
However, for more complex functions, ε might not be cancelled in the derivative.
Definition of influence function
The influence function of
ˆ
θ at F, ψ
ˆ
θ,F
: X Γ is defined as:
ψ
ˆ
θ,F
= lim
ϵ0
ˆ
θ (F
ϵ
)
ˆ
θ(F )
ϵ
where F
ε
= (1 ε)F + εG and G is an arbitrary distribution.
Application to OLS
Another routine to derive OLS estimators.
4
n
ˆ
β β
=
X
X
1
X
Y β
=
n
X
X
1
X
(Y Xβ)
=
n
1
n
n
X
i=1
x
i
x
i
!
1
1
n
n
X
i=1
x
i
ε
i
!
= E [XX
]
1
1
n
n
X
i=1
x
i
ε
i
!
+
1
n
n
X
i=1
x
i
x
i
!
1
E [XX
]
1
1
n
n
X
i=1
x
i
ε
i
!
=
1
n
n
X
i=1
E [XX
]
1
x
i
ε
i
+ O
p
(1) · O
p
(1)
AV
h
ˆ
β
i
= V
"
E [XX
]
1
1
n
n
X
i=1
x
i
ε
i
#
= E [XX
]
1
V
"
1
n
n
X
i=1
x
i
ε
i
#
E [XX
]
1
X
X
1
X
b
ΣX
X
X
1
Intuition
For any complicated statistic, which is a function of the data,
ˆ
Θ = Θ(F
n
), if we know that
n
ˆ
Θ Θ
0
=
1
n
n
X
i=1
ψ(z
i
) + (s.o.)
where ψ is the influence function, and we can find some ways to determine the influence function,
then we can quickly derive the asymptotic properties of that statistic.
Example: Standard MLE
Data z
i
, parameter θ, log likelihood l(z, θ), and θ
0
= arg min
θ
E[l(z, θ)], then we have
0
F OC
=
n
X
i=1
l(z
i
,
ˆ
θ)
θ
=
n
X
i=1
l (z
i
, θ
0
)
θ
+
n
X
i=1
2
l (z
i
, θ
0
)
θθ
ˆ
θ θ
0
+ (s.o.)
n
ˆ
θ θ
0
=
1
n
"
n
X
i=1
2
l (z
i
, θ
0
)
θθ
#
1
"
n
X
i=1
l (z
i
, θ
0
)
θ
#
+(s.o.) =
1
n
n
X
i=1
H(θ
0
)
1
l (z
i
, θ
0
)
θ
+(s.o.)
3 Two-Step Estimation
3.1 Standard Two-Step Estimation & Inference
Key idea with observational data: X captures why people select
E [Y (1) Y (0)] = E [E [Y (1) Y (0)|X = x]] = E [CATE(x)]
Imputation
E [Y |T = 1, X = x] = E [Y (1)|T = 1, X = x] = E [Y (1)|X = x]
Inverse propensity weighting (IPW)
E
T Y
p(X)
|X = x
= E
T Y (1)
p(X)
|X = x
=
E [Y (1)|X = x] E [T |X = x]
p(x)
= E [Y (1)|X = x]
where p(x) = P [T = 1|X = x] and 0 < c p(x) d < 1, for fixed c, d.
5
Two-step estimation
Y = α(X) + β(X)T + ε
τ = E [CATE(x)] = E [β(X)]
1. Estimate α(X) and β(X)
2. Use these to estimate τ = E [β(X)]
Example: Linear Models
µ
t
(x) = E [Y (t)|X = x] = x
β
t
CATE(x) = β(x) = τ(x) = x
β
1
x
β
0
Imputation
1. Estimate β
t
: run regression of Y on X in T = 1 and get
ˆ
β
t
2. Plug in
ˆ
β
t
:
1
n
P
i
x
i
ˆ
β
t
and compute τ =
1
n
P
i
x
i
(
ˆ
β
1
ˆ
β
0
)
IPW is also a two step estimation
1. Run logit regression of T on X and get ˆp(x
i
)
2. Plug in
1
n
P
i
t
i
Y
i
ˆp(x
i
)
Sources of uncertainty (for inference)
Motivation: Why not bootstrapping? Too complex. Might need at least an outer loop to bootstrap τ
and an inner loop to bootstrap β
t
. Almost impossible for complex models.
Suppose that we have E
XF
n
h
X
ˆ
β(F
n
)
i
. As the data changes, X
ˆ
β
t
changes twice
1. Data uncertainty: suppose that β
t
is fixed, when data changes, x
i
ˆ
β
t
and
\
E [Y (1)] would change
2. Model uncertainty: when data changes the function
ˆ
β
t
(F
n
) changes
n
\
E [Y (1)] E [Y (1)]
=
1
n
n
X
i=1
x
i
ˆ
β
1
E [Y (1)]
=
1
n
n
X
i=1
x
i
ˆ
β
1
E [X
β
1
]
=
1
n
n
X
i=1
(x
i
β
1
E [X
β
1
]) +
1
n
n
X
i=1
x
i
!
n
ˆ
β
1
β
1
=
1
n
n
X
i=1
(
(x
i
β
1
E [X
β
1
]) + E [X]
n
X
i=1
M
1
1
t
i
x
i
ε
i
)
The first term is centered and scaled by
n, similar to sample average (can use CLT). The second
captures
ˆ
β
1
= β
1
, showing that the first step noisy cannot be ignored when you go to the second step.
This matters for inference. We plug in the influence function here:
n
ˆ
β
1
β
1
=
1
n
n
X
i=1
E [T XX
]
1
t
i
x
i
ε
i
=
1
n
n
X
i=1
M
1
1
t
i
x
i
ε
i
The influence process becomes:
1. Run a regression of Y on X in T = 1
2. Compute
1
n
P
i
x
i
ˆ
β
1
3. Inference: figure out the influence function
6
However, the second term is too complex, almost infeasible for models like DNN. So instead, we use
the following process for inference:
1. Run a regression of Y on X in T = 1 and estimate
ˆ
β, ˆε, and
ˆ
M
2. Plug these estimation in the average
The cost is that we need to do more in the first two steps and have to know the form of the influence
function. In practice, we will use automatic differentiation in the second step.
Connection to influence function
E [Y (1)] = E
XF
n
h
X
ˆ
β(F
n
)
i
= µ
1
(F
n
,
ˆ
β(F
n
)) = µ
1
(F
n
, z)
z=
ˆ
β(F
n
)
ε
µ
1
(F
ε
)
ε=0
=
ε
µ
1
(F
ε
,
ˆ
β(F
n
)) +
µ
1
ˆ
β(F
n
)
ˆ
β(F
n
)
ε
=
n
X
i=1
(
(x
i
β
1
E [X
β
1
]) + E [X]
n
X
i=1
M
1
1
t
i
x
i
ε
i
)
Doubly Robust Estimation
\
E [Y (1)]
DR
=
1
n
n
X
i=1
ˆµ
1
(x
i
) +
t
1
(y
i
ˆµ
1
(x
i
))
ˆp(x
i
)
We require only ˆµ
1
or ˆp is correct (or both are very close to the truth).
Similar intuition: the second step is less sensitive to the first step.
3.2 Nonparametric/ML Two-Step Estimation & Inference
Variance-bias decomposition
Suppose that f (x) is the real function, f
n
(x) is the approximation sequence, and
ˆ
f(x) is the estimation.
We have
|f(x)
ˆ
f(x)| |f(x) f
n
(x)|
| {z }
approximation bias
+ |f
n
(x)
ˆ
f(x)|
| {z }
variance
Example: polynomial 1 in J (quantile) bins
Let f
n
(x) = β
0
+ β
1
x where x [x, ¯x], β
0
= f(x) + f
(x)(x) and β
1
= f
(x) if there are enough large
number of bins. We can compute the bias and variance as
Bias: |f (x) f
n
(x)| = |f(x) α βx| = |
f
′′
(x)
2
(x x)
2
| = O(J
2
)
Variance: using influence function we have |f
n
(x)
ˆ
f(x)| = O(
p
J
d
/n) given x R
d
Approximating a smooth function
Polynomial K in J bins:
q
J
d
n
+ J
K1
Kernelsof order P :
1
nh
d
+ h
P
Series:
q
K
n
+ K
α
In general:
Var =
# of params
n
Bias = (# of params)
smoothness
7
Farrell, Liang, and Misra (2021 Econometrica) show that
|
ˆ
f
DNN
(x) f(x)| = O
p
r
W × L log(W ) log(n)
n
+ ϵ
n
!
where W is number of parameters, L is the number of layers, and ϵ
n
is the bias (depends on the
architecture). Theoretically it is not as good as traditional nonparametric model but in practice it is
very powerful. Because given the adaptive ability of DNN, the true error would be smaller than the
one in theory. Further, we have
W = (d + 1)H
1
+
L
X
l=2
(H
l1
+ 1)H
l
+ (H
L
+ 1)
H
l
=H
H
2
L
ϵ
n
(W L log(W ))
smoothness/2×dim
(H
2
L
2
log(H
2
L))
smoothness/2×dim
This indicates that even the number of parameters is fixed, deeper neural network gives smaller bias.
Inference challenges for nonparametric/ML methods
Two-step estimation fails for nonparametric/ML methods. Take nonparametric model as an example:
µ
n
(x
i
) =
J
X
j=1
I{x
i
b
j
}x
i
β
i
To conduct inference, we compute
\
E[Y (1)] E[Y (1)] =
1
n
X
i
ˆµ(x
i
) E[Y (1)]
=
"
1
n
X
i
ˆµ(x
i
)
1
n
X
i
µ
n
(x
i
)
#
+
"
1
n
X
i
µ
n
(x
i
)
1
n
X
i
µ(x
i
)
#
+
"
1
n
X
i
µ(x
i
) E[Y (1)]
#
Then,
n
"
1
n
X
i
ˆµ(x
i
)
1
n
X
i
µ
n
(x
i
)
#
=
1
n
X
i
x
i
!
J
X
j=1
n(
ˆ
β β
n
)I{x
i
b
j
}
n
J
n
=
J
n
"
1
n
X
i
µ
n
(x
i
)
1
n
X
i
µ(x
i
)
#
= (nJ
4
)
1
2
= o
p
(1) if J grows very fast
Plug in to the original formula
n
h
\
E[Y (1)] E[Y (1)]
i
= (nJ
4
)
1
2
| {z }
ignorable
+
J
|{z}
goes to +
+
n
"
1
n
X
i
µ(x
i
) E[Y (1)]
#
| {z }
CLT
This shows that the variance would be very large if we use flexible model in the first stage and let J
grow fast enough to reduce bias. Thus, the influence would be problematic.
We use influence function to deal with this issue:
Given that θ
ε
(x) = E
ε
[Y | T = 1, X = x] =
R
ydF
ε
(y; x) =
R
yd[(1 ε)F
Y |X
(y; x) + εG(y; x)] and
µ(F ) =
R
θ(x)dF
X
, we can compute the influence function:
8
ψ(x
i
) = lim
ε0
ε
µ(F
ε
) = lim
ε0
ε
Z
θ
ε
(x)dF
ε
(x)
= lim
ε0
ε
Z
θ
ε
(x)d[(1 ε)F
X
(x) + εG(x)]
=
Z
θ
ε
(x)d[G(x) F
X
(x)] +
Z
ε
θ
ε
(x)dF
X
(x)
= θ(x
i
) E[Y (1)] + E
ε
θ
ε
(x)
How can we get the blue term? Based on the moment condition,
E
ε
[t(Y θ
ε
(X)) | X = x
i
] = 0,
we have
0 =
ε
Z
t(y θ
ε
(x))dF
ε
(y, t; x
i
)
=
Z
t(y θ
ε
(x))
ε
dF
ε
(y, t; x
i
)
Z
t
ε
θ
ε
(x)dF
ε
(y, t; x
i
)
=
Z
t(y θ
ε
(x))d[G(y, t; x
i
) F
Y T |X
(y, t; x
i
)]
ε
θ
ε
(x)
Z
tdF
ε
(y, t; x
i
)
= t
i
(y
i
θ(x
i
)) E[t(y θ(x))]
| {z }
F.O.C.
ε
θ
ε
(x)E[T | X = x
i
]
= t
i
(y
i
θ(x
i
))
ε
θ
ε
(x)p(x
i
)
Therefore,
ε
θ
ε
(x) =
t
i
(y
i
θ(x
i
))
p(x
i
)
Now we can plug in this result into the influence function
ψ(x
i
) = θ(x
i
) E[Y (1)] +
t
i
(y
i
θ(x
i
))
p(x
i
)
and instead of using
\
E[Y (1)] =
1
n
P
n
i=1
ˆ
θ(x
i
), whose variance blows up as shown previously, we use
\
E[Y (1)] =
1
n
n
X
i=1
ˆ
θ(x
i
) +
t
i
(y
i
ˆ
θ(x
i
))
ˆp(x
i
)
!
In the first step we estimate
ˆ
θ(x
i
) and ˆp(x
i
) nonparametrically and in the second step just take the
average. This procedure fits in the doubly robust estimation.
n ·
1
n
n
X
i=1
ˆ
θ(x
i
) +
t
i
(y
i
ˆ
θ(x
i
))
ˆp(x
i
)
E[Y (1)]
!
=
n ·
1
n
n
X
i=1
ˆ
θ(x
i
) +
t
i
(y
i
θ(x
i
))
ˆp(x
i
)
+
t
i
(θ(x
i
)
ˆ
θ(x
i
))
ˆp(x
i
)
E[Y (1)]
!
=
n ·
1
n
n
X
i=1
θ(x
i
) +
t
i
(y
i
θ(x
i
))
p(x
i
)
E[Y (1)]
+ R
1
+ R
2
+ R
3
9
where
R1 =
n ·
1
n
n
X
i=1
t
i
(y
i
θ(x
i
))
1
ˆp(x
i
)
1
p(x
i
)
R2 =
n ·
1
n
n
X
i=1
(
ˆ
θ(x
i
) θ(x
i
))(ˆp(x
i
) p(x
i
))
t
i
ˆp(x
i
)p(x
i
)
R3 =
n ·
1
n
n
X
i=1
(
ˆ
θ(x
i
) θ(x
i
)) ·
p(x
i
) t
i
p(x
i
)
The variance of R
1
is given by
V ar(R1) =
1
n
n
X
i=1
1
ˆp(x
i
)
1
p(x
i
)
2
V [y
i
| x
i
] 0
For R
3
, getting the asymptotic property of the variance is tricky. We have E[t
i
| x
i
] = p(x
i
) but
E[t
i
p(x
i
) | t
i
, x
i
] = 0 might not hold. To break the connection with t
i
, we need use the sample
split trick. Specifically, we first estimate
ˆ
θ(x
i
) on part of the data, and get 1/n
P
iC
ˆ
θ(x
i
) on the
rest of the data. Then, we have
E[(
ˆ
θ(x
i
) θ(x
i
)) · (p(x
i
) t
i
)|x
i
] = (
ˆ
θ(x
i
) θ(x
i
)) · E[(p(x
i
) t
i
)|x
i
] = 0
For R
2
, we have
n ·
1
n
n
X
i=1
(
ˆ
θ(x
i
) θ(x
i
))(ˆp(x
i
) p(x
i
))
=
n ·
1
n
n
X
i=1
(
ˆ
θ(x
i
) θ(x
i
))
2
!
1/2
1
n
n
X
i=1
(ˆp(x
i
) p(x
i
))
2
!
1/2
n
r
J
d
n
+ J
2
!
r
J
d
n
+ J
2
!
So we just need J
d
/
n 0.
The influence function (with sample split) is less sensitive to the first step estimation (probably due
to doubly robust estimation?), so the variance won’t blow up.
More general influence function (if we want to do inference for something other than the ATE).
µ(F ) = µ = E[H(X, θ(X))], θ
0
= argmax
θ
n
X
i=1
l(y
i
, t
i
, θ(x
i
))
has influence function
H(x
i
, θ(x
i
)) µ + (
θ
H)E[l
θθ
| x
i
]
1
l
θ
(y
i
, t
i
, θ(x
i
))
We can use automatic differentiation engine to compute this.
4 Deep Learning for Individual Heterogeneity
Consider a utility function U = α+βT +ε where T is some treatment (for our example, say a targeted
price). Then a ”structural” choice with heterogeneity and the usual EVTI error gives
P (y
i
= 1 | x
i
, t
i
) =
exp (α
i
+ β
i
t
i
)
1 + exp (α
i
+ β
i
t
i
)
10
Change this to
P (y
i
= 1 | x
i
, t
i
) =
exp (α
DNN
(x
i
) + β
DNN
(x
i
) t
i
)
1 + exp (α
DNN
(x
i
) + β
DNN
(x
i
) t
i
)
and estimate
ˆ
θ = argmax
θ∈F
DNN
1
n
X
i
l(y
i
, t
i
, θ(x
i
))
This retains the structural interpretation completely: β(x) is still the price effect, and we can still
use usual tricks for WTP, elasticity, surplus, ... In the first stage we use DNN to estimate β ; and in
the second stage we can do inference for these economic outcomes through computing the standard
asymptotic variance of the influence function.
Sample split trick: we need to split the data into three part. Usually A and B should be larger than
C because DNN is harder than taking average.
Part A for estimating θ(x) using DNN
Part B for estimating Λ(x) = E[l
θθ
| x] (probably using DNN)
Part C for averaging the influence function
Advantages of this methods:
If there’s heterogeneity, we cannot get an accurate estimation of β if simply regress the outcome
on the treatment (unless fully randomization).
While we cannot derive the confidence interval of β(x), we can use β(x) to do targeting and
compute the confidence interval of E[Π(target ads)]
11