Machine Learning with Graphs
Zijing Hu
June 22, 2024
Contents
1 Graph Features 1
1.1 Feature Engineering for Graph Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Node Embeddings and Graph Representation Learning . . . . . . . . . . . . . . . . . . 3
1.3 Graph and Stochastic Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Graph Learning 4
2.1 Collective Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Graph Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
*This note is based on CS224W: Machine Learning with Graphs by Dr. Jure Leskovec, Stanford
1 Graph Features
1.1 Feature Engineering for Graph Data
Representation of a network
Objects (nodes, vertices): N
Interactions (links, edges) E (networks are sparse, i.e. |E| |E
max
|)
System (network, graph): G(N, E)
Node degree, k
i
: the number of edges adjacent to node i.
Undirected:
¯
k =
1
N
P
N
i=1
k
i
=
2|E|
N
Directed:
¯
k =
|E|
N
, k
in
= k
out
Adjacency matrix A: A
ij
= 1 if there’s a link from node i to j (directed) else A
ij
= 0
More types: weighted, self-loops, multigraph
Block-diagonal form
Node and edge attributes
Node-level features
*
Importance-based;
structure-based (typological properties)
*
Node degree counts (simple but ignore node importance)
*
Node centrality
Eigenvector: a node is important if surrounded by important neighboring nodes
c
i
=
X
jN (i)
c
j
λc = Ac
1
Betweenness: a node is important if it lies on many shortest paths between other nodes
c
i
=
X
j=i=k
#(shortest paths between j and k that contain i)
#(shortest paths between j and k)
Closeness: a node is important if it has shortest path lengths to others ()
c
i
=
1
P
i=j
min D(i, j)
Clustering coefficient: how a node’s neighboring nodes are connected with each other clustering
coefficient counts the #(triangles) in the ego-network
e
i
=
#(edges among neighboring nodes)
#(all possible edges among neighboring nodes)
[0, 1]
Graphlet Degree Vector (GDV) counts #(graphlets/pre-specified subgraphs) that a node touches.
It provides a measure of a nodes’ local network topology
Link-level features
Link prediction as a task: (1) links missing at random; (2) links over time
Methodology: compute score c(i, j) for each pair of nodes and predict top n pairs as new links
Distance-based feature (e.g., shortest-path distance)
Local neighborhood overlap
Common neighbors: |N (i) N(j)|
Jaccard’s coefficient: |N (i) N(j)|/|N(i) N(j)|
Adamic-Adar index (high if the common neighbor has small degree, useful in social net-
work):
P
uN(i)N(j)
1/ log(k
u
)
Global neighborhood overlap: Let P
(K)
ij
denotes #paths of length K between i and j. We have
P
(K)
= A
K
, where A is the adjacency matrix. Then we can compute the Kats index
S
ij
=
X
l=1
β
l
P
(l)
ij
=
X
l=1
β
l
A
l
ij
where β (0, 1) is the discount factor. The closed-form of Katz index matrix is given by
S =
X
l=1
β
l
A
l
= (I βA)
1
I
Graph-level features
Graphlet Kernel: given graph G and a graphlet list G
k
= {g
1
, g
2
, . . . , g
n
k
}, define the graphlet
count vector f
G
R
n
k
as (f
G
)
i
= #(g
i
G) for i = 1, 2, . . . , n
k
. Define the graphlet kernel as
K(G, G
) = f
G
f
G
or K(G, G
) =
f
G
sum(f
G
)
f
G
sum(f
G
)
The second one helps deal with the situation in which G and G
have different sizes. But counting
graphlets is still very expensive.
Weisfeiler-Lehman Kernel: color refinement algorithm. The intuition is to use neighborhood
structure to design a more efficient graph feature descriptor.
Other kernels (e.g., random-work kernel, shortest-path graph kernel, etc.)
2
1.2 Node Embeddings and Graph Representation Learning
The goal is to encode nodes so that similarity in the embedding space (e.g., dot product) approximates
similarity in the graph.
How to use embeddings of nodes?
Clustering/community detection
Node classification
Link prediction
Graph classification
Node embeddings
“Shallow” encoding: embedding-lookup. ENC(i) = z
i
= Z · v
i
, where Z R
d×|V |
is a matrix
in which each column is a node embedding and v
i
R
|V |
is a one-hot encoding of node i. The
object is to maximize v
i
v
j
for similar nodes i and j
Similarity decoding
Random walk: v
i
v
j
probability of visiting j on a random walk starting from i. Algo-
rithm: (1) run short fixed-length random walks starting from each node using some random
walk strategy R (DeepWalk: unbiased random walks); (2) for each node i collect N
R
(u),
the multiset of nodes visited; (3) optimize embeddings
L = max
f
X
iV
log P (N
R
(i) | z
i
)
We can equivalently optimize
L =
X
iV
X
jN
R
(i)
log(P (j | z
i
))
and parameterize P (j | z
i
) as
P (j | z
i
) =
exp(z
i
z
j
)
P
kV
exp(z
i
z
k
)
log (σ(z
i
z
j
))
T
X
t=1
log (σ(z
i
z
k
t
))
where z
k
t
are sampled with probability proportional to its degree. In practice, T = 5 20
node2vec: flexible, biased random walks. Three choices: return back, same distance move
(parameterized by p), larger distance move (parameterized by q)
Method selection (Goyal and Ferrara 2017 survey)
Graph embeddings
Sum up node embeddings
Introduce a “virtual node” to represent (connect) the (sub)graph
Anonymous walk
Represent the graph as a probability distribution over these walks. Need to decide how
many random walks we need
Embed anonymous walks
1.3 Graph and Stochastic Matrix
PageRank
Intuition: the importance of a page is the sum of flows of importance come from other page. Given
the stochastic adjacency matrix M and rank vector r, the flow equations can be written as r = M · r
Stationary distribution of infinite random walk
3
Eigendecomposition: r is the principal eigenvector of M with eigenvalue 1.
We can solve r by power iteration.
Using teleports to deal with spider traps and dead ends
Google PageRank matrix (teleport with same probability)
˜
M = βM + (1 β)
1
N
N×N
Random Walk with Restarts
Can be viewed as a special case of PageRank in which the teleport is always to the same node (start
node). It is a very powerful solution for compute “similarity” that considers:
Multiple connections
Multiple paths
Direct and indirect connections
Degree of the node
The above methods are connected to to Matrix Factorization but the optimization target might be
different across algorithms (e.g. DeepWalk).
Limitations of node embeddings
Cannot deal with unseen nodes or evolving networks
Cannot capture structural similarity (node-level features)
Cannot utilize node, edge, and graph features
2 Graph Learning
2.1 Collective Classification
The goal is to use the network structures and labeled nodes to analysis unlabeled nodes (collective
classification). The intuition is to find correlations in the network
Homophily: individual characteristics = social connections
Influence: social connections = individual characteristics
Relational classifier
Intuition: class probability of a node is a weighted average of class probabilities of its neighbors.
Algorithm: (1) initialize unlabeled nodes with random labels and (2) update sequentially until
convergence or maximum number of iterations based on
P (Y
i
= c) =
P
(i,j)E
A
ij
P (Y
j
= c)
P
(i,j)E
A
ij
Challenge: convergence is not guaranteed; cannot use node feature
Iterative classification
Intuition: improve by incorporating features
Algorithm: (1) Train two classifiers ϕ
1
(f
i
) and ϕ
2
(f
i
, z
i
) on training set (different dataset) to
predict Y
i
, where f
i
is a vector of node features and z
i
is a vector of labels of i’s neighbors; (2)
On the testing set, set Y
j
based on ϕ
1
, compute z
j
and predict the labels with ϕ
2
; (3) repeat
(2) until convergence or maximum number of iterations
Challenge: convergence is not guaranteed
4
Loopy belief propagation
Intuition: let nodes to pass message to each other
Algorithm:
m
ij(Y
j
)
=
X
Y
i
L = ψ(Y
i
, Y
j
)ϕ
i
(Y
i
)
Y
kN
i
j
m
ki
(Y
i
)
where ψ(Y
i
, Y
j
) is label-label potential, ϕ
i
(Y
i
) is the prior, and
Q
kN
i
j
m
ki
(Y
i
) are all messages
sent by neighbors from previous round.
Challenge: convergence is not guaranteed especially if many closed loops; require training the
label-label potential functions.
2.2 Graph Neural Networks
Notation
The vertex set V
The adjacency matrix (assume binary) A
A matrix of node features X R
m×|V |
A node i V ; the set of neighbors of i: N(i)
Message-aggregation architecture
Message:
message from other neighbors: m
(l)
i
= MSG
(l)
other
(h
(l1)
i
)
e.g .
= W
(l)
h
(l1)
i
message from the node itself: m
(l)
j
= MSG
(l)
self
(h
(l1)
j
)
e.g .
= B
(l)
h
(l1)
j
Aggregation:
h
(l)
j
= CONCAT
AGG({m
(l)
i
, i N(j)}), m
(l)
j
Graph convolutional networks
Architecture
Layer-0 transforms nodes to embeddings: h
(0)
i
= x
i
Layer-l (0 l L1) aggregates neighbor messages and incorporate non-linearity (e.g., ReLU)
h
(l+1)
j
= σ
1
|N(j)|
X
iN(j)
W
(l)
h
(l)
i
+ B
(l)
h
(l)
i
Matrix formulations: Let H
(l)
= [h
(l)
1
, . . . , h
(l)
|V |
]
and D = Diag([D(1), . . . D(|V |)]). Then
H
(l+1)
= σ(D
1
AH
(l)
W
(l)
+ H
(l)
B
(l)
)
5
Layer-L outputs the embedding of i after L layers of aggregation: z
i
= h
(L)
i
Downstream tasks
Supervised (e.g. label prediction): L =
P
iV
Loss(y
i
, f(z
i
))
Unsupervised (e.g., link prediction): L =
P
i,jV
CrossEntropyLoss(y
ij
, Similarity(z
i
, z
j
))
Inductive capability: can model unseen nodes
GraphSAGE
Two-stage aggregation
Stage 1: aggregate from node neighbors h
(l)
N(j)
AGG({h
(l1)
i
, i N(j)})
Pool: AGG = MAX({MLP(h
(l1)
i
), i N(j)}
LSTM (order information): AGG = LSTM([h
(l1)
i
, i π(N(j))]}
Stage 2: aggregate over the node itself h
(l)
j
σ(W
(l)
CONCAT({h
(l1)
i
, h
(l)
N(j)
}))
Stage 3 (optional): normalization h
(l)
j
h
(l)
j
/||h
(l)
j
||
2
Graph isomorphism network
GCN and GraphSAGE could fail in certain cases (Xu et al. ICLR 2019)
GIN is the most expressive GNN in the class of message-passing GNNs
Two-MLP architecture
MLP
Φ
X
xS
MLP
f
(x)
!
GIN is a “neural network” version of the WL graph kernel (Section 1.1)
c
(k+1)
(i) = GINConv
n
c
(k)
(i), {c
k
(j)}
jN (i)
o
= MLP
Φ
(1 + ϵ) · c
(k)
(i)
X
jN (i)
c
k
(j)
where c is a differentiable color hash function.
Graph attention networks
Intuition: using flexible weights instead of fixed ones to implicitly specify different importance values
to different neighbors.
h
(l+1)
j
= σ
X
iN(j)∪{j}
α
ij
W
(l)
h
(l)
i
The attention weight α
ij
is computed by
α
ij
=
exp(e
ij
)
P
kN (k)
exp(e
ik
)
and
e
AB
= a(W
(l)
h
(l1)
A
, W
(l)
h
(l1)
B
)
e.g .
= Linear
Concat(W
(l)
h
(l1)
A
, W
(l)
h
(l1)
B
)
We can also use multi-head attention to get multiple attention scores and then aggregate them by
concatenation or summation.
Training tricks
General tricks
Batch normalization
Dropout: apply to W
(l)
6
Activation: PReLU performs better empirically
Train/test split
Transductive setting: only split the labels but the whole graph will be used in training
Inductive setting: break edges between splits to get multiple graph
Avoid oversmoothing
Carefully add GNN layers
Increase the complexity withing GNN layers
Add layers between GNN layers that not pass message
Add skip connections (ResNet)
h
(l+1)
j
= σ
1
|N(j)|
X
iN(j)
W
(l)
h
(l)
i
+ B
(l)
h
(l)
i
+ h
(l)
j
Avoid pooling issue
Hierarchical pooling
DiffPoll: two GNNs, one for node embedding and another one for clustering
2.3 Graph Augmentation
Implicit assumption: raw input graph = computational graph. But we might want to break it
The input graph lacks features
The graph is too sparse inefficient message passing
The graph is too dense message passing is too costly
The graph is too large high requirement of GPU
Feature augmentation (graph lacks features)
Assign constant values to nodes (node fixed effects)
Assign on-hot encodings to nodes (good expressive power but low generalizability and scalability)
Other human generated graph features (Section 1.1)
Augment sparse graphs
Add virtual edges: using A + A
2
as adjacency matrix
Add virtual nodes: the virtual node will connect to all the nodes in the graph
Node neighborhood sampling
Sample different neighbors in diferent layers
7