Machine Learning with Graphs - Zijing "Jimmy" Hu

Machine Learning with Graphs

Zijing Hu

June 22, 2024

Contents

1 Graph Features 1

1.1 Feature Engineering for Graph Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Node Embeddings and Graph Representation Learning . . . . . . . . . . . . . . . . . . 3

1.3 Graph and Stochastic Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Graph Learning 4

2.1 Collective Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Graph Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

*This note is based on CS224W: Machine Learning with Graphs by Dr. Jure Leskovec, Stanford

1 Graph Features

1.1 Feature Engineering for Graph Data

Representation of a network

• Objects (nodes, vertices): N

• Interactions (links, edges) E (networks are sparse, i.e. |E| ≪ |E

max

• System (network, graph): G(N, E)

• Node degree, k

: the number of edges adjacent to node i.

• Undirected:

k =

i=1

2|E|

• Directed:

k =

|E|

, k

= k

out

• Adjacency matrix A: A

= 1 if there’s a link from node i to j (directed) else A

= 0

• More types: weighted, self-loops, multigraph

• Block-diagonal form

• Node and edge attributes

Node-level features

Importance-based;

†

structure-based (typological properties)

•

*†

Node degree counts (simple but ignore node importance)

•

Node centrality

• Eigenvector: a node is important if surrounded by important neighboring nodes

j∈N (i)

⇔ λc = Ac

• Betweenness: a node is important if it lies on many shortest paths between other nodes

j=i=k

#(shortest paths between j and k that contain i)

#(shortest paths between j and k)

• Closeness: a node is important if it has shortest path lengths to others ()

i=j

min D(i, j)

•

†

Clustering coeﬃcient: how a node’s neighboring nodes are connected with each other clustering

coeﬃcient counts the #(triangles) in the ego-network

#(edges among neighboring nodes)

#(all possible edges among neighboring nodes)

∈ [0, 1]

•

†

Graphlet Degree Vector (GDV) counts #(graphlets/pre-speciﬁed subgraphs) that a node touches.

It provides a measure of a nodes’ local network topology

Link-level features

Link prediction as a task: (1) links missing at random; (2) links over time

Methodology: compute score c(i, j) for each pair of nodes and predict top n pairs as new links

• Distance-based feature (e.g., shortest-path distance)

• Local neighborhood overlap

• Common neighbors: |N (i) ∩ N(j)|

• Jaccard’s coeﬃcient: |N (i) ∩ N(j)|/|N(i) ∪ N(j)|

• Adamic-Adar index (high if the common neighbor has small degree, useful in social net-

work):

u∈N(i)∩N(j)

1/ log(k

)

• Global neighborhood overlap: Let P

(K)

denotes #paths of length K between i and j. We have

(K)

= A

, where A is the adjacency matrix. Then we can compute the Kats index

∞

l=1

(l)

∞

l=1

where β ∈ (0, 1) is the discount factor. The closed-form of Katz index matrix is given by

S =

∞

l=1

= (I − βA)

−1

− I

Graph-level features

• Graphlet Kernel: given graph G and a graphlet list G

= {g

, g

, . . . , g

}, deﬁne the graphlet

count vector f

∈ R

as (f

)

= #(g

⊂ G) for i = 1, 2, . . . , n

. Deﬁne the graphlet kernel as

K(G, G

′

) = f

′

or K(G, G

′

) =



sum(f

)



′



′

sum(f

′

)



The second one helps deal with the situation in which G and G

′

have diﬀerent sizes. But counting

graphlets is still very expensive.

• Weisfeiler-Lehman Kernel: color reﬁnement algorithm. The intuition is to use neighborhood

structure to design a more eﬃcient graph feature descriptor.

• Other kernels (e.g., random-work kernel, shortest-path graph kernel, etc.)

1.2 Node Embeddings and Graph Representation Learning

The goal is to encode nodes so that similarity in the embedding space (e.g., dot product) approximates

similarity in the graph.

How to use embeddings of nodes?

• Clustering/community detection

• Node classiﬁcation

• Link prediction

• Graph classiﬁcation

Node embeddings

• “Shallow” encoding: embedding-lookup. ENC(i) = z

= Z · v

, where Z ∈ R

d×|V |

is a matrix

in which each column is a node embedding and v

∈ R

|V |

is a one-hot encoding of node i. The

object is to maximize v

′

for similar nodes i and j

• Similarity decoding

• Random walk: v

′

≈ probability of visiting j on a random walk starting from i. Algo-

rithm: (1) run short ﬁxed-length random walks starting from each node using some random

walk strategy R (DeepWalk: unbiased random walks); (2) for each node i collect N

(u),

the multiset of nodes visited; (3) optimize embeddings

L = max

i∈V

log P (N

(i) | z

)

We can equivalently optimize

L =

i∈V

j∈N

(i)

log(P (j | z

))

and parameterize P (j | z

) as

P (j | z

) =

exp(z

′

)

k∈V

exp(z

′

)

≈ log (σ(z

′

)) −

t=1

log (σ(z

′

))

where z

are sampled with probability proportional to its degree. In practice, T = 5 − 20

• node2vec: ﬂexible, biased random walks. Three choices: return back, same distance move

(parameterized by p), larger distance move (parameterized by q)

• Method selection (Goyal and Ferrara 2017 survey)

Graph embeddings

• Sum up node embeddings

• Introduce a “virtual node” to represent (connect) the (sub)graph

• Anonymous walk

• Represent the graph as a probability distribution over these walks. Need to decide how

many random walks we need

• Embed anonymous walks

1.3 Graph and Stochastic Matrix

PageRank

Intuition: the importance of a page is the sum of ﬂows of importance come from other page. Given

the stochastic adjacency matrix M and rank vector r, the ﬂow equations can be written as r = M · r

• Stationary distribution of inﬁnite random walk

• Eigendecomposition: r is the principal eigenvector of M with eigenvalue 1.

• We can solve r by power iteration.

• Using teleports to deal with spider traps and dead ends

• Google PageRank matrix (teleport with same probability)

M = βM + (1 − β)





N×N

Random Walk with Restarts

Can be viewed as a special case of PageRank in which the teleport is always to the same node (start

node). It is a very powerful solution for compute “similarity” that considers:

• Multiple connections

• Multiple paths

• Direct and indirect connections

• Degree of the node

The above methods are connected to to Matrix Factorization but the optimization target might be

diﬀerent across algorithms (e.g. DeepWalk).

Limitations of node embeddings

• Cannot deal with unseen nodes or evolving networks

• Cannot capture structural similarity (node-level features)

• Cannot utilize node, edge, and graph features

2 Graph Learning

2.1 Collective Classiﬁcation

The goal is to use the network structures and labeled nodes to analysis unlabeled nodes (collective

classiﬁcation). The intuition is to ﬁnd correlations in the network

• Homophily: individual characteristics =⇒ social connections

• Inﬂuence: social connections =⇒ individual characteristics

Relational classiﬁer

• Intuition: class probability of a node is a weighted average of class probabilities of its neighbors.

• Algorithm: (1) initialize unlabeled nodes with random labels and (2) update sequentially until

convergence or maximum number of iterations based on

P (Y

= c) =

(i,j)∈E

P (Y

= c)

(i,j)∈E

• Challenge: convergence is not guaranteed; cannot use node feature

Iterative classiﬁcation

• Intuition: improve by incorporating features

• Algorithm: (1) Train two classiﬁers ϕ

) and ϕ

, z

) on training set (diﬀerent dataset) to

predict Y

, where f

is a vector of node features and z

is a vector of labels of i’s neighbors; (2)

On the testing set, set Y

based on ϕ

, compute z

and predict the labels with ϕ

; (3) repeat

(2) until convergence or maximum number of iterations

• Challenge: convergence is not guaranteed

Loopy belief propagation

• Intuition: let nodes to pass message to each other

• Algorithm:

i→j(Y

)

∈ L = ψ(Y

, Y

)ϕ

)

k∈N

k→i

)

where ψ(Y

, Y

) is label-label potential, ϕ

) is the prior, and

k∈N

k→i

) are all messages

sent by neighbors from previous round.

• Challenge: convergence is not guaranteed especially if many closed loops; require training the

label-label potential functions.

2.2 Graph Neural Networks

Notation

• The vertex set V

• The adjacency matrix (assume binary) A

• A matrix of node features X ∈ R

m×|V |

• A node i ∈ V ; the set of neighbors of i: N(i)

Message-aggregation architecture

Message:

message from other neighbors: m

(l)

= MSG

(l)

other

(l−1)

)

e.g .

= W

(l)

(l−1)

message from the node itself: m

(l)

= MSG

(l)

self

(l−1)

)

e.g .

= B

(l)

(l−1)

Aggregation:

(l)

= CONCAT



AGG({m

(l)

, i ∈ N(j)}), m

(l)



Graph convolutional networks

Architecture

• Layer-0 transforms nodes to embeddings: h

(0)

= x

• Layer-l (0 ≤ l ≤ L−1) aggregates neighbor messages and incorporate non-linearity (e.g., ReLU)

(l+1)

= σ





|N(j)|

i∈N(j)

(l)

+ B

(l)





• Matrix formulations: Let H

(l)

= [h

(l)

, . . . , h

(l)

|V |

]

⊤

and D = Diag([D(1), . . . D(|V |)]). Then

(l+1)

= σ(D

−1

(l)

(l)⊤

+ H

(l)

(l)⊤

)

• Layer-L outputs the embedding of i after L layers of aggregation: z

= h

(L)

Downstream tasks

• Supervised (e.g. label prediction): L =

i∈V

Loss(y

, f(z

))

• Unsupervised (e.g., link prediction): L =

i,j∈V

CrossEntropyLoss(y

, Similarity(z

, z

))

• Inductive capability: can model unseen nodes

GraphSAGE

• Two-stage aggregation

Stage 1: aggregate from node neighbors h

(l)

N(j)

← AGG({h

(l−1)

, ∀i ∈ N(j)})

• Pool: AGG = MAX({MLP(h

(l−1)

), ∀i ∈ N(j)}

• LSTM (order information): AGG = LSTM([h

(l−1)

, ∀i ∈ π(N(j))]}

Stage 2: aggregate over the node itself h

(l)

← σ(W

(l)

CONCAT({h

(l−1)

, h

(l)

N(j)

}))

Stage 3 (optional): normalization h

(l)

← h

(l)

/||h

(l)

Graph isomorphism network

• GCN and GraphSAGE could fail in certain cases (Xu et al. ICLR 2019)

• GIN is the most expressive GNN in the class of message-passing GNNs

• Two-MLP architecture

MLP

x∈S

MLP

(x)

• GIN is a “neural network” version of the WL graph kernel (Section 1.1)

(k+1)

(i) = GINConv

n

(k)

(i), {c

(j)}

j∈N (i)

o

= MLP





(1 + ϵ) · c

(k)

(i)

j∈N (i)

(j)





where c is a diﬀerentiable color hash function.

Graph attention networks

Intuition: using ﬂexible weights instead of ﬁxed ones to implicitly specify diﬀerent importance values

to diﬀerent neighbors.

(l+1)

= σ





i∈N(j)∪{j}

(l)





The attention weight α

is computed by

exp(e

)

k∈N (k)

exp(e

)

and

= a(W

(l)

(l−1)

, W

(l)

(l−1)

)

e.g .

= Linear



Concat(W

(l)

(l−1)

, W

(l)

(l−1)

)



We can also use multi-head attention to get multiple attention scores and then aggregate them by

concatenation or summation.

Training tricks

• General tricks

• Batch normalization

• Dropout: apply to W

(l)

• Activation: PReLU performs better empirically

• Train/test split

• Transductive setting: only split the labels but the whole graph will be used in training

• Inductive setting: break edges between splits to get multiple graph

• Avoid oversmoothing

• Carefully add GNN layers

• Increase the complexity withing GNN layers

• Add layers between GNN layers that not pass message

• Add skip connections (ResNet)

(l+1)

= σ





|N(j)|

i∈N(j)

(l)

+ B

(l)

+ h

(l)





• Avoid pooling issue

• Hierarchical pooling

• DiﬀPoll: two GNNs, one for node embedding and another one for clustering

2.3 Graph Augmentation

Implicit assumption: raw input graph = computational graph. But we might want to break it

• The input graph lacks features

• The graph is too sparse → ineﬃcient message passing

• The graph is too dense → message passing is too costly

• The graph is too large → high requirement of GPU

Feature augmentation (graph lacks features)

• Assign constant values to nodes (node ﬁxed eﬀects)

• Assign on-hot encodings to nodes (good expressive power but low generalizability and scalability)

• Other human generated graph features (Section 1.1)

Augment sparse graphs

• Add virtual edges: using A + A

as adjacency matrix

• Add virtual nodes: the virtual node will connect to all the nodes in the graph

Node neighborhood sampling

• Sample diﬀerent neighbors in diferent layers