SEMI-SUPERVISED CLASSIFICATION
WITH GRAPH CONVOLUTIONAL
NETWORKS
Thomas N. Kipf, Max WellingICLR 2017
Presented by Devansh Shah
1
Semi-Supervised Learning
Goal: Learn a better prediction rule than based on labeled data alone2
Why bother?
• Unlabeled data is cheap
• Labeled data can be hard to get
• human annotation is boring
• labels may require experts
3
Can Unlabeled data help?
• Assuming each class is a coherent group (e.g. Gaussian)
• With and without unlabeled data: decision boundary shift
4
Can Unlabeled data help?
“Similar” data points have “similar” labels5
Semi-supervised vs transductive learning
• labeled data (Xl ,Yl) = {(x1:l , y1:l)}• unlabeled data Xu = {xl+1:n}, available during training
• test data Xtest = {xn+1:}, not available during training
Inductive learning is ultimately applied to the test data.
Transductive learning is only concerned with the unlabeled data.
6
Graph Convolutional Networks
7
Applications
• Social Networks
• Protein-Protein Interaction
• 3D Meshes
• Clustering
• Scene Graphs
8
Graph Learning Problem
Inputs:
• graph G = (V ,E )
• A feature description xi for every node i; summarized in a
N × D feature matrix X (N: number of nodes, D: number of
input features)
• Adjacency matrix A
Outputs:
• node-level output Z (an N×F feature matrix, where F is the
number of output features per node)
9
Understanding Graph Neural Networks
Every neural network layer can be written as a non-linear function
H l+1 = f (H l ,A) with
• H0 = X
• HL = Z where L is number of layers
10
Understanding Graph Neural Networks
f (H l ,A) = σ(AH lW l) where
• W l is weight matrix for the l-th layer
• σ(.) is a non-linear activation function like the ReLU
11
Understanding Graph Neural Networks
Limitation I:
• Multiplication with A means that, for every node, we sum up
all the feature vectors of all neighboring nodes but not the
node itself
Fix:
• Enforce self-loop in the graph by adding identity matrix to A
12
Understanding Graph Neural Networks
Limitation II:
• A is typically not normalized and therefore the multiplication
with A will completely change the scale of the feature vectors
Fix:
• Normalize A such that all rows sum to one, i.e. D−1A, where
D is the diagonal node degree matrix. Multiplying with D−1A
now corresponds to taking the average of neighboring node
features
13
Understanding Graph Neural Networks
Propagation Rule: f (H l ,A) = σ(D−0.5AD−0.5H lW l)
• A = A + I , where I is the identity matrix
• D is the diagonal node degree matrix of A
14
Semi-Supervised Node Classification
Cross-Entropy error over all labeled examples
Z = softmax(HL)
Loss = −∑l∈YL
F∑f=1
Ylf lnZlf
• HL is the output of the last layer
• YL is the set of node indices that have labels
• F is the number of distinct output classes
15
Experiments
Datasets
16
Experiments
Baselines
• Label Propagation (LP)
• Semi-Supervised embedding (SemiEmb)
• Manifold regularization (ManiReg)
• skip-gram based graph embeddings (DeepWalk)
• Iterative classification algorithm (ICA)
17
Experiments
Results
18
Robust Graph Convolutional Networks Against
Adversarial Attacks
Dingyuan Zhu, Ziwei Zhang, Peng Cui, Wenwu ZhuACM SIGKDD 2019
Presented by Devansh Shah
19
Adversarial Attacks on Graphs
RELATED WORK
• Adversarial Attack on Graph Structured Data
• Adversarial Attacks on Neural Networks for Graph Data
20
Graph adversarial attack
Transductive Node Classification Setting
• A single graph G0 = (V0,E0) is considered in the entire
dataset
• A target node ci ∈ Vi of graph Gi is associated with a
corresponding node label yi ∈ Y
• Test nodes (but not their labels) are also observed during
training
• D(tra) = {(G0, ci , yi )}Ni=1
21
Graph adversarial attack
Problem DefinitionGiven:
• A learned classifier f
• An instance from the dataset (G , c , y) ∈ D
The graph adversarial attacker g(·, ·) : G × D → G modifies the
graph G = (V ,E ) into G = (V , E ) such that,
maxG
1(f (G , c) 6= y)
s.t. G = g(f , (G , c , y))
Eq(G , G , c) = 1
Here Eq(·, ·, ·) : G × G × V → {0, 1} is an equivalency indicator
that tells whether two graphs G and G are semantically equivalent 22
Graph adversarial attack
23
Robust Graph Convolutional Network (RGCN)
Crux of the paper
• Instead of representing nodes as vectors, they are represented
as Gaussian distributions in each convolutional layer
• When the graph is attacked, the model can automatically
absorb the effects of adversarial changes in the variances of
the Gaussian distributions
• To remedy the propagation of adversarial attacks in GCNs,
variance-based attention mechanism is used when performing
convolutions
24
Gaussian-based Graph Convolution Layer
Latent representation of node vi in layer l
hli = N (µli , diag(σli ))
µli ∈ Rfl is the mean vector
diag(σli )) ∈ Rfl×fl is the diagonal variance matrix
Notation:
M l = [µl1, ..., µN1 ] ∈ RN×fl is the mean matrix
Covl = [σl1, ..., σN1 ] ∈ RN×fl is the variance matrix
25
RGCN
26
RGCN
TheoremIf xi ∼ N (µi , diag(σi )) i = 1, ...n and they are independent, then
for any fixed weights wi , we have:
n∑i=1
wixi ∼ N (n∑
i=1
wiµi , diag(n∑
i=1
w2i σi ))
27
RGCN Node Aggregation
To prevent the propagation of adversarial attacks in GCNs, we
propose an attention mechanism to assign different weights to
neighbors based on their variances since larger variances indicate
more uncertainties in the latent representations and larger
probability of having been attacked
αlj = exp(−γσlj )
Here αlj are the attention weights of node vj in the layer l and γ is
a hyper-parameter
28
RGCN Node Aggregation
µl+1i = ReLU(
∑j∈ne(i)
1√Di ,i Dj ,j
(µlj � αlj)W
lµ)
σl+1i = ReLU(
∑j∈ne(i)
1
Di ,i Dj ,j
(σlj � αlj � αl
j)Wlσ)
29
Loss Functions
Considering that the hidden representations of our method are
Gaussian distributions, we first adopt a sampling process in the last
hidden layer
zi ∼ N (µLi , diag(σLi ))
Next zi is passed to a softmax function to get the predicted labels:
Y = softmax(Z ),Z = [z1, ..., zn]
Lcls is the cross-entropy loss between the actual labels and the
predicted probabilities for the labelled nodes
30
Loss Functions
To ensure that the learned representations are indeed Gaussian
distributions, we use an explicit regularization to constrain the
latent representations in the first layer as follows
Lreg1 =n∑
i=1
KL(N (µi , diag(σi ))||N (0, I ))
where KL(·||·) is the KL-divergence between two distributions
We also impose L2 regularization on parameters of the first layer as
follows:
Lreg2 =∥∥∥W (0)
µ
∥∥∥22
+∥∥∥W (0)
σ
∥∥∥22
31
Loss Functions
L = Lcls + β1Lreg1 + β2Lreg2
where β1 and β2 are hyper-parameters that control the impact of
different regularizations
32
Results
Node Classification on Clean Datasets
RGCN slightly outperforms the baseline methods on Pubmed,
while having comparable performance on Cora and Citeseer
33
Results
Against Non-targeted Adversarial Attacks
34
Results
Against Targeted Adversarial Attacks
35
Thank You!
35