Download - SEMI-SUPERVISED CLASSIFICATION WITH GRAPH ...SEMI-SUPERVISED CLASSIFICATION WITH GRAPH CONVOLUTIONAL NETWORKS Thomas N. Kipf, Max Welling ICLR 2017 Presented by Devansh Shah 1 Semi-Supervised

SEMI-SUPERVISED CLASSIFICATION

WITH GRAPH CONVOLUTIONAL

NETWORKS

Thomas N. Kipf, Max WellingICLR 2017

Presented by Devansh Shah

1

Semi-Supervised Learning

Goal: Learn a better prediction rule than based on labeled data alone2

Why bother?

• Unlabeled data is cheap

• Labeled data can be hard to get

• human annotation is boring

• labels may require experts

3

Can Unlabeled data help?

• Assuming each class is a coherent group (e.g. Gaussian)

• With and without unlabeled data: decision boundary shift

4

Can Unlabeled data help?

“Similar” data points have “similar” labels5

Semi-supervised vs transductive learning

• labeled data (Xl ,Yl) = {(x1:l , y1:l)}• unlabeled data Xu = {xl+1:n}, available during training

• test data Xtest = {xn+1:}, not available during training

Inductive learning is ultimately applied to the test data.

Transductive learning is only concerned with the unlabeled data.

6

Graph Convolutional Networks

7

Applications

• Social Networks

• Protein-Protein Interaction

• 3D Meshes

• Clustering

• Scene Graphs

8

Graph Learning Problem

Inputs:

• graph G = (V ,E )

• A feature description xi for every node i; summarized in a

N × D feature matrix X (N: number of nodes, D: number of

input features)

• Adjacency matrix A

Outputs:

• node-level output Z (an N×F feature matrix, where F is the

number of output features per node)

9

Understanding Graph Neural Networks

Every neural network layer can be written as a non-linear function

H l+1 = f (H l ,A) with

• H0 = X

• HL = Z where L is number of layers

10


f (H l ,A) = σ(AH lW l) where

• W l is weight matrix for the l-th layer

• σ(.) is a non-linear activation function like the ReLU

11


Limitation I:

• Multiplication with A means that, for every node, we sum up

all the feature vectors of all neighboring nodes but not the

node itself

Fix:

• Enforce self-loop in the graph by adding identity matrix to A

12


Limitation II:

• A is typically not normalized and therefore the multiplication

with A will completely change the scale of the feature vectors

Fix:

• Normalize A such that all rows sum to one, i.e. D−1A, where

D is the diagonal node degree matrix. Multiplying with D−1A

now corresponds to taking the average of neighboring node

features

13


Propagation Rule: f (H l ,A) = σ(D−0.5AD−0.5H lW l)

• A = A + I , where I is the identity matrix

• D is the diagonal node degree matrix of A

14

Semi-Supervised Node Classification

Cross-Entropy error over all labeled examples

Z = softmax(HL)

Loss = −∑l∈YL

F∑f=1

Ylf lnZlf

• HL is the output of the last layer

• YL is the set of node indices that have labels

• F is the number of distinct output classes

15

Experiments

Datasets

16

Experiments

Baselines

• Label Propagation (LP)

• Semi-Supervised embedding (SemiEmb)

• Manifold regularization (ManiReg)

• skip-gram based graph embeddings (DeepWalk)

• Iterative classification algorithm (ICA)

17

Experiments

Results

18

Robust Graph Convolutional Networks Against

Adversarial Attacks

Dingyuan Zhu, Ziwei Zhang, Peng Cui, Wenwu ZhuACM SIGKDD 2019

Presented by Devansh Shah

19

Adversarial Attacks on Graphs

RELATED WORK

• Adversarial Attack on Graph Structured Data

• Adversarial Attacks on Neural Networks for Graph Data

20

http://proceedings.mlr.press/v80/dai18b/dai18b.pdf

https://arxiv.org/pdf/1805.07984.pdf

Graph adversarial attack

Transductive Node Classification Setting

• A single graph G0 = (V0,E0) is considered in the entire

dataset

• A target node ci ∈ Vi of graph Gi is associated with a

corresponding node label yi ∈ Y

• Test nodes (but not their labels) are also observed during

training

• D(tra) = {(G0, ci , yi )}Ni=1

21


Problem DefinitionGiven:

• A learned classifier f

• An instance from the dataset (G , c , y) ∈ D

The graph adversarial attacker g(·, ·) : G × D → G modifies the

graph G = (V ,E ) into G = (V , E ) such that,

maxG

1(f (G , c) 6= y)

s.t. G = g(f , (G , c , y))

Eq(G , G , c) = 1

Here Eq(·, ·, ·) : G × G × V → {0, 1} is an equivalency indicator

that tells whether two graphs G and G are semantically equivalent 22


23

Robust Graph Convolutional Network (RGCN)

Crux of the paper

• Instead of representing nodes as vectors, they are represented

as Gaussian distributions in each convolutional layer

• When the graph is attacked, the model can automatically

absorb the effects of adversarial changes in the variances of

the Gaussian distributions

• To remedy the propagation of adversarial attacks in GCNs,

variance-based attention mechanism is used when performing

convolutions

24

Gaussian-based Graph Convolution Layer

Latent representation of node vi in layer l

hli = N (µli , diag(σli ))

µli ∈ Rfl is the mean vector

diag(σli )) ∈ Rfl×fl is the diagonal variance matrix

Notation:

M l = [µl1, ..., µN1 ] ∈ RN×fl is the mean matrix

Covl = [σl1, ..., σN1 ] ∈ RN×fl is the variance matrix

25

RGCN

26

RGCN

TheoremIf xi ∼ N (µi , diag(σi )) i = 1, ...n and they are independent, then

for any fixed weights wi , we have:

n∑i=1

wixi ∼ N (n∑

i=1

wiµi , diag(n∑

i=1

w2i σi ))

27

RGCN Node Aggregation

To prevent the propagation of adversarial attacks in GCNs, we

propose an attention mechanism to assign different weights to

neighbors based on their variances since larger variances indicate

more uncertainties in the latent representations and larger

probability of having been attacked

αlj = exp(−γσlj )

Here αlj are the attention weights of node vj in the layer l and γ is

a hyper-parameter

28

RGCN Node Aggregation

µl+1i = ReLU(

∑j∈ne(i)

1√Di ,i Dj ,j

(µlj � αlj)W

lµ)

σl+1i = ReLU(

∑j∈ne(i)

1

Di ,i Dj ,j

(σlj � αlj � αl

j)Wlσ)

29

Loss Functions

Considering that the hidden representations of our method are

Gaussian distributions, we first adopt a sampling process in the last

hidden layer

zi ∼ N (µLi , diag(σLi ))

Next zi is passed to a softmax function to get the predicted labels:

Y = softmax(Z ),Z = [z1, ..., zn]

Lcls is the cross-entropy loss between the actual labels and the

predicted probabilities for the labelled nodes

30

Loss Functions

To ensure that the learned representations are indeed Gaussian

distributions, we use an explicit regularization to constrain the

latent representations in the first layer as follows

Lreg1 =n∑

i=1

KL(N (µi , diag(σi ))||N (0, I ))

where KL(·||·) is the KL-divergence between two distributions

We also impose L2 regularization on parameters of the first layer as

follows:

Lreg2 =∥∥∥W (0)

µ

∥∥∥22

+∥∥∥W (0)

σ

∥∥∥22

31

Loss Functions

L = Lcls + β1Lreg1 + β2Lreg2

where β1 and β2 are hyper-parameters that control the impact of

different regularizations

32

Results

Node Classification on Clean Datasets

RGCN slightly outperforms the baseline methods on Pubmed,

while having comparable performance on Cora and Citeseer

33

Results

Against Non-targeted Adversarial Attacks

34

Results

Against Targeted Adversarial Attacks

35

Thank You!

35