Exploiting Interaction Links for Node Classiﬁcation …...overcome sparsity in the adjacency...

Exploiting Interaction Links for Node Classificationwith Deep Graph Neural Networks

Hogun Park and Jennifer NevilleDepartment of Computer Science Purdue University

hogun nevillepurdueedu

AbstractNode classification is an important problem in re-lational machine learning However in scenarioswhere graph edges represent interactions amongthe entities (eg over time) the majority of cur-rent methods either summarize the interaction in-formation into link weights or aggregate the linksto produce a static graph In this paper we proposea neural network architecture that jointly capturesboth temporal and static interaction patterns whichwe call Temporal-Static-Graph-Net (TSGNet) Ourkey insight is that leveraging both a static neigh-bor encoder which can learn aggregate neighborpatterns and a graph neural network-based recur-rent unit which can capture complex interactionpatterns improve the performance of node clas-sification In our experiments on node classifica-tion tasks TSGNet produces significant gains com-pared to state-of-the-art methodsmdashreducing clas-sification error up to 24 and an average of 10compared to the best competitor on four real-worldnetworks and one synthetic dataset

1 IntroductionNode classification is a central task in relational machinelearning In complex network domains node classificationmethods have used different types of relational informationsuch as features of direct neighbors [Lu and Getoor 2003]and autocorrelation of class labels [Jensen et al 2004]While these and other methods have shown the effective-ness of using neighbor information to improve node clas-sification many real-world network datasets have sparselink structure which limits the amount of neighbor infor-mation available for classification This is particularly truefor networks constructed from interactions over time Re-cent work on low-dimensional node embeddings and neuralnetwork architectures for graphs [Kipf and Welling 2016Grover and Leskovec 2016] has shown promising results foraddressing the sparsity problem in node classification taskson static graphs In particular GCN [Kipf and Welling 2016]learns individual node embeddings by passing transformingand aggregating node feature information across neighbors inan end-to-end fashion

While many real world social network domains consist ofinteractions that are changing and evolving over time it canbe difficult to leverage the dynamics of temporal interactionsin node classification methods For example users developconnections in social networks while interacting and commu-nicating among themselves over time The temporal patternsof interactions could be potentially useful for predicting theirclass labels (eg political view) However when the interac-tion edges are very sparse in each temporal snapshot or whena nodersquos neighborhood is biased toward a particular class la-bel in different snapshots it may be difficult to identify tem-poral patterns that are useful for prediction In these casesstatic aggregated patterns may be more informative Our pro-posed model can not only learn temporal interaction patternsbut also model the aggregated neighborhood and jointly learnhow to combine the two views for node classification

Specifically in this paper we propose a novel deep neu-ral network architecture for sequences of interaction graphscalled TSGNet Our TSGNet model learns a temporal en-coder that leverages the strengths of both the GCN to discoverinteraction patterns at each temporal snapshot and a recurrentunit LSTM [Hochreiter and Schmidhuber 1997] to capturecomplex long-term dependencies To learn the temporal rep-resentation more efficiently we propose a mini-batch trainingvia importance sampling which reduces the recursive neigh-borhood expansion across layers and helps to decrease timecomplexity while maintaining performance In addition TS-GNet learns a second neighbor encoder representation from astatic summary of each nodersquos neighborhood The static andtemporal components of the model are jointly estimated tooptimize node classification performance For evaluation weconduct extensive experiments on both one synthetic and fourdifferent real-world networks with and without attributesand we observe significant performance gains compared tostate-of-the-art methods Moreover we conduct a careful ab-lation study to show that our architecture design is most ro-bust compared to models that use different static andor tem-poral components

2 Problem Definition21 MotivationFigure 1a-b show examples of interactions in complex net-works Each node and edge indicate an author and a co-

Figure 1 Examples of interactions over time (k) (a) User A and B have same neighbors when aggregated but different patterns over time(b) User A and B have same patterns over time but different neighborhood patterns when aggregated (c) Architecture for TSGNet

author event respectively Note that colors represent topiclabels yellow for NLP and grey for Database In Figure 1auser A and B have the same coauthors (neighbors) when theyare aggregated In this context the existing graph embeddingapproaches (eg Node2Vec [Grover and Leskovec 2016] orGCN [Kipf and Welling 2016]) will end up learning simi-lar latent representations for the two nodes However theirtemporal interaction patterns may be different When authorA collaborates with other authors differently over time com-pared to author B their representations should be modifiedaccordingly Meanwhile if node A and B show similar inter-action patterns over time then it may be difficult to determinethe correct class labels through temporal patterns only InFigure 1b although the temporal coauthoring patterns aroundauthor A and B are similar their neighborhoods on the aggre-gated graph are entirely different In that case using the ag-gregated neighborhoods (which correspond to static features)the class labels of node A and B could be identified

In this paper our goal is to jointly learn patterns in bothinteractions over time and in static neighbor sets In order tomodel both properties we propose a neural network modelTSGNet The details are described in Section 3

22 NotationWe define a graph sequence as a set of graphs such thatG = [G1 G2 Gm] Each Gk has the same set of nodesvi isin V where foralli isin [1 n] but a different set of edgesEk sube V timesV such that Gk = 〈VEk〉 If eij isin Ek there is anedge between vi and vj at time k otherwise there is not Al-ternatively let A = [A1A2 Am] be the set of adjacencymatrices for G where Ak[i j] = 1 if eij isin Ek 0 otherwiseWhile the network structure is changing over time we assumethat the node attributes are not changing over time1 Let F be

1The setting is realistic for many social and interaction graphsbecause available node attributes are from basic profiles resourcesor fixed properties so they are static or changing very slowly Forexample in Facebook profile attributes like gender and religious

the feature (attribute) set over the nodes Each vi isin V hasa corresponding feature vector fi isin F Y is the label setover the nodes Only a subset of the nodes vi sube V have aclass label yi isin R|C| where C is a set of class labels Thegoal is to learn a model from the partially labeled networkand use the model to make predictions y for the unlabelednodes vi st yi isin Y In this work we assume that Y canbe multi-labeled Moreover each prediction yi for vi has anestimated probability

3 TSGNet for Node ClassificationFigure 1c represents the overall architecture for TSGNet TheTSGNet is composed of (1) a static neighbor encoder and (2)multiple layers of GCN for modeling interaction graphs ateach time step k The details of each are described below

31 GCN LayersWe use GCN as a basic component for modeling each tempo-ral graph A = [A1A2 Am] Before the data is fed intothe GCN we use the symmetric normalizing trick describedin [Kipf and Welling 2016] Dk is the diagonal degree ma-trix of Ak + I and I is an Identity matrix

Ak = Dkminus12(Ak + I)Dk

minus12 (1)

Each GCN layer produces node-level output H(l+1)k isin

R|V |timesZ(l)

where |V | is the number of nodes and Z(l) is thesize of output representation per a node which is determinedby W

(l)k The outputs are generated at each time step k

H(l+1)k = f(H

(l)k Ak)

= (ReLU(AkH(l)k )W

(l)k ))

(2)

views are used as attributes and in IMDB pre-determined valueslike contents-rating and budgets are used as attributes of movies

Note that each GCN layer has its own W(l)k at each time

stamp and ReLU is used only in the first GCN layer More-over GCN originally uses the attribute matrix for H(1)

k[Kipf

and Welling 2016] In this paper we also support a non-attribute option where the identity matrix is used for H(1)

k Let L be the number of GCN layers for each time step

then the final output for each time step will be an input for anLSTM cell Ie H(L+1)

k for each time step k is the tempo-ral input for the kth cell in the LSTM sequence In Eq (3)below oi returns the final output vector for vi in the LSTMThe output oi is projected a final time using the weight ma-trix Wlstm and bias vector blstm This will be added to theneighbor encoder for vi below If the interaction of vi endsbefore the last step m it still uses the same Wlstm and blstmto generate the output oprime

i

oi =LSTM(H(L+1)1m(vi))

oprimei =softmax(Wlstmoi + blstm)

(3)

32 Neighbor Encoder (NE)The neighbor encoder uses an aggregated matrix Aagg asinput Aagg is created by aggregating all elements in[A1A2 Am] and normalizing in the same way de-scribed above The static component reduces the dimension-ality for node vi from its neighbor vector in Aagg usingstacked fully-connected layers

NE(2)i = ReLU(W(1)(Aagg[i ]⊙ hi) + b(1))

=

NE(Lprime+1)i = softmax(W(Lprime)NE

(Lprime)i + b(Lprime))

(4)

where ⊙ refers to the Hadamard product hi = hij|V |j=1

Here if Aagg[i j] gt 0 hij = β (for β ge 1) Otherwisehij will be equal to 0 This Hadamard product is used toovercome sparsity in the adjacency matrix If β is 1 it isthe same as the adjacency matrix Otherwise it puts moreweight on non-zero elements in the matrix It is expected thatβ will make larger outputs and offset issues from sparsity Inexperiments we set β = 20 and using this we observe up to4 of improvement in accuracy

33 Addition LayerThe element-wise addition layer combines the outputs fromthe GCN-LSTM and the Neighbor Encoder Specifically wecompute vi as the element-wise addition of the outputs fromthe GCN-LSTM (Eq 3) and the Neighbor Encoder (Eq 4)

vi = NE(Lprime+1)i + oprime

i (5)

The addition layer enables a joint representation learnedfrom both the static and temporal neighborhoods around thenode An additional benefit is that the addition layer doesnot introduce extra parameters nor does it increase computa-tional complexity Then the output vi is put though anothersoftmax layer for classification

yi = softmax(Wfinalvi + bfinal) (6)

Algorithm 1 TSGNetrsquos mini-batched training (one epoch)

Generate a mini-batch set B from Vfor each mini-batch isin B do

Sample |S| vertices v1 vs isin B according to distri-bution q from mini-batchAssign H

(1)k = H

(1)k [S ]

For l = [1 Lminus1] assign A(l)k = Ak[S S]

Assign A(L)k = Ak[ S]

Initialize Aagg to |V |times |V | matrix of zerosfor each v isin S do

Assign Aagg[ v] = Aagg[ v]end forCompute the categorical cross-entropy in Eq (7)Update W

(l)k Wlstm W(1)(Lprime) b(1)(Lprime) Wfinal

and bfinal

end for

Here yi is the vector output of the softmax function and eachdimension yij represents the predicted probability of the cor-responding class j given the inputs For learning we usecategorical cross-entropy (over VL the set of labeled nodes)as a loss function at the final layer

L = minus983131

iisinVL

|C|983131

j

yij log(yij) (7)

Since all activation functions are differentiable learning issimply done via back-propagation

34 Importance SamplingNote that in Eq (2) the neighbor aggregation for node u

is computed as

(AkH(l)k )u = |V |

|V |983131

v=1

1

|V |Ak[u v]H(l)k [v ] (8)

which involves a sum over all other nodes in the graph Whenwe use multiple GCN layers the recursive neighborhood ex-pansion across layers poses time and memory challenges fortraining with large graphs To overcome this limitation wepropose a method for efficient sampling-based learning

Similar to [Chen et al 2018] we approximate the equa-tion above with importance sampling First we sam-ple a set of S nodes using an importance distributionbased on the overall number of interactions q(v) =

||Aagg[ v]||2983123

vprimeisinV ||Aagg[ vprime]||2 Then given the sam-

pled set S we set A(l)k = Ak[S S] isin R|S|times|S| for all layers

l lt L and A(l)k = Ak[ S] isin R|V |times|S| when l = L ie the

last GCN layer At the last layer GCN still returns the noderepresentation for all V Finally we approximate Eq (8) fornode u as follows

(AkH(l)k )u asymp |V |

|S|

|S|983131

v=1

1

q(v)A

(l)k [u v]H

(l)k [v ] (9)

Note the distribution q is only calculated once (ie beforetraining) given the normalized aggregated graph and the inputfeature matrix H(1)

k should also be updated according to S

via H(1)k = H

(1)k [S ]

Our overall mini-batched training procedure is describedin Algorithm 1 At every epoch all nodes are randomly di-vided to create a mini-batch set B which is composed ofmultiples of γ nodes We set γ = 1024 B provides a candi-date node set for sampling |S| later When it comes to a newmini-batch A(1)(L)

k is induced from Ak according to theS Similarity the input matrix of neighbor encoder Aagg isreplaced by Aagg The inner for-loop for Aagg retains onlythe edges in S but maintains the dimensionality of the NE

35 Complexity AnalysisRecall that m is the number of temporal GCNs and |F | isthe number of node attributes Then assuming the number ofhidden units in LSTM GCN and NE is constant the compu-tational complexity of TSGNet is O(|V | +m|F ||E|) whereO(|V |) is from neighbor encoder and O(m|F ||E|) is from theset of temporal GCNs Because m and |F | are typically muchsmaller than |E| the time complexity is linear in the numberof edges ie O(|E|) With importance sampling the com-plexity becomes O(|V | + |ES |) where |ES | is the numberof edges induced in S When the non-attribute option is cho-sen the complexity is still O(|V | + |ES |) because the inputidentity matrix is sparse thus |F | = 1 in the sparse represen-tation and can be considered as a constant with sparse-densematrix multiplication

4 Related WorkSupervised node classification There is some work on re-lational models that consider temporal patterns to improvenode classification TVRC [Sharan and Neville 2008] at-tempts to model temporal structures through a two-step pro-cess The key idea behind the TVRC is to model the temporalpatterns through an exponential weight decay kernel wherethe implicit assumption is that network structure in recent pastis more important than the structure in the earlier past Thework was extended to ensemble model in [Rossi and Neville2012] but it is still limited in learning complex temporal in-teraction patterns because it heavily relies on the graph sum-marization with kernel-based edge weighting This limits themodels ability to learn more diverse temporal interactionsand temporal information is lost when collapsing graphs Inaddition DDRC [Park et al 2017] proposed a convolutionalneural network architecture with max pooling for node classi-fication which models temporal interactions among a nodersquosneighbors DDRC shows stable performance in spite of dif-ferent variability of neighbor vectors However its effective-ness was partially shown in long and relatively denser graphsequences Our TSGNet is evaluated from more diverse andlarger graph datasets and shows better performance

Dynamic node embedding Recently dynamic networkembedding approaches were proposed by [Zhang et al 2018Zhu et al 2018 Ma et al 2018] which use spectral up-dates over time for general relational tasks However they

No Attributes Static Node Attrs

Dynamic Edges TSGNetDynamicTriad TSGNet DDRC

Static Edges Node2Vec LR NN NN GraphSAGEGCN ASNE LR

No Graph not applicable LR NN

Table 1 Models categorized wrt the types of relational inputs

are evaluated in a synthetic setting where two temporal snap-shots are created by assigning a random timestamp to eachedge Moreover attributes are also not exploited for learningtemporal representations In addition CTDNE [Nguyen etal 2018] learns node representations using temporal randomwalks While the method shows promising results on linkprediction tasks it is still limited for learning attributes ofnodes and is under-explored to supervised node classificationtasks DynamicTriad [Zhou et al 2018] also attempted tolearn node evolution through representation learning for gen-eral relational learning tasks using multiple temporal graphsnapshots but their effectiveness on node classification is stillnot clear For example they are limited to specific types ofevolution strategies and attributes are also not used

Model categorization Table 1 categorizes TSGNet andother baseline models according to the types of relationalinformation they use as inputs to their models (wrt edgesand attributes) TSGNet uses dynamic interaction edges withand without attributes In our experiments we compare TS-GNet to all the listed models Logistic Regression (LR) andMulti-layered Neural Network (NN) Node2Vec [Grover andLeskovec 2016] and an attributed node embedding ASNE[Liao et al 2018] are also employed to model static edgesSee Section 5 for more detail The colors in the table willbe used later in the experiments to highlight the performanceachieved using each type of relational input

5 Experimental Evaluation51 DataWe use four real-world network datasets for evaluation Ta-ble 2 reports brief statistics for each network

Facebook The Facebook network was scraped from thePurdue University network [Pfeiffer et al 2015] Each user(node) is associated with political views for their class labelsAn edge is formed when a user writes a post to his or herfriendrsquos wall Users who post more than once a week for atleast 8 weeks are chosen A time window is defined as twoweeks Node attributes are religious views and gender

|V | |E| m |F | |C|Facebook 2716 22712 55 2 2DBLP 17191 318735 18 2997 2IMDB G 5043 43494 65 73 2IMDB R 92611 472630 14 - 2

Table 2 Network data statistics m is the number of time windowsand other notations are from Section 22

DBLP For this dataset we extract a co-authorship networkfrom the papers published in DBLP from 2000-2017 Anedge is created to link two authors when they publish a papertogether with a time stamp based on the publication Thusnodes represent authors Publication venues are selected fromAINLP and DB conferences2 Authors are selected whenthey have at least 7 years of publication history and their classlabel is assigned as the area in which they publish the major-ity of their papers Node attributes correspond to term vectorsfrom the titles of each userrsquos published papersIMDB G (Gross Income) We use Kagglersquos IMDB (Inter-net Movie Database) 5000 movie dataset An edge is formedwhen two movies share an actor or actress at each year Allmovies have at least 2 temporal edges A movie has a positivelabel if the movie gross is larger than 10 million dollars Forthis work we choose budgets content rating the number offaces in a movie poster and genres as features The budgetsare quantized from 0 to 9 using percentiles Each feature istransformed into one-hot encoding representationIMDB R (Rating) This dataset is from the whole IMDBdatabase3 and all participants including actors and writersare imported An edge is formed when two movies share anycrew at each year When a moviersquos rating is larger than 70it is chosen as a positive label The periods of all movies arefrom 2005 and 2018 There are many missing values when allmovies are considered so node attributes are ignored in thisdataset All movies have at least 11 temporal edges

52 Comparison ModelsLogistic regression (LR) Logistic regression is performedusing neighbor vectors with L1 regularization This allowsus to compare how relational and temporal patterns improveperformance The aggregated (binary or degree normalized(weighted)) graph of all temporal graphs is used for trainingLSTM We use the TSGNetrsquos input representation for theLSTM but the GCN layers and the neighbor encoder are notused in the architecture For inputs the first-hop neighbors ateach time window are fed directly into the LSTM layerGCN and GraphSAGE To compare TSGNet with Graphneural networks GCN [Kipf and Welling 2016] and Graph-SAGE [Hamilton et al 2017] are evaluated with the aggre-gated (binary) static graph input Attributes are used in thesame way with TSGNet We used an LSTM aggregator forGraphSAGE Because other aggregators for the GraphSAGEsuch as GCN mean and pool are worse than LSTM the re-sults are not reported hereNode2Vec For learning a static node embedding methodNode2Vec we set d = [16 32 64] r = 10 l = 80 k = 10and p and q were searched over [05 1 2] The aggregated(binary) matrix is used for its trainingASNE ASNE [Liao et al 2018] is a recent attributednode embedding method We used the same hyper-parametersearch criteria as in [Liao et al 2018]

2AINLP IJCAI AAAI SIGIR ECIR CLEF CHIIR AIRSACL EMNLP and COLING DB ICDE VLDB SIGMODPODSand EDBT

3The IMDB dataset was downloaded November 2018

Figure 2 Dense Block Model used to generate synthetic networksSparse Block model is 10 times sparser than the Dense Block Model

DDRC DDRC [Park et al 2017] is a CNN-baed temporalclassifier which considers interactions over time This doesnot have a neighbor encoder or GCN component The inputsare used as in LSTM above

Multi-layered neural network (NN) For NN the neigh-bor encoder of TSGNet is used for training and testing

TempGCN (GCN+LSTM) This is a version of the TS-GNet without the neighbor encoder (NE) component wherewe use GCN to model the temporal graphs with an LSTM

DynamicTriad A dynamic node embedding method Dy-namicTriad is also tested with all combinations of param-eters as in [Zhou et al 2018] Both unweighted graphsand weighted graphs were used for learning and edges areweighted by the number of common neighbors

53 Evaluation MethodologyEvery result we report is the average of 10 trials using ran-domly shuffled node sets Note that the entire graph is knownbefore learning and 70 20 and 10 of node labels areused for training testing and validation respectively If theaccuracy on the validation dataset does not increase duringfive epochs learning stops We also use dropout regulariza-tion (02) and rectified linear units for activation functionsFor optimization we use the adam optimizer [Kingma andBa 2014] to update variables For TSGNet LSTM andGCN the number of hidden nodes is searched over [16 3264 128] and the numbers of hidden nodes in the neighborencoder are 512 128 and 16 at each layer In addition thereare three GCN layers for TSGNet For importance samplingthe sampling size |S| is chosen from [16 32 128 256 512]

54 Results Synthetic DataTo evaluate the concept of TSGNet we generate syntheticdata from two simplified-Dynamic Stochastic Block Models(DSBM) [Yang et al 2011] to evaluate our model We setthe number of users to 100 and the length of time-windowsfor each node is determined by 25+ uniform(0 1)times 25 Thefirst DSBM (Dense Block Model) is composed of 4 differentpartitions P1 P4 at time(k)2 == 1 Each partition iscomposed of 50times50 nodes In all other time-windows edgesare generated from 9 partitions P prime

1 P prime9 Each partition has

different edge probabilities as in Figure 2 The second model(Sparse Block Model) is designed to generate sparse DSBMwith low probability All other conditions are same but eachprobability is 10 times sparser than the dense block modelFor class labels 0 is assigned for the first half of nodes (thussenders of P1 and P2) and 1 is set to the second half

Dense Block Model Sparse Block ModelTSGNet 089 (00299) 0884 (00282)DynamicTriad 10 (00) 0615 (00316)DDRC 10 (00) 0705 (0026)LSTM 10 (00) 0530 (00376)GCN 0519 (00272) 0504 (00277)Node2Vec 0494 (00360) 0771 (00260)LR 0429 (00164) 063 (02270)NN 0485 (00212) 0688 (00326)

Table 3 Classification accuracy on synthetic data Values in ( )denote standard errors Colors indicate type of relational input usedby the model from Table 1

Node classification accuracy (and standard errors) on thesynthetic data are shown in Table 3 Bold numbers indicatestatistically significant top results For data from both sparseand dense block models our proposed model exhibited goodperformance While DynamicTriad DDRC and LSTM showbetter performance than TSGNet in the dense data they areworse in the sparse block model Meanwhile other classi-fiers (GCN LR and NN) which were originally designed forstatic graphs showed the worse performance than TSGNetrsquosresults Node2Vec works well for the sparse data but it isworse than TSGNet (p-value lt 005 in paired t-test)

55 Results Real-world DataPerformance Without Node AttributesTable 4 shows the classification results for the four differentreal-world datasets Note that bolded numbers indicate sta-tistically significant top results (Weighted) in LR refers toversions where the input matrices of the corresponding meth-ods are normalized by the number of edges per each node Inthe experiment TSGNet exhibited the best performance overother alternatives for all datasets and shows comparable per-formance to TSGNet wo IS (Importance Sampling) Whilesimple static classifiers such as LR and NN return good per-formance for Facebook (FB) and DBLP due to the high cor-relation between neighbor vectors and class labels howeverthey are still worse than our TSGNet These characteristicsmake TempGCN more difficult to model the data because itis too complex to learn the simple neighborhood Despitethat the neighbor encoder component of the TSGNet helps itlearn the hidden dependencies among nodes and their staticneighborhood well As a result it produces a significant gainin performance DDRC and LSTM showed poor performancebecause the data is also very sparse DynamicTriads are bet-ter than GCN and GraphSAGE in IMDB G but still worsethan TempGCN and TSGNet Overall TSGNet produces anaverage reduction in classification error of 16 compared toGraphSAGE which is the best competitor

Figure 3 shows learning curves on the four datasets as wevary the amount of training data The learning curves com-pare the performance as the number of training nodes in-creases Note that the set of nodes for testing and valida-tion is same across all range of x-axis Although the numberof training nodes was controlled to calculate the supervisedloss the complete adjacency matrices at each time step forthe GCN layers were fixed for the experiment The experi-mental assumption was also applied to all other alternatives

TempGCN Node2Vec and LR For Node2Vec the competenetwork structure is known for learning representation andthe number of nodes is controlled when its supervised clas-sifier is trained Therefore all results with the small trainingdata were not poor In the Facebook and DBLP datasets TS-GNet was consistently better than the others For IMDB Gdataset TSGNet improved in performance as the size of train-ing set increased

Performance With Node AttributesTable 5 shows classification results when node attributes areincorporated into the models The result for TSGNet with at-tributes was better than the other alternatives which used at-tributes in their input Moreover the performance of TSGNetwithout attributes was even better than the result of the bestmodel which uses attributes Note that for the DDRC with-out attributes an identity matrix is concatenated to the inputneighbor vector This result indicates that it can learn a goodrepresentation with only the structural interactions (attr +neighbor) refers to a concatenated input including both at-tribute and neighborhood vectors ASNE LR and NN withthe new input show good results in general but they are worsethan TSGNet ASNE performed poorly on DBLP because itcould not utilize labels to learn the embedding Also GCNdid not work well both with and without attributes GCN isbased on a 1-layer perceptron which is not a universal ap-proximator [Hornik 1991] The 1-layer perceptron in theGCN works like a linear mapping so the layers may degen-erate into simply summing over neighborhood features [Xuet al 2019] With this reason GraphSAGE with LSTM ag-gregator can model interaction better than GCN for Facebookand DBLP Overall TSGNet with or wo attributes reducesclassification error up to 24 and produces an average reduc-tion in classification error of 10 compared to GraphSAGE

Temporal Sequence Randomization Impact onPerformanceTo see the effect of temporal sequencersquos randomization thetime-windows were randomly shuffled and used for train-ing The order of words in language models for NLP andspeech recognition is quite important to represent sentencesbut the temporal order of social interactions could be reversed

FB DBLP IMDB G IMDB RTSGNet 068 097 078 078TSGNet wo IS 0688 097 0786 0771TempGCN 0646 0734 077 0591DynamicTriad W 0542 0652 0732 0657DynamicTriad 0534 0633 0730 0645DDRC 0554 0542 0717 -LSTM 0514 0538 0696 -GraphSAGE 0645 0963 0712 0752GCN 0521 0665 0719 0568Node2Vec 0515 096 07 0768NN 0623 083 0716 0726LR 0593 0939 0699 0665LR W 0613 0955 0689 0673

Table 4 Classification accuracy on real-world datasets Colors indi-cate relational input type from Table 1 Results of DDRC and LSTMfor IMDB R are ignored due to the learning time limit (ge1 day)

(a) Facebook (b) DBLP (c) IMDB G (d) IMDB R

Figure 3 Learning curves for each dataset as amount of training data is varied

and often spontaneously happen In this case the random-ized temporal sequences are likely to represent another in-stance of evolution As can be seen in Table 6 TSGNet andTempGCN also work well given the randomized inputs andare not significantly different from the results of original in-puts These results may be interpreted in the context of recentwork on Janossy pooling [Murphy et al 2019] for learningpermutation invariant functions with LSTMs via randomiza-tion The fact that the randomized inputs work well withinour LSTM architecture may indicate that the model is learn-ing a temporally-invariant function over the interactions

Ablation Study of Model ComponentsTSGNet uses GCNs for learning temporal interactions and aNN neighbor encoder for learning the aggregated static first-neighbors However we could have chosen other architec-tures for either component Table 7 shows the results fordifferent variants of the architecture with the original com-ponents of TSGNet in the first row Note that we did not useimportance sampling to see the true effect of each compo-nent Instead of the GCN in TSGNet when we use regulardensely-connected NN its performances decreases in DBLPas shown in the second row of the table When the GCN ismissing in TSGNet like the last row of the table it also doesnot work well Similarly when the NN in TSGNet is replacedwith GCN layers or an empty layer we can observe the sig-nificant drop in Facebook and DBLP This indicates that ourNN-based neighbor encoder helps to jointly learn the tempo-ral networkrsquos interaction well if we use GCN layers

FB DBLP IMDB GWith Static AttributesTSGNet 0675 096 0777DDRC 0554 0938 0749GraphSAGE 0655 0967 0717GCN 0483 0881 0720ASNE 0525 0601 0734LR (attr + neighbor) 0664 096 0744NN (attr + neighbor) 0645 0955 0759LR (attr only) 063 0891 0756NN (attr only) 063 0886 0735

Table 5 Classification accuracy on real-world datasets with nodeattributes Colors indicate relational input type from Table 1

FB DBLP IMDB G IMDB RTSGNet 0688 097 0786 078TSGNet (R) 0679 096 0774 0772TempGCN 0646 0734 0771 0591TempGCN (R) 0658 0735 0750 0573DDRC 0554 0542 0717 -DDRC (R) 0573 054 0718 -LSTM 0514 0538 0696 -LSTM (R) 0480 053 0693 -

Table 6 Classification accuracy with different temporal inputs(R) denotes that the sequence of inputs is randomized

N-En T-En FB DBLP IMDB G IMDB RNN GCN 0688 097 0786 0771NN NN 0676 0953 0776 0711GCN GCN 0672 0652 0788 0707ndash GCN 0646 0734 0771 0591ndash NN 0647 0732 0769 0707GCN ndash 0521 0665 0719 0658NN ndash 0623 083 0716 0726

Table 7 Effect of joint learning with different approaches used forthe neighbor encoder (N-En) and the temporal encoder (T-En)

6 ConclusionsIn this paper we described TSGNet a neural network archi-tecture that can learn jointly from static and temporal neigh-borhood structure The architecture exploits the interactionsamong local neighbors over time by learning the temporalevolution of a low-dimensional embedding from a GCN andmodels its static neighborhood with a densely connected NNTSGNet is able to improve classification performance by uti-lizing both patterns in social interactions over time and the setof nodes in the aggregate relational neighborhood

AcknowledgmentsWe thank the anonymous reviewers for their useful com-ments This research is supported by NSF and AFRL undercontract numbers IIS-1546488 IIS-1618690 and FA8650-18-2-7879 The US Government is authorized to reproduceand distribute reprints for governmental purposes notwith-standing any copyright notation hereon

References[Chen et al 2018] Jie Chen Tengfei Ma and Cao Xiao

Fastgcn fast learning with graph convolutional networksvia importance sampling In Proceedings of InternationalConference on Learning Representations (ICLR) 2018

[Grover and Leskovec 2016] Aditya Grover and JureLeskovec node2vec Scalable feature learning fornetworks In Proceedings of SIGKDD InternationalConference on Knowledge Discovery and Data Miningpages 855ndash864 2016

[Hamilton et al 2017] William L Hamilton Rex Ying andJure Leskovec Inductive representation learning on largegraphs In Proceedings of Conference on Neural Infor-mation Processing Systems (NeurIPS) pages 1024ndash10342017

[Hochreiter and Schmidhuber 1997] Sepp Hochreiter andJurgen Schmidhuber Long short-term memory NeuralComputation 9(8)1735ndash1780 1997

[Hornik 1991] Kurt Hornik Approximation capabilitiesof multilayer feedforward networks Neural networks4(2)251ndash257 1991

[Jensen et al 2004] David Jensen Jennifer Neville andBrian Gallagher Why collective inference improves re-lational classification In Proceedings of SIGKDD Inter-national Conference on Knowledge Discovery and DataMining pages 593ndash598 2004

[Kingma and Ba 2014] Diederik Kingma and Jimmy BaAdam A method for stochastic optimization In Proceed-ings of International Conference on Learning Representa-tions (ICLR) 2014

[Kipf and Welling 2016] Thomas N Kipf and Max WellingSemi-supervised classification with graph convolutionalnetworks In Proceedings of International Conference onLearning Representations (ICLR) 2016

[Liao et al 2018] Lizi Liao Xiangnan He Hanwang Zhangand Tat-Seng Chua Attributed social network embeddingIEEE Transactions on Knowledge and Data Engineering30(12)2257ndash2270 2018

[Lu and Getoor 2003] Qing Lu and Lise Getoor Link-basedclassification In Proceedings of International Conferenceon Machine Learning (ICML) pages 496ndash503 2003

[Ma et al 2018] Jianxin Ma Peng Cui and Wenwu ZhuDepthlgp learning embeddings of out-of-sample nodes indynamic networks In Proceedings of AAAI Conference onArtificial Intelligence 2018

[Murphy et al 2019] Ryan L Murphy BalasubramaniamSrinivasan Vinayak Rao and Bruno Ribeiro Janossypooling Learning deep permutation-invariant functionsfor variable-size inputs In Proceedings of InternationalConference on Learning Representations (ICLR) 2019

[Nguyen et al 2018] Giang Hoang Nguyen John Boaz LeeRyan A Rossi Nesreen K Ahmed Eunyee Koh andSungchul Kim Continuous-time dynamic network em-beddings In Proceedings of BigNet Workshop in The WebConference (WWW) 2018

[Park et al 2017] Hogun Park John Moore and JenniferNeville Deep dynamic relational classifiers Exploitingdynamic neighborhoods in complex networks In Proceed-ings of MAISoN Workshop in International Conference onWeb Search and Data Mining (WSDM) 2017

[Pfeiffer et al 2015] Joseph J Pfeiffer III Jennifer Nevilleand Paul N Bennett Overcoming relational learning bi-ases to accurately predict preferences in large scale net-works In Proceedings of International World Wide WebConference (WWW) pages 853ndash863 2015

[Rossi and Neville 2012] Ryan Rossi and Jennifer NevilleTime-evolving relational classification and ensemblemethods In Proceedings of Pacific-Asia Conference onKnowledge Discovery and Data Mining (PAKDD) pages1ndash13 2012

[Sharan and Neville 2008] Umang Sharan and JenniferNeville Temporal-relational classifiers for prediction inevolving domains In Proceedings of International Con-ference on Data Mining (ICDM) pages 540ndash549 2008

[Xu et al 2019] Keyulu Xu Weihua Hu Jure Leskovec andStefanie Jegelka How powerful are graph neural net-works In Proceedings of International Conference onLearning Representations (ICLR) 2019

[Yang et al 2011] Tianbao Yang Yun Chi Shenghuo ZhuYihong Gong and Rong Jin Detecting communities andtheir evolutions in dynamic social networks ndash a bayesianapproach Machine learning 82(2)157ndash189 2011

[Zhang et al 2018] Ziwei Zhang Peng Cui Jian Pei XiaoWang and Wenwu Zhu Timers Error-bounded svd restarton dynamic networks In Proceedings of AAAI Conferenceon Artificial Intelligence 2018

[Zhou et al 2018] Lekui Zhou Yang Yang Xiang Ren FeiWu and Yueting Zhuang Dynamic network embeddingby modeling triadic closure process In Proceedings ofAAAI Conference on Artificial Intelligence 2018

[Zhu et al 2018] Dingyuan Zhu Peng Cui Ziwei ZhangJian Pei and Wenwu Zhu High-order proximity pre-served embedding for dynamic networks IEEE Transac-tions on Knowledge and Data Engineering 30(11)2134ndash2144 2018

Figure 1 Examples of interactions over time (k) (a) User A and B have same neighbors when aggregated but different patterns over time(b) User A and B have same patterns over time but different neighborhood patterns when aggregated (c) Architecture for TSGNet

author event respectively Note that colors represent topiclabels yellow for NLP and grey for Database In Figure 1auser A and B have the same coauthors (neighbors) when theyare aggregated In this context the existing graph embeddingapproaches (eg Node2Vec [Grover and Leskovec 2016] orGCN [Kipf and Welling 2016]) will end up learning simi-lar latent representations for the two nodes However theirtemporal interaction patterns may be different When authorA collaborates with other authors differently over time com-pared to author B their representations should be modifiedaccordingly Meanwhile if node A and B show similar inter-action patterns over time then it may be difficult to determinethe correct class labels through temporal patterns only InFigure 1b although the temporal coauthoring patterns aroundauthor A and B are similar their neighborhoods on the aggre-gated graph are entirely different In that case using the ag-gregated neighborhoods (which correspond to static features)the class labels of node A and B could be identified

In this paper our goal is to jointly learn patterns in bothinteractions over time and in static neighbor sets In order tomodel both properties we propose a neural network modelTSGNet The details are described in Section 3

22 NotationWe define a graph sequence as a set of graphs such thatG = [G1 G2 Gm] Each Gk has the same set of nodesvi isin V where foralli isin [1 n] but a different set of edgesEk sube V timesV such that Gk = 〈VEk〉 If eij isin Ek there is anedge between vi and vj at time k otherwise there is not Al-ternatively let A = [A1A2 Am] be the set of adjacencymatrices for G where Ak[i j] = 1 if eij isin Ek 0 otherwiseWhile the network structure is changing over time we assumethat the node attributes are not changing over time1 Let F be

1The setting is realistic for many social and interaction graphsbecause available node attributes are from basic profiles resourcesor fixed properties so they are static or changing very slowly Forexample in Facebook profile attributes like gender and religious

the feature (attribute) set over the nodes Each vi isin V hasa corresponding feature vector fi isin F Y is the label setover the nodes Only a subset of the nodes vi sube V have aclass label yi isin R|C| where C is a set of class labels Thegoal is to learn a model from the partially labeled networkand use the model to make predictions y for the unlabelednodes vi st yi isin Y In this work we assume that Y canbe multi-labeled Moreover each prediction yi for vi has anestimated probability

3 TSGNet for Node ClassificationFigure 1c represents the overall architecture for TSGNet TheTSGNet is composed of (1) a static neighbor encoder and (2)multiple layers of GCN for modeling interaction graphs ateach time step k The details of each are described below

31 GCN LayersWe use GCN as a basic component for modeling each tempo-ral graph A = [A1A2 Am] Before the data is fed intothe GCN we use the symmetric normalizing trick describedin [Kipf and Welling 2016] Dk is the diagonal degree ma-trix of Ak + I and I is an Identity matrix

Ak = Dkminus12(Ak + I)Dk

minus12 (1)

Each GCN layer produces node-level output H(l+1)k isin

R|V |timesZ(l)

where |V | is the number of nodes and Z(l) is thesize of output representation per a node which is determinedby W

(l)k The outputs are generated at each time step k

H(l+1)k = f(H

(l)k Ak)

= (ReLU(AkH(l)k )W

(l)k ))

(2)

views are used as attributes and in IMDB pre-determined valueslike contents-rating and budgets are used as attributes of movies



k[Kipf





i



(3)



=



(4)





i (5)






(1)k = H

(1)k [S ]






and bfinal

end for


L = minus983131

iisinVL

|C|983131

j

yij log(yij) (7)



is computed as

(AkH(l)k )u = |V |

|V |983131

v=1

1

|V |Ak[u v]H(l)k [v ] (8)



||Aagg[ v]||2983123






|S|

|S|983131

v=1

1

q(v)A

(l)k [u v]H

(l)k [v ] (9)



via H(1)k = H

(1)k [S ]













































































k[Kipf





i



(3)



=



(4)





i (5)






(1)k = H

(1)k [S ]






and bfinal

end for


L = minus983131

iisinVL

|C|983131

j

yij log(yij) (7)



is computed as

(AkH(l)k )u = |V |

|V |983131

v=1

1

|V |Ak[u v]H(l)k [v ] (8)



||Aagg[ v]||2983123






|S|

|S|983131

v=1

1

q(v)A

(l)k [u v]H

(l)k [v ] (9)



via H(1)k = H

(1)k [S ]













































































via H(1)k = H

(1)k [S ]












































































































































































































































Date post:	11-Mar-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Exploiting Interaction Links for Node Classiﬁcation …...overcome sparsity in the adjacency...

Documents