N-gram Graph: Simple Unsupervised Representation for Graphs, with
Applications to Molecules
Shengchao L iu , Mehmet Furkan Demire l , Yingyu L iang
Univers i ty o f Wiscons in -Madison, Madison
Machine Learning Progress
โข Significant progress in Machine Learning
Computer vision Machine translation
Game Playing Medical Imaging
ML for Molecules?
ML for Molecules?
โข Molecule property prediction
Machine
Learning
Model
Toxic
Not Toxic
Challenge: Representations
โข Input to traditional ML models: vectors
โข How to represent molecules as vectors? โข Fingerprints: Morgan fingerprints, etc
โข Graph kernels: Weisfeiler-Lehman kernel, etc
โข Graph Neural Networks (GNN): Graph CNN, Weave, etc
โข Fingerprints/kernels: unsupervised, fast to compute
โข GNN: supervised end-to-end, more expensive; powerful
Our method: N-gram Graphs
โข Unsupervised
โข Relatively fast to compute
โข Strong prediction performanceโข Overall better than traditional fingerprint/kernel and popular GNNs
โข Inspired by N-gram approach in Natural Language Processing
N-gram Approach in NLP
โข ๐-gram is a consecutive sequence of ๐ words in a sentence
โข Example: โthis molecule looks beautifulโ
โข Its 2-grams: โthis moleculeโ, โmolecule looksโ, โlooks beautifulโ
N-gram Approach in NLP
โข ๐-gram is a consecutive sequence of ๐ words in a sentence
โข Example: โthis molecule looks beautifulโ
โข Its 2-grams: โthis moleculeโ, โmolecule looksโ, โlooks beautifulโ
โข N-gram count vector ๐(๐) is a numeric representation vector
โข coordinates correspond to all ๐-grams
โข coordinate value is the number of times the corresponding ๐-gram shows up in the sentence
โข Example: ๐(1) is just the histogram of the words in the sentence
Dimension Reduction by Embeddings
โข N-gram vector ๐(๐) has high dimensions: ๐ ๐ for vocabulary ๐
โข Dimension reduction by word embeddings: ๐(1) = ๐๐(1)
Dimension Reduction by Embeddings
โข N-gram vector ๐(๐) has high dimensions: ๐ ๐ for vocabulary ๐
โข Dimension reduction by word embeddings: ๐(1) = ๐๐(1)
โข =
โข ๐(1) is just the sum of the word vectors in the sentence!
๐ ๐(1)๐(1)
๐-th column is the embedding vector for ๐-th word in the vocabulary
Dimension Reduction by Embeddings
โข N-gram vector ๐(๐) has high dimensions: ๐ ๐ for vocabulary ๐
โข Dimension reduction by word embeddings: ๐(1) = ๐๐(1)
For general ๐:
โข Embedding of an ๐-gram: entrywise product of its word vectors
โข ๐(๐): sum of embeddings of the ๐-grams in the sentence
N-gram Graphs
โข Sentence: linear graph on words
โข Molecule: graph on atoms with attributes
Analogy:
โข Atoms with different attributes: different words
โข Walks of length ๐: ๐-grams
N-gram Graphs
โข Sentence: linear graph on words
โข Molecule: graph on atoms with attributes
Analogy:
โข Atoms with different attributes: different words
โข Walks of length ๐: ๐-grams
A molecular graphIts 2-grams
N-gram Graph Algorithm
โข Sentence: linear graph on words
โข Molecule: graph on atoms with attributes
Given the embeddings for the atoms (vertex vectors)
โข Enumerate all ๐-grams (walks of length ๐)
โข Embedding of an ๐-gram: entrywise product of its vertex vectors
โข ๐(๐): sum of embeddings of the ๐-grams
โข Final N-gram Graph embedding ๐๐บ: concatenation of ๐(1), โฆ , ๐(๐)
N-gram Graph Algorithm
โข Sentence: linear graph on words
โข Molecule: graph on atoms with attributes
Given the embeddings for the atoms (vertex vectors)
โข Enumerate all ๐-grams (walks of length ๐)
โข Embedding of an ๐-gram: entrywise product of its vertex vectors
โข ๐(๐): sum of embeddings of the ๐-grams
โข Final N-gram Graph embedding ๐๐บ: concatenation of ๐(1), โฆ , ๐(๐)
โข Vertex vectors: trained by an algorithm similar to node2vec
N-gram Graphs as Simple GNNs
โข Efficient dynamic programming version of the algorithm
โข Given vectors ๐๐ for vertices ๐, and the graph adjacent matrix A
โข Equivalent to a simple GNN without parameters!
Experimental Results
โข 60 tasks on 10 datasets from [1]
โข Methodsโข Weisfeiler-Lehman kernel + SVM
โข Morgan fingerprints + Random Forest (RF) or XGBoost (XGB)
โข GNN: Graph CNN (GCNN), Weave Neural Network (Weave), Graph Isomorphism Network (GIN)
โข N-gram Graphs + Random Forest (RF) or XGBoost (XGB)
โข Vertex embedding dimension ๐ = 100, and ๐ = 6
[1] Wu, Zhenqin, et al. "MoleculeNet: a benchmark for molecular machine
learning." Chemical science 9.2 (2018): 513-530.
Experimental Results
โข N-gram+XGB: top-1 for 21 among 60 tasks, and top-3 for 48
โข Overall better than the other methods
Runtime
โข Relatively fast
Theoretical Analysis
โข Recall ๐(1) = ๐๐(1)โข ๐ is the vertex embedding matrix
โข ๐(1) is the count vector
โข With sparse ๐(1) and random ๐, ๐(1) can be recovered from ๐(1)โข Well-known in compressed sensing
Theoretical Analysis
โข Recall ๐(1) = ๐๐(1)โข ๐ is the vertex embedding matrix
โข ๐(1) is the count vector
โข With sparse ๐(1) and random ๐, ๐(1) can be recovered from ๐(1)โข Well-known in compressed sensing
โข In general, ๐(๐) = ๐(๐)๐(๐), for some linear mapping ๐(๐)depending on ๐
โข With sparse ๐(๐) and random ๐, ๐(๐) can be recovered from ๐(๐)
Theoretical Analysis
โข Recall ๐(1) = ๐๐(1)โข ๐ is the vertex embedding matrix
โข ๐(1) is the count vector
โข With sparse ๐(1) and random ๐, ๐(1) can be recovered from ๐(1)โข Well-known in compressed sensing
โข In general, ๐(๐) = ๐(๐)๐(๐), for some linear mapping ๐(๐)depending on ๐
โข With sparse ๐(๐) and random ๐, ๐(๐) can be recovered from ๐(๐)
โข So ๐(๐) preserves the information in ๐(๐)
โข Furthermore, can prove: regularized linear classifier on ๐(๐) is
competitive to the best linear classifier on ๐(๐)
THANK YOU!