N-gram Graph: Simple Unsupervised Representation for...

Post on 04-Aug-2020

1 views 0 download

transcript

N-gram Graph: Simple Unsupervised Representation for Graphs, with

Applications to Molecules

Shengchao L iu , Mehmet Furkan Demire l , Yingyu L iang

Univers i ty o f Wiscons in -Madison, Madison

Machine Learning Progress

โ€ข Significant progress in Machine Learning

Computer vision Machine translation

Game Playing Medical Imaging

ML for Molecules?

ML for Molecules?

โ€ข Molecule property prediction

Machine

Learning

Model

Toxic

Not Toxic

Challenge: Representations

โ€ข Input to traditional ML models: vectors

โ€ข How to represent molecules as vectors? โ€ข Fingerprints: Morgan fingerprints, etc

โ€ข Graph kernels: Weisfeiler-Lehman kernel, etc

โ€ข Graph Neural Networks (GNN): Graph CNN, Weave, etc

โ€ข Fingerprints/kernels: unsupervised, fast to compute

โ€ข GNN: supervised end-to-end, more expensive; powerful

Our method: N-gram Graphs

โ€ข Unsupervised

โ€ข Relatively fast to compute

โ€ข Strong prediction performanceโ€ข Overall better than traditional fingerprint/kernel and popular GNNs

โ€ข Inspired by N-gram approach in Natural Language Processing

N-gram Approach in NLP

โ€ข ๐‘›-gram is a consecutive sequence of ๐‘› words in a sentence

โ€ข Example: โ€œthis molecule looks beautifulโ€

โ€ข Its 2-grams: โ€œthis moleculeโ€, โ€œmolecule looksโ€, โ€œlooks beautifulโ€

N-gram Approach in NLP

โ€ข ๐‘›-gram is a consecutive sequence of ๐‘› words in a sentence

โ€ข Example: โ€œthis molecule looks beautifulโ€

โ€ข Its 2-grams: โ€œthis moleculeโ€, โ€œmolecule looksโ€, โ€œlooks beautifulโ€

โ€ข N-gram count vector ๐‘(๐‘›) is a numeric representation vector

โ€ข coordinates correspond to all ๐‘›-grams

โ€ข coordinate value is the number of times the corresponding ๐‘›-gram shows up in the sentence

โ€ข Example: ๐‘(1) is just the histogram of the words in the sentence

Dimension Reduction by Embeddings

โ€ข N-gram vector ๐‘(๐‘›) has high dimensions: ๐‘‰ ๐‘› for vocabulary ๐‘‰

โ€ข Dimension reduction by word embeddings: ๐‘“(1) = ๐‘Š๐‘(1)

Dimension Reduction by Embeddings

โ€ข N-gram vector ๐‘(๐‘›) has high dimensions: ๐‘‰ ๐‘› for vocabulary ๐‘‰

โ€ข Dimension reduction by word embeddings: ๐‘“(1) = ๐‘Š๐‘(1)

โ€ข =

โ€ข ๐‘“(1) is just the sum of the word vectors in the sentence!

๐‘Š ๐‘(1)๐‘“(1)

๐‘–-th column is the embedding vector for ๐‘–-th word in the vocabulary

Dimension Reduction by Embeddings

โ€ข N-gram vector ๐‘(๐‘›) has high dimensions: ๐‘‰ ๐‘› for vocabulary ๐‘‰

โ€ข Dimension reduction by word embeddings: ๐‘“(1) = ๐‘Š๐‘(1)

For general ๐‘›:

โ€ข Embedding of an ๐‘›-gram: entrywise product of its word vectors

โ€ข ๐‘“(๐‘›): sum of embeddings of the ๐‘›-grams in the sentence

N-gram Graphs

โ€ข Sentence: linear graph on words

โ€ข Molecule: graph on atoms with attributes

Analogy:

โ€ข Atoms with different attributes: different words

โ€ข Walks of length ๐‘›: ๐‘›-grams

N-gram Graphs

โ€ข Sentence: linear graph on words

โ€ข Molecule: graph on atoms with attributes

Analogy:

โ€ข Atoms with different attributes: different words

โ€ข Walks of length ๐‘›: ๐‘›-grams

A molecular graphIts 2-grams

N-gram Graph Algorithm

โ€ข Sentence: linear graph on words

โ€ข Molecule: graph on atoms with attributes

Given the embeddings for the atoms (vertex vectors)

โ€ข Enumerate all ๐‘›-grams (walks of length ๐‘›)

โ€ข Embedding of an ๐‘›-gram: entrywise product of its vertex vectors

โ€ข ๐‘“(๐‘›): sum of embeddings of the ๐‘›-grams

โ€ข Final N-gram Graph embedding ๐‘“๐บ: concatenation of ๐‘“(1), โ€ฆ , ๐‘“(๐‘‡)

N-gram Graph Algorithm

โ€ข Sentence: linear graph on words

โ€ข Molecule: graph on atoms with attributes

Given the embeddings for the atoms (vertex vectors)

โ€ข Enumerate all ๐‘›-grams (walks of length ๐‘›)

โ€ข Embedding of an ๐‘›-gram: entrywise product of its vertex vectors

โ€ข ๐‘“(๐‘›): sum of embeddings of the ๐‘›-grams

โ€ข Final N-gram Graph embedding ๐‘“๐บ: concatenation of ๐‘“(1), โ€ฆ , ๐‘“(๐‘‡)

โ€ข Vertex vectors: trained by an algorithm similar to node2vec

N-gram Graphs as Simple GNNs

โ€ข Efficient dynamic programming version of the algorithm

โ€ข Given vectors ๐‘“๐‘– for vertices ๐‘–, and the graph adjacent matrix A

โ€ข Equivalent to a simple GNN without parameters!

Experimental Results

โ€ข 60 tasks on 10 datasets from [1]

โ€ข Methodsโ€ข Weisfeiler-Lehman kernel + SVM

โ€ข Morgan fingerprints + Random Forest (RF) or XGBoost (XGB)

โ€ข GNN: Graph CNN (GCNN), Weave Neural Network (Weave), Graph Isomorphism Network (GIN)

โ€ข N-gram Graphs + Random Forest (RF) or XGBoost (XGB)

โ€ข Vertex embedding dimension ๐‘Ÿ = 100, and ๐‘‡ = 6

[1] Wu, Zhenqin, et al. "MoleculeNet: a benchmark for molecular machine

learning." Chemical science 9.2 (2018): 513-530.

Experimental Results

โ€ข N-gram+XGB: top-1 for 21 among 60 tasks, and top-3 for 48

โ€ข Overall better than the other methods

Runtime

โ€ข Relatively fast

Theoretical Analysis

โ€ข Recall ๐‘“(1) = ๐‘Š๐‘(1)โ€ข ๐‘Š is the vertex embedding matrix

โ€ข ๐‘(1) is the count vector

โ€ข With sparse ๐‘(1) and random ๐‘Š, ๐‘(1) can be recovered from ๐‘“(1)โ€ข Well-known in compressed sensing

Theoretical Analysis

โ€ข Recall ๐‘“(1) = ๐‘Š๐‘(1)โ€ข ๐‘Š is the vertex embedding matrix

โ€ข ๐‘(1) is the count vector

โ€ข With sparse ๐‘(1) and random ๐‘Š, ๐‘(1) can be recovered from ๐‘“(1)โ€ข Well-known in compressed sensing

โ€ข In general, ๐‘“(๐‘›) = ๐‘‡(๐‘›)๐‘(๐‘›), for some linear mapping ๐‘‡(๐‘›)depending on ๐‘Š

โ€ข With sparse ๐‘(๐‘›) and random ๐‘Š, ๐‘(๐‘›) can be recovered from ๐‘“(๐‘›)

Theoretical Analysis

โ€ข Recall ๐‘“(1) = ๐‘Š๐‘(1)โ€ข ๐‘Š is the vertex embedding matrix

โ€ข ๐‘(1) is the count vector

โ€ข With sparse ๐‘(1) and random ๐‘Š, ๐‘(1) can be recovered from ๐‘“(1)โ€ข Well-known in compressed sensing

โ€ข In general, ๐‘“(๐‘›) = ๐‘‡(๐‘›)๐‘(๐‘›), for some linear mapping ๐‘‡(๐‘›)depending on ๐‘Š

โ€ข With sparse ๐‘(๐‘›) and random ๐‘Š, ๐‘(๐‘›) can be recovered from ๐‘“(๐‘›)

โ€ข So ๐‘“(๐‘›) preserves the information in ๐‘(๐‘›)

โ€ข Furthermore, can prove: regularized linear classifier on ๐‘“(๐‘›) is

competitive to the best linear classifier on ๐‘(๐‘›)

THANK YOU!