Data-driven dictionary definition for diverse document domains Michael W. Mahoney Yahoo! Research ...

Data-driven dictionary definition for diverse document domains

Michael W. MahoneyMichael W. Mahoney

Yahoo! Research

http://www.cs.yale.edu/homes/mmahoney

(Joint work with P. Drineas, S. Muthukrishnan and others as listed.)

Modeling data and documents

People studying “documents,” or data more generally: • put the data onto a graph or into a vector space• even if the data don’t naturally or obviously live there• and perform graph operations or vector space operations• to extract information from the data.

Such data “documents” often have structure unrelated to the graphical or linear algebraic structure implicit in the modeling. • This non-modeled structure is difficult to formalize.

Practitioners often have extensive field-specific intuition about the data.• This intuition is often used to choose “where the data live.”• The choice of where the data live may capture non-modeled structure.

Documents modeled as matrices

Matrices often arise since n objects (“documents,” genomes, images, web pages), each with m features, may be represented by an m x n matrix A.

Such “documents” often have structure: • for linear structure, SVD or PCA is often used,• for non-linear structure, kernel, e.g., diffusion-based, methods used,

Note: We know what the rows/columns “mean” from the application area.

Goal: Develop principled provably-accurate algorithmic methods such that: • they are agnostic with respect to any particular field, • one can fruitfully couple them to the field-specific intuition,• they perform well on complex non-toy data “documents”.

SVD of a matrix

U (V): orthogonal matrix containing the left (right) singular vectors of A.

: diagonal matrix containing the singular values of A, ordered non-increasingly.

rank of A, the number of non-zero singular values.

Data Application: Principal Components Analysis (PCA) is just SVD.

Complexity: Exact computation of the SVD takes O(min{mn2 , m2n}) time.

Theorem: Any m x n matrix A can be decomposed as:

SVD and low-rank approximations

This gives another matrix Ak of the same dimensions that is the “best” approximation to A among all rank-k matrices.

Interesting property: For future reference, note:

• The rows of Uk (= UA,k) are NOT orthogonal and are NOT unit length.

• The lengths/Euclidean norms of the rows of Uk capture a notion of information dispersal.

Theorem: Truncate the SVD by keeping k ≤ terms:

Rows of left singular vectors

What do the lengths of the rows of the n x d matrix U = UA “mean”?

Consider possible n x d matrices U of d left singular vectors:

In|k = k columns from the identity

row lengths = 0 or 1

In|k x -> x

Hn|k = k columns from the n x n Hadamard (real Fourier) matrix

row lengths all equal

Hn|k x -> maximally dispersed

Uk = k columns from any orthogonal matrix

row lengths between 0 and 1

The lengths of the rows of U = UA correspond to a notion of information dispersal

Problems with SVD/Eigen-Analysis

Problems arise since structure in the data is not respected by mathematical operations on the data: • Reification - maximum variance directions are just that.• Interpretability - what does a linear combination of 6000 genes mean.• Sparsity - is destroyed by orthogonalization.• Non-negativity - is a convex and not linear algebraic notion.

The SVD gives a low-rank matrix approximation with a very particular structure (think: rotation-with-truncation;rescaling;rotation-back-up).

Question: Do there exist “better” low-rank matrix approximations. • “better” structural properties for certain applications.• “better” at respecting relevant structure.• “better” for interpretability and informing intuition.

Dictionaries for document analysis

Discrete Fourier Transform (DCT):• fj = i=0,…,N-1 xn cos[j(n+1/2)/N] • the basis is fixed.• O(N2) or O(Nlog(N)) computation to determine coefficients.

Singular Value Decomposition (SVD):• A = i=1,…, iU(i)V(i)T = i=1,…, i A[i] • O(N3) computation to determine basis and coefficients.

Many other more complex/expensive procedures depending on the application.

Question: Can actual data points and/or feature vectors be the dictionary. • “Core-sets” on graphs. • “CUR-decompositions” on matrices.

Dictionaries & the SVD

A = U VT = i=1,..., iU(i)V(i)T,

• where U(i),V(i) = eigen-cols and eigen-rows.

Approximate: A(j) ≈ i=1,...,k zijU(i)

• by minzij|| A(j) - i=1,...,k zijU(i) ||2

Z = UkTA --> A ≈ Ak = (UkUk

T)A

• project onto space of top k eigen-cols.

Z = kVkT --> A ≈ Ak = Uk(kVk

T)

• approximate every column of A i.t.o. a small number of eigen-rows and a low-dimensional encoding matrix k.

Dictionaries & columns and rows

A = CUR = ij uijC(i)R(i), where U=W+ and W = intersection of C and R,

• where C(i),R(i) = actual-cols and actual-rows.

Approximate: A(j) ≈ i=1,...,c yijC(i)

• by minyij|| A(j) - i=1,...,c yijC(i) ||2

Y = C+A --> A ≈ PCA = (CC+)A

• project onto space of those c actual-cols.

Y ≈ W+R --> A ≈ PCA ≈ C(W+R)

• approximate every column of A i.t.o. a small number of actual-rows and a low-dimensional encoding matrix U=W+.

CX and CUR matrix decompositions

O(1) columnsO(1) rows

Carefully chosen U

Def: A CX matrix decomposition is a low-rank approximation explicitly expressed in terms of a small number of columns of the original matrix A.

Def: A CUR matrix decomposition is a low-rank approximation explicitly expressed in terms of a small number of rows and columns of the original matrix A.

Problem formulation (1 of 3)

Consider (for now) just columns:

• Could ask to find the “best” k of n columns of A (by whatever measure-of-merit).

• Combinatorial problem - trivial algorithm takes nk time.

• Probably NP-hard if k is not fixed.

Instead:

• Fix a rank parameter k.

• Let’s over-sample columns by a little (e.g., k+3, 10k, k2, etc.).

• Get close (additive error or relative error) to the “best” rank-k approximation..

Note: Error and over-sampling are computational resources to exploit algorithmically.


Ques: Do there exist O(k), or O(k2), or …, columns s.t.:

||A-CC+A||2,F < ||A-Ak||2,F + ||A||F

Ans: Yes - and can find them in O(m+n) space and time after two passes over the data! (DFKVV99,DKM04)

Ques: Do there exist O(k), or O(k2), or …, columns s.t.:

||A-CC+A||2,F < (1+)-1||A-Ak||2,F + t||A||F

Ans: Yes - and can find them in O(m+n) space and time after t passes over the data! (RVW05,DM05)

Ques: Do there exist, and can we find, O(k), or O(k2), or …, columns s.t.:

||A-CC+A||F < (1+)||A-Ak||F

Ans: Yes - existential proof - no non-exhaustive algorithm given! (RVW05,DRVW06)

Ans: ...


Ques: Do there exist O(k), or O(k2), or …, columns and rows s.t.:

||A-CUR||2,F < ||A-Ak||2,F + ||A||F

Ans: Yes - lots of them, and can find them in O(m+n) space and time after two passes over the data! (DK03,DKM04)

Note: “lots of them” since these are randomized Monte Carlo algorithms!

Ques: Do there exist O(k), or O(k2), or …, columns and rows s.t.:

||A-CUR||F < (1+)||A-Ak||F

Ans: …

Back to columns and rows:

Algorithm to select U, R, given C

Thm: Given C, in O(c2m + cmn) = O(mn) time, we can construct D and R s.t.

holds with probability at least 1-. We need to pick r= O(c2log(1/)/2) rows.

Idea: approximate all columns of A as linear combinations of the “basis” columns in C.

Algorithm:

• Compute a good set of probabilities pi summing to 1; % DETAILS COMING UP

• Pick r rows i1,i2, … , ir of A w.r.t. the pi in i.i.d. trials.

• Let R be the r x n matrix containing these rows;

• Let Dtt = 1/(rpit)1/2 for t = 1…r;

• Let W be the intersection of C and R;

Row-sampling probabilities pi

• Let U = UC be the orthogonal matrix containing the left singular vectors of C.

• Let U(i) denote the i-th row of U.

NOTE: U(i) is NOT unit-length and is NOT orthogonal to U(j) in general.

• We can compute these probabilities in O(c2m + mnc) = O(mn) time.

Algorithm to select C

Theorem: For any k, let Ak be the “best” rank k approximation to A.

Then, in O(SVD(A)) time we can construct a matrix C consisting of c = O(k2log(1/)/2) columns of A s.t.:

holds with probability at least 1-.

(D., M., & Muthukrishnan ’06)

Idea: express all columns of A as linear combinations of the “basis” columns in C.

Algorithm:

• Compute a good set of probabilities pi summing to 1; % DETAILS COMING UP

• Pick c columns of A w.r.t. the pi in i.i.d. trials.

• Let C be the m x c matrix containing these columns;

Column-sampling probabilities pi

• We can compute these probabilities in O(SVD(A)) time.

k: rank parameter input to the algorithm

rank of A

Vk: top k right singular vectors of A

-k: bottom -k singular values of A

V-k: bottom -k right sing. vectors of A

NOTE: In general, (Vk)(i) is NOT unit-length and is NOT orthogonal to (Vk)(j).

Theorem: (relative error) CUR

Theorem: Fix any k, , . Then, there exists a Monte Carlo algorithm that uses O(SVD(A)) time to find C and R and construct U s.t.:

holds with probability at least 1-, by picking c = O( k2 log(1/) / 2 ) columns and r = O( k4 log2(1/) / 6 ) rows.

Proof: Really nice. We disentangle “subspace” information and “size-of-A” information to get relative error bound. Skip for now.

(Current theory work: we can improve the sampling complexity to c,r=O(k poly(1/, 1/)).)

(Current empirical work: we can usually choose c,r ≤ k+4.)

(Don’t worry about : choose =1 if you want!)

Previous CUR-type decompositionsGoreinov, Tyrtyshnikov, &

Zamarashkin(LAA ’97, …)

C: columns that span max volumeU: W+

R: rows that span max volume

Existential resultError bounds depend on ||W+||2

Spectral norm bounds!

Berry, Stewart, & Pulatova(Num. Math. ’99, TR ’04, … )

C: variant of the QR algorithmR: variant of the QR algorithmU: minimizes ||A-CUR||F

No a priori boundsA must be known to construct U.Solid experimental performance

Williams & Seeger(NIPS ’01, …)

C: uniformly at randomU: W+

R: uniformly at random

Experimental evaluationA is assumed PSDConnections to Nystrom method

D., M., & Kannan(SODA ’03, TR ’04, SICOMP ‘06)

C: w.r.t. column lengthsU: in linear/constant timeR: w.r.t. row lengths

“Sketching” massive matricesProvable, a priori, boundsExplicit dependency on A – Ak

D., M., & Muthukrishnan(TR ’06)

C: depends on singular vectors of A. U: (almost) W+

R: depends on singular vectors of C

(1+) approximation to A – Ak

Computable in low polynomial time(Suffices to compute SVD(A))

(For details see Drineas & Mahoney, “A Randomized Algorithm for a Tensor-Based Generalization of the SVD”, ‘05.)

Nonnegative Matrix Factorization (NMF)

Problem definition:

Given an m x n matrix A with non-negative entries and a number c << n: find an m x c matrix W and a c x n matrix H such that all entries of W and H are non-negative and s.t.:

Typical (non-convex) optimization objective: minW, H || A – WH ||F2

References: Paatero & Tapper, Chemometrics ’94

Lee & Seung, Nature ’00

A lot of recent work by M. Berry, B. Plemmons, P. Hoyer, etc..

Motivation: respect the nonnegative structure in the input matirix.

Observation: Why not use actual columns or rows in the decomposition?Refs: Work with Lek-Heng Lim and Petros Drineas ‘05.

Applications of CX/CUR to diverse data documents

Currently application areas for CUR-based analysis:

• Term-document matrices: (with Yahoo people).• User-group matrices: (with Yahoo people).• Recommendation Systems: (with Yahoo people).

• DNA microarray: (with O. Alter).• Functional MRI data: (with F. Meyer).• DNA SNP data: (with P. Paschou and K. Kidd).• Hyperspectral Image data: (with M. Maggioni and R. Coifman).

CUR data application: DNA tagging-SNPs

Single Nucleotide Polymorphisms: the most common type of genetic variation in the genome across different individuals.

They are known locations at the human genome where two alternate nucleotide bases (alleles) are observed (out of A, C, G, T).

(data from K. Kidd’s lab at Yale University, joint work with Dr. Paschou at Yale University)

SNPs

indiv

idu

als

… AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …

… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA …

… GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …

… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG …

… GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA …

… GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA …

… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA …

There are ∼10 million SNPs in the human genome, so this table could have ~10 million columns.

Why are SNPs important

SNPs occur quite frequently within the genome allowing the tracking of disease genes and population histories.

• Mapping the whole genome sequence of a single individual is very expensive.

• Mapping all the SNPs is also quite expensive, but the costs are dropping fast.

HapMap project (~$108 funding from NIH and other sources):

Map all 107 SNPs for ~400 individuals from 4 different populations, in order to create a “genetic map” to be used by researchers.

Funding from pharmaceutical companies, NSF, the Department of Justice*, etc.

*Is it possible to identify the ethnicity of a suspect from his DNA?

Research directions

Research questions (working within a population):

(i) Are different SNPs correlated, within or across populations?

(ii) Find a “good” set of tagging-SNPs capturing the diversity of a chromosomal region of the human genome.

(iii) Find a set of individuals that capture the diversity of a chromosomal region.

(iii) Is extrapolation feasible?Existing literature

Pairwise metrics of SNP correlation, called LD (linkage disequilibrium) distance, based on nucleotide frequencies and co-occurrences.

Almost no metrics exist for measuring correlation between more than 2 SNPs and LD is very difficult to generalize.

Exhaustive and semi-exhaustive algorithms in order to pick “good” ht-SNPs that have small LD distance with all other SNPs.

Using Linear Algebra: an SVD based algorithm was proposed by Lin & Altman, Am. J. Hum. Gen. 2004.

Why?

- Understand structural properties of the human genome.

- Save time/money by assaying only the tSNPs and predicting the rest.

- Save time/money by running (drug) tests only on the cell lines of the selected individuals.

• Samples from 38 different populations.

• Average size 50 subjects/population.

• For each subject 63 SNPs were assayed.

• These SNPs drawn from a chromosomal region which is roughly 900,000 bp long.

• This region is close to the end of the long arm of chromosome 17.

• At each SNP location two alternate nucleotide bases (alleles) are observed (so we use genotype and not haplotype information).

The SNP data we examined

SNPs

indiv

idu

als

… AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …

… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA …

… GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …

… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG …

… GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA …

… GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA …

… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA …

-

Yoruba Biaka

MbutiIbo

Hausa

Jews, Ethiopian

African Americans

Chagga

N > 50

N: 25 ~ 50

Africa

SW Asia

Druze

Jews, Yemenite

Samaritans

Europe

Adygei

Russians

Finns

DanesIrish European, Mixed

Chuvash

Chinese, Taiwan

Chinese,

Han

Hakka Japanese

Atayal

Ami

Cambodians

E Asia

NW Siberia NE Siberia

YakutKom

-

Zyrian

Khanty

Oceania

Micronesians

Nasioi

Jews, Ashkenazi

N America S America

Ticuna

Surui

Karitiana

Pima, Arizona

Pima, Mexico

Cheyenne

Maya

Encoding the SNP data into a matrix

SNPs

indiv

idu

als

0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0 1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1 -1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 1

-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

-1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1 0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1 0 0 0 0 0 0 0 0 0 1 -1 -1 1

-1 -1 -1 1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0 -1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 -1 -1 1

-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1 1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 0 -1 -1 1

• Exactly two nucleotides (out of A,G,C,T) appear in each column of the data matrix.

• Thus, the two alleles might be both equal to the first one (encode by +1), both equal to the second one (encode by -1), or different (encode by 0).

Note: The order of the alleles is irrelevant (i.e., TG is the same as GT).

Note: Encoding, e.g., GG to +1 and TT to -1 is not any different (for our purposes) from encoding GG to -1 and TT to +1.

(Flipping the signs of the columns of a matrix does not affect our techniques.)

Evaluating (linear) structure

Expressing columns of A as linear combinations of

the top k left singular vectors

• BUT: we do NOT want eigen-SNPS, or eigen-people for that matter.

• Is it possible to pick a small number (e.g., roughly k) of columns of A and express every column (i.e., SNP) of A as a linear combination of the picked columns (matrix C) loosing at most 10% of the information in the matrix?

Expressing columns of A as linear combinations of

a few columns of A

• For each population, we ran SVD to determine the “optimal” number k of principal components that are necessary in order to cover (for example) 90% of its spectrum.

• That is, if we select the top k left singular vectors Uk we can express every column (i.e, every SNP) of A as a linear combination of the top k left singular vectors (i.e., eigen-SNPs) and loose at most 10% of the “information”.

Fast algorithms to select good SNPs

We ran various algorithms to select “good” columns (i.e., ht-SNPs).

A greedy Multi-Pass heuristic scheme gave the best results.

• Select one column in each round, subtracting from A the projection of A on this column and repeating.

• Provable quality-of-approximation bounds exist for similar algorithms.

Nice feature: SVD provides a non-trivial (maybe not achievable) lower bound.

• In many cases, the lower bound is attained by the greedy heuristic!

• In our data, at most k+4 columns suffice to extract 90% of the structure.

America

Oceania

Asia

Europe

Africa

Extrapolation using both SNPs and subjects

Given a small number of SNPs for all subjects, and all SNPs for some judiciously chosen subjects, extrapolate the values of the missing SNPs.

SNPs

individuals

“Training” data

JUDICIOUCLY CHOSEN

(for a few subjects, we are given all SNPs)

BUT

We choose which subjects to keep.

SNP sample

(for all subjects, we are given a small number of SNPs)

BUT

We choose these SNPs by looking at the whole matrix

A.

CUR data application: image analysis

(with M. Maggioni and R. Coifman at Yale)

Goal: Extract structure from temporally-resolved images or spectrally-resolved images of medical interest using a small number of samples (images and/or pixels).

Mode 2

Mode 1

Mode 3

Note: A temporally or spectrally resolved image may be viewed as a tensor (naively, a dataset subscripted by multiple indices) or as a matrix (whose columns have internal structure that is not modeled).

m x n x p tensor A or mn x p matrix A

CUR applied to resolved images

Let R consist of the sampled rows or “slabs”.

Express the remaining images as linear combinations of the sampled “slabs”.

2 samples

p time steps Pick a constant number of columns or “fibers” of A (the red dotted lines).

Express the remaining slabs as linear combination of the sampled slabs.

Note: The chosen images are a dictionary from the data to express every image.

Note: The chosen pixels are a dictionary from the data to express every pixel.

Absorption/transmittance and nonuniform sampling probabilities

Eigen-analysis of slabs and fibers

Look at the exact 65-th (or any other) slab.

The 65-th slab approximately reconstructed

This slab was reconstructedby approximate least-squares fit to the basis from slabs 41 and 50, using 1000 (of 250K) pixels/fibers.

Tissue Classification - Exact Data

Tissue Classification - Ns=12 & Nf=1000

Conclusions

CUR matrix decompositions provide data-driven dictionary definition mechanism for diverse “document” domains.• Provides a low-rank approximation in terms of the actual columns and rows of the matrix.

• Take advantage of field-specific intuition for improved analysis of mediumly large data.

• Approximate least squares fitting to the dictionary of chosen columns/rows.

CUR has applications to lots of diverse data “documents”: • to DNA SNP data and DNA microarray data,

• to spectrally- and temporally-resolved image analysis,

• to recommendation systems and internet data.

Big Algorithm Question: How to better couple data/document analysis methods with field-dependent data generation, preprocessing, and modeling.

Date post:	18-Jan-2016
Category:	Documents
Upload:	peregrine-barker
View:	215 times
Download:	0 times