PTree Text Mining... Position 123456 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 are April apple and an...

pTree Text Mining ... Position1 2 3 4 5 6 7

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

are

April

apple

and

an

always.

all

again

a

... T

erm

(V

ocab

)

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0 . . .

0 . . .

0 . . .

0

0

0

0 0 0 0 0 0 0 . . .

0 0 0 0 0 0 0 . . .

0 0 0 0 0 1 0 . . .

DocTrmPos pTreeSet

0

0

0

1

0

0

0

0

1

... T

erm

Ex

0

0

0

3

0

3

0

0

2

...T

erm

Fre

q

0

0

0

0

0

0

0

0

0

...

tf2

0

0

0

1

0

0

0

0

1

...

tf1

0

0

0

1

0

0

0

0

0

...

tf0

1

1

1

3

1

1

2

1

3

..doc

fre

q

JSE HHS

LMM

...do

c

1 0 1

doc=1 doc=1 doc=1 term=a trm=again term=all

0

Data Cube Text Mining

1 2 3 4 5 6 7 mdl reading-positions for doc=1, term=a

(mdl = max doc length)

0

1 2 3 4 5 6 7 mdlreading-positions: doc=1, term=again

0

1 2 3 4 5 6 7 mdl reading-positions for doc=1, term=all

0

Length of this level-0 pTree= mdl*VocabLen*DocCount . . .

1 0 0

doc=2 doc=2 doc=2 term=a trm=again term=all

1 1 0

doc=3 doc=3 doc=3 term=a trm=again term=all ...

Length of this level-1 TermExistencePTree =VocabLen*DocCount

pred is NOTpure0

8 1 38 1 3

8 1 31 0 0

3 1 2 df (cnt)

<--dfP3

<--dfP0

dfk isn't a level-2 pTree since it's not a predicate on level-1 te strides.

Next slides shows how to do it differently so that even the dfk's come out as level-2 pTrees.

2 0 0 . . .doc=1 d=1 d=1 term=a t=again t=all

3 0 0 . . .

d=2 d=2 d=2 t=a t=again t=all

2 1 0 . . . ... tf d=3 d=3 d=3 t=a t=again t=all ...

0 0 0 . . . 1 0 0 . . . 0 1 0 . . . ... tf0 1 0 0 . . . 1 0 0 . . . 1 0 0 . . . ... tf1

level-1 TermFreqPTrees (E.g., the predicate of tfP0: mod(sum(mdl-stride),2)=1)

... Pos1 2 3 4 5 6 7

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

are

April

apple

and

an

always.

all

again

a

Voc

abT

erm

s

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0 . . .

0 . . .

0 . . .

0

0

0

0 0 0 0 0 0 0 . . .

0 0 0 0 0 0 0 . . .

0 0 0 0 0 1 0 . . .

Corpus pTreeSet

0

0

0

1

0

0

0

0

1

... T

erm

Ex

0

0

0

3

0

3

0

0

2

...T

erm

Fre

q

0

0

0

0

0

0

0

0

0

...

tf2

0

0

0

1

0

0

0

0

1

...

tf1

0

0

0

1

0

0

0

0

0

...

tf0

1

1

1

3

1

1

2

1

1

..doc

fre

q

JSE HHS

LMM

...do

c

0

data Cube layout:

term=a doc1 term=a doc2 term=a doc3

0 tePt=again

trm=a trm=a term=adoc1 doc2 doc3

1 0 0 tePt=a t=again t=again t=again doc1 doc2 doc3

tePt=all

tr=all t=all t=alldoc1 doc2 doc3 ...

pTree Text Mining

term=again doc1 ...

This one, overall, level-0 pTree, corpusP, has length =

MaxDocLen*DocCount*VocabLen

This one, overall, level-1 pTree, teP, has length = DocCount*VocabLength

These level-2 pTrees, dfPk have len= VocabLength

0 0 0 . . . 0 . . . . . . ... tfP0

1 0 0 . . . 0 . . . . . . ... tfP1 level-1 PTrees, tfPk e.g., pred of tfP0: mod(sum(mdl-stride),2)=1

2 0 0 . . .

doc=1 d=2 d=3 term=a t=a t=a

0 . . .

d=1 d=2 d=3t=again t=again t=again

. . . ... tf d=1 d=2 d=3 t=all t=all t=all ...

level-2 PTree, hdfP?? (Hi Doc Feq): pred=NOTpure0 applied to tfP1

8 1 38 1 3

8 1 30 0 0 . . .

1 1 2 . . . df count

<--dfP3

<--dfP0

doc1 doc2 doc3

1 1 0 . . . hdfP

doc1 doc2 doc3

... Pos1 2 3 4 5 6 7

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

are

April

apple

and

an

always.

all

again

a

Voc

abT

erm

s

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0 . . .

0 . . .

0 . . .

0

0

0

0 0 0 0 0 0 0 . . .

0

0

0

1

0

0

0

0

1

...

te

0

0

0

3

0

3

0

0

2

...

tf

0

0

0

0

0

0

0

0

0

...

tf2

0

0

0

1

0

0

0

0

1

...

tf1

0

0

0

1

0

0

0

0

0

...

tf0

1

1

1

3

1

1

2

1

1

..doc

fre

q

JSE HHS

LMM

...do

c

0

data Cube layout:

pTree Text Mining

Pt=a,d=3 Pt=a,d=20 0 0 0 0 0 0 Pt=again,d=10 0 0 0 0 0 0 Pt=a,d=1. . .

term=a doc1 term=a doc2 term=a doc3 term=again doc1 ...

This overall level-0 pTree corpusP length MaxDocLen*DocCount*VocabLen

0 tePt=again

trm=a trm=a term=adoc1 doc2 doc3

1 0 0 tePt=a t=again t=again t=again doc1 doc2 doc3

tePt=alltr=all t=all t=all

doc1 doc2 doc3 ...

This overall, level-1 pTree, teP, has length = DocCount*VocabLength

These level-2 pTrees, dfPk have len= VocabLength

0 0 0 . . . 0 . . . . . . ... tfP0

1 0 0 . . . 0 . . . . . . ... tfP1 level-1 PTrees, tfPk e.g., pred of tfP0: mod(sum(mdl-stride),2)=1

2 0 0 . . . doc=1 d=2 d=3 term=a t=a t=a

0 . . .

d=1 d=2 d=3t=again t=again t=again

. . . ... tf d=1 d=2 d=3 t=all t=all t=all ...

level-2 PTree, hdfP?? (Hi Doc Feq): pred=NOTpure0 applied to tfP1

8 1 38 1 3

8 1 30 0 0 . . .

1 1 2 . . . df count

<--dfP3

<--dfP0

doc1 doc2 doc3

1 1 2 . . . hdfPdoc1 doc2 doc3

Preface pTree 1 1 1 0 0 0 0

LastChpt pTree 0 0 0 0 0 1 0

Refrncs pTree 0 0 0 0 0 0 1Any of these masks can be ANDed into the Pt= , d= pTrees before

they are concatenated as above (or repetitions of the mask can be ANDED after they are concatenated).

I have put together a pBase of 75 Mother Goose Rhymes or Stories. Created a pBase of the 15 documents with 30 words (Universal Document Length, UDL) using as vocabulary, all white-space separated strings.

te tf tf1 tf0 VOCAB Little Miss Muffet sat on a tuffet eating 1 2 1 0 a 0 0 0 0 0 1 0 0 0 0 0 0 again. 0 0 0 0 0 0 0 0 0 0 0 0 all 0 0 0 0 0 0 0 0 0 0 0 0 always 0 0 0 0 0 0 0 0 0 0 0 0 an 0 0 0 0 0 0 0 0 1 3 1 1 and 0 0 0 0 0 0 0 0 0 0 0 0 apple 0 0 0 0 0 0 0 0 0 0 0 0 April 0 0 0 0 0 0 0 0 0 0 0 0 are 0 0 0 0 0 0 0 0 0 0 0 0 around 0 0 0 0 0 0 0 0 0 0 0 0 ashes, 0 0 0 0 0 0 0 0 0 0 0 0 away 0 0 0 0 0 0 0 0 0 0 0 0 away 0 0 0 0 0 0 0 0 1 1 0 1 away. 0 0 0 0 0 0 0 0 0 0 0 0 baby 0 0 0 0 0 0 0 0 0 0 0 0 baby. 0 0 0 0 0 0 0 0 0 0 0 0 bark! 0 0 0 0 0 0 0 0 0 0 0 0 beans 0 0 0 0 0 0 0 0 0 0 0 0 beat 0 0 0 0 0 0 0 0 0 0 0 0 bed, 0 0 0 0 0 0 0 0 0 0 0 0 Beggars 0 0 0 0 0 0 0 0 0 0 0 0 begins. 0 0 0 0 0 0 0 0 1 1 0 1 beside 0 0 0 0 0 0 0 0 0 0 0 0 between 0 0 0 0 0 0 0 0 . . . 0 0 0 0 your 0 0 0 0 0 0 0 0

pos123456789101112131415161718192021222324...182

of curds and whey. There came a big spider and sat down... 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20...Lev-0Little Miss Muffet

Lev1 (term freq/exist)

te tf tf1 tf0 05HDS Humpty Dumpty sat on a wall. Humpt yDumpty 1 2 1 0 a 0 0 0 0 1 0 0 0 1 1 0 1 again. 0 0 0 0 0 0 0 0 1 2 1 0 all 0 0 0 0 0 0 0 0 0 0 0 0 always 0 0 0 0 0 0 0 0 0 0 0 0 an 0 0 0 0 0 0 0 0 1 1 0 1 and 0 0 0 0 0 0 0 0 0 0 0 0 apple 0 0 0 0 0 0 0 0 0 0 0 0 April 0 0 0 0 0 0 0 0 0 0 0 0 are 0 0 0 0 0 0 0 0 0 0 0 0 around 0 0 0 0 0 0 0 0 0 0 0 0 ashes, 0 0 0 0 0 0 0 0 0 0 0 0 away 0 0 0 0 0 0 0 0 0 0 0 0 away 0 0 0 0 0 0 0 0 0 0 0 0 away. 0 0 0 0 0 0 0 0 0 0 0 0 baby 0 0 0 0 0 0 0 0 0 0 0 0 baby. 0 0 0 0 0 0 0 0 0 0 0 0 bark! 0 0 0 0 0 0 0 0 0 0 0 0 beans 0 0 0 0 0 0 0 0 0 0 0 0 beat 0 0 0 0 0 0 0 0 0 0 0 0 bed, 0 0 0 0 0 0 0 0 0 0 0 0 Beggars 0 0 0 0 0 0 0 0 0 0 0 0 begins. 0 0 0 0 0 0 0 0 0 0 0 0 beside 0 0 0 0 0 0 0 0 0 0 0 0 between 0 0 0 0 0 0 0 0 . . . 0 0 0 0 your 0 0 0 0 0 0 0 0

pos123456789101112131415161718192021222324...182

1 2 3 4 5 6 7 8...Lev-0Humpty Dumpty

Lev1 (term freq/exist)

df3 df2 df1 df0 df VOCAB te04 te05 te08 te09 te27 te29 te34 1 0 0 0 8 a 1 1 0 1 0 0 0 0 0 0 1 1 again. 0 1 0 0 0 0 0 0 0 1 1 3 all 0 1 0 0 0 0 0 0 0 0 1 1 always 0 0 0 0 0 1 0 0 0 0 1 1 an 0 0 0 0 0 0 0 1 1 0 1 13 and 1 1 1 1 1 1 1 0 0 0 1 1 apple 0 0 0 0 0 0 0 0 0 0 1 1 April 0 0 0 0 0 0 0 0 0 0 1 1 are 0 0 0 0 0 0 0 0 0 0 1 1 around 0 0 0 0 0 0 0 0 0 0 1 1 ashes, 0 0 0 0 0 0 0 0 0 1 0 2 away 0 0 0 0 0 1 0 0 0 1 0 2 away 0 0 0 0 0 1 0 0 0 0 1 1 away. 1 0 0 0 0 0 0 0 0 0 1 1 baby 0 0 0 0 1 0 0 0 0 0 1 1 baby. 0 0 0 1 0 0 0 0 0 0 1 1 bark! 0 0 0 0 0 0 0 0 0 0 1 1 beans 0 0 0 0 0 0 1 0 0 0 1 1 beat 0 0 0 0 0 0 0 0 0 0 1 1 bed, 0 0 0 0 0 1 0 0 0 0 1 1 Beggars 0 0 0 0 0 0 0 0 0 0 1 1 begins. 0 0 0 0 0 0 0 0 0 0 1 1 beside 1 0 0 0 0 0 0 0 0 0 1 1 between 0 0 1 0 0 0 0

Level-2 pTrees (document frequency)

Latent semantic indexing (LSI) is indexing and retrieval that uses Singular value decomposition for patterns in terms and concepts in text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. LSI feature: ability to extract conceptual content of a body of text by establishing associations between terms that occur in similar contexts. [1]LSI overcomes synonymy, polysemy which cause mismatches in info retrieval [3] and cause Boolean keyword queries to mess up.LSI performs autodoc categorization (assignment of docs to predefined categories based on similarity to conceptual content of the categories.[5] LSI uses example docs for conceptual basis categories - concepts are compared to the concepts contained in the example items, and a category (or

categories) is assigned to the docs based on similarities between concepts they contain and the concepts contained in example docs.

Mathematics of LSI (linear algebra techniques to learn the conceptual correlations in a collection of text). Construct a weighted term-document matrix, do Singular Value Decomposition on it. Use that to identify the concepts contained in the text.Term Document Matrix, A: Each (of m) terms represented by a row, each (of n) doc is rep'ed by a column, with each matrix cell, a ij, initially

representing number of times the associated term appears in the indicated document, tf ij. This matrix is usually large and very sparse.Once a term-document matrix is constructed, local and global weighting functions can be applied to it to condition the data. local: [13] Binary if

term exists in the doc TermFrequency; global weighting functions: Binary Normal GfIdf, Idf EntropyMathematics of LSI (linear algebra techniques to learn the conceptual correlations in a collection of text). Construct a weighted term-document matrix, do Singular Value Decomposition on it. Use that to identify the concepts contained in the text.Term Document Matrix, A: Each (of m) term represented by a row, each (of n) doc is rep'ed by a column, with each matrix cell, a ij, initially

representing number of times the associated term appears in the indicated document, tf ij. This matrix is usually large and very sparse.SVD basically reduces the dimensionality of the matrix to a tractable size by finding the singular values. It involves matrix operations and may not be amenable to pTree operations (i.e. horizontal methods are highly developed and my be best.We should study it though to see if we can identify a pTree based breakthrough for creating the reduction that SVD achieves.Is a new SVD program run required for every new query? or is it a one time thing? If it is one-time, there is probably little advantage in

searching for pTree speedups?If and when it is not a one-time application to the original data, pTree speedups my hold promise.Even if it is one-time, we might take the point of view that we do the SVD reduction (using standard horizontal methods) and then covert the

result to vertical pTrees for the data mining (which would be done over and over again). That pTree-ization of the end result of the SVD reduction could be organized as in the previous slides.

Here is a good paper on the subject of LSI and SVD: http://www.cob.unt.edu/itds/faculty/evengelopoulos/dsci5910/LSA_Deerwester1990.pdf

Thoughts for the future:

I am now convinced we can do LSI using pTree processing. The heart of LSI is SVD. The heart of SVD is Gaussian Elimination (which is adding a constant times a matrix row to another row - which we can do with pTrees).

We will talk more about this next Saturday and during the week.

http://en.wikipedia.org/wiki/Singular_value_decomposition

http://en.wikipedia.org/wiki/Latent_semantic_indexing#cite_note-0

http://en.wikipedia.org/wiki/Synonymy

http://en.wikipedia.org/wiki/Polysemy

SVD: Let X be the t by d TermFrequency (tf) matrix. It can be decomposed as T0S0D0T where T and D have ortho-normal

columns and S has only the singular values on its diagonal in descending order. Remove from T0,S0,D0, row-col of all but highest k singular values, giving T,S,D. X ~= X^ ≡ TSDT (X^ is the rank=k matrix closest to X).

We have reduced the dimension from rank(X) to k and we note, X^X^T = TS2TT and X^TX^ = DS2DT

There are three sorts of comparisons of interest: Comparing1. terms (how similar are terms, i and j?) (comparing rows)2. documents (how similar are documents i and j?) (comparing documents)3. terms and documents (how associated are term i and doc j?) (examining individual cells)

Comparing terms (how similar are terms, i and j?) (comparing rows)Dot product between two rows of X^ reflects their similarity (similar occurrence pattern across the documents).X^X^T is the square t x t symmetric matrix containing all these dot products. X^X^T = TS2TT This means the ij cell in X^X^T is the dot prod of i and j rows of TS (rows TS can be considered coords of terms).

Comparing documents (how similar are documents, i and j?) (comparing columns)Dot product of two columns of X^ reflects their similarity (extent to which two documents have a similar profile of terms).X^TX^ is the square d x d symmetric matrix containing all these dot products. X^TX^ = DS2DT This means the ij cell in X^TX^ is the dot prod of i and j columns of DS (considered coords of documents).

Comparing a term and a document (how associated are term i and document j?) (analyzing cell i,j of X^)Since X^ = TSDT cell ij is the dot product of the ith row of TS½ and the jth column of DS½

term\doc c1 c2 c3 c4 c5 m1 m2 m3 m4human 1 0 0 1 0 0 0 0 0interface 1 0 1 0 0 0 0 0 0computer 1 1 0 0 0 0 0 0 0user 0 1 1 0 1 0 0 0 0system 0 1 1 2 0 0 0 0 0response 0 1 0 0 1 0 0 0 0time 0 1 0 0 1 0 0 0 0EPS 0 0 1 1 0 0 0 0 0survey 0 1 0 0 0 0 0 0 1trees 0 0 0 0 0 1 1 1 0graph 0 0 0 0 0 0 1 1 1minors 0 0 0 0 0 0 0 1 1

c1 Human machine interface for Lab ABC computer appsc2 A survey of user opinion of comp system response timec3 The EPS user interface management systemc4 System and human system engineering testing of EPSc5 Relation of user-perceived response time to error measmnt

m1 The generation of random, binary, unordered treesm2 The intersection graph of paths in treesm3 Graph minors IV: Widths of trees and well-quasi-orderingm4 Graph minors: A survey

X = T0S0D0T T0 D0 col-orthonormal. Approx X keeping only 1st 2 singular values and corresp cols of T,D

which are coords used to position terms and docs in 2D rep above. In this reduced model: X ~ X^ = TSDT

term\doc c1 c2 c3 c4 c5 m1 m2 m3 m4human 1 0 0 1 0 0 0 0 0interface 1 0 1 0 0 0 0 0 0computer 1 1 0 0 0 0 0 0 0user 0 1 1 0 1 0 0 0 0system 0 1 1 2 0 0 0 0 0response 0 1 0 0 1 0 0 0 0time 0 1 0 0 1 0 0 0 0EPS 0 0 1 1 0 0 0 0 0survey 0 1 0 0 0 0 0 0 1trees 0 0 0 0 0 1 1 1 0graph 0 0 0 0 0 0 1 1 1minors 0 0 0 0 0 0 0 1 1

http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/tools/

inter comp doc\term human face uter user system response time EPS c1 1 1 1 0 0 0 0 0 c2 0 0 1 1 1 1 1 0 c3 0 1 0 1 1 0 0 1 c4 1 0 0 0 2 0 0 1 c5 0 0 0 1 0 1 1 0 mc 0.4 0.4 0.4 0.6 0.8 0.4 0.4 0.4

m1 0 0 0 0 0 0 0 0 m2 0 0 0 0 0 0 0 0 m3 0 0 0 0 0 0 0 0 m4 0 0 0 0 0 0 0 0 mm 0 0 0 0 0 0 0 0

q 1 0 1 0 0 0 0 0

D 0.4 0.4 0.4 0.6 0.8 0.4 0.4 0.4 d 0.23 0.30 0.42 1.09 -16.00 -0.47 -0.32 -0.24(mc+mm)/2 0.09 0.12 0.17 0.65 -12.80 -0.19 -0.13 -0.10mc+mm/2*d 0.02 0.04 0.07 0.71 204.80 0.09 0.04 0.02 a 204.92

q * d 0.23 0.00 0.42 0.00 0.00 0.00 0.00 0.00 q dot d 0.65 far less than a, so q is way into the c class

survey trees graph minors 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.2 0 0 0

0 1 0 0 0 1 1 0 0 1 1 1 1 0 1 1 0.25 0.75 0.75 0.5

0 0 0 0

-0.05 -0.75 -0.75 -0.5 0.02 0.38 0.60 1.00 -0.00 -0.28 -0.45 -0.50 -0.00 -0.11 -0.27 -0.50

0.00 0.00 0.00 0.00

d(doc-i,q) human interface computer user system reponse time 1.00 (c1-q)^2 0 1 0 0 0 0 0 2.45 (c2-q)^2 1 0 0 1 1 1 1 2.45 (c3-q)^2 1 1 1 1 1 0 0 2.45 (c4-q)^2 0 0 1 0 4 0 0 2.24 (c5-q)^2 1 0 1 1 0 1 1

1.73 (m1-q)^2 1 0 1 0 0 0 0 2.00 (m2-q)^2 1 0 1 0 0 0 0 2.24 (m3-q)^2 1 0 1 0 0 0 0 2.24 (m4-q)^2 1 0 1 0 0 0 0What this tells us is that c1 is closests to q in the full space and thatthe other c documents are no closer than the m documents.

Therefore q would probably be classified as c (one voter in the 1.5nbhd) but not clearly. This shows the need for SVD or Oblique FAUST!

EPS survey trees graph minors 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0

0 0 1 0 0 0 0 1 1 0 0 0 1 1 1 0 1 0 1 1

Date post:	17-Jan-2016
Category:	Documents
Upload:	mariah-wright
View:	222 times
Download:	0 times

PTree Text Mining... Position 123456 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 are April apple and an...

Documents