+ All Categories
Home > Documents > Mathematical approach for Text Mining 1

Mathematical approach for Text Mining 1

Date post: 25-Jan-2015
Category:
Upload: kyunghoon-kim
View: 171 times
Download: 3 times
Share this document with a friend
Description:
Standard Latent semantic indexing
30
Kyunghoon Kim Mathematical approach for Text Mining - Standard Latent Semantic Indexing - 7/17/2014 Standard Latent Semantic Indexing 1 2014. 07. 17. UNIST Mathematical Sciences Kyunghoon Kim ( [email protected] )
Transcript
Page 1: Mathematical approach for Text Mining 1

Kyunghoon Kim

Mathematical approach for Text Mining

- Standard Latent Semantic Indexing -

7/17/2014 Standard Latent Semantic Indexing 1

2014. 07. 17.

UNIST Mathematical Sciences

Kyunghoon Kim ( [email protected] )

Page 2: Mathematical approach for Text Mining 1

Kyunghoon Kim

What is the Indexing?

7/17/2014 Standard Latent Semantic Indexing 2

Google Glasses is a computer with a head-mounted display.

He wore thick glasses. He worked in google corporation.

He wore glasses to be able to read signs at a distance.

googleglassesisacomputer withhead-mounteddisplayhe

1 21 2 311 311112 3

worethickworkedincorporationtobeableread…

2 322223333

1 2 3

Page 3: Mathematical approach for Text Mining 1

Kyunghoon Kim

>>> Original

matrix([[1, 1, 0, 1],

[7, 0, 0, 7],

[1, 1, 0, 1],

[2, 5, 3, 6]])

>>> U, Sigma, VT = np.linalg.svd(Original)

SVD with Numpy

7/17/2014 Standard Latent Semantic Indexing 3

Page 4: Mathematical approach for Text Mining 1

Kyunghoon Kim

Singular Value Decomposition(SVD)

7/17/2014 Standard Latent Semantic Indexing 4

Harrington, Peter. Machine learning in action. Manning Publications Co., 2012.

Page 5: Mathematical approach for Text Mining 1

Kyunghoon Kim

>>> np.matrix(np.diag(Sigma))

matrix([

[ 1.218e+01, 0.0e+00, 0.0e+00, 0.0e+00],

[ 0.0e+00, 5.370e+00, 0.0e+00, 0.0e+00],

[ 0.0e+00, 0.0e+00, 8.823e-01, 0.0e+00],

[ 0.0e+00, 0.0e+00, 0.0e+00, 1.082e-15]])

Singular Values

7/17/2014 Standard Latent Semantic Indexing 5

Page 6: Mathematical approach for Text Mining 1

Kyunghoon Kim

np.matrix(U)*np.matrix(np.diag(Sigma))*np.matrix(VT)

matrix([

[ 1.0e+00, 1.0e+00, -5.296e-16, 1.0e+00],

[ 7.0e+00, 4.302e-16, 7.979e-16, 7.0e+00],

[ 1.0e+00, 1.0e+00, -2.542e-17, 1.0e+00],

[ 2.0e+00, 5.0e+00, 3.0e+00, 6.0e+00]])

Full Recovery

7/17/2014 Standard Latent Semantic Indexing 6

matrix([[1, 1, 0, 1],[7, 0, 0, 7],[1, 1, 0, 1],[2, 5, 3, 6]])

Page 7: Mathematical approach for Text Mining 1

Kyunghoon Kim

# Calculation with all singular value

[[1 1 0 1]

[7 0 0 7]

[1 1 0 1]

[2 5 3 6]]# Calculation with 3 of 4

[[1 1 0 1]

[7 0 0 7]

[1 1 0 1]

[2 5 3 6]]

Recovering with some singular values

7/17/2014 Standard Latent Semantic Indexing 7

# Calculation with 2 of 4

[[1 1 0 1]

[7 0 0 7]

[1 1 0 1]

[2 5 3 6]]# Calculation with 1 of 4

[[1 0 0 1]

[5 3 1 7]

[1 0 0 1]

[4 2 1 6]]

Page 8: Mathematical approach for Text Mining 1

Kyunghoon Kim

>>> sig2=Sigma**2

array([1.48e+02, 2.88e+01, 7.78e-01, 1.17e-30])

>>> sum(sig2)

178.0

>>> sum(sig2)*0.9

160.20000000000002

>>> sum(sig2[:1])

148.375554981108

How many take singular values

7/17/2014 Standard Latent Semantic Indexing 8

>>> sum(sig2[:2])

177.22150138532837

Page 9: Mathematical approach for Text Mining 1

Kyunghoon Kim

Corpus

7/17/2014 Standard Latent Semantic Indexing 9

Page 10: Mathematical approach for Text Mining 1

Kyunghoon Kim

Corpus

7/17/2014 Standard Latent Semantic Indexing 10

Page 11: Mathematical approach for Text Mining 1

Kyunghoon Kim

Frequency Matrix

7/17/2014 Standard Latent Semantic Indexing 11

Page 12: Mathematical approach for Text Mining 1

Kyunghoon Kim

• Each term 𝑡𝑡𝑖𝑖 generates a row vector (𝑎𝑎𝑖𝑖𝑖,𝑎𝑎𝑖𝑖𝑖,⋯ ,𝑎𝑎𝑖𝑖𝑖𝑖)referred to as a term vector and each document 𝑑𝑑𝑗𝑗 generates a column vector

𝑑𝑑𝑗𝑗 =𝑎𝑎𝑖𝑗𝑗⋮

𝑎𝑎𝑚𝑚𝑗𝑗

Frequency Matrix

7/17/2014 Standard Latent Semantic Indexing 12

Page 13: Mathematical approach for Text Mining 1

Kyunghoon Kim

Frequency Matrix

7/17/2014 Standard Latent Semantic Indexing 13

>>> A = np.matrix([[1,0,0],[0,1,0],[1,1,1],[1,1,0],[0,0,1]])

>>> A

matrix([[1, 0, 0],

[0, 1, 0],

[1, 1, 1],

[1, 1, 0],

[0, 0, 1]])

Page 14: Mathematical approach for Text Mining 1

Kyunghoon Kim

U, Sigma, VT = np.linalg.svd(A)

S = np.zeros((U.shape[1],VT.shape[0]))

S[:3,:3] = np.diag(Sigma)

Recon = U*S*VT

print np.round(Recon)

Example of SVD :: Full Singular values

7/17/2014 Standard Latent Semantic Indexing 14

[[ 1. 0. 0.][ 0. 1. 0.][ 1. 1. 1.][ 1. 1. 0.][ 0. 0. 1.]]

Page 15: Mathematical approach for Text Mining 1

Kyunghoon Kim

Singular Value Decomposition(SVD)

7/17/2014 Standard Latent Semantic Indexing 15

Harrington, Peter. Machine learning in action. Manning Publications Co., 2012.

Page 16: Mathematical approach for Text Mining 1

Kyunghoon Kim

U, Sigma, VT = np.linalg.svd(A)

S = np.zeros((U.shape[1],VT.shape[0]))

S[:2,:2] = np.diag(Sigma[:2])

Recon = U*S*VT

print np.round(Recon,5)

Example of SVD :: 2 singular values

7/17/2014 Standard Latent Semantic Indexing 16

[[ 0.5 0.5 0.][ 0.5 0.5 0.][ 1. 1. 1.][ 1. 1. 0.][ 0. 0. 1.]]

Page 17: Mathematical approach for Text Mining 1

Kyunghoon Kim

array([[ 0.5, 0.5, 0. ],

[ 0.5, 0.5, 0. ],

[ 1. , 1. , 1. ],

[ 1. , 1. , 0. ],

[ 0. , 0. , 1. ]]) % rounded Matrix for convenience

% not rounded Matrix

matrix([[ 5.00000000e-01, 5.00000000e-01, 5.27355937e-16],

[ 5.00000000e-01, 5.00000000e-01, -1.94289029e-16],

[ 1.00000000e+00, 1.00000000e+00, 1.00000000e+00],

[ 1.00000000e+00, 1.00000000e+00, 3.33066907e-16],

[ 5.55111512e-16, -2.49800181e-16, 1.00000000e+00]])

Example of SVD :: 2 singular values

7/17/2014 Standard Latent Semantic Indexing 17

Page 18: Mathematical approach for Text Mining 1

Kyunghoon Kim

Query

7/17/2014 Standard Latent Semantic Indexing 18

Page 19: Mathematical approach for Text Mining 1

Kyunghoon Kim

Query

7/17/2014 Standard Latent Semantic Indexing 19

Page 20: Mathematical approach for Text Mining 1

Kyunghoon Kim

Case1.

Case2.

Example with Query

7/17/2014 Standard Latent Semantic Indexing 20

matrix([[ 5.00000000e-01, 5.00000000e-01, 5.27355937e-16],[ 5.00000000e-01, 5.00000000e-01, -1.94289029e-16],[ 1.00000000e+00, 1.00000000e+00, 1.00000000e+00],[ 1.00000000e+00, 1.00000000e+00, 3.33066907e-16],[ 5.55111512e-16, -2.49800181e-16, 1.00000000e+00]])

Page 21: Mathematical approach for Text Mining 1

Kyunghoon Kim

Case1.

Example with Query

7/17/2014 Standard Latent Semantic Indexing 21

query = np.matrix([[1,0,0,1,0]])

for i in range(int(Recon.shape[1])):

q = query

d = Recon[:,i]

dotproduct = np.asscalar(np.dot(q,d))

normq = np.linalg.norm(q)

normd = np.linalg.norm(d)

print dotproduct / (normq*normd)

Page 22: Mathematical approach for Text Mining 1

Kyunghoon Kim

Case1.

Case2.

Example with Query

7/17/2014 Standard Latent Semantic Indexing 22

matrix([[ 5.00000000e-01, 5.00000000e-01, 5.27355937e-16],[ 5.00000000e-01, 5.00000000e-01, -1.94289029e-16],[ 1.00000000e+00, 1.00000000e+00, 1.00000000e+00],[ 1.00000000e+00, 1.00000000e+00, 3.33066907e-16],[ 5.55111512e-16, -2.49800181e-16, 1.00000000e+00]])

Page 23: Mathematical approach for Text Mining 1

Kyunghoon Kim

What’s the feature of LSI?

7/17/2014 Standard Latent Semantic Indexing 23

Appx of A = matrix([[ 0.5, 0.5, 0. ],[ 0.5, 0.5, 0. ],[ 1. , 1. , 1. ],[ 1. , 1. , 0. ],[ 0. , 0. , 1. ]])

Page 24: Mathematical approach for Text Mining 1

Kyunghoon Kim

Related work

7/17/2014 Standard Latent Semantic Indexing 24

Page 25: Mathematical approach for Text Mining 1

Kyunghoon Kim

Demonstration of LSI

7/17/2014 Standard Latent Semantic Indexing 25

Page 26: Mathematical approach for Text Mining 1

Kyunghoon Kim7/17/2014 Standard Latent Semantic Indexing 26

Page 27: Mathematical approach for Text Mining 1

Kyunghoon Kim7/17/2014 Standard Latent Semantic Indexing 27

Page 28: Mathematical approach for Text Mining 1

Kyunghoon Kim7/17/2014 Standard Latent Semantic Indexing 28

Page 29: Mathematical approach for Text Mining 1

Kyunghoon Kim

• Probabilistic Latent Semantic Indexing

• Latent Dirichlet Allocation

What’s Next?

7/17/2014 Standard Latent Semantic Indexing 29

Page 30: Mathematical approach for Text Mining 1

Kyunghoon Kim

• Harrington, Peter. Machine learning in action. Manning Publications Co., 2012.

• Simovici, Dan A. Linear algebra tools for data mining. World Scientific, 2012.

• Berry, Michael W., Susan T. Dumais, and Gavin W. O'Brien. "Using linear algebra for intelligent information retrieval." SIAM review 37.4 (1995): 573-595.

References

7/17/2014 Standard Latent Semantic Indexing 30


Recommended