+ All Categories
Home > Technology > NLP and LSA getting started

NLP and LSA getting started

Date post: 26-Jan-2015
Upload: innovation-engineering
View: 111 times
Download: 0 times
Share this document with a friend
An introduction to Natural Language Processing and Latent Semantic Analysis
Popular Tags:
Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. Wikipedia Latent semantic analysis Getting started
Page 1: NLP and LSA getting started

Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.


Latent semantic analysis Getting started

Page 2: NLP and LSA getting started

Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.

WikipediaNatural language processing could be divided in 4 phases:

Grammar analysis

Lexical analysis

Semantic analysis

Syntactic analysis

Apache OpenNLPMachine learning based toolkit for the processing of natural language text.



LSA could be seen as a part of NLP

Page 3: NLP and LSA getting started

Apache OpenNLP usage examples:

Lexical analysis

Grammar analysis

Syntactic analysis

Part-of-speech tagging


Chunker - Parser

NOTE:Before the lexical analysis is possible to use a sentences analysis tool: sentence detector (Apache OpenNLP).

Page 4: NLP and LSA getting started

Supervised machine learning concepts

INPUT DATA(ex: wikipedia corpus)

Humans produce a finite set of couples (INPUT,OUTPUT).It represents the training set.It can be seen as discrete function.

Machine learning algorithm (ex:linear regretion, maximum

entropy, perceptron)


OUTPUT DATA(ex:corpus POSTagged)

Machine produces a model.It can be seen as a continuous function.

INPUT DATA(ex: just a document)

OUTPUT DATA(that document


Input data are taken from an infinte set.

Machine, using model and input, produces the expected output.

Page 5: NLP and LSA getting started

LSA assumes that words that are close in meaning will occur in similar pieces of text.

LSA is a method for discovering hidden concepts in document data.

LSA key concepts

Doc 2

Doc 3Doc 4

Doc 1

Set of documents, each document contains several words.

LSA algorithm takes docs and words and evaluates vectors in a semantic vectorial space using:• A documents/words matrix• Singular value decomposition (SVD)





Semantic vectorial space. Word1 and word2 are close, it means that their (latent) meaning is related.

Page 6: NLP and LSA getting started


Doc 2

Doc 3Doc 4

Doc 1

Doc1 Doc2 Doc3 Doc4

Word1 1 0 1 0Word2 1 0 1 1Word3 0 1 0 1 …

Words/document matrix

1: there are occurrences of the i-word in the j-doc.0: there are not occurrences of the i-word in the j-doc.

The matrix dimension is very big (thousands of words, hundreds of documents).

Matrix SVD decomposition To reduce the matrix dimension

Semantic Vector or JLSI libraries:• SVD decomposition.• Build the vectorial semantic space.





UIMA to manage the solution

Page 7: NLP and LSA getting started

Online references:http://opennlp.apache.org/documentation/manual/opennlp.htmlhttps://code.google.com/p/semanticvectors/ http://hlt.fbk.eu/en/technology/jlsihttp://uima.apache.org/http://en.wikipedia.org/wiki/Singular_value_decompositionhttp://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors

Coursera video references:http://www.coursera.org/course/nlangphttp://www.coursera.org/course/ml

Page 8: NLP and LSA getting started

Some snipptes and console commandsOpenNLP has a command line tool which is used to train the models.

Trained Model

Page 9: NLP and LSA getting started

Models and document to manage

This snippet takes as inputs 4 files and it evaluates a new file sentence detected, tokenized and POSTtaggered.




Document that is sentence detected, tokenized and POSTaggered, and that could be, for example, indexed in a search engine like Apache Solr.

Page 10: NLP and LSA getting started

Note that the lucene-core is a hierarchical dependency.

.bat file to load the classpath

SemanticVectors has two main functions: 1. Building wordSpace models.

To build the wordSpace model Semantic Vector needs indexes created by Apache Lucene.

2. Searching through the vectors in such models.

Es: Bible chapter Indexed by Lucene

Page 11: NLP and LSA getting started

1. Building wordSpace models using pitt.search.semanticvectors.LSA class from the index created by Apache Lucene (from a bible chapter).

In this example the Bible chapter contains 29 documents, and in total there are 2460 terms.

Semantic Vector builds:1. 29 vectors that represent the documents (docvector.bin)2. 2460 vectors that represent the terms (termvector.bin) This two files represent the wordSpace.

Note that could be also possible to use pitt.search.semanticvectors.BuildIndex class that use Random Projection instead of LSA to reduce the dimensional representation.

Page 12: NLP and LSA getting started

2. Searching through docVector and termVector

2.1 Searching for Documents using Terms Search for document vectors closest to the vector ”Abraham”:

Page 13: NLP and LSA getting started

2.2 Using a document file as a source of queries Find terms most closely related to Chapter 1 of Chronicles:

Page 14: NLP and LSA getting started

2.3 Search a general word Find terms most closely related to “Abraham”.

Page 15: NLP and LSA getting started

2.4 Comparing words Compare “abraham” with “Isaac”.

Compare “abraham” with “massimo”.
