Natural Language Processing with Python

Post on 14-Apr-2017

224 views 7 download

transcript

Natural Language Processing

with Python

Kodliuk Tetiana

www.vitech.com.ua

What is NLP?Natural language processing (NLP) is the ability of a computer program to understand human speech as it is spoken and as it is written.

www.vitech.com.ua

Why NLP?

NUMBERS EVERYWHERE

In the beginning was THE WORD…

www.vitech.com.ua

The most terrible – Statistics…

www.vitech.com.ua

What does statistic lie?

World Average• 6.1 Trillion Text Messages / year• 7 billion people• 3 messages/day/person

But:• Teenagers: 50 messages/day

www.vitech.com.ua

What does statistic lie? 2050• 9B people acting like teenagers • 450 billion texts/day• 164 Trillion texts/year (6 Trillion now)

www.vitech.com.ua

Why Python?

WHAT?

www.vitech.com.ua

Business problems

•Sentiment analysis

•Spam/Non-spam detection

•Similar text searching•Text specialization

www.vitech.com.ua

Liquid crystal suspensions of carbon nanotubesassisted by organically modified Laponite

nanoplatelets

If you are Scientist…

www.vitech.com.ua

Articles Similarity

● How to find similar articles?● How to find interesting news for you?● How to say if these customers are similar?● How to detect the theme of text?

www.vitech.com.ua

Word2Vec - Tomas Mikolov, 2013

Lda2Vec- Christopher Moody, 2015

Doc2Vec - Tomas Mikolov, 2014

www.vitech.com.ua

Good solution: Doc2Vec✓ “Oculist and eye-doctor … occur in almost the same environments”, Z. Harris (1954)

✓ “You shall know a word by the company it keeps!”, Firth (1957)

✓ “Tell me who your friends are and I tell you who you are”, Ukrainian

www.vitech.com.ua

Word2Vec

www.vitech.com.ua

Word2Vec

www.vitech.com.ua

Word2VecCorpus Reading

Vocabulary creating

Sub-sampling

Window moving

Feedforward Neural Network

www.vitech.com.ua

Word2Vec

It

elementary

dear Watsonmyis

CBoW

www.vitech.com.ua

Word2Vec

it [0.23, 0.45, …… 0.71]

is [0.13, 0.50, …… 0.12]

elementary [0.05, 0.89, …… 0.08]

my [0.65, 0.15, …… 0.41]

dear [0.98, 0.21, …… 0.11]

watson [0.42, 0.12, …… 0.81]

www.vitech.com.ua

Word2Vec

Sherlock Holmes cried: “Exactly, my dear Watson!”

Holmes said: Elementary, my dear fellow! Ho! Elementary“

Then Psmith murmured: “Elementary, my dear Watson, elementary,”

Holmes

Psmith

Watson

fellow

Elementary

Exactly

cried

said

www.vitech.com.ua

Doc2Vec

www.vitech.com.ua

Doc2Vec

Titanic.txt [0.23, 0.45, …… 0.71]Room.txt [0.13, 0.50, …… 0.12]Sangredus.txt [ 0.05, 0.89, …… 0.08]Umbriel.txt [0.65, 0.15, …… 0.41]Dumped.txt [0.98, 0.21, …… 0.11]Nessa.txt [0.42, 0.12, …… 0.81]

titanic [0.03, 0.89, …… 0.71]apartment [0.83, 0.50, …… 0.12]room [ 0.55, 0.89, …… 0.08]parrot [0.62, 0.15, …… 0.41]nessa [0.08, 0.21, …… 0.11]word [0.42, 0.12, …… 0.81]

Vector for Document Vector for Word

www.vitech.com.ua

Doc2Vec

LDA

Doc2VecWord2

Vec

www.vitech.com.ua

LDA

www.vitech.com.ua

LDA2Vec

www.vitech.com.ua

LDA2Vec

= 0,15*programming + 0,25*football + 0,60*beer

www.vitech.com.ua

Doc2VecTIME FOR PYTHON

www.vitech.com.ua

Why Python?

• NLTK

• Gensim

• TextBlob

• Urllib

• Pattern

• Orange

• Sklearn

www.vitech.com.ua

www.vitech.com.ua

www.vitech.com.ua

Data Sciense Flow

Target formulation

Wikipedia parsing

Text cleaning

Models

building

Results analysis

www.vitech.com.ua

Target formulation

Articles similarity for Doc2VecTopics for LDA2Vec

www.vitech.com.ua

Wikipedia parsing

www.vitech.com.ua

Text cleaning

TokenizationDigits

removingStopwords removing

Punctuation cleaning

Coding Stemming

['ukraine', 'ukrainian', 'ukraina', 'country', 'eastern', 'europe', 'bordered', 'russia', 'east', 'northeast', 'belarus', 'northwest', 'poland', 'slovakia', 'west', 'hungary', 'romania', 'moldova', 'southwest', 'black', 'azov', 'south', 'southeast', 'respectively', 'ukraine', 'currently', 'territorial', 'dispute', 'russia', 'crimean', 'peninsula', 'russia', 'annexed', 'ukraine', 'international', 'community', 'recognise', 'ukrainian', 'including', 'crimea', 'ukraine', 'area', 'making', 'largest', 'country', 'entirely', 'within', 'europe', 'largest', 'country', 'world', 'population', 'million', 'making', 'populous', 'country', 'world']

www.vitech.com.ua

Doc2Vec: LabeledSentence

“Doc_12” “Robot” “Food_cat” LDA vec

www.vitech.com.ua

Doc2Vecmodel = Doc2Vec(size=300, window=10, min_count=10, workers=4,alpha=0.025, min_alpha=0.025)

www.vitech.com.ua

Doc2Vec as Word2VecARTICLE

WORD

MALWARE

www.vitech.com.ua

www.vitech.com.ua

Doc2Vec as Word2Vec

www.vitech.com.ua

Robot

Hobbit

Programmer

Math

www.vitech.com.ua

LDA2Vec: LabeledSentence

“Doc_12” “Robot” “Food_cat” LDA vec

www.vitech.com.ua

LDAlda = gensim.models.ldamodel.LdaModel(modelled_corpus, num_topics=20, update_every=100, passes=20, id2word=dictionary, alpha='auto', eval_every=5)

www.vitech.com.ua