Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50...

transcript

Sentiment Analysis: best practices and challenges

Vitalii Radchenko

Problem definition

• A company wants to build sentiment analysis model

• Main task is to classify review: positive or negative

• Metrics: accuracy / f1score

Data sources

https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/src/parsers 3

Data sources• Open datasets:

• Amazon – 143.7 million reviews

• Imdb (50k), RT, Twitter (1.5kk)

• Parse data:

• iHerb, iTunes, RT, GoodReads, Expedia, Yelp etc.

• Remember about Terms of Usehttps://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/src/parsers 3

• Parse data:

• iHerb, iTunes, RT, GoodReads, Expedia, Yelp etc.

• Remember about Terms of Usehttps://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/src/parsers 3

Data Analysis

https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data4

Data Analysis• Very important don’t skip this step

https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data4

Data Analysis• Very important don’t skip this step

• Calculate simple statistics:

• Count reviews

• Mean number of words in review, mean length (chars)

• Look at number of words distribution (<3, 4-10, 11-50, >51)

• Count duplicates

• Check languages (with SpaCy)https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data

Data preprocessing

• Text preprocessing

• Text —> Vector

• Embeddings

• Dimensionality reduction

Children -> child

Better -> good

Text preprocessing

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-NLP_Libraries.ipynb 6

Text preprocessing• NLTK – over 50 corpora, wordNet, tokenization, stemming,

tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries

• TextBlob – part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation etc

• Pattern – fast part-of-speech tagger for English, sentiment analysis, tools for English verb conjugation and noun singularization & pluralization, and a WordNet interface

• spaCy – tokenization, syntax-driven sentence segmentation, pre-trained word vectors, part-of-speech tagging, named entity recognition, labelled dependency parsing (Cython)

Text —> Vector

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb7

Text —> Vector• Bag of Words

• CountVectorizer (ngrams, max/min_df, max_features)

• Tf/Idf (ngrams, max/min_df, max_features, norm, smooth_idf)

• HashingVectorizer (ngrams, n_features, non_negative)

• Sentiment features

• polarity, subjectivity (TextBlob)

• contrast conjunctions

• pos and neg smiles

• Sentiment features

• polarity, subjectivity (TextBlob)

• contrast conjunctions

• pos and neg smiles

• Manual features

• count exclamation, question marks

• uppercase words

• extract rating from text (“2/10”)

Embeddings

https://github.com/3Top/word2vec-api https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-word2vec_practice_gensim.ipynb 8

Embeddings• Word2Vec

• pre-trained : GoogleNews 6Bx300

• gensim - fastest (available on tf)

• Glove

• pre-trained: Stanford

• C(Stanford)/tf/numpy implementation

• Glove

• pre-trained: Stanford

• C(Stanford)/tf/numpy implementation

• HellingerPCAhttps://github.com/3Top/word2vec-api https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-word2vec_practice_gensim.ipynb 8

Dimensionality reduction

• PCA & SVD doesn’t work with sparse matrixes

Dimensionality reduction

• PCA & SVD doesn’t work with sparse matrixes

• TruncatedSVD

Approaches

Approaches• Linear Models:

• SVM, Logistic Regression, Naive Bayes

• Trees, ensembles and boosting

• Random Forest, Extratrees, Xgboost, LightGbm

• FastText

• Word-based NN

• LSTM, GRU, CNN

• FastText

• Word-based NN

• LSTM, GRU, CNN

• Char-based NN

• CNN&Dense, CNN&LSTM

Linear Models

https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-Linear_models__svm_logistic_regression.ipynb https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-working-with-linear-models.ipynb https://github.com/udsclub/xray-sentiment-analysis https://github.com/udsclub/zulu-sentiment-analysis 11

Linear Models• LinearSVC with small data, Logistic Regression –

with bigger data

• Count/Tf-Idf Vectorizer with many ngrams and regularization (min/max_df, max_features)

with bigger data

• Stemming/Lemmatization don’t work

with bigger data

• Stemming/Lemmatization don’t work

• Remove stopwords with cross-validationhttps://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-Linear_models__svm_logistic_regression.ipynb https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-working-with-linear-models.ipynb https://github.com/udsclub/xray-sentiment-analysis https://github.com/udsclub/zulu-sentiment-analysis 11

Trees, ensembles and boosting

https://github.com/udsclub/kilo-sentiment-analysis https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-trees-and-boosting.ipynb

• The worst models for sentiment analysis :)

• Overfit

• Works good in ensemble with linear models

FastText

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-fastText.ipynb https://github.com/udsclub/foxtrot-sentiment-analysis 13

FastText• Very simple

• Needs text preprocessing (spaCy.en + stopwords)

• Pre-trained vectors wiki.en

• Tune regularization parameters

• Good result

Deal with it

Word-based NN (LSTM)

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-rnn.ipynb https://github.com/udsclub/alpha-sentiment-analysis/tree/master/full_movie_reviews

Word-based NN (LSTM)• The best simple LSTM

• Pre-trained google word2vec as embeddings

• Truncate post, bigger maxlen is better

• Use masking, adam optimizer, many dropouts!

• Have to store a big vocabulary and embeddingshttps://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-rnn.ipynb https://github.com/udsclub/alpha-sentiment-analysis/tree/master/full_movie_reviews

Word-based NN (CNN)

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-CNN.ipynb https://github.com/udsclub/charlie-sentiment-analysis

Word-based NN (CNN)

• 1D convolutions, maxpooling, dropouts, dense

Word-based NN (CNN)

• Better to train own embeddings

Word-based NN (CNN)

• Stemming, removing stopwords

Word-based NN (CNN)

• Stemming, removing stopwords

• Works worse than LSTM

Char-based NN

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-char-models.ipynb16

Char-based NN• Two approaches for preparing data:

• OHE (70 symbols)

• Embeddings

• Two most popular approaches for training model:

• N*(Conv1d) + GlobalMaxPooling + Dense

• N*(Conv1d + MaxPooling) + LSTM

• Embeddings

• Two most popular approaches for training model:

• N*(Conv1d) + GlobalMaxPooling + Dense

• N*(Conv1d + MaxPooling) + LSTM

• Using OHE or Embeddings we don’t need to store a big vocabulary

My ranking (small data)1. Word-based LSTM

2. Linear models

3. Char-based CNN + LSTM

4. FastText

5. Word-based CNN

6. Boosting

My ranking (big data)1. Word-based LSTM

2. Char-based CNN + LSTM

3. FastText

4. Linear models (log reg)

5. Word-based CNN

6. Boosting

Observations 1

Observations 1• Small number of short reviews

• Linear Models with BoW (many ngrams and big regularization)

• Small number of long reviews

• one layer LSTM with pre-trained google word2vec and many dropouts

• Small number of long reviews

• one layer LSTM with pre-trained google word2vec and many dropouts

• Many reviews

• LSTM and char-CNN

Observations 2

https://github.com/udsclub/whiskey-sentiment-analysis/blob/master/test-attention.ipynb 20

Observations 2• Begin with simple models one-layer LSTM or

Logistic Regression

• Complex LSTMs (Bidirectional, Stacked, Merged, with Attention) don’t usually work better than simple LSTM

Logistic Regression

• Complex LSTMs (Bidirectional, Stacked, Merged, with Attention) don’t usually work better than simple LSTM

• LSTM with Attention gives the biggest weights to the last words

Observations 3• Imbalanced dataset lead to the big overfitting on

smaller class on test set.

• Pay attention to F1score and classification report

• If you have many reviews, just remove some samples from the bigger class

https://github.com/udsclub/alpha-sentiment-analysis/blob/master/amazonTv/scripts/validation_curves.ipynb 21

Observations 4• Predict other domains

• use Amazon dataset (works good)

• Trained on Amazon Movie and TV 1.5kk reviews (LSTM, other models lose more than 1%)

• “Digital Music” – 95.82% “Office Products” – 95.76% “Video Games” – 94.08%

https://github.com/udsclub/alpha-sentiment-analysis/blob/master/amazonTv/scripts/validation_curves.ipynb

Challenges

https://github.com/udsclub/alpha-sentiment-analysis/tree/master/Enrichment%20dataset

Challenges• Enrich dataset with synonyms

• synonymscrawler, word2vec (the closest vector, cosine distance), wordnet (works bad)

Challenges• Enrich dataset with synonyms

• synonymscrawler, word2vec (the closest vector, cosine distance), wordnet (works bad)

• Transfer learning on other languages

• train on English and transfer to other languages with the same chars (works good)

Contact me• OpenDataScience – @vradchenko

• Facebook – https://www.facebook.com/vitaliyradchenko127

• Email – radchenko.vitaliy.o@gmail.com

• UDS Club – https://github.com/udsclub

Thank you

Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50...

Documents