Date post: | 08-Jan-2017 |
Category: |
Data & Analytics |
Upload: | dan-sullivan |
View: | 3,380 times |
Download: | 4 times |
Dan SullivanOctober 21, 2015
Portland, OR
Text Mining Meets Neural Nets: Mining the Biomedical Literature
*Overview
* Introduction to Natural Language Processing and Text Mining
* Linguistic and Statistical Approaches
*Critiquing Classifier Results
* A New Dawn: Deep Learning
* What’s Next
*My Background
* Enterprise Architect, Big Data and Analytics
* Former Research Scientist, bioinformatics institute
* Completing PhD in Computational Biology with focus on text mining
*Author
*Contact*[email protected]*@dsapptech*Linkedin.com/in/dansullivanpdx
*Introduction to Natural Language
Processing and Text Mining
*“Text is unstructured”
*Unstructured?
Manual procedures are time consuming and costly
Volume of literature continues to grow
Commonly used search techniques, such as keyword, similarity searching, metadata filtering, etc. can still yield volumes of literature that are difficult to analyze manually
Some success with popular tools but limitations
Challenges in Text Analysis
*Dominant Eras in NLP
* Linguistic (from 1960s)* Focus on syntax* Transformational Grammar * Sentence parsing
*Statistical (from 1990s)* Focus on words, ngrams, etc.* Statistics and Probability* Related work in Information
Retrieval* Topic Modeling and Classification
* Deep Learning (from ~2006)* Focus on multi-layered neural net
computing non-linear functions* Light on theory, heavy on
engineering* Multiple NLP tasks
*Symbolic vs Sub-Symbolic
VS.
*Linguistic and Statistical
Approaches
http://www.slideshare.net/DanSullivan10/text-mining-meets-neural-nets
http://www.slideshare.net/DanSullivan10/text-mining-meets-neural-nets
http://www.slideshare.net/DanSullivan10/text-mining-meets-neural-nets
*Linguistic Approaches
*Linguistic Approaches -
SyntaxImage: http://www.nltk.org/book_1ed/ch08.html
*Linguistic Approaches - Semantics
Stephen H. Chen et al. Physiol. Genomics 2005;22:257-267
*Statistical Approaches
*Statistical Approach: Topic
Models
* Technique for identify dominant themes in document
* Does not require training
* Multiple Algorithms* Probabilistic Latent Semantic Indexing
(PLSI)* Latent Dirichlet allocation (LDA)
*Assumptions*Documents about a mixture of topics*Words used in document attributable to
topic
Source: http://www.keepcalm-o-matic.co.uk/p/keep-calm-theres-no-training-today/
Debt, Law, Graduation
Debt, EU, Greece, Euro
Source: http://www.nytimes.com/pages/business/index.html April 27, 2015
EU, Greece, Negotiations, Varoufakis
*Topic Modeling Techniques
* Topics represented by words; documents about a set of topics*Doc 1: 50% politics, 50% presidential*Doc 2: 25% CPU, 30% memory, 45% I/O*Doc 3: 30% cholesterol, 40% arteries, 30% heart
* Learning Topics*Assign each word to a topic*For each word and topic, compute* Probability of topic given a document P(topic|doc)* Probability of word given a topic P(word|topic)* Reassign word to new topic with probability
P(topic|doc) * P(word|topic)* Reassignment based on probability that topic T
generated use of word W
TOPICS
Image Source: David Blei, “Probabilistic Topic Models” http://yosinski.com/mlss12/MLSS-2012-Blei-Probabilistic-Topic-Models/
* 3 Key Components* Data* Representation scheme* Algorithms
* Data * Positive examples – Examples from
representative corpus* Negative examples – Randomly selected
from same publications
* Representation* TF-IDF* Vector space representation* Cosine of vectors measure of similarity
* Algorithms* Supervised learning
* SVMs* Ridge Classifier* Perceptrons* kNN* SGD Classifier* Naïve Bayes* Random Forest* AdaBoost *Training a Text Classifier
*Text Classification Process
Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python:Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/
*Term Frequency (TF) tf(t,d) = # of occurrences of t in dt is a termd is a document
* Inverse Document Frequency (IDF)idf(t,D) = log(N / |{d in D : t in d}|)D is set of documentsN is number of document
*TF-IDF = tf(t,d) * idf(t,D)
*TF-IDF is * large when high term frequency in document
and low term frequency in all documents*small when term appears in many documents
*Representation: TF-IDF
The 1 0 0 0 0 0 0Esp8 0 1 0 0 0 0 0gene 0 0 1 0 0 0 0is 0 0 0 1 0 0 0a 0 0 0 0 1 0 0known 0 0 0 0 0 1 0virulence 0 0 0 0 0 0 1
translocates reduced levels of Esp8 host cell
Sentence 1 0.193 0.2828 0.078 0.0001 0.389 0.0144 0.011
Sentence 2 0 0.0091 0.0621 0 0 0 0
Sentence 3 0 0 0 0 0.028 0.0113 0
Sentence 4 0.021 0 0 0 0 0 0
One Hot Representation
TF-IDF Representation
*Sparse Representations
* Bag of words model
* Ignores structure (syntax) and meaning (semantics) of sentences
* Representation vector length is the size of set of unique words in corpus
* Stemming used to remove morphological differences
* Each word is assigned an index in the representation vector, V
* The value V[i] is non-zero if word appears in sentence represented by vector
* The non-zero value is a function of the frequency of the word in the sentence and the frequency of the term in the corpus
*Representation: Vector Space
Support Vector Machine (SVM) is large margin classifier
Commonly used in text classification
Initial results based on life sciences sentence classifier
Image Source:http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png
*Classification Algorithms
*Critiquing Classifier Results
Non-VF, Predicted VF: “Collectively, these data suggest that EPEC 30-5-1(3) translocates reduced levels of
EspB into the host cell.”
“Data were log-transformed to correct for heterogeneity of the variances where necessary.”
“Subsequently, the kanamycin resistance cassette from pVK4 was cloned into thePstI site of pMP3, and the resulting plasmid pMP4 was used to target a disruption in the cesF region of EHEC strain 85-170.”
VF, Predicted Non-VF “Here, it is reported that the pO157-encoded Type V-secreted serine protease
EspP influences the intestinal colonization of calves. “
“Here, we report that intragastric inoculation of a Shiga toxin 2 (Stx2)-producing E. coli O157:H7 clinical isolate into infant rabbits led to severe diarrhea and intestinal inflammation but no signs of HUS. “
“The DsbLI system also comprises a functional redox pair”
Virulence Factor (VF)-Misclassification
Examples
Adding additional examples is not likely to substantially improve results as seen by error curve
Preliminary Results-Training
Error
0 2000 4000 6000 8000 100000
0.050.1
0.150.2
0.250.3
0.350.4
0.450.5
All
Training ErrorValidation Error
8 Alternative AlgorithmsSelect 10,000 most important features using chi-square
Alternative Supervised Learning Algorithms
* Increase quantity of data (not always helpful; see error curves)
* Improve quality of data* Utilize multiple supervised algorithms,
ensemble and non-ensemble* Use unlabeled data and semi-supervised
techniques
* Feature Selection
* Parameter Tuning
* Feature Engineering
* Given:* High quality data in sufficient quantity* State of the art machine learning algorithms
* How to improve results: Change Representation?
*Improving Quality
*TF-IDF*Loss of syntactic and
semantic information
*No relation between term index and meaning
*No support for disambiguation
*Feature engineering extends vector representation or substitute specific for more general terms – a crude way to capture semantic properties
*Representation Schemes
Ideal Representation◦ Capture semantic
similarity of words
◦ Does not require feature engineering
◦ Minimal pre-processing, e.g. no mapping to ontologies
◦ Improves precision and recall
*A New Dawn: Deep Learning
*Word Embeddings
*Dense vector representation (n = 50 … 300 or more)
*Capture semantics – similar words close by cosine measure
*Captures language features*Syntactic relations*Semantic relations
*Dense Word Representation
[0.160610 -0.547976 -0.444522 -0.037896 0.044305 0.245423 -0.261498 0.000294 -0.275621 -0.021201 -0.432955 0.388905 0.106494 0.405797 -0.159357 -0.073897 0.177182 0.043535 0.600987 0.064762 -0.348964 0.189289 0.650318 0.112554 0.374456 -0.227780 0.208623 0.065362 0.235401 -0.118003 0.032858 -0.309767 0.024085 -0.055148 0.158807 0.171749 -0.153825 0.090301 0.033275 0.089936 0.187864 -0.044472 0.421533 0.209217 -0.142092 0.153070 -0.168291 -0.052823 -0.090984 0.018695 -0.265503 -0.055572 -0.212252 -0.326411 -0.083590 -0.009575 -0.125065 0.376738 0.059734 -0.005585 -0.085654 0.111499 -0.099688 0.147020 -0.419087 -0.042069 -0.241274 0.154339 -0.008625 -0.298928 0.060612 0.216670 -0.080013 -0.218985 -0.805539 0.298797 0.089364 0.071044 0.390878 0.167600 -0.101478 -0.017312 -0.260500 0.392749 0.184021 -0.258466 -0.222133 0.357018 -0.244508 0.221385 -0.012634 -0.073752 -0.409362 0.113296 0.048397 0.000424 0.146018 -0.060891 -0.139045 -0.180432 0.014984 0.023384 -0.032300 -0.161608 -0.188434 0.018036 0.023236 0.060335 -0.173066 0.053327 0.523037 -0.330135 -0.014888 -0.124564 0.046332 -0.124301 0.029865 0.144504 0.163142 -0.018653 -0.140519 0.060562 0.098858 -0.128970 0.762193 -0.230067 -0.226374 0.100086 0.367147 0.160035 0.148644 -0.087583 0.248333 -0.033163 -0.312134 0.162414 0.047267 0.383573 -0.271765 -0.019852 -0.033213 0.340789 0.151498 -0.195642 -0.105429 -0.172337 0.115681 0.033890 -0.026444 -0.048083 -0.039565 -0.159685 -0.211830 0.191293 0.049531 -0.008248 0.119094 0.091608 -0.077601 -0.050206 0.147080 -0.217278 -0.039298 -0.303386 0.543094 -0.198962 -0.122825 -0.135449 0.190148 0.262060 0.146498 -0.236863 0.140620 0.128250 -0.157921 -0.119241 0.059280 -0.003679 0.091986 0.105117 0.117597 -0.187521 -0.388895 0.166485 0.149918 0.066284 0.210502 0.484910 0.396106 -0.118060 -0.076609 -0.326138 -0.305618 -0.297695 -0.078404 -0.210814 0.423335 -0.377239 -0.323599 0.282586]
immune_system
*Large volume of data*Billions of words in context*Multiple passes over data
*Algorithms*Word2Vec*CBOW*Skip-gram
*GloVe
*Linguistic terms with similar distributions have similar meaning* Learning Word Representation
T. Mikolov, et. al. “Efficient Estimation of Word Representations in Vector Space.” 2013. http://arxiv.org/pdf/1301.3781.pdf
*Skip-gram predicts surrounding wordsImage:
https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc
*CBOW predicts current wordImage:
https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc
*Word Similarity - Malaria
*Word Similarity: Alanine (Amino Acid)
*Word Similarity: Leukocyte
*Word Similarity: Shigella
*Analogy I (correct)
Heart : Cardiovascular as Kidney:
*Analogy II (near miss)
Salmonella : Proteobacteria Staphylococcus
*Analogy III (miss)
Salmonella : Enterobacteriacea as Staphylococcus
Staphylococcaceae
*Quick Intro to Neural Networks
*Feed forward neural networkImage: http://u.cs.biu.ac.il/~yogo/nnlp.pdf
*Calculating with Neural Netshttps://en.wikibooks.org/wiki/Artificial_Neural_Networks/
Activation_Functions
*Key Characteristics
* Non-linear Activation Function*Sigmoid*Hyberbolic tangent (tanh)*Rectifier (ReLU)
* Word embeddings
* Window size
* Loss function*Binary*Multiclass*Cross-entropy
*Training a Neural Network – Stochastic
Gradient DescentImages: http://u.cs.biu.ac.il/~yogo/nnlp.pdf; http://blog.datumbox.com/tuning-the-learning-rate-in-gradient-descent/
*Convolutional Neural Network for TextImage: https://aclweb.org/anthology/P/P14/P14-2105.xhtml
*Sentence Classification with Convolutional
Networks
*What’s Next?
*Survey n-dimensional Word Embedding Space
Image: http://greg.org/archive/2010/07/05/the_planck_all-sky_survey.html
*Formalize a Mathematical Model of
Semanticshttp://riotwire.com/column/immigrants-socialists-and-semantics-oh-my/
*Tools and References
*Word Embedding Tools
* Word2Vec – command line tool* Gensim – Python topic modeling tool
with word2vec module* GloVe (Global Vector for Word
Representation) – command line tool
*Deep Learning Tools
* Theano: Python CPU/GPU symbolic expression compiler
* Torch: Scientific framework for LuaJIT
* PyLearn2: Python deep learning platform
* Lasange: light weight framework on Theano
* Keras: Python library for working with Theano
* DeepDist: Deep Learning on Spark
* Deeplearning4J: Java and Scala, integrated with Hadoop and Spark
*References
*Deep Learning Bibliography - http://memkite.com/deep-learning-bibliography/
* Deep Learning Reading List –http://deeplearning.net/reading-list/
*Kim, Yoon. "Convolutional neural networks for sentence classification." arXiv preprint arXiv:1408.5882 (2014).
* Goldberg, Yav. “A Primer on Neural Network Models for Natural Language Processing” http://u.cs.biu.ac.il/~yogo/nnlp.pdf
*Q & A