Date post: | 21-Jan-2018 |
Category: |
Technology |
Upload: | rebecca-bilbro |
View: | 236 times |
Download: | 3 times |
Rebecca BilbroLead Data Scientist at ByteCubedFaculty at Georgetown Univ.Partner at District Data Labs
@rebeccabilbro
tl;dr● Text is the next frontier in big data.● Language-aware data products are:
○ Not academia, but informed by it.○ Not automagic, just feel that way.
● Machine learning is flexible; rules are not.● Text comes with some unique requirements.● Facilitate iteration with the model selection triple.● Deployment is an opportunity to ingest more data.● Pipelines are necessary for production.
“Two roads diverged in a yellow wood,And sorry I could not travel bothAnd be one traveler, long I stood
And looked down one as far as I couldTo where it bent in the undergrowth”
Natural Language Understanding (AI)
Models for semantic understanding, reasoning, and generation of natural
languages for human-computer interaction.
Computational Linguistics (NLP)
Approaches to demonstrate how humans interpret and understand
language and show how languages evolve.
vs.
negative
angry, bad, contempt, deceive, evil, fake, grim, hoarder, ignorant, joke,
kaput, lies, measly, nasty, obscure,pointless, quit, rampant, stupid, trivial,
unclean, venomous, weak, yell, zealot
positive
awesome, best, cool, dazzle, easy, friendly,
golden, happy, improve, joy, keen, lucky, marvel,
normal, original, peerless, quick, remedy, super,
tidy, upbeat, vivid, warm, yay, zenith
“It sucks I didn't take pictures of the food I ordered here because I really wanted to show it off.
The restaurant isn't the biggest. It's pretty small. I had people constantly run into my bag that I hung on the edge of my chair. Quite annoying honestly but it's my bad for carrying such a large bag.
It didn't take long for the food to come out. I've been disappointed with one of New York's best rated brunch spots that I waited 2+ hours for before so I decided not to have any expectations for this place at all. However, the food here actually tastes great.”
- 9/6/2017 Yelp Review
Sample Sentiment Analysis Pipeline
Training Data(Historic Reviews)
Training Labels(# Stars)
Feature Vectors
ClassificationAlgorithm
New Data:New Review Feature
VectorPredictive
ModelPredicted Label
(# Stars)
0
at2
bat
1
can
0
door
1
echo
locati
on
0
eleph
ant
0
of0
open
0
potat
o
2
see
0
she
1
sight
1
snee
ze
0
studio
1
the
0
to
1
via
0
wonde
r
The elephant sneezed at the sight of potatoes.
Bats can see via echolocation. See the bat sight sneeze!
Wondering, she opened the door to the studio.
Bag-of-words · One-hot encoding · TFIDF · Distributed representation
Vectorization
Feature Analysis
Algorithm Selection
Hyperparameter Tuning
The Model Selection TripleArun Kumar http://bit.ly/2abVNrI
Data Management LayerRaw Data
Feature Engineering Hyperparameter Tuning
Algorithm Selection
Model Selection Triples
Instance Database
Model Storage
Model Family
Model Form
Partisan Discourse: Architecture
Initial ModelDebate Transcripts
Submit URL Preprocessing
Feature Extraction
Evaluate Model
Fit Model
Model Storage
Model
Monitoring
Corpus Storage
Corpus
Monitoring
Classification
Feedback
Model Selectionstarthere
Data LoaderText
NormalizationText
VectorizationFeature
Decomposition Estimator
Data Loader
Feature Union Pipeline
Estimator
Text Normalization
Document Features
Text Extraction
Summary Vectorization
Article Vectorization
Concept Features
Metadata Features
Dict Vectorizer
tl;dr● Text is the next frontier in big data.● Language-aware data products are:
○ Not academia, but informed by it.○ Not automagic, just feel that way.
● Machine learning is flexible; rules are not.● Text comes with some unique requirements.● Facilitate iteration with the model selection triple.● Deployment is an opportunity to ingest more data.● Pipelines are necessary for production.
• Summarization• Reference Resolution• Machine Translation• Language Generation• Language Understanding• Document Classification• Author Identification• Part of Speech Tagging
• Question Answering• Information Extraction• Information Retrieval• Speech Recognition• Sense Disambiguation• Topic Recognition• Relationship Detection• Named Entity Recognition
Everyday NLP Applications