Date post: | 23-Jan-2018 |
Category: |
Data & Analytics |
Upload: | olga-scrivner |
View: | 205 times |
Download: | 8 times |
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Visual Analytics for Linguistics - Day 3
Olga Scrivner
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
What You Will Learn
DAY 1 Introduction to Visual Analytics
DAY 2 Visualization Methods, Design, and Tools
DAY 3 Working with Unstructured Data
DAY 4 Working with Structured Data
DAY 5 Advanced Analytics
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Our Materials - Web Site
http://obscrivn.wixsite.com/visualization
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
What We Need
I Interactive Text Mining Suite
I Voyant
I R and Rstudio
I R libraries: ggplot2, plotly, reshape2
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
What We Need
I Interactive Text Mining Suite
I Voyant
I R and Rstudio
I R libraries: ggplot2, plotly, reshape2
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Quiz: Which Chart Are You?
https://www.sisense.com/blog/quiz-chart/
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Creating a Bar Chart
I The value of a column in the data set. This is done withstat=“identity” , which leaves the y values unchanged.
I The count of cases for each group - each x valuerepresents one group.
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Creating a Bar Chart - Sample
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Creating a Bar Chart - Sample
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Creating a Bar Chart - Values
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Creating a Bar Chart - Counts
To get a bar graph of counts, we do not map a variable to y,and we use stat=“count”
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Creating a Bar Chart - Counts
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Title
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Creating Line Chart
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Creating Line Chart
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Creating Area Chart
http://www.r-graph-gallery.com/136-stacked-area-chart/
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Creating Scatter Plot
http://www.r-graph-gallery.com/272-basic-scatterplot-with-ggplot2/
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Creating Bubble Plot
https://plot.ly/r/bubble-charts/
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Creating Bubble Plot
https://plot.ly/r/bubble-charts/
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Creating Heatmap
http://www.r-graph-gallery.com/215-interactive-heatmap-with-plotly/
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Creating Heatmap
http://www.r-graph-gallery.com/215-interactive-heatmap-with-plotly/
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Creating Heatmap
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Creating Word Cloud
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Word Cloud - Contest - 10 min
I Create your own word cloudI Look at the function - type ?wordcloud2 and run
I Can you change a shape of your cloud?I Save (or make a screenshot) and post it on
twitter/facebook etc
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Why Analyze Text?
The “epic transformation of archives” - shifting from print todigital archival form (Folsom, 2007)
“As our collective knowledge continues to be digitized andstored (...) it becomes more difficult to find and discover
what we are looking for.” (Blei 2012)
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Text Mining Challenges
source - 1) Dan Jurafsky, 2) Text Mining with R for Social Science Research (Ryan Wesslen)
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Basic Terminology
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
What is Bag of Words?
I Simplest way to quantify text
I Word order ignored
I Term counts per document
I N-grams (uni-grams, bi-grams)
Source - Chris Manning
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Preprocessing
I Tokenization (splitting words)
I Cleaning (lower case, punctuation)
I Stemming
I works, worked → work
I Filter (stopwords)
I and, the, a
Source - Wesslen
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Preprocessing
I Tokenization (splitting words)
I Cleaning (lower case, punctuation)
I Stemming
I works, worked → work
I Filter (stopwords)
I and, the, a
Source - Wesslen
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Preprocessing
I Tokenization (splitting words)
I Cleaning (lower case, punctuation)
I Stemming
I works, worked → work
I Filter (stopwords)
I and, the, a
Source - Wesslen
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Macro-analysis
Concept Macro-analysis (Jockers, 2013)
“the construction of abstract models”(Jasinski, 2001)
Methods Tag clouds, heat maps, clusters, topics,network graphs
Tools GUI: Voyant, Papermachine, ITMSTUI: Mallet, Meta, R and Python packages
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Visual Analytics
Visual Analytics - “The science of analytical reasoningfacilitated by visual interactive interfaces” (Thomas et all.,2005)
I Graphs, maps and trees for literature analysis (Moretti,2005)
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Visualization Methods
I Word clouds to analyze a novel (Vuillemot et al., 2009)
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Visualization Methods
I Social network graphs of characters in Greek tragedies(Rydberg-Cox, 2011)
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Visualization Methods
I Literary fingerprint and summaries (Oelke et al., 2012)
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Visualization Methods
I Tracking emotion and sentiment in fairy tales(Mohammad, 2012)
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Topic Modeling
Discovering underlying theme of collection from Science magazine1990-2000 (Blei 2012)
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Topics - Word Term
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Topics - Word Term
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Wikipedia Topics
http://www.princeton.edu/~achaney/tmve/wiki100k/browse/topic-presence.html
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Wikipedia Topics - Assignment - 10 min
1. Language Related Topic2. Words: Dialect3. Related Document: Macedonian Language4. Related Document: Egyptian hieroglyphs5. Go to Full article:6. Find meaning:
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Voyant
http://voyant-tools.org/
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Voyant
http://voyant-tools.org/
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Voyant - 10 min
http://voyant-tools.org/
I Examine visualization charts (identify typesand properties)
I Apply various filters and queries
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Voyant Tools - Bubblelines - 7 min
http://docs.voyant-tools.org/tools/
I Delete top termsI Search for man and woman
I Make sure to have “separate lines for terms” clickedI Change search terms
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Voyant Tools - Pair Work - 10 min
http://docs.voyant-tools.org/tools/I Examine visualization methodsI Select 5 methodsI Look at the documentation and how to use them
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Interactive Text Mining Suite
I A user-friendly tool for quantitative analysis andvisualization of unstructured data
I Platform-independent
I Interactive
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
ITMS Structure
1. File Uploads
I Upload files (txt, pdf, rdf and Google books API)
2. Data Preparation
I Data preprocessing (stopwords, stemming, metadata)
3. Data Visualization
I Word frequencies, Cluster analysis and topic modeling
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
ITMS Structure
1. File Uploads
I Upload files (txt, pdf, rdf and Google books API)
2. Data Preparation
I Data preprocessing (stopwords, stemming, metadata)
3. Data Visualization
I Word frequencies, Cluster analysis and topic modeling
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Workshop Files
I Download 3 text files
https://iu.box.com/s/knua9af3bip7g63s3zdax9ti4z243ldz
I NY Times articles (3 documents in a plain text format)
I ITMS Web site:
http://www.interactivetextminingsuite.com
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Upload File
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Upload File
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Upload File
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Preprocessing Data
Before performing data analysis we should preprocess data.
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Preprocessing Options
Select preprocessing options and click apply.
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Stopwords
Stopwords (e.g. the, and): select Default for English
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Manual Removal of Stopwords
Based on the need, remove any additional stopwords that youmay consider a noise, e,g, paper, shows etc
Select apply
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Stemming
To improve analytics, you can stem all your tokens, ex.instead of worked, works, working, you will have only onerelevant stem work
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Metadata Extraction
You can extract or upload metadata. You will needdatestamp (year) information for chronological topicmodeling.
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Visualization
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Word Cloud Representation
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Customization
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Cluster Analysis
You need to have at least three documentsDocuments will be grouped based on their term similaritymeasures
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Cluster Analysis
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Topic Modeling
I LDA (Latent Dirichlet allocation)
I STM (Structural Topic model)
I Chronological topic visualization (lda): requiresmetadata
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Topic Modeling Tuning
I Selection of topics (how many different themes)
I Selection of words per theme (how many words pertopic)
I Selection of iteration
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Topic Model Selection
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
LDA Topic Model
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
STM Topic Model
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Other Formats - Google Books
Before switching to other data formats, refresh your localbrowser.
Start with File Uploads and select Structured Data
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Other Formats - Google Books
Select your search terms and submit
Current limitation is 40 books
Visual Analyticsfor Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
TextVisualization
ITMS
PreprocessingData
DataVisualization
Cluster Analysis
Topic Modeling
Google Book API
Resources
http://www.rdatamining.com/examples/text-mining
https://en.wikibooks.org/wiki/R_Programming/Text_Processing
http://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/
http://www.katrinerk.com/courses/words-in-a-haystack-an-introductory-statistics-course/schedule-words-in-a-haystack/r-code-the-text-mining-package
tm package