Date post: | 23-Jan-2018 |
Category: |
Data & Analytics |
Upload: | olga-scrivner |
View: | 166 times |
Download: | 6 times |
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Visual Analytics for Linguistics - Day 4
Olga Scrivner
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
What You Will Learn
DAY 1 Introduction to Visual Analytics
DAY 2 Visualization Methods, Design, and Tools
DAY 3 Working with Unstructured Data
DAY 4 Working with Structured Data
DAY 5 Advanced Analytics
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Our Materials - Web Site
http://obscrivn.wixsite.com/visualization
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
What We Need
I Interactive Text Mining Suite
I Language Variation Suite
I Download R code
I Download files from Day 3 and Day 4
I R and Rstudio
I R libraries: tm, party, partykit
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
What We Need
I Interactive Text Mining Suite
I Language Variation Suite
I Download R code
I Download files from Day 3 and Day 4
I R and Rstudio
I R libraries: tm, party, partykit
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Text Mining Review
I Intro to ITMS - slides Day 3I R coding
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Text Mining Review
1. Open RStudio2. Open R file textmining.R:
3. Set up working directory
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Text Mining Package - tm
I Main structure - corpus
I Corpus is constructed via DirSource, VectorSource,DataframeSource
mycorpus <- Corpus(VectorSource(file.txt))
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Preprocessing with tm
I lower case
mycorpus <- tm_map(mycorpus, tolower)
I remove punctuation
mycorpus <- tm_map(mycorpus,removePunctuation)
I remove numbers
mycorpus <- tm_map(mycorpus, removeNumbers)
I remove stopwords
mycorpus <- tm_map(mycorpus, removeWords,stopwords(’english’))
I mycorpus <- tm_map(mycorpus, stripWhitespace)
I mycorpus <- tm_map(mycorpus,PlainTextDocument)
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
What is LVS?
Language Variation Suite
It is a Shiny web application originally designed for dataanalysis in sociolinguistic research.
It can be used for:I Processing spreadsheet data
I Visualizing data
I Analyzing means, regression, conditional trees ...(and much more)
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Background
LVS is built in R using Shiny package:
1. R - a free programming language for statisticalcomputing and graphics
2. Shiny App - a web application framework for R
Computational power of R + Web interactivity
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Background
http://littleactuary.github.io/blog/Web-application-framework-with-Shiny/
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Workspace
Browser
I Chrome, Firefox, Safari - recommendable
I Explorer may cause instability issues
AccessibilityI PC, Mac, Linux
I Data files will be uploaded from any location on yourcomputer
I Smart PhoneI Data files must be on a cloud platform connected to
your phone account (e.g. dropbox)
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Server
Since LVS is hosted on a server, Shiny idle time-out settingsmay stop application when it is left inactive (it will grey out).
Solution: Click reload and re-upload your csv file
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Data Preparation
Important things to consider before data entry:I File format:
I Comma separated value (CSV) - faster processingI Excel format will slow processing
I Column names should not contain spacesI Permitted: non-accented characters, numbers,
underscore, hyphen, and period
I One column must contain your dependent variableI The rest of the columns contain independent variables
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Terminology Review
a. Categorical - non-numerical data with two values
I yes - no; male - female
b. Continuous - numerical data
I duration, age, year
c. Multinomial - non-numerical data with three or morevalues
I regions, nationalities
d. Ordinal - scale: currently not supported
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Terminology Review
a. Categorical - non-numerical data with two values
I yes - no; male - female
b. Continuous - numerical data
I duration, age, year
c. Multinomial - non-numerical data with three or morevalues
I regions, nationalities
d. Ordinal - scale: currently not supported
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Workshop Files
1. http://obscrivn.wixsite.com/visualizationDownload file Day 4: movie_metadata.csvSimplified set from https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset
2. LVS web site: http://languagevariationsuite.com/
http://cl.indiana.edu/~obscrivn/docs/movie_metadata.csv
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Movie Data
I BudgetI DirectorI Actor 1I Director facebook likesI Actor 1 facebook likesI GenreI Year
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Language Variation Suite - Structure
1. Data
I Upload file, data summary, adjust data, cross tabulation
2. Visual Analysis
I Plotting, cluster classification
3. Inferential Statistics
I Modeling, regression, conditional trees, random forest
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Language Variation Suite - Structure
1. Data
I Upload file, data summary, adjust data, cross tabulation
2. Visual Analysis
I Plotting, cluster classification
3. Inferential Statistics
I Modeling, regression, conditional trees, random forest
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Upload File
Upload movie_metadata.csv
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Uploaded Dataset
The data content is imported as a table and allows forsorting columns.
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Summary
Summary provides a quantitative summary for each variable,e.g. frequency count, mean, median.
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Data Structure
1. Total number of observations (rows)
2. Number of variables (columns)
3. Variable types
I Factor - categorical valuesI Num - numeric values (0.95, 1.05)I Int - integer values (1, 2, 3)
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Cross Tabulation
Cross-tabulation examines the relationship between variables.
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Cross Tabulation Plot
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Cross Tabulation Plot
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Adjusting Browser - Layout
Shiny pages are fluid and reactive.
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Adjusting Browser - Layout
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Language Variation Suite - Structure
1. Data
I Upload file, data summary, adjust data, cross tabulation
2. Visual Analysis
I Plotting, cluster classification
3. Inferential statistics
I Modeling, regression, varbrul analysis, conditional trees,random forest
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
One Variable Plot
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
One Variable Plot
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Customizing Plot
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Customizing Plot
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Saving Plot
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Cluster Plot
I Classification of data into sub-groups is based onpairwise similarities
I Groups are clustered in the form of a tree-likedendrogram
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Cluster Plot
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Cluster Plot
Group 1 Animation, Biography and Group 2 Action, Drama,Comedy
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Inferential Statistics
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Language Variation Suite - Structure
1. Data
I Upload file, data summary, adjust data, cross tabulation
2. Visual Analysis
I Plotting, cluster classification
3. Inferential statistics
I Modeling, regression, varbrul analysis, conditional trees,random forest
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
How to Create a Regression Model
Step 1 Modeling - create a model with dependentand independent variables
Step 2 Regression - specify the type of regression(fixed, mixed) and type of dependent variable(binary, continuous, multinomial)
Step 3 Stepwise Regression - compare models(Log-likelihood, AIC, BIC)
Step 4 Conditional Trees - apply non-parametrictests to the model
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Modeling
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Regression Types
I Model
a.) Fixed effect
b.) Mixed effect - individual speaker/token variation (withingroup)
I Type of Dependent Variable
a.) Binary/categorical (only two values)
b.) Continuous (numeric)
c.) Multinomial - categorical with more than two values
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Regression
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Model Output
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Model Output
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Interpretation: Budget and Genre
I Genre Action is the reference value
I Positive coefficient - positive effect
I Negative coefficient - negative effect
http://www.free-online-calculator-use.com/scientific-notation-converter.html
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Interpretation: Budget and Genre
I Genre Action is the reference value
I Positive coefficient - positive effect
I Negative coefficient - negative effecthttp://www.free-online-calculator-use.com/scientific-notation-converter.html
exponential notation:1.46e-7
0.000000146
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Conditional Tree
Conditional tree: a simple non-parametric regression analysis,commonly used in social and psychological studies
I Linear regression: all information is combined linearly
I Conditional tree regression: visual splitting to captureinteraction between variables
Recursive splitting (tree branches)
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Conditional Tree
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Conditional Tree
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Conditional Tree
1. Genre is the significant factor for budget2. Budget distribution is split in two groups:
I Action and Animation
I Biography, Comedy and Drama
3. Budget is significantly higher for Animation and Action
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Random Forest
1. Variable importance for predictors
2. Robust technique with small n large p data
3. All predictors considered jointly (allows for inclusion ofcorrelated factors)
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Random Forest
Let’s add more factors!
I Return to Modeling
I Add independent factors: director facebook likes,actor 1 facebook likes, title year
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Random Forest
I Genre is the most important predictor for this model.I Close to zero or red-dotted line (cut off values) -
irrelevant for this model
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Random Forest
I Genre is the most important predictor for this model.I Close to zero or red-dotted line (cut off values) -
irrelevant for this model
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Assignment
I Select LVS or ITMS
I Upload your own file (csv for LVS or txt/pdf for ITMS)
I Explore your data
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
References I
Baayen, Harald. 2008. Analyzing linguistic data: A practical introduction to statistics.Cambridge: Cambridge University Press
Gries, Stefan Th. 2015. Quantitative designs and statistical techniques. In DouglasBiber Randi Reppen (eds.), The Cambridge Handbook of English Corpus Linguistics.Cambridge: Cambridge University Press
Schnapp, Jeffrey, and Peter Presner. 2009. Digital Humanities Manifesto 2.0.
http://gifsanimados.espaciolatino.com/x_bob_esponja_8.gif
https://daniellestolt.files.wordpress.com/2013/01/are-you-ready1.gif
http://www.martijnwieling.nl/R/sheets.pdf
Visual Analyticsfor Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
StatisticalVisualization
Data Preparation
LVS
Working withData
Visual Analytics
InferentialAnalysis
Resources
http://www.rdatamining.com/examples/text-mining
https://en.wikibooks.org/wiki/R_Programming/Text_Processing
http://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/
http://www.katrinerk.com/courses/words-in-a-haystack-an-introductory-statistics-course/schedule-words-in-a-haystack/r-code-the-text-mining-package
tm package