Date post: | 25-Dec-2015 |
Category: |
Documents |
Upload: | ferdinand-mcgee |
View: | 220 times |
Download: | 0 times |
Digital Text and
Data Processing
Introduction to R
□ Tools themselves are often based on specific assumptions / subjective decisions
□ There is subjectivity in the way in which tools are used
□ Reproducible results
□ Rockwell & Ramsay, in “Developing Things”: A tool is a theory
Objectivity of DH Research
Willard McCarty, Humanities Computing (Palgrave, 2005)
"The point of all modelling exercises, as of scholarly research generally, is the process seen in and by means of a developing product, not the definitive achievement"(p. 22).
Models, "however finely perfected, are better understood as temporary states in a process of coming to know rather than fixed structures of knowledge"(p. 27)
-> Clash between tacit and intuitive knowledge of scholar and computer’s need for consistency and explicitness
□ Data creation
□ Data analysis
Two stages in text mining
□ Finding distinctive vocabulary
□ Finding stylistic or grammatical differences and similarities
□ Examining topics or themes
□ Clustering texts on the basis of quantifiable aspects
Types of analyses
opendir (DIR, $dir) or die "Can't open directory!";
while (my $file = readdir(DIR)) {
if ( $file =~ /txt$/) {push ( @files, $file ) ;
}
}
Reading a directory
Inverse document frequency
For an application, see Stephen Ramsay, Algorithmic Criticism
□ Both a programme and a programming language
□ Successor of “S”
□ “a free software environment for statistical computing and graphic”
□ The capabilities of R can be extended via external “packages”
□ Any combination of alphanumerical characters, underscore and dot
□ Unlike Perl, they do not begin with a $ □ First characters cannot be a number. The second characters
cannot be a number if the first character is a dot
Variables in R
Allowed: Not allowed:data 3rdDataSetmy.data .4thData.setmy_2ndDataSet.myCsv
□ A collection of indexed values
□ Can be created using the c() function, or by supplying a range
□ N.B. The assignment operator in R is <-
□ Examples:
Vectors
x <- c( 4, 5, 3, 7) ;
y <- 1:30 ;
□ A collection of vectors, all of the same length
□ Each column of the table is stored in R as a vector.
Data frame
V1 V2 V3R1 3, 4, 5R2 1, 21, 8R3 23, 5, 6
Comma Separated Values
i,you,heEmma,160416,3178,1994Persuasion,77431,1284,918PrideAndPrejudice,121812,2068,1356
N.B. The first row has one column less
□ Use the read.csv function, with parameter header = TRUE□ The CSV file will be represented as a data frame□ Values on first line and first value of each subsequent line will be used as rownames and colnames
Reading data
data <- read.csv( "data.csv" , header = TRUE) ;
colnames(data)
□ Can be accessed using the $ operator
Data frame columns
data <- read.csv( "data.csv" , header = TRUE) ;
data$you
□ max(), min(), mean(), sd()
Calculations
y <- data$you ;
max(y) ;
sd(y) ;
□ Run the program “typeToken.pl”
□ Use the file “ratio.csv” that is created by this program.
□ Print a list of all the texts that have been read□ Calculate the average number of tokens□ Calculate the total number of tokens in the full corpus□ Identify the lowest number in the column “types”□ Identify the highest number in the column “ratio”
Exercise
d <- read.csv("data.csv") ;
d <- d[ 1 , 2 ] ;
d <- d[ 2 , ] ;
od <- data[ order( data$ratio ), ]
Subsetting and sorting
□ Qualitative data (categorical)
□ Nominal scale (unordered scale), e.g. eye colour, marital status□ Ordinal scale (ordered scale), e.g. educational level
□ Quantitative data
□ Interval (scale with no mathematical zero)□ Ratio (multipliable scale), e.g. age
Quantitative and Qualitative
Source: Seminar Basic Statistics, Laura Bettens
□ Two quantitative variables can be clarified in a variety of ways (e.g. line chart, pie chart)
□ A combination of one qualitative variable and one quantitative variable is best presented using a bar chart or a dot chart
Diagrams