Analysis of Bob Dylan’s Lyrics
jinseog Kim
2016-10-26
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 1 / 28
Bob Dylan
https://www.youtube.com/watch?v=rnKbImRPhTE&index=1&list=RDrnKbImRPhTE
https://www.google.co.kr/webhp?tab=Tw&ei=JXUQWMKfCMef8QX61qvoDg&ved=0EKkuCAYoAg#newwindow=1&tbm=nws&q=bob+dylan
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 2 / 28
데이터 및 도구(R패키지)
웹 문서
http://www.azlyrics.com/d/dylan.html
R 패키지stringr : 문자열 처리XML : XML, HTML 처리RCurl : HTTP 요청처리 및 관련함수 제공
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 3 / 28
R Functions
RCurl::getURL(url, .encoding=“UTF-8” )url을 다운로드.encoding: encoding of contents
htmlParse(doc) : XML/HTML파일을 구분분석(parse)을 통해 트리구조의 객체로 변환xpathSApply(d, ’//*[@id="listAlbum"]/a’, xmlValue)
주어진 XMLDocument에서 xpath에 일치하는 노드들을 탐색
gsub(“album:”, “”, list.album)do.call(“rbind”, . . . )strsplit: 문자열을 정해진 구분자에 의해 분리str_trim: 문자열의 양쪽 빈칸을 제거
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 4 / 28
Extract meta infomation & lyrics’ URL
library(XML)library(RCurl)url <- "http://www.azlyrics.com/d/dylan.html"doc <- getURL(url, .encoding="UTF-8" )d <- htmlParse(doc)list.song.titles <- xpathSApply(d, '//*[@id="listAlbum"]/a', xmlValue)list.song.urls <- xpathSApply(d, '//*[@id="listAlbum"]/a[@href]', xmlAttrs)[1,]list.song.urls <- paste0("http://www.azlyrics.com", gsub("..", '', list.song.urls, fixed=T))list.album <- xpathSApply(d, '//*[@class="album"]', xmlValue)
a.idx <- which(list.song.titles == "")nsongs <- c(diff(a.idx)-1, 12, 9)ns <- length(nsongs)list.album <- gsub("album: ", "", list.album)list.album <- gsub('\"', "", list.album)list.album <- do.call("rbind", strsplit(list.album, split='(', fixed=T))list.album.title <- str_trim(list.album[,1])list.album.year <- as.integer(str_trim(gsub(')', "", list.album[,2], fixed=T)))
meta <- data.frame(album.no=rep(seq_along(list.album.title), nsongs),album.title = rep(list.album.title, nsongs),album.year = rep(list.album.year, nsongs),song.title=list.song.titles[list.song.titles != ""],html=list.song.urls, stringsAsFactors=F)
write.csv(meta, file="/home/jskim/data/bobdylan/meta.csv")jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 5 / 28
Extract meta infomation & lyrics’ URL
head(meta)
## X album.no album.title album.year song.title## 1 1 1 Bob Dylan 1962 She's No Good## 2 2 1 Bob Dylan 1962 Talkin' New York## 3 3 1 Bob Dylan 1962 In My Time Of Dyin'## 4 4 1 Bob Dylan 1962 Man Of Constant Sorrow## 5 5 1 Bob Dylan 1962 Fixin' To Die Blues## 6 6 1 Bob Dylan 1962 Pretty Peggy-O## html## 1 http://www.azlyrics.com/lyrics/bobdylan/shesnogood.html## 2 http://www.azlyrics.com/lyrics/bobdylan/talkinnewyork.html## 3 http://www.azlyrics.com/lyrics/bobdylan/inmytimeofdyin.html## 4 http://www.azlyrics.com/lyrics/bobdylan/manofconstantsorrow.html## 5 http://www.azlyrics.com/lyrics/bobdylan/fixintodieblues.html## 6 http://www.azlyrics.com/lyrics/bobdylan/prettypeggyo.html
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 6 / 28
Web crawling from web site
library(stringr)library(XML)library(RCurl)
write.lyrics <- function(i){append <- ifelse(i == 1, FALSE, TRUE)doc <- getURL(html[i], .encoding="UTF-8" )d <- htmlParse(doc)#제목 추출
title <- xpathSApply(d, '/html/body/div[3]/div/div[2]/div[2]', xmlValue)title <- gsub("lyrics", "", title)title <- gsub("\"", "", title)title <- str_trim(title)
#가사 추출
lyric <- xpathSApply(d, '/html/body/div[3]/div/div[2]/div[6]', xmlValue)lyric <- gsub("[\r\n]", " ", lyric)
filename <- sprintf("bobdylan/lyrics/%03d.txt", i)#제목을 파일로 저장
write(title, file="bobdylan/lyrics/title", append=append)#가사를 파일로 저장
write(lyric, file=filename, append=FALSE)}
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 7 / 28
Web crawling from web site
wt <- round(20+runif(405)*20)for(i in c(51:395)){
cat(i, html[i], "\n")write.lyrics(i)cat("\t\t.... waiting..", wt[i], "seconds\n")Sys.sleep(wt[i])
}
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 8 / 28
Corpus
library(tm)Dir <- "/home/jskim/data/bobdylan/lyrics"bob <- VCorpus(DirSource(Dir,pattern="txt"))
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 9 / 28
pre-processing
library(SnowballC)#install.packages("SnowballC")# 소문자로 변화bob.dylan <- tm_map(bob, content_transformer(tolower));bob.dylan[[1]]$content# 숫자제거bob.dylan <- tm_map(bob, removeNumbers);bob.dylan[[1]]$content# 어근 추출
bob.dylan <- tm_map(bob.dylan, stemDocument);bob.dylan[[1]]$content# 불용어 제거bob.dylan <- tm_map(bob.dylan, removeWords, stopwords('english'));bob.dylan[[1]]$content# 문장부호 제거bob.dylan <- tm_map(bob.dylan, removePunctuation);bob.dylan[[1]]$content# 공백제거bob.dylan <- tm_map(bob.dylan, stripWhitespace);bob.dylan[[1]]$content
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 10 / 28
POS tagging : 품사 구분
Table 1:JopenPart–of–Speech Tag Part–of–Speech category( JJ/NN/RB/VB-형용사/명사/부사/동사)
TAG descriptionJJ AdjectiveJJR Adjective, comparativeJJS Adjective, superlativeNN Noun, singular or massNNS Noun, pluralNNP Proper noun, singularNNPS Proper noun, pluralRB AdverbRBR Adverb, comparativeRBS Adverb, superlativeVB Verb, base formVBD Verb, past tenseVBG Verb, gerund or present participleVBN Verb, past participleVBP Verb, non–3rd person singular presentVBZ Verb, 3rd person singular present
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 11 / 28
POS tagging : 품사 구분
# install.packages("openNLPmodels.en", repos = "http://datacube.wu.ac.at/", type = "source")library(NLP);library(openNLP)sent_token_annotator <- openNLP::Maxent_Sent_Token_Annotator()word_token_annotator <- openNLP::Maxent_Word_Token_Annotator()pos_tag_annotator <- openNLP::Maxent_POS_Tag_Annotator()classifyPOS <- function(x, ...) {
x <- as.String(x)y1 <- NLP::annotate (x , list(sent_token_annotator, word_token_annotator))y2 <- NLP::annotate (x , pos_tag_annotator , y1)y2w <- subset(y2 , type == "word")words <- x[y2w]tags <- sapply(y2w$features, `[[`, "POS")selected <- grepl("^NN|^JJ|^VB", tags)word <- words[selected]paste(word, collapse=" ")
}
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 12 / 28
Pos-tagging
bob.dy <- sapply(bob, function(x) x$content, USE.NAMES=FALSE);system.time( x <- lapply(bob.dy[1:10], classifyPOS) )gc(reset=T)for(i in 2:39){
si <- 10*i + 1ei <- min(10*(i+1), 395)print(system.time( x2 <- lapply(bob.dy[si:ei], classifyPOS) ))gc(reset=T)x <- c(x, x2)print(x[[length(x)]])cat(si, ei, "=====================\n")
}
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 13 / 28
POS-tagging + Preprocessing
bob.dy <- do.call(c, x)bob.dy <- tolower(bob.dy);bob.dy[1]bob.dy <- removeNumbers(bob.dy);bob.dy[1]bob.dy <- removeWords(bob.dy, stopwords("english"))bob.dy <- removePunctuation(bob.dy);bob.dy[1]bob.dy <- stripWhitespace(bob.dy); bob.dy[1]bob.dy <- stringr::str_trim(bob.dy); bob.dy[1]
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 14 / 28
Term-Document Matrix
load("bobdy.RData")
library(tm)bob.dylan2 <- VCorpus(VectorSource(bob.dy))tds1 <- TermDocumentMatrix(bob.dylan2,
control=list(wordLengths=c(2, Inf), weighting=weightTf))tds2 <- TermDocumentMatrix(bob.dylan2, control=list(wordLengths=c(2, Inf), weighting=weightBin))tds3 <- TermDocumentMatrix(bob.dylan2,
control=list(wordLengths=c(2, Inf), weighting=weightTfIdf))
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 15 / 28
Word Frequencies
library(wordcloud)library(slam)wordFreq <- row_sums(tds1)wordFreq <- sort(wordFreq, decreasing=TRUE)length(wordFreq)
## [1] 6961
wordFreq[1:20]
## got know re love come go gonna see said ve man say## 439 438 393 386 338 298 292 292 290 274 268 266## time get baby night let way tell day## 261 242 223 207 190 188 185 182
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 16 / 28
Word cloud
library(wordcloud)pal <- brewer.pal(8,"Dark2")wordcloud(words=names(wordFreq), freq=wordFreq,
min.freq=50,random.order=F,random.color=T,colors=pal)
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 17 / 28
Word cloud
gotknow
relove
come
gogonnaseesaid
ve
mansay
timeget
baby
night
let
way
tell
day
ai
heart
make
wantgood
take
gone
world
eyes oldnothing
went
feel
mind
little
think
give townlook
hear
made
hom
e
thingsgoing
head
door
something
light
true
people
life
long
put
girl
last
road
comesgod
sun
black
hand
keep
one
new
lord
woman morning
face
place
tookcall
need
came
wanna
wrong
please
end
says
stand
stay
done
left
days
dead
high
men
right
moo
nnobody
remember
somebody
seen find
hard
ever
ybod
y
honey
someone
big
many
pum
hold
heard
mama
broken
next
blood
diefall
live
nam
e
saw
work
believe
blue
stop
thingwalk
bad
lost
turn
friend
friends
kingside
street
sweet
goin
knows
leave
wiggle
boyrun
wind
born
brea
k
dark
Figure 1:Word cloud of Dylan’s Lyrics
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 18 / 28
키워드 연관 네트워크
tds31 <- removeSparseTerms(tds3, sparse=0.9)M <- t(as.matrix(tds31))g <- cor(M)
diag(g) <- 0g[is.na(g)] <- 0g[g < 0.2 ] <- 0rownames(g) <- colnames(g) <- Terms(tds31)
library(sna)sna::gplot(g, label=colnames(g), gmode="graph")
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 19 / 28
키워드 연관 네트워크
ai
baby
bad
big
black
callcame
come
comes
day
dead
done
door
everybody
eyes face
fall
feel
find
fire
get
girl
give
go
god
going
gonegonna
good
got
handhard
head
hear
heard
heart
highhold homekeep
kind
knowlast
leave
left
let
lifelight
little
longlook
lost
love
made
make
man
men
mindmoney
morning
nameneed
new
night
nothingold
one
people
place
put
re
right
road
said
say
see
seensomeone
something
stand
stay
sun
sweet
take
tell
thingthings
think
time
took
town
true
turn
ve
walk
want
way
went
woman
world
wrong
Figure 2:키워드네트워크
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 20 / 28
토픽 분석(latent Dirichlet allocation)
dtm <- as.DocumentTermMatrix(tds1)library(topicmodels)lda <- LDA(dtm, k = 4, control=list(seed=123456)) # find 10 topics#plot(lda@loglikelihood, type="l")(lda@loglikelihood[length(lda@loglikelihood)])
## [1] -261.6555
(term <- terms(lda, 5))
## Topic 1 Topic 2 Topic 3 Topic 4## [1,] "love" "said" "love" "know"## [2,] "got" "man" "re" "re"## [3,] "know" "got" "know" "got"## [4,] "baby" "time" "got" "said"## [5,] "go" "see" "come" "say"
tt <- apply(posterior(lda)$topics, 1, which.max)table(tt)
## tt## 1 2 3 4## 126 93 81 85
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 21 / 28
토픽 분석(latent Dirichlet allocation)
x <- posterior(lda)$termslibrary(ggplot2)y <- data.frame(t(x[, apply(x, 2, max) > 0.01]))z <- data.frame(type=paste("Topic", 1),
keyword=rownames(y), posterior=y[,1])for(i in 2:4){
z <- rbind(z, data.frame(type=paste("Topic", i),keyword=rownames(y), posterior=y[,i]))
}ggplot(z, aes(keyword, posterior, fill=as.factor(keyword)))+
geom_bar(position="dodge",stat="identity")+coord_flip() +facet_wrap(~type,nrow=1) +theme(legend.position="none")
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 22 / 28
토픽별 키워드 분포
Topic 1 Topic 2 Topic 3 Topic 4
baby
come
go
gonna
got
know
love
man
re
said
ve
0.000 0.005 0.010 0.000 0.005 0.010 0.000 0.005 0.010 0.000 0.005 0.010posterior
keyw
ord
Figure 3:토픽별 키워드 분포
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 23 / 28
토픽 트렌드
y<- unique(meta0$album.year)pos <- posterior(lda)$topicstrends <- aggregate(pos[,1], list(meta0$album.year[1:385]), mean)plot(trends[,1], trends[,2], type="b", xlab="year", ylab="posterior for topic 1")
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 24 / 28
토픽 트렌드
1960 1970 1980 1990 2000 2010
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
year
post
erio
r for
topi
c 1
Figure 4:토픽 1 트렌드
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 25 / 28
토픽2 트렌드
trends <- aggregate(pos[,2], list(meta0$album.year[1:385]), mean)plot(trends[,1], trends[,2], type="b", xlab="year", ylab="posterior for topic 2")
1960 1970 1980 1990 2000 2010
0.0
0.1
0.2
0.3
0.4
0.5
year
post
erio
r for
topi
c 2
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 26 / 28
토픽3 트렌드
trends <- aggregate(pos[,3], list(meta0$album.year[1:385]), mean)plot(trends[,1], trends[,2], type="b", xlab="year", ylab="posterior for topic 3")
1960 1970 1980 1990 2000 2010
0.0
0.1
0.2
0.3
0.4
0.5
year
post
erio
r for
topi
c 3
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 27 / 28
토픽4 트렌드
trends <- aggregate(pos[,4], list(meta0$album.year[1:385]), mean)plot(trends[,1], trends[,2], type="b", xlab="year", ylab="posterior for topic 4")
1960 1970 1980 1990 2000 2010
0.0
0.1
0.2
0.3
0.4
0.5
0.6
year
post
erio
r for
topi
c 4
jinseog Kim Analysis of Bob Dylan’s Lyrics 2016-10-26 28 / 28