Date post: | 05-Dec-2014 |
Category: |
Engineering |
Upload: | vivian-s-zhang |
View: | 312 times |
Download: | 0 times |
Is that Dothraki or Valyrian?and other NLP tasks with Python and NLTK
Charlie Redmon | SupStat, Inc.
August 18, 2014
Dothraki
Astapori Valyrian
High Valyrian
Importing raw text
dothraki_f = codecs.open(
"/home/cr/Python/westeros/dothraki.txt",
encoding=’utf -8’)
dothraki_raw = dothraki_f.read()
print dothraki_raw
Athchomar chomakaan , [zhey] khal vezhven. Azha
anhaan asshilat ... Itte oakah! Jadi , zhey Jora
Andahli. Khal vezhven. Ajjalan anha zalat vitiherat
yer hatif. Kash qoy qoyi thira disse. Hash shafka
zali addrivat mae , zhey Khaleesi? Ishish chare
...
Text processing: Cleaning
punct_re = re.compile(
ur’[\. ,;:\?!\ u2014\u2019\u2026 \[\]] ’,
re.UNICODE)
dothraki_proc = punct_re.sub(’’, dothraki_raw)
dothraki_proc = dothraki_proc.lower ()
print dothraki_proc
athchomar chomakaan zhey khal vezhven azha anhaan
asshilat itte oakah jadi zhey jora andahli khal
vezhven ajjalan anha zalat vitiherat yer hatif kash
qoy qoyi thira disse
...
Text processing: Tokenizing
dothraki_tokens = re.split(ur’\s+’, dothraki_proc)
dothraki_types = set(dothraki_tokens)
print dothraki_types
set([u’izzi’, u’ale’, u’morea’, u’vesazhao ’,
u’yeri’, u’ishish ’, u’dalen ’, u’vesazhae ’, u’yera’,
u’afisi ’, u’rhae’, u’mawizzi ’, u’vee’, u’arrisse ’,
u’ti’, u’ven’, u’rizh’, u’afichak ’, u’gache ’,
u’zigerek ’, u’zigereo ’, u’drivoe ’, u’maaz’,
u’zigeree ’, u’ayyeyoon ’, u’maan’, u’mahrazhi ’,
u’ma’, u’vos’, u’movekkhi ’, u’mahrazhis ’,
u’meshafka ’, u’qisi’, u’sani’, u’ville ’, u’vikeesi ’,
u’ifak’, u’javrathi ’, u’zisa’, u’chek’, u’nem’,
...
])
Inspecting the lexical distribution in a text
dothraki_freqdist = FreqDist(dothraki_tokens)
print dothraki_freqdist
<FreqDist: u’anha’: 50, u’vos’: 40, u’me’: 39,
u’ma’: 38, u’zhey’: 29, u’mae’: 27, u’anni’: 26,
u’hash’: 23, u’yer’: 23, u’khal’: 16,
u’khaleesi ’: 16, u’mori’: 15, u’jin’: 13,
u’kisha ’: 12, u’nem’: 11, u’vo’: 11, u’che’: 10,
u’jini’: 10, u’she’: 10, ... >
dothraki_freqdist.plot(20, cumulative=True)
CFD of Dothraki words
Top 10: anha, vos, me, ma, zhey, mae, anni, hash, yer, khal
Valyrian vocabulary distribution
Astapori Valyrian (Top 10):
ji, me, do, espo, si, mysa, eji, ez, ivetra, sa
High Valyrian (Top 10):
daor, se, issa, syt, ziry, hen, jemele, lue, yne, avy
Feature 1: Consonant proportion
def c_prop(word):
c_num = 0
for letter in u’bcdfgjklmnpqrstvxz\u00f1’:
c_num += word.count(letter)
return c_num / len(word)
c_prop(u’z\u016bgusy ’)
0.5
Word-internal consonant proportions across languages
Feature 2: Obstruent proportion
def obstruent_prop(word):
obstruent_num = 0
for letter in u’bcdfgjkpqstvxz ’
obstruent_num += word.count(letter)
return obstruent_num / len(word)
obstruent_prop(u’\u012blvi ’)
0.25
Word-internal obstruent proportions across languages
Feature 3: Coda presence
def c_coda(word):
if word[-1] in u’bcdfgjklmnpqrstvxz\u00f1’:
return 1
else:
return 0
def obstruent_coda(word):
if word[-1] in u’bcdfgjkpqstvxz ’:
return 1
else:
return 0
c_coda(u’lysoon ’)
1
obstruent_coda(u’lysoon ’)
0
Mean coda consonant presence across languages
Mean coda obstruent presence across languages
Feature 4: Consonant clusters
regex = ur’[bcdfghjklmnpqrstvxz\u00f1]
[bcdfghjklmnpqrstvxz\u00f1 ]+’
def c_cluster(word):
cc_set = re.findall(regex , word , re.UNICODE)
return len(cc_set)
c_cluster(u’avvirsosh ’)
3
Mean consonant cluster frequency across languages
Feature 5: Obstruent clusters
regex1 = ur’[bcdfghjkpqstvxz ][ bcdfghjkpqstvxz ]+’
def obs_cluster(word):
oo_set = re.findall(regex1 , word , re.UNICODE)
return len(oo_set)
obs_cluster(u’avvirsosh ’)
2
Mean obstruent cluster frequency across languages
Feature 6: Vowel clusters
regex2 = ur’[bcdfghjklmnpqrstvxz\u00f1]+’
def v_cluster(word):
v_set = re.split(regex2 , word , re.UNICODE)
vv_set = [v for v in v_set if len(v) > 1]
return len(vv_set)
v_cluster(u’haeshi ’)
1
Mean vowel cluster frequency across languages
Data from real languages
TDIL Assamese Corpus
TDIL Assamese Corpus
Assamese corpus files
directory = "/home/cr/Documents/NLPwP_pres/
TDIL_assamese_corpus_data"
os.listdir(directory)
[’subj_art2.txt’, ’subj_politics1.txt’, ’lit3.txt’,
’drama.txt’, ’religion2.txt’, ’criticism2.txt’,
’criticism1.txt’, ’subj_science3.txt’,
’ref_encyclopaedia -entry.txt’, ’subj_science2.txt’,
’subj_social -studies.txt’, ’music.txt’, ’subj_art1.txt’,
’subj_science1.txt’, ’subj_art3.txt’, ’news.txt’,
’subj_sociology.txt’, ’criticism3.txt’, ’lit8.txt’,
’subj_history1.txt’, ’lit4.txt’, ’lit6.txt’, ’religion3.txt’,
’subj_law.txt’, ’lit7.txt’, ’religion1.txt’, ’criticism4.txt’,
’lit5.txt’, ’subj_math.txt’, ’lit1.txt’, ’subj_science6.txt’,
’subj_science_5.txt’, ’subj_history2.txt’, ’lit2.txt’,
’subj_science4.txt’, ’letter.txt’]
Assamese sample: ‘lit5.txt’
Frequency of the sound /x/ in ’lit5.txt’
len(re.findall(ur’[\ u09b6\u09b7\u09b8]’,
assamese_sample_raw , re.UNICODE ))
1313
len(re.findall(ur’\u09b6’, assamese_sample_raw ,
re.UNICODE ))
298
len(re.findall(ur’\u09b7’, assamese_sample_raw ,
re.UNICODE ))
195
len(re.findall(ur’\u09b8’, assamese_sample_raw ,
re.UNICODE ))
820
Positional restrictions
Beginning a word:
len(re.findall(ur’\b[\ u09b6\u09b7\u09b8]’,
assamese_sample_raw , re.UNICODE ))
1129
Ending a word:
len(re.findall(ur’[\ u09b6\u09b7\u09b8]\b’,
assamese_sample_raw , re.UNICODE ))
895
Positional restrictions
Following /a/:
len(re.findall(ur’\u09be[\ u09b6\u09b7\u09b8]’,
assamese_sample_raw , re.UNICODE ))
57
Following /i/:
len(re.findall(ur’[\ u09bf\u09c0 ][\ u09b6\u09b7\u09b8]’, a
ssamese_sample_raw , re.UNICODE ))
70
Following /u/:
len(re.findall(ur’[\ u09c1\u09c2 ][\ u09b6\u09b7\u09b8]’,
assamese_sample_raw , re.UNICODE ))
10
Further work
I Incorporate segmental parameters into classifier (fix Unicodeissues with NLTK’s classify module)
I Use classifier to predict assignment of random words fromWesteros to Dothraki, Astapori Valyrian, and High Valyrianlanguages
I Isolate most important word-internal parameters inclassification model (log-likelihood ranking in Naive Bayesmodel)
I Use full distributional account of select Assamese consonantsas priors in acoustic classification model
Thank you