Natural Language Processing(SupStat Inc)

Is that Dothraki or Valyrian?and other NLP tasks with Python and NLTK

Charlie Redmon | SupStat, Inc.

August 18, 2014

Dothraki

Astapori Valyrian

High Valyrian

Importing raw text

dothraki_f = codecs.open(

"/home/cr/Python/westeros/dothraki.txt",

encoding=’utf -8’)

dothraki_raw = dothraki_f.read()

print dothraki_raw

Athchomar chomakaan , [zhey] khal vezhven. Azha

anhaan asshilat ... Itte oakah! Jadi , zhey Jora

Andahli. Khal vezhven. Ajjalan anha zalat vitiherat

yer hatif. Kash qoy qoyi thira disse. Hash shafka

zali addrivat mae , zhey Khaleesi? Ishish chare

...

Text processing: Cleaning

punct_re = re.compile(

ur’[\. ,;:\?!\ u2014\u2019\u2026 \[\]] ’,

re.UNICODE)

dothraki_proc = punct_re.sub(’’, dothraki_raw)

dothraki_proc = dothraki_proc.lower ()

print dothraki_proc

athchomar chomakaan zhey khal vezhven azha anhaan

asshilat itte oakah jadi zhey jora andahli khal

vezhven ajjalan anha zalat vitiherat yer hatif kash

qoy qoyi thira disse

...

Text processing: Tokenizing

dothraki_tokens = re.split(ur’\s+’, dothraki_proc)

dothraki_types = set(dothraki_tokens)

print dothraki_types

set([u’izzi’, u’ale’, u’morea’, u’vesazhao ’,

u’yeri’, u’ishish ’, u’dalen ’, u’vesazhae ’, u’yera’,

u’afisi ’, u’rhae’, u’mawizzi ’, u’vee’, u’arrisse ’,

u’ti’, u’ven’, u’rizh’, u’afichak ’, u’gache ’,

u’zigerek ’, u’zigereo ’, u’drivoe ’, u’maaz’,

u’zigeree ’, u’ayyeyoon ’, u’maan’, u’mahrazhi ’,

u’ma’, u’vos’, u’movekkhi ’, u’mahrazhis ’,

u’meshafka ’, u’qisi’, u’sani’, u’ville ’, u’vikeesi ’,

u’ifak’, u’javrathi ’, u’zisa’, u’chek’, u’nem’,

...

])

Inspecting the lexical distribution in a text

dothraki_freqdist = FreqDist(dothraki_tokens)

print dothraki_freqdist

<FreqDist: u’anha’: 50, u’vos’: 40, u’me’: 39,

u’ma’: 38, u’zhey’: 29, u’mae’: 27, u’anni’: 26,

u’hash’: 23, u’yer’: 23, u’khal’: 16,

u’khaleesi ’: 16, u’mori’: 15, u’jin’: 13,

u’kisha ’: 12, u’nem’: 11, u’vo’: 11, u’che’: 10,

u’jini’: 10, u’she’: 10, ... >

dothraki_freqdist.plot(20, cumulative=True)

CFD of Dothraki words

Top 10: anha, vos, me, ma, zhey, mae, anni, hash, yer, khal

Valyrian vocabulary distribution

Astapori Valyrian (Top 10):

ji, me, do, espo, si, mysa, eji, ez, ivetra, sa

High Valyrian (Top 10):

daor, se, issa, syt, ziry, hen, jemele, lue, yne, avy

Feature 1: Consonant proportion

def c_prop(word):

c_num = 0

for letter in u’bcdfgjklmnpqrstvxz\u00f1’:

c_num += word.count(letter)

return c_num / len(word)

c_prop(u’z\u016bgusy ’)

0.5

Word-internal consonant proportions across languages

Feature 2: Obstruent proportion

def obstruent_prop(word):

obstruent_num = 0

for letter in u’bcdfgjkpqstvxz ’

obstruent_num += word.count(letter)

return obstruent_num / len(word)

obstruent_prop(u’\u012blvi ’)

0.25

Word-internal obstruent proportions across languages

Feature 3: Coda presence

def c_coda(word):

if word[-1] in u’bcdfgjklmnpqrstvxz\u00f1’:

return 1

else:

return 0

def obstruent_coda(word):

if word[-1] in u’bcdfgjkpqstvxz ’:

return 1

else:

return 0

c_coda(u’lysoon ’)

1

obstruent_coda(u’lysoon ’)

0

Mean coda consonant presence across languages

Mean coda obstruent presence across languages

Feature 4: Consonant clusters

regex = ur’[bcdfghjklmnpqrstvxz\u00f1]

[bcdfghjklmnpqrstvxz\u00f1 ]+’

def c_cluster(word):

cc_set = re.findall(regex , word , re.UNICODE)

return len(cc_set)

c_cluster(u’avvirsosh ’)

3

Mean consonant cluster frequency across languages

Feature 5: Obstruent clusters

regex1 = ur’[bcdfghjkpqstvxz ][ bcdfghjkpqstvxz ]+’

def obs_cluster(word):

oo_set = re.findall(regex1 , word , re.UNICODE)

return len(oo_set)

obs_cluster(u’avvirsosh ’)

2

Mean obstruent cluster frequency across languages

Feature 6: Vowel clusters

regex2 = ur’[bcdfghjklmnpqrstvxz\u00f1]+’

def v_cluster(word):

v_set = re.split(regex2 , word , re.UNICODE)

vv_set = [v for v in v_set if len(v) > 1]

return len(vv_set)

v_cluster(u’haeshi ’)

1

Mean vowel cluster frequency across languages

Data from real languages

TDIL Assamese Corpus

TDIL Assamese Corpus

Assamese corpus files

directory = "/home/cr/Documents/NLPwP_pres/

TDIL_assamese_corpus_data"

os.listdir(directory)

[’subj_art2.txt’, ’subj_politics1.txt’, ’lit3.txt’,

’drama.txt’, ’religion2.txt’, ’criticism2.txt’,

’criticism1.txt’, ’subj_science3.txt’,

’ref_encyclopaedia -entry.txt’, ’subj_science2.txt’,

’subj_social -studies.txt’, ’music.txt’, ’subj_art1.txt’,

’subj_science1.txt’, ’subj_art3.txt’, ’news.txt’,

’subj_sociology.txt’, ’criticism3.txt’, ’lit8.txt’,

’subj_history1.txt’, ’lit4.txt’, ’lit6.txt’, ’religion3.txt’,

’subj_law.txt’, ’lit7.txt’, ’religion1.txt’, ’criticism4.txt’,

’lit5.txt’, ’subj_math.txt’, ’lit1.txt’, ’subj_science6.txt’,

’subj_science_5.txt’, ’subj_history2.txt’, ’lit2.txt’,

’subj_science4.txt’, ’letter.txt’]

Assamese sample: ‘lit5.txt’

Frequency of the sound /x/ in ’lit5.txt’

len(re.findall(ur’[\ u09b6\u09b7\u09b8]’,

assamese_sample_raw , re.UNICODE ))

1313

len(re.findall(ur’\u09b6’, assamese_sample_raw ,

re.UNICODE ))

298


re.UNICODE ))

195


re.UNICODE ))

820

Positional restrictions

Beginning a word:

len(re.findall(ur’\b[\ u09b6\u09b7\u09b8]’,


1129

Ending a word:

len(re.findall(ur’[\ u09b6\u09b7\u09b8]\b’,


895

Positional restrictions

Following /a/:

len(re.findall(ur’\u09be[\ u09b6\u09b7\u09b8]’,


57

Following /i/:

len(re.findall(ur’[\ u09bf\u09c0 ][\ u09b6\u09b7\u09b8]’, a

ssamese_sample_raw , re.UNICODE ))

70

Following /u/:

len(re.findall(ur’[\ u09c1\u09c2 ][\ u09b6\u09b7\u09b8]’,


10

Further work

I Incorporate segmental parameters into classifier (fix Unicodeissues with NLTK’s classify module)

I Use classifier to predict assignment of random words fromWesteros to Dothraki, Astapori Valyrian, and High Valyrianlanguages

I Isolate most important word-internal parameters inclassification model (log-likelihood ranking in Naive Bayesmodel)

I Use full distributional account of select Assamese consonantsas priors in acoustic classification model

Thank you

Date post:	05-Dec-2014
Category:	Engineering
Upload:	vivian-s-zhang
View:	312 times
Download:	0 times

Natural Language Processing(SupStat Inc)

Engineering