+ All Categories
Home > Engineering > Natural Language Processing(SupStat Inc)

Natural Language Processing(SupStat Inc)

Date post: 05-Dec-2014
Category:
Upload: vivian-s-zhang
View: 312 times
Download: 0 times
Share this document with a friend
Description:
SupStat Inc, Natural Language Processing, NYC data science academy
33
Is that Dothraki or Valyrian? and other NLP tasks with Python and NLTK Charlie Redmon | SupStat, Inc. August 18, 2014
Transcript
Page 1: Natural Language Processing(SupStat Inc)

Is that Dothraki or Valyrian?and other NLP tasks with Python and NLTK

Charlie Redmon | SupStat, Inc.

August 18, 2014

Page 2: Natural Language Processing(SupStat Inc)

Dothraki

Page 3: Natural Language Processing(SupStat Inc)

Astapori Valyrian

Page 4: Natural Language Processing(SupStat Inc)

High Valyrian

Page 5: Natural Language Processing(SupStat Inc)

Importing raw text

dothraki_f = codecs.open(

"/home/cr/Python/westeros/dothraki.txt",

encoding=’utf -8’)

dothraki_raw = dothraki_f.read()

print dothraki_raw

Athchomar chomakaan , [zhey] khal vezhven. Azha

anhaan asshilat ... Itte oakah! Jadi , zhey Jora

Andahli. Khal vezhven. Ajjalan anha zalat vitiherat

yer hatif. Kash qoy qoyi thira disse. Hash shafka

zali addrivat mae , zhey Khaleesi? Ishish chare

...

Page 6: Natural Language Processing(SupStat Inc)

Text processing: Cleaning

punct_re = re.compile(

ur’[\. ,;:\?!\ u2014\u2019\u2026 \[\]] ’,

re.UNICODE)

dothraki_proc = punct_re.sub(’’, dothraki_raw)

dothraki_proc = dothraki_proc.lower ()

print dothraki_proc

athchomar chomakaan zhey khal vezhven azha anhaan

asshilat itte oakah jadi zhey jora andahli khal

vezhven ajjalan anha zalat vitiherat yer hatif kash

qoy qoyi thira disse

...

Page 7: Natural Language Processing(SupStat Inc)

Text processing: Tokenizing

dothraki_tokens = re.split(ur’\s+’, dothraki_proc)

dothraki_types = set(dothraki_tokens)

print dothraki_types

set([u’izzi’, u’ale’, u’morea’, u’vesazhao ’,

u’yeri’, u’ishish ’, u’dalen ’, u’vesazhae ’, u’yera’,

u’afisi ’, u’rhae’, u’mawizzi ’, u’vee’, u’arrisse ’,

u’ti’, u’ven’, u’rizh’, u’afichak ’, u’gache ’,

u’zigerek ’, u’zigereo ’, u’drivoe ’, u’maaz’,

u’zigeree ’, u’ayyeyoon ’, u’maan’, u’mahrazhi ’,

u’ma’, u’vos’, u’movekkhi ’, u’mahrazhis ’,

u’meshafka ’, u’qisi’, u’sani’, u’ville ’, u’vikeesi ’,

u’ifak’, u’javrathi ’, u’zisa’, u’chek’, u’nem’,

...

])

Page 8: Natural Language Processing(SupStat Inc)

Inspecting the lexical distribution in a text

dothraki_freqdist = FreqDist(dothraki_tokens)

print dothraki_freqdist

<FreqDist: u’anha’: 50, u’vos’: 40, u’me’: 39,

u’ma’: 38, u’zhey’: 29, u’mae’: 27, u’anni’: 26,

u’hash’: 23, u’yer’: 23, u’khal’: 16,

u’khaleesi ’: 16, u’mori’: 15, u’jin’: 13,

u’kisha ’: 12, u’nem’: 11, u’vo’: 11, u’che’: 10,

u’jini’: 10, u’she’: 10, ... >

dothraki_freqdist.plot(20, cumulative=True)

Page 9: Natural Language Processing(SupStat Inc)

CFD of Dothraki words

Top 10: anha, vos, me, ma, zhey, mae, anni, hash, yer, khal

Page 10: Natural Language Processing(SupStat Inc)

Valyrian vocabulary distribution

Astapori Valyrian (Top 10):

ji, me, do, espo, si, mysa, eji, ez, ivetra, sa

High Valyrian (Top 10):

daor, se, issa, syt, ziry, hen, jemele, lue, yne, avy

Page 11: Natural Language Processing(SupStat Inc)

Feature 1: Consonant proportion

def c_prop(word):

c_num = 0

for letter in u’bcdfgjklmnpqrstvxz\u00f1’:

c_num += word.count(letter)

return c_num / len(word)

c_prop(u’z\u016bgusy ’)

0.5

Page 12: Natural Language Processing(SupStat Inc)

Word-internal consonant proportions across languages

Page 13: Natural Language Processing(SupStat Inc)

Feature 2: Obstruent proportion

def obstruent_prop(word):

obstruent_num = 0

for letter in u’bcdfgjkpqstvxz ’

obstruent_num += word.count(letter)

return obstruent_num / len(word)

obstruent_prop(u’\u012blvi ’)

0.25

Page 14: Natural Language Processing(SupStat Inc)

Word-internal obstruent proportions across languages

Page 15: Natural Language Processing(SupStat Inc)

Feature 3: Coda presence

def c_coda(word):

if word[-1] in u’bcdfgjklmnpqrstvxz\u00f1’:

return 1

else:

return 0

def obstruent_coda(word):

if word[-1] in u’bcdfgjkpqstvxz ’:

return 1

else:

return 0

c_coda(u’lysoon ’)

1

obstruent_coda(u’lysoon ’)

0

Page 16: Natural Language Processing(SupStat Inc)

Mean coda consonant presence across languages

Page 17: Natural Language Processing(SupStat Inc)

Mean coda obstruent presence across languages

Page 18: Natural Language Processing(SupStat Inc)

Feature 4: Consonant clusters

regex = ur’[bcdfghjklmnpqrstvxz\u00f1]

[bcdfghjklmnpqrstvxz\u00f1 ]+’

def c_cluster(word):

cc_set = re.findall(regex , word , re.UNICODE)

return len(cc_set)

c_cluster(u’avvirsosh ’)

3

Page 19: Natural Language Processing(SupStat Inc)

Mean consonant cluster frequency across languages

Page 20: Natural Language Processing(SupStat Inc)

Feature 5: Obstruent clusters

regex1 = ur’[bcdfghjkpqstvxz ][ bcdfghjkpqstvxz ]+’

def obs_cluster(word):

oo_set = re.findall(regex1 , word , re.UNICODE)

return len(oo_set)

obs_cluster(u’avvirsosh ’)

2

Page 21: Natural Language Processing(SupStat Inc)

Mean obstruent cluster frequency across languages

Page 22: Natural Language Processing(SupStat Inc)

Feature 6: Vowel clusters

regex2 = ur’[bcdfghjklmnpqrstvxz\u00f1]+’

def v_cluster(word):

v_set = re.split(regex2 , word , re.UNICODE)

vv_set = [v for v in v_set if len(v) > 1]

return len(vv_set)

v_cluster(u’haeshi ’)

1

Page 23: Natural Language Processing(SupStat Inc)

Mean vowel cluster frequency across languages

Page 24: Natural Language Processing(SupStat Inc)

Data from real languages

Page 25: Natural Language Processing(SupStat Inc)

TDIL Assamese Corpus

Page 26: Natural Language Processing(SupStat Inc)

TDIL Assamese Corpus

Page 27: Natural Language Processing(SupStat Inc)

Assamese corpus files

directory = "/home/cr/Documents/NLPwP_pres/

TDIL_assamese_corpus_data"

os.listdir(directory)

[’subj_art2.txt’, ’subj_politics1.txt’, ’lit3.txt’,

’drama.txt’, ’religion2.txt’, ’criticism2.txt’,

’criticism1.txt’, ’subj_science3.txt’,

’ref_encyclopaedia -entry.txt’, ’subj_science2.txt’,

’subj_social -studies.txt’, ’music.txt’, ’subj_art1.txt’,

’subj_science1.txt’, ’subj_art3.txt’, ’news.txt’,

’subj_sociology.txt’, ’criticism3.txt’, ’lit8.txt’,

’subj_history1.txt’, ’lit4.txt’, ’lit6.txt’, ’religion3.txt’,

’subj_law.txt’, ’lit7.txt’, ’religion1.txt’, ’criticism4.txt’,

’lit5.txt’, ’subj_math.txt’, ’lit1.txt’, ’subj_science6.txt’,

’subj_science_5.txt’, ’subj_history2.txt’, ’lit2.txt’,

’subj_science4.txt’, ’letter.txt’]

Page 28: Natural Language Processing(SupStat Inc)

Assamese sample: ‘lit5.txt’

Page 29: Natural Language Processing(SupStat Inc)

Frequency of the sound /x/ in ’lit5.txt’

len(re.findall(ur’[\ u09b6\u09b7\u09b8]’,

assamese_sample_raw , re.UNICODE ))

1313

len(re.findall(ur’\u09b6’, assamese_sample_raw ,

re.UNICODE ))

298

len(re.findall(ur’\u09b7’, assamese_sample_raw ,

re.UNICODE ))

195

len(re.findall(ur’\u09b8’, assamese_sample_raw ,

re.UNICODE ))

820

Page 30: Natural Language Processing(SupStat Inc)

Positional restrictions

Beginning a word:

len(re.findall(ur’\b[\ u09b6\u09b7\u09b8]’,

assamese_sample_raw , re.UNICODE ))

1129

Ending a word:

len(re.findall(ur’[\ u09b6\u09b7\u09b8]\b’,

assamese_sample_raw , re.UNICODE ))

895

Page 31: Natural Language Processing(SupStat Inc)

Positional restrictions

Following /a/:

len(re.findall(ur’\u09be[\ u09b6\u09b7\u09b8]’,

assamese_sample_raw , re.UNICODE ))

57

Following /i/:

len(re.findall(ur’[\ u09bf\u09c0 ][\ u09b6\u09b7\u09b8]’, a

ssamese_sample_raw , re.UNICODE ))

70

Following /u/:

len(re.findall(ur’[\ u09c1\u09c2 ][\ u09b6\u09b7\u09b8]’,

assamese_sample_raw , re.UNICODE ))

10

Page 32: Natural Language Processing(SupStat Inc)

Further work

I Incorporate segmental parameters into classifier (fix Unicodeissues with NLTK’s classify module)

I Use classifier to predict assignment of random words fromWesteros to Dothraki, Astapori Valyrian, and High Valyrianlanguages

I Isolate most important word-internal parameters inclassification model (log-likelihood ranking in Naive Bayesmodel)

I Use full distributional account of select Assamese consonantsas priors in acoustic classification model

Page 33: Natural Language Processing(SupStat Inc)

Thank you


Recommended