+ All Categories
Home > Documents > Programming for Linguists An Introduction to Python 22/12/2011.

Programming for Linguists An Introduction to Python 22/12/2011.

Date post: 02-Apr-2015
Category:
Upload: camryn-lamkin
View: 229 times
Download: 0 times
Share this document with a friend
Popular Tags:
28
Programming for Linguists An Introduction to Python 22/12/2011
Transcript
Page 1: Programming for Linguists An Introduction to Python 22/12/2011.

Programming for Linguists

An Introduction to Python22/12/2011

Page 2: Programming for Linguists An Introduction to Python 22/12/2011.

Ex. 1)Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of “men”, “women”, and “people” in each document. What has happened to the usage of these words over time?

Feedback

Page 3: Programming for Linguists An Introduction to Python 22/12/2011.

import nltk

from nltk.corpus import state_union

cfd = nltk.ConditionalFreqDist((fileid, word) for fileid in state_union.fileids( ) for word in state_union.words(fileids = fileid))

fileids = state_union.fileids( )

search_words = ["men", "women", "people"]

cfd.tabulate(conditions = fileids, samples = search_words)

Page 4: Programming for Linguists An Introduction to Python 22/12/2011.

Ex 2)According to Strunk and White's Elements of Style, the word “however”, used at the start of a sentence, means "in whatever way" or "to whatever extent", and not "nevertheless". They give this example of correct usage: However you advise him, he will probably do as he thinks best. Use the concordance tool to study actual usage of this word in 5 NLTK texts.

Page 5: Programming for Linguists An Introduction to Python 22/12/2011.

import nltk

from nltk.book import *

texts = [text1, text2, text3, text4, text5]

for text in texts:

print text.concordance("however")

Page 6: Programming for Linguists An Introduction to Python 22/12/2011.

Ex 3)Create a corpus of your own of minimum 10 files containing text fragments. You can take texts of your own, the internet,…Write a program that investigates the usage of modal verbs in this corpus using the frequency distribution tool and plot the 10 most frequent words.

Page 7: Programming for Linguists An Introduction to Python 22/12/2011.

import nltk

from nltk.corpus import PlaintextCorpusReader

corpus_root = “/Users/claudia/my_corpus”

#corpus_root = “C:\Users\...”

my_corpus = PlaintextCorpusReader(corpus_root, '.*’)

words = my_corpus.words( )

cfd = nltk.ConditionalFreqDist((fileid, word) for fileid in my_corpus.fileids( ) for word in my_corpus.words(fileid))

Page 8: Programming for Linguists An Introduction to Python 22/12/2011.

fileids = my_corpus.fileids( )modals = ['can', 'could', 'may',

'might', 'must', 'will’cfd.tabulate(conditions = fileids,

samples = modals)fd = nltk.FreqDist(words)all_tokens = fd.keys( )for t in all_tokens:

if re.match(r'[^a-zA-Z0-9]+', t):all_tokens.remove(t)

most_frequent=all_tokens[:10]most_frequent.plot( )

Page 9: Programming for Linguists An Introduction to Python 22/12/2011.

Ex 1)Choose a website. Read it in in Python using the urlopen function, remove all HTML mark-up and tokenize it. Make a frequency dictionary of all words ending with ‘ing’ and sort it on its values (decreasingly).

Ex 2) Write the raw text of the text in the previous exercise to an output file.

Page 10: Programming for Linguists An Introduction to Python 22/12/2011.

import nltk

import re

url= “website”

from urllib import urlopen

htmltext= urlopen(url).read( )

rawtext= nltk.clean_html(htmltext)

rawtext2= rawtext.lower( )

tokens= nltk.wordpunct_tokenize(rawtext2)

my_text= nltk.Text(tokens)

wordlist_ing= [w for w in tokens if re.search(r'^.*ing$',w)]

Page 11: Programming for Linguists An Introduction to Python 22/12/2011.

freq_dict= { }

for word in wordlist_ing:

if word not in freq_dict:

freq_dict[word] = 1

else:

freq_dict[word] = freq_dict[word]+1

from operator import itemgetter

sorted_wordlist_ing = sorted(freq_dict.iteritems(), key= itemgetter(1), reverse=True)

Page 12: Programming for Linguists An Introduction to Python 22/12/2011.

Ex 2)

output_file = open(“dir/output.txt","w")

output_file.write(str(rawtext2)+"\n")

output_file.close( )

Page 13: Programming for Linguists An Introduction to Python 22/12/2011.

Ex 3)Write a script that performs the same classification task as we saw today using word bigrams as features instead of single words.

Page 14: Programming for Linguists An Introduction to Python 22/12/2011.

Loading your own corpus in NLTK with no subcategories:

import nltk

from nltk.corpus import PlaintextCorpusReader

loc = “/Users/claudia/my_corpus” #Mac

loc = “C:\Users\claudia\my_corpus” #Windows 7

my_corpus = PlaintextCorpusReader(loc, “.*”)

Some Mentioned Issues

Page 15: Programming for Linguists An Introduction to Python 22/12/2011.

Loading your own corpus in NLTK with subcategories:

import nltk

from nltk.corpus import CategorizedPlaintextCorpusReader

loc=“/Users/claudia/my_corpus” #Mac

loc=“C:\Users\claudia\my_corpus” #Windows 7

my_corpus = CategorizedPlaintextCorpusReader(loc, '(?!\.svn).*\.txt', cat_pattern=

r'(cat1|cat2)/.*')

Page 16: Programming for Linguists An Introduction to Python 22/12/2011.

Dispersion Plotdetermine the location of a word in the text: how many

words from the beginning it appears

Page 17: Programming for Linguists An Introduction to Python 22/12/2011.

ExercisesWrite a program that reads a file,

breaks each line into words, strips whitespace and punctuation from the text, and converts the words to lowercase.You can get a list of all punctuation marks by:import stringprint string.punctuation

Page 18: Programming for Linguists An Introduction to Python 22/12/2011.

import nltk, string

def strip(filepath):f = open(filepath, ‘r’)text = f.read( )

tokens = nltk.wordpunct_tokenize(text)for token in tokens:

token = token.lower( )if token in string.punctuation:

tokens.remove(token)

return tokens

Page 19: Programming for Linguists An Introduction to Python 22/12/2011.

If you want to analyse a text, but filter out a stop list first (e.g. containing “the”, “and”,…), you need to make 2 dictionaries: 1 with all words from your text and 1 with all words from the stop list. Then you need to subtract the 2nd from the 1st. Write a function subtract(d1, d2) which takes dictionaries d1 and d2 and returns a new dictionary that contains all the keys from d1 that are not in d2. You can set the values to None.

Page 20: Programming for Linguists An Introduction to Python 22/12/2011.

def subtract(d1, d2): d3 = { }for key in d1.keys():

if key not in d2:d3[key] = None

return d3

Page 21: Programming for Linguists An Introduction to Python 22/12/2011.

Let’s try it out:

import nltk

from nltk.book import *

from nltk.corpus import stopwords

d1 = { }

for word in text7:

d1[word] = None

Page 22: Programming for Linguists An Introduction to Python 22/12/2011.

wordlist = stopwords.words(“english”)

d2 = { }

for word in wordlist:d2[word] = None

rest_dict = subtract(d1, d2)

wordlist_min_stopwords=rest_dict.keys( )

Page 23: Programming for Linguists An Introduction to Python 22/12/2011.

Questions?

Page 24: Programming for Linguists An Introduction to Python 22/12/2011.

Evaluation AssignmentDeadline = 23/01/2012

Conversation in the week of 23/01/12

If you need any explanation about the content of the assignment, feel free to e-mail me

Page 25: Programming for Linguists An Introduction to Python 22/12/2011.

Further ReadingSince this was only a short

introduction to programming in Python, if you want to expand your programming skills further: see chapters 15 – 18 about object-oriented programming

Page 26: Programming for Linguists An Introduction to Python 22/12/2011.

Think Python. How to Think Like a Computer Scientist?

NLTK book

Official Python documentation:http://www.python.org/doc/

There is a newer version of Python available, but it is not (yet) compatible with NLTK

Page 27: Programming for Linguists An Introduction to Python 22/12/2011.

Our research group:CLiPS: Computational Linguistics and Psycholinguistics Research Center

http://www.clips.ua.ac.be/

Our projects:http://www.clips.ua.ac.be/projects

Page 28: Programming for Linguists An Introduction to Python 22/12/2011.

Happy holidays and success with your exams


Recommended