NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

NLTK & PythonDay 9

LING 681.02Computational Linguistics

Harry HowardTulane University

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

2

Course organization

NLTK should be installed on the computers in this room!

NLPP §3 Processing raw text

§3.1 Accessing text from the Web and from disk


4

Using e-books

Download an e-bookthey are of type 'str'

Tokenization: break up the string into words and punctuation

Convert to NLTK textRemove headers


5

Download an e-book

>>> from urllib import urlopen>>> url = "http://www.gutenberg.org/files/2554/2554.txt"

>>> raw = urlopen(url).read()>>> type(raw)<type 'str'>>>> len(raw)1176831>>> raw[:75]'The Project Gutenberg EBook of Crime and

Punishment, by Fyodor Dostoevsky\r\n'


6

Tokenize

>>> tokens = nltk.word_tokenize(raw)>>> type(tokens)<type 'list'>>>> len(tokens)255809>>> tokens[:10]['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']


7

Convert to NLTK text

>>> text = nltk.Text(tokens)>>> type(text)<type 'nltk.text.Text'>>>> text[1020:1060]['CHAPTER', 'I', 'On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in', 'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in', 'which', 'he', 'lodged', 'in', 'S', '.', 'Place', 'and', 'walked', 'slowly', ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K', '.', 'bridge', '.']>>> text.collocations()Katerina Ivanovna; Pulcheria Alexandrovna; Avdotya Romanovna; Pyotr Petrovitch; Project Gutenberg; Marfa Petrovna; Rodion Romanovitch; Sofya Semyonovna; Nikodim Fomitch; did not; Hay Market; etc.


8

Remove headers

>>> raw.find("PART I")5303>>> raw.rfind("End of Project Gutenberg's Crime")

1157681>>> raw = raw[5303:1157681]>>> raw.find("PART I")0


9

Dealing with HTML

Download a webpagethey are of type 'str'

Tokenization: break up the string into words and punctuation

Convert to NLTK textRemove headers


10

Download a web page

>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"

>>> html = urlopen(url).read()>>> html[:60]'<!doctype html public "-//W3C//DTD HTML 4.0

Transitional//EN'>>> print html # only if you want to see the html

code


11

Tokenize

>>> raw = nltk.clean_html(html)>>> tokens = nltk.word_tokenize(raw)>>> tokens['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'", 'to', 'die', 'out', ...]


12

Convert to NLTK text

>>> tokens = tokens[96:399]>>> text = nltk.Text(tokens)>>> text.concordance('gene')they say too few people now carry the gene for blondes to last beyond the next twt blonde hair is caused by a recessive gene . In order for a child to have blondeto have blonde hair , it must have the gene on both sides of the family in the grathere is a disadvantage of having that gene or by chance . They don ' t disappearondes would disappear is if having the gene was a disadvantage and I do not think


13

Remove headers

Trial and error.


14

For more sophisticated processing of HTML

Use the Beautiful Soup package, available from:http://www.crummy.com/software/BeautifulSoup/

http://www.crummy.com/software/BeautifulSoup/


15

Other Internet formats

Search engine resultsFeeds/RSS


16

Search engine results

Advantageslarge sizeeasy to do

Disadvantagessearch engine restricts patternsresults vary according to time and placecontent may be duplicated


17

Search engine API

What is the Google AJAX Search API?The Google AJAX Search API lets you put

Google Search in your web pages with JavaScript.

You can embed a simple, dynamic search box and display search results in your own web pages or use the results in innovative, programmatic ways.

http://code.google.com/apis/ajaxsearch/

http://code.google.com/apis/ajaxsearch/


18

RSS

What is it?Use the Universal Feed Parser from

http://feedparser.org/ to access the content of a blog, as in the following example.


19

RSS example

>>> import feedparser>>> llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")>>> llog['feed']['title']u'Language Log'>>> len(llog.entries)15>>> post = llog.entries[2]>>> post.titleu"He's My BF"


20

RSS example, cont.

>>> content = post.content[0].value>>> content[:70]u'<p>Today I was chatting with three of our visiting graduate students f'>>> nltk.word_tokenize(nltk.html_clean(content))>>> nltk.word_tokenize(nltk.clean_html(llog.entries[2].content[0].value))[u'Today', u'I', u'was', u'chatting', u'with', u'three', u'of', u'our', u'visiting', u'graduate', u'students', u'from', u'the', u'PRC', u'.', u'Thinking', u'that', u'I', u'was', u'being', u'au', u'courant', u',', u'I', u'mentioned', u'the', u'expression', u'DUI4XIANG4', u'\u5c0d\u8c61', u'("', u'boy', u'/', u'girl', u'friend', u'"', ...]


21

Reading local files

Plain text or asciiBinary formatsUser input


22

Plain text or ASCII files

Use the functions mentioned in §2 that involve open(), repeated in next slide from there.


23

Loading your own corpusTable 2.3

Example Description

abspath(fileid) the location of the file on disk

encoding(fileid) the encoding of the file (if known)

open(fileid)open a stream for reading the given corpus file

root()the path to the root of locally installed corpus

readme()the contents of the README file of the corpus


24

Your turn, p. 84

Create a file called document.txt using a text editor, and type in a few lines of text, and save it as plain text. If you are using IDLE, select the New Window

command in the File menu, typing the required text into this window, and then saving the file as document.txt inside the directory that IDLE offers in the pop-up dialogue box.

Next, in the Python interpreter, open the file using f = open('document.txt'), then inspect its contents using print f.read().


25

Text from binary

Plain text of a pdf file:%PDF-1.2%ｄ恃2 0 obj<</Length 3205/Filter /FlateDecode>>stream

HâîW€r€» Õ3øÇo°™$ 3∏ø≠Ì›$fiZªíX[~ê¸ í# kê‡‚"Y˛˙ÙÙm ä©T U 1 ÙÙtü>:Z>.fi¸˝≥Y>ˆã? q -˝øõd Kc”uY À‹ÿuë fπ=,fiXº˘≤4ã7ˇˇΩ˚ÁˇÁ√«Â–çnÒ·ÁÂ_ ø|X¸kÒÓvÒÊ÷ö•Yfi>,¢µ±d õØÌ2Àãu^,o ãhyª]¨~sUw¨èèW∑ ,≤<Ü›∑?/Vü›„¡ ˝ZöDº6TC›‚Zí ^€¯«"„ß ˇd‚Ç¬ˇ] wÌ '9Ø¡-i¬èUÛ≠˜+ø‹N£`¶Q»≤hù%≈ˇ ÑËb l AHñôç◊÷˙(¨>V]¨º qú≤[¬]›†£Ö‹Â>≤)Æà„x≠B ≠+åå…-≠‹Ahn‚(äV-¸µ˛«ÁııˆWÕî‰€çImæ,-ö¯˝X?ë IÇÎ˙z†xqÑmÊsv∑ziØlÊm>\›†Ì/hŸ&¸)•∂Í˜êÿÅrñFôxˆ 2Ó kfl¢ëèÆi`”ç-¸ ˙·≠ÂeÚ_ °T√Mæâç ú•¡wá√Èß –≤~f ¸fi\!– ◊Ñ û€!ä‰˝àπ0íã?¿∆O€~Ω=åk∑ ˘‹úœı åì wæ›ÙCWmfi"A.¡≤O§¬/ó§:|ípü:◊ Êm¨òß4OìûãüG˜L˘*y·©¶ »à\ı wDrD® ÒPõ»GΩVöÕÏ´J≥©¨a•%…4¸z©™ªÁ∫w b˘¢Ø ıç_∫± Ä√0k ÓCÌ˙ı|Û•¿»aıë»Aºs Ñ


26

Text from binary

Open it with third-party libraries such as pypdf or pywin32, or

open it with the corresponding program and save it as text, or

if it is on the Internet, see if Google has a html version.


27

User input

raw_input() to capture what the user has typed or pasted.


28

Summary: NLP pipelineFig. 3.1

Next time

NLPP §3.2 Strings

NLPP §3.3 RegEx

Date post:	29-Dec-2015
Category:	Documents
Upload:	lucas-webster
View:	216 times
Download:	0 times

NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

Documents