+ All Categories
Home > Documents > NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

Date post: 29-Dec-2015
Category:
Upload: lucas-webster
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
29
NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University
Transcript
Page 1: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

NLTK & PythonDay 9

LING 681.02Computational Linguistics

Harry HowardTulane University

Page 2: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

2

Course organization

NLTK should be installed on the computers in this room!

Page 3: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

NLPP §3 Processing raw text

§3.1 Accessing text from the Web and from disk

Page 4: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

4

Using e-books

Download an e-bookthey are of type 'str'

Tokenization: break up the string into words and punctuation

Convert to NLTK textRemove headers

Page 5: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

5

Download an e-book

>>> from urllib import urlopen>>> url = "http://www.gutenberg.org/files/2554/2554.txt"

>>> raw = urlopen(url).read()>>> type(raw)<type 'str'>>>> len(raw)1176831>>> raw[:75]'The Project Gutenberg EBook of Crime and

Punishment, by Fyodor Dostoevsky\r\n'

Page 6: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

6

Tokenize

>>> tokens = nltk.word_tokenize(raw)>>> type(tokens)<type 'list'>>>> len(tokens)255809>>> tokens[:10]['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']

Page 7: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

7

Convert to NLTK text

>>> text = nltk.Text(tokens)>>> type(text)<type 'nltk.text.Text'>>>> text[1020:1060]['CHAPTER', 'I', 'On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in', 'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in', 'which', 'he', 'lodged', 'in', 'S', '.', 'Place', 'and', 'walked', 'slowly', ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K', '.', 'bridge', '.']>>> text.collocations()Katerina Ivanovna; Pulcheria Alexandrovna; Avdotya Romanovna; Pyotr Petrovitch; Project Gutenberg; Marfa Petrovna; Rodion Romanovitch; Sofya Semyonovna; Nikodim Fomitch; did not; Hay Market; etc.

Page 8: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

8

Remove headers

>>> raw.find("PART I")5303>>> raw.rfind("End of Project Gutenberg's Crime")

1157681>>> raw = raw[5303:1157681]>>> raw.find("PART I")0

Page 9: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

9

Dealing with HTML

Download a webpagethey are of type 'str'

Tokenization: break up the string into words and punctuation

Convert to NLTK textRemove headers

Page 10: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

10

Download a web page

>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"

>>> html = urlopen(url).read()>>> html[:60]'<!doctype html public "-//W3C//DTD HTML 4.0

Transitional//EN'>>> print html # only if you want to see the html

code

Page 11: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

11

Tokenize

>>> raw = nltk.clean_html(html)>>> tokens = nltk.word_tokenize(raw)>>> tokens['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'", 'to', 'die', 'out', ...]

Page 12: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

12

Convert to NLTK text

>>> tokens = tokens[96:399]>>> text = nltk.Text(tokens)>>> text.concordance('gene')they say too few people now carry the gene for blondes to last beyond the next twt blonde hair is caused by a recessive gene . In order for a child to have blondeto have blonde hair , it must have the gene on both sides of the family in the grathere is a disadvantage of having that gene or by chance . They don ' t disappearondes would disappear is if having the gene was a disadvantage and I do not think

Page 13: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

13

Remove headers

Trial and error.

Page 14: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

14

For more sophisticated processing of HTML

Use the Beautiful Soup package, available from:http://www.crummy.com/software/BeautifulSoup/

Page 15: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

15

Other Internet formats

Search engine resultsFeeds/RSS

Page 16: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

16

Search engine results

Advantageslarge sizeeasy to do

Disadvantagessearch engine restricts patternsresults vary according to time and placecontent may be duplicated

Page 17: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

17

Search engine API

What is the Google AJAX Search API?The Google AJAX Search API lets you put

Google Search in your web pages with JavaScript.

You can embed a simple, dynamic search box and display search results in your own web pages or use the results in innovative, programmatic ways.

http://code.google.com/apis/ajaxsearch/

Page 18: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

18

RSS

What is it?Use the Universal Feed Parser from

http://feedparser.org/ to access the content of a blog, as in the following example.

Page 19: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

19

RSS example

>>> import feedparser>>> llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")>>> llog['feed']['title']u'Language Log'>>> len(llog.entries)15>>> post = llog.entries[2]>>> post.titleu"He's My BF"

Page 20: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

20

RSS example, cont.

>>> content = post.content[0].value>>> content[:70]u'<p>Today I was chatting with three of our visiting graduate students f'>>> nltk.word_tokenize(nltk.html_clean(content))>>> nltk.word_tokenize(nltk.clean_html(llog.entries[2].content[0].value))[u'Today', u'I', u'was', u'chatting', u'with', u'three', u'of', u'our', u'visiting', u'graduate', u'students', u'from', u'the', u'PRC', u'.', u'Thinking', u'that', u'I', u'was', u'being', u'au', u'courant', u',', u'I', u'mentioned', u'the', u'expression', u'DUI4XIANG4', u'\u5c0d\u8c61', u'("', u'boy', u'/', u'girl', u'friend', u'"', ...]

Page 21: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

21

Reading local files

Plain text or asciiBinary formatsUser input

Page 22: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

22

Plain text or ASCII files

Use the functions mentioned in §2 that involve open(), repeated in next slide from there.

Page 23: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

23

Loading your own corpusTable 2.3

Example Description

abspath(fileid) the location of the file on disk

encoding(fileid) the encoding of the file (if known)

open(fileid)open a stream for reading the given corpus file

root()the path to the root of locally installed corpus

readme()the contents of the README file of the corpus

Page 24: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

24

Your turn, p. 84

Create a file called document.txt using a text editor, and type in a few lines of text, and save it as plain text. If you are using IDLE, select the New Window

command in the File menu, typing the required text into this window, and then saving the file as document.txt inside the directory that IDLE offers in the pop-up dialogue box.

Next, in the Python interpreter, open the file using f = open('document.txt'), then inspect its contents using print f.read().

Page 25: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

25

Text from binary

Plain text of a pdf file:%PDF-1.2%d恃2 0 obj<</Length 3205/Filter /FlateDecode>>stream

HâîW€r€» Õ3øÇo°™$ 3∏ø≠Ì›$fiZªíX[~ê¸ í# kꇂ"Y˛˙ÙÙm ä©T U 1 ÙÙtü>:Z>.fi¸˝≥Y>ˆã? q -˝øõd Kc”uY À‹ÿuë fπ=,fiXº˘≤4ã7ˇˇΩ˚ÁˇÁ√«Â–çnÒ·ÁÂ_ ø|X¸kÒÓvÒÊ÷ö•Yfi>,¢µ±d õØÌ2Àãu^,o ãhyª]¨~sUw¨èèW∑ ,≤<Ü›∑?/Vü›„¡ ˝ZöDº6TC›‚Zí ^€¯«"„ß ˇd‚Ǭˇ] wÌ '9Ø¡-i¬èUÛ≠˜+ø‹N£`¶Q»≤hù%≈ˇ ÑËb l AHñôç◊÷˙(¨>V]¨º qú≤[¬]›†£Ö‹Â>≤)Æà„x≠B ≠+åå…-≠‹Ahn‚(äV-¸µ˛«ÁııˆWÕçImæ,-ö¯˝X?ë IÇÎ˙z†xqÑmÊsv∑ziØlÊm>\›†Ì/hŸ&¸)•∂͘êÿÅrñFôxˆ 2Ó kfl¢ëèÆi`”ç-¸ ˙·≠ÂeÚ_ °T√Mæâç ú•¡wá√Èß –≤~f ¸fi\!– ◊Ñ û€!䉽àπ0íã?¿∆O€~Ω=åk∑ ˘‹úœı åì wæ›ÙCWmfi"A.¡≤O§¬/ó§:|ípü:◊ Êm¨òß4OìûãüG˜L˘*y·©¶ »à\ı wDrD® ÒPõ»GΩVöÕÏ´J≥©¨a•%…4¸z©™ªÁ∫w b˘¢Ø ıç_∫± Ä√0k ÓCÌ˙ı|Û•¿»aıë»Aºs Ñ

Page 26: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

26

Text from binary

Open it with third-party libraries such as pypdf or pywin32, or

open it with the corresponding program and save it as text, or

if it is on the Internet, see if Google has a html version.

Page 27: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

27

User input

raw_input() to capture what the user has typed or pasted.

Page 28: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

14-Sept-2009 LING 681.02, Prof. Howard, Tulane University

28

Summary: NLP pipelineFig. 3.1

Page 29: NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

Next time

NLPP §3.2 Strings

NLPP §3.3 RegEx


Recommended