Date post: | 29-Dec-2015 |
Category: |
Documents |
Upload: | lucas-webster |
View: | 216 times |
Download: | 0 times |
NLTK & PythonDay 9
LING 681.02Computational Linguistics
Harry HowardTulane University
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
2
Course organization
NLTK should be installed on the computers in this room!
NLPP §3 Processing raw text
§3.1 Accessing text from the Web and from disk
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
4
Using e-books
Download an e-bookthey are of type 'str'
Tokenization: break up the string into words and punctuation
Convert to NLTK textRemove headers
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
5
Download an e-book
>>> from urllib import urlopen>>> url = "http://www.gutenberg.org/files/2554/2554.txt"
>>> raw = urlopen(url).read()>>> type(raw)<type 'str'>>>> len(raw)1176831>>> raw[:75]'The Project Gutenberg EBook of Crime and
Punishment, by Fyodor Dostoevsky\r\n'
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
6
Tokenize
>>> tokens = nltk.word_tokenize(raw)>>> type(tokens)<type 'list'>>>> len(tokens)255809>>> tokens[:10]['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
7
Convert to NLTK text
>>> text = nltk.Text(tokens)>>> type(text)<type 'nltk.text.Text'>>>> text[1020:1060]['CHAPTER', 'I', 'On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in', 'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in', 'which', 'he', 'lodged', 'in', 'S', '.', 'Place', 'and', 'walked', 'slowly', ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K', '.', 'bridge', '.']>>> text.collocations()Katerina Ivanovna; Pulcheria Alexandrovna; Avdotya Romanovna; Pyotr Petrovitch; Project Gutenberg; Marfa Petrovna; Rodion Romanovitch; Sofya Semyonovna; Nikodim Fomitch; did not; Hay Market; etc.
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
8
Remove headers
>>> raw.find("PART I")5303>>> raw.rfind("End of Project Gutenberg's Crime")
1157681>>> raw = raw[5303:1157681]>>> raw.find("PART I")0
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
9
Dealing with HTML
Download a webpagethey are of type 'str'
Tokenization: break up the string into words and punctuation
Convert to NLTK textRemove headers
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
10
Download a web page
>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> html = urlopen(url).read()>>> html[:60]'<!doctype html public "-//W3C//DTD HTML 4.0
Transitional//EN'>>> print html # only if you want to see the html
code
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
11
Tokenize
>>> raw = nltk.clean_html(html)>>> tokens = nltk.word_tokenize(raw)>>> tokens['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'", 'to', 'die', 'out', ...]
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
12
Convert to NLTK text
>>> tokens = tokens[96:399]>>> text = nltk.Text(tokens)>>> text.concordance('gene')they say too few people now carry the gene for blondes to last beyond the next twt blonde hair is caused by a recessive gene . In order for a child to have blondeto have blonde hair , it must have the gene on both sides of the family in the grathere is a disadvantage of having that gene or by chance . They don ' t disappearondes would disappear is if having the gene was a disadvantage and I do not think
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
13
Remove headers
Trial and error.
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
14
For more sophisticated processing of HTML
Use the Beautiful Soup package, available from:http://www.crummy.com/software/BeautifulSoup/
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
15
Other Internet formats
Search engine resultsFeeds/RSS
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
16
Search engine results
Advantageslarge sizeeasy to do
Disadvantagessearch engine restricts patternsresults vary according to time and placecontent may be duplicated
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
17
Search engine API
What is the Google AJAX Search API?The Google AJAX Search API lets you put
Google Search in your web pages with JavaScript.
You can embed a simple, dynamic search box and display search results in your own web pages or use the results in innovative, programmatic ways.
http://code.google.com/apis/ajaxsearch/
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
18
RSS
What is it?Use the Universal Feed Parser from
http://feedparser.org/ to access the content of a blog, as in the following example.
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
19
RSS example
>>> import feedparser>>> llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")>>> llog['feed']['title']u'Language Log'>>> len(llog.entries)15>>> post = llog.entries[2]>>> post.titleu"He's My BF"
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
20
RSS example, cont.
>>> content = post.content[0].value>>> content[:70]u'<p>Today I was chatting with three of our visiting graduate students f'>>> nltk.word_tokenize(nltk.html_clean(content))>>> nltk.word_tokenize(nltk.clean_html(llog.entries[2].content[0].value))[u'Today', u'I', u'was', u'chatting', u'with', u'three', u'of', u'our', u'visiting', u'graduate', u'students', u'from', u'the', u'PRC', u'.', u'Thinking', u'that', u'I', u'was', u'being', u'au', u'courant', u',', u'I', u'mentioned', u'the', u'expression', u'DUI4XIANG4', u'\u5c0d\u8c61', u'("', u'boy', u'/', u'girl', u'friend', u'"', ...]
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
21
Reading local files
Plain text or asciiBinary formatsUser input
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
22
Plain text or ASCII files
Use the functions mentioned in §2 that involve open(), repeated in next slide from there.
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
23
Loading your own corpusTable 2.3
Example Description
abspath(fileid) the location of the file on disk
encoding(fileid) the encoding of the file (if known)
open(fileid)open a stream for reading the given corpus file
root()the path to the root of locally installed corpus
readme()the contents of the README file of the corpus
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
24
Your turn, p. 84
Create a file called document.txt using a text editor, and type in a few lines of text, and save it as plain text. If you are using IDLE, select the New Window
command in the File menu, typing the required text into this window, and then saving the file as document.txt inside the directory that IDLE offers in the pop-up dialogue box.
Next, in the Python interpreter, open the file using f = open('document.txt'), then inspect its contents using print f.read().
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
25
Text from binary
Plain text of a pdf file:%PDF-1.2%d恃2 0 obj<</Length 3205/Filter /FlateDecode>>stream
HâîW€r€» Õ3øÇo°™$ 3∏ø≠Ì›$fiZªíX[~ê¸ í# kꇂ"Y˛˙ÙÙm ä©T U 1 ÙÙtü>:Z>.fi¸˝≥Y>ˆã? q -˝øõd Kc”uY À‹ÿuë fπ=,fiXº˘≤4ã7ˇˇΩ˚ÁˇÁ√«Â–çnÒ·ÁÂ_ ø|X¸kÒÓvÒÊ÷ö•Yfi>,¢µ±d õØÌ2Àãu^,o ãhyª]¨~sUw¨èèW∑ ,≤<Ü›∑?/Vü›„¡ ˝ZöDº6TC›‚Zí ^€¯«"„ß ˇd‚Ǭˇ] wÌ '9Ø¡-i¬èUÛ≠˜+ø‹N£`¶Q»≤hù%≈ˇ ÑËb l AHñôç◊÷˙(¨>V]¨º qú≤[¬]›†£Ö‹Â>≤)Æà„x≠B ≠+åå…-≠‹Ahn‚(äV-¸µ˛«ÁııˆWÕçImæ,-ö¯˝X?ë IÇÎ˙z†xqÑmÊsv∑ziØlÊm>\›†Ì/hŸ&¸)•∂͘êÿÅrñFôxˆ 2Ó kfl¢ëèÆi`”ç-¸ ˙·≠ÂeÚ_ °T√Mæâç ú•¡wá√Èß –≤~f ¸fi\!– ◊Ñ û€!䉽àπ0íã?¿∆O€~Ω=åk∑ ˘‹úœı åì wæ›ÙCWmfi"A.¡≤O§¬/ó§:|ípü:◊ Êm¨òß4OìûãüG˜L˘*y·©¶ »à\ı wDrD® ÒPõ»GΩVöÕÏ´J≥©¨a•%…4¸z©™ªÁ∫w b˘¢Ø ıç_∫± Ä√0k ÓCÌ˙ı|Û•¿»aıë»Aºs Ñ
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
26
Text from binary
Open it with third-party libraries such as pypdf or pywin32, or
open it with the corresponding program and save it as text, or
if it is on the Internet, see if Google has a html version.
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
27
User input
raw_input() to capture what the user has typed or pasted.
14-Sept-2009 LING 681.02, Prof. Howard, Tulane University
28
Summary: NLP pipelineFig. 3.1
Next time
NLPP §3.2 Strings
NLPP §3.3 RegEx