Date post: | 10-May-2015 |
Category: |
Technology |
Upload: | david-beazley-dabeaz-llc |
View: | 8,488 times |
Download: | 2 times |
Copyright (C) 2007, http://www.dabeaz.com 2-
Python in Action
1
Presented at USENIX LISA ConferenceNovember 16, 2007
David M. Beazleyhttp://www.dabeaz.com
(Part II - Systems Programming)
Copyright (C) 2007, http://www.dabeaz.com 2-
Section Overview• In this section, we're going to get dirty
• Systems Programming
• Files, I/O, file-system
• Text parsing, data decoding
• Processes and IPC
• Networking
• Threads and concurrency
2
Copyright (C) 2007, http://www.dabeaz.com 2-
Commentary
• I personally think Python is a fantastic tool for systems programming.
• Modules provide access to most of the major system libraries I used to access via C
• No enforcement of "morality"
• Decent performance
• It just "works" and it's fun
3
Copyright (C) 2007, http://www.dabeaz.com 2-
Approach
• I've thought long and hard about how I would present this part of the class.
• A reference manual approach would probably be long and very boring.
• So instead, we're going to focus on building something more in tune with the times
4
Copyright (C) 2007, http://www.dabeaz.com 2-
"To Catch a Slacker"• Write a collection of Python programs that can
quietly monitor Firefox browser caches to find out who has been spending their day reading Slashdot instead of working on their TPS reports.
• Oh yeah, and be a real sneaky bugger about it.
5
Copyright (C) 2007, http://www.dabeaz.com 2-
Why this Problem?
• Involves a real-world system and data
• Firefox already installed on your machine (?)
• Cross platform (Linux, Mac, Windows)
• Example of tool building
• Related to a variety of practical problems
• A good tour of "Python in Action"
6
Copyright (C) 2007, http://www.dabeaz.com 2-
Disclaimers• I am not involved in browser forensics (or
spyware for that matter).
• I am in no way affiliated with Firefox/Mozilla nor have I ever seen Firefox source code
• I have never worked with the cache data prior to preparing this tutorial
• I have never used any third-party tools for looking at this data.
7
Copyright (C) 2007, http://www.dabeaz.com 2-
More Disclaimers• All of the code in this tutorial works with a
standard Python installation
• No third party modules.
• All code is cross-platform
• Code samples are available online at
8
http://www.dabeaz.com/action/
• Please look at that code and follow along
Copyright (C) 2007, http://www.dabeaz.com 2-
Assumptions
• This is not a tutorial on systems concepts
• You should be generally familiar with background material (files, filesystems, file formats, processes, threads, networking, protocols, etc.)
• Hopefully you can "extrapolate" from the material presented here to construct more advanced Python applications.
9
Copyright (C) 2007, http://www.dabeaz.com 2-
The Big Picture
• We want to write a tool that allows someone to locate, inspect, and perform queries across a distributed collection of Firefox caches.
• For example, the cache directories on all machines on the LAN of a quasi-evil corporation.
10
Copyright (C) 2007, http://www.dabeaz.com 2-
The Firefox Cache• The Firefox browser keeps a disk cache of
recently visited sites
11
% ls Cache/-rw------- 1 beazley 111169 Sep 25 17:15 01CC0844d01-rw------- 1 beazley 104991 Sep 25 17:15 01CC3844d01-rw------- 1 beazley 47233 Sep 24 16:41 021F221Ad01...-rw------- 1 beazley 26749 Sep 21 11:19 FF8AEDF0d01-rw------- 1 beazley 58172 Sep 25 18:16 FFE628C6d01-rw------- 1 beazley 1939456 Sep 25 19:14 _CACHE_001_-rw------- 1 beazley 2588672 Sep 25 19:14 _CACHE_002_-rw------- 1 beazley 4567040 Sep 25 18:44 _CACHE_003_-rw------- 1 beazley 33044 Sep 23 21:58 _CACHE_MAP_
• A bunch of cryptically named files.
Copyright (C) 2007, http://www.dabeaz.com 2-
Problem : Finding Files• Find the Firefox cache
Write a program findcache.py that takes a directory name as input and recursively scans that directory and all subdirectories looking for Firefox/Mozilla cache directories.
12
• Example:% python findcache.py /Users/beazley/Users/beazley/Library/.../qs1ab616.default/Cache/Users/beazley/Library/.../wxuoyiuf.slt/Cache%
• Use case: Searching for things on the filesystem.
Copyright (C) 2007, http://www.dabeaz.com 2-
findcache.py# findcache.py# Recursively scan a directory looking for # Firefox/Mozilla cache directories
import sysimport os
if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1)
caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files)
for name in caches: print name
13
Copyright (C) 2007, http://www.dabeaz.com 2-
The sys module# findcache.py# Recursively scan a directory looking for # Firefox/Mozilla cache directories
import sysimport os
if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1)
caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files)
for name in caches: print name
14
The sys module has basic information related to the execution environment.
sys.argv
A list of the command line options
sys.stdinsys.stdoutsys.stderr
Standard I/O files
sys.argv = ['findcache.py', '/Users/beazley']
Copyright (C) 2007, http://www.dabeaz.com 2-
Program Termination# findcache.py# Recursively scan a directory looking for # Firefox/Mozilla cache directories
import sysimport os
if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1)
caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files)
for name in caches: print name
15
SystemExit exception
Forces Python to exit.Value is return code.
Copyright (C) 2007, http://www.dabeaz.com 2-
os Module# findcache.py# Recursively scan a directory looking for # Firefox/Mozilla cache directories
import sysimport os
if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1)
caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files)
for name in caches: print name
16
os module
Contains useful OS related functions (files, processes, etc.)
Copyright (C) 2007, http://www.dabeaz.com 2-
os.walk()# findcache.py# Recursively scan a directory looking for # Firefox/Mozilla cache directories
import sysimport os
if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1)
caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files)
for name in caches: print name
17
os.walk(topdir)
Recursively walks a directory tree and generates a sequence of tuples (path,dirs,files)
path = The current directory namedirs = List of all subdirectory names in pathfiles = List of all regular files (data) in path
Copyright (C) 2007, http://www.dabeaz.com 2-
A Sequence of Caches# findcache.py# Recursively scan a directory looking for # Firefox/Mozilla cache directories
import sysimport os
if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1)
caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files)
for name in caches: print name
18
This statement generates a sequence of directory names where '_CACHE_MAP_' is contained in the filelist.
The directory name that is generated as a
result
File name check
Copyright (C) 2007, http://www.dabeaz.com 2-
Printing the Result# findcache.py# Recursively scan a directory looking for # Firefox/Mozilla cache directories
import sysimport os
if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1)
caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files)
for name in caches: print name
19
This prints the sequence of cache directories that are generated by the previous statement.
Copyright (C) 2007, http://www.dabeaz.com 2-
Commentary
• Our solution is strongly based on a "declarative" programming style (again)
• We simply write out a sequence of operations that produce what we want
• Not focused on the underlying mechanics of how to traverse all of the directories.
20
Copyright (C) 2007, http://www.dabeaz.com 2-
Big Idea : Iteration• Python allows iteration to be captured as a
kind of object.
21
caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files)
• This de-couples iteration from the code that uses the iterationfor name in caches: print name
• Another usage example:for name in caches: print len(os.listdir(name)), name
Copyright (C) 2007, http://www.dabeaz.com 2-
Big Idea : Iteration• Compare to this:
22
for path,dirs,files in os.walk(sys.argv[1]): if '_CACHE_MAP_' in files: print len(os.listdir(path)),path
• This code is simple, but the loop and the code that executes in the loop body are coupled together
• Not as flexible, but this is somewhat subtle to wrap your brain around at first.
Copyright (C) 2007, http://www.dabeaz.com 2-
Mini-Reference : sys, os• sys module
23
sys.argv # List of command line optionssys.stdin # Standard inputsys.stdout # Standard outputsys.stderr # Standard errorsys.executable # Full path of Python executablesys.exc_info() # Information on current exception
• os moduleos.walk(dir) # Recursively walk dir producing a # sequence of tuples (path,dlist,flist) os.listdir(dir) # Return a list of all files in dir
• SystemExit exceptionraise SystemExit(n) # Exit with integer code n
Copyright (C) 2007, http://www.dabeaz.com 2-
Problem: Searching for Text• Extract all URL requests from the cache
Write a program requests.py that scans the contents of the _CACHE_00n_ files and prints a list of URLs for documents stored in the cache.
24
• Example:% python requests.py /Users/.../qs1ab616.default/Cachehttp://www.yahoo.com/http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/js/ad_eo_1.1.jshttp://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gifhttp://us.i1.yimg.com/us.yimg.com/i/ww/thm/1/search_1.1.png...%
• Use case: Searching the contents of files for text patterns.
Copyright (C) 2007, http://www.dabeaz.com 2-
The Firefox Cache• The cache directory holds two types of data
• Metadata (URLs, headers, etc.).
• Raw data (HTML, JPEG, PNG, etc.)
• This data is stored in two places
• Cryptic files in the Cache directory
• Blocks inside the _CACHE_00n_ files
• Metadata almost always in _CACHE_00n_
25
Copyright (C) 2007, http://www.dabeaz.com 2-
Possible Solution : Regex• The _CACHE_00n_ files are encoded in a
binary format, but URLs are embedded inside as null-terminated text:
26
\x00\x01\x00\x08\x92\x00\x02\x18\x00\x00\x00\x13F\xff\x9f\xceF\xff\x9f\xce\x00\x00\x00\x00\x00\x00H)\x00\x00\x00\x1a\x00\x00\x023HTTP:http://slashdot.org/\x00request-method\x00GET\x00request-User-Agent\x00Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7\x00request-Accept-Encoding\x00gzip,deflate\x00response-head\x00HTTP/1.1 200 OK\r\nDate: Sun, 30 Sep 2007 13:07:29 GMT\r\nServer: Apache/1.3.37 (Unix) mod_perl/1.29\r\nSLASH_LOG_DATA: shtml\r\nX-Powered-By: Slash 2.005000176\r\nX-Fry: How can I live my life if I can't tell good from evil?\r\nCache-Control:
• Maybe the requests could just be ripped using a regular expression.
Copyright (C) 2007, http://www.dabeaz.com 2-
A Regex Solution# requests.pyimport reimport osimport sys
cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
# A regex for embedded URL stringsrequest_pat = re.compile('([a-z]+://.*?)\x00')
# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()
27
Copyright (C) 2007, http://www.dabeaz.com 2-
# requests.pyimport reimport osimport sys
cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
# A regex for embedded URL stringsrequest_pat = re.compile(r'([a-z]+://.*?)\x00')
# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()
The re module
28
re module
Contains all functionality related to regular expression pattern matching, searching, replacing, etc.
Features are strongly influenced by Perl, but regexs are not directly integrated into the Python language.
Copyright (C) 2007, http://www.dabeaz.com 2-
# requests.pyimport reimport osimport sys
cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
# A regex for embedded URL stringsrequest_pat = re.compile('([a-z]+://.*?)\x00')
# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()
Using re
29
Patterns are first specified as strings and compiled into a regex object.
pat = re.compile(pattern [,flags])
The pattern syntax is "standard"pat*pat+pat?(pat).
pat1|pat2[chars][^chars]pat{n}pat{n,m}
Copyright (C) 2007, http://www.dabeaz.com 2-
# requests.pyimport reimport osimport sys
cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
# A regex for embedded URL stringsrequest_pat = re.compile(r'([a-z]+://.*?)\x00')
# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()
Using re
30
All subsequent operations are methods of the compiled regex patternm = pat.match(data [,start]) # Check for matchm = pat.search(data [,start]) # Search for matchnewdata = pat.sub(data, repl) # Pattern replace
Copyright (C) 2007, http://www.dabeaz.com 2-
# requests.pyimport reimport osimport sys
cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
# A regex for embedded URL stringsrequest_pat = re.compile(r'([a-z]+://.*?)\x00')
# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()
Searching for Matches
31
pat.search(text [,start])
Searches the string text for the first occurrence of the regex pattern starting at position start. Returns a "MatchObject" if a match is found.
In the code below, we're finding matches one at a time.
Copyright (C) 2007, http://www.dabeaz.com 2-
# requests.pyimport reimport osimport sys
cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
# A regex for embedded URL stringsrequest_pat = re.compile(r'([a-z]+://.*?)\x00')
# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()
Match Objects
32
Regex matches are represented by a MatchObject
m.group([n]) # Text matched by group nm.start([n]) # Starting index of group nm.end([n]) # End index of group n
The matching text for just the URL.
The end of the match
Copyright (C) 2007, http://www.dabeaz.com 2-
# requests.pyimport reimport osimport sys
cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
# A regex for embedded URL stringsrequest_pat = re.compile('([a-z]+://.*?)\x00')
# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()
Groups
33
In patterns, parentheses () define groups which are numbered left to right.group 0 # The entire patterngroup 1 # Text in first ()group 2 # Text in next ()...
Copyright (C) 2007, http://www.dabeaz.com 2-
Mini-Reference : re• re pattern compilation
34
pat = re.compile(r'patternstring')
• Pattern syntaxliteral # Match literal textpat* # Match 0 or more repetitions of patpat+ # Match 1 or more repetitions of patpat? # Match 0 or 1 repetitions of patpat1|pat2 # Patch pat1 or pat2(pat) # Patch pat (group)[chars] # Match characters in chars[^chars] # Match characters not in chars. # Match any character except \n\d # Match any digit\w # Match alphanumeric character\s # Match whitespace
Copyright (C) 2007, http://www.dabeaz.com 2-
Mini-Reference : re• Common pattern operations
35
pat.search(text) # Search text for a matchpat.match(text) # Search start of text for matchpat.sub(repl,text) # Replace pattern with repl
• Match objectsm.group([n]) # Text matched by group nm.start([n]) # Starting position of group nm.end([n]) # Ending position of group n
• How to loop over all matches of a patternfor m in pat.finditer(text): # m is a MatchObject that you process ...
Copyright (C) 2007, http://www.dabeaz.com 2-
Mini-Reference : re• An example of pattern replacement
36
# This replaces American dates of the form 'mm/dd/yyyy'# with European dates of the form 'dd/mm/yyyy'.
# This function takes a MatchObject as input and returns# replacement text as output.
def euro_date(m): month = m.group(1) day = m.group(2) year = m.group(3) return "%d/%d/%d" % (day,month,year)
# Date re pattern and replacement operationdatepat = re.compile(r'(\d+)/(\d+)/(\d+)')newdata = datepat.sub(euro_date,text)
Copyright (C) 2007, http://www.dabeaz.com 2-
Mini-Reference : re
37
• There are many more features of the re module
• Strongly influenced by Perl (feature set)
• Regexs are a library in Python, not integrated into the language.
• A book on regular expressions may be essential for advanced functions.
Copyright (C) 2007, http://www.dabeaz.com 2-
# requests.pyimport reimport osimport sys
cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
# A regex for embedded URL stringsrequest_pat = re.compile(r'([a-z]+://.*?)\x00')
# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()
File Handling
38
What is going on in this statement?
Copyright (C) 2007, http://www.dabeaz.com 2-
# requests.pyimport reimport osimport sys
cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
# A regex for embedded URL stringsrequest_pat = re.compile(r'([a-z]+://.*?)\x00')
# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()
os.path module
39
os.path has portable file related functionsos.path.join(name1,name2,...) # Join path namesos.path.getsize(filename) # Get the file sizeos.path.getmtime(filename) # Get modification date
There are many more functions, but this is the preferred module for basic filename handling
Copyright (C) 2007, http://www.dabeaz.com 2-
# requests.pyimport reimport osimport sys
cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
# A regex for embedded URL stringsrequest_pat = re.compile(r'([a-z]+://.*?)\x00')
# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()
os.path.join()
40
Creates a fully-expanded pathnamedirname = '/foo/bar' filename = 'name'
'/foo/bar/name'
os.path.join(dirname,filename)
Aware of platform differences ('/' vs. '\')
Copyright (C) 2007, http://www.dabeaz.com 2-
Mini-Reference : os.path
41
os.path.join(s1,s2,...) # Join pathname parts togetheros.path.getsize(path) # Get file size of pathos.path.getmtime(path) # Get modify time of pathos.path.getatime(path) # Get access time of pathos.path.getctime(path) # Get creation time of pathos.path.exists(path) # Check if path existsos.path.isfile(path) # Check if regular fileos.path.isdir(path) # Check if directoryos.path.islink(path) # Check if symbolic linkos.path.basename(path) # Return file part of pathos.path.dirname(path) # Return dir part ofos.path.abspath(path) # Get absolute path
Copyright (C) 2007, http://www.dabeaz.com 2-
# requests.pyimport reimport osimport sys
cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
# A regex for embedded URL stringsrequest_pat = re.compile(r'([a-z]+://.*?)\x00')
# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()
Binary I/O
42
For all binary files, use modes "rb","wb", etc.
Disables new-line translation (critical on Windows)
Copyright (C) 2007, http://www.dabeaz.com 2-
# requests.pyimport reimport osimport sys
cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
# A regex for embedded URL stringsrequest_pat = re.compile(r'([a-z]+://.*?)\x00')
# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()
Common I/O Shortcuts
43
# Read an entire file into a stringdata = open(filename).read()
# Write a string out to a fileopen(filename,"w").write(text)
# Loop over all lines in a filefor line in open(filename): ...
Copyright (C) 2007, http://www.dabeaz.com 2-
Commentary on Solution
• This regex approach is mostly a hack for this particular application.
• Reads entire cache files into memory as strings (may be quite large)
• Only finds URLs, no other metadata
• Some risk of false positives since URLs could also be embedded in data.
44
Copyright (C) 2007, http://www.dabeaz.com 2-
Commentary
• We have started to build a collection of very simple command line tools
• Very much in the "Unix tradition."
• Python makes it easy to create such tools
• More complex applications could be assembled by simply gluing scripts together
45
Copyright (C) 2007, http://www.dabeaz.com 2-
Working with Processes
• It is common to write programs that run other programs, collect their output, etc.
• Pipes
• Interprocess Communication
• Python has a variety of modules for supporting this.
46
Copyright (C) 2007, http://www.dabeaz.com 2-
subprocess Module
• A module for creating and interacting with subprocesses
• Consolidates a number of low-level OS functions such as system(), execv(), spawnv(), pipe(), popen2(), etc. into a single module
• Cross platform (Unix/Windows)
47
Copyright (C) 2007, http://www.dabeaz.com 2-
Example : Slackers
• Find slacker cache entries.
Using the programs findcache.py and requests.py as subprocesses, write a program that inspects cache directories and prints out all entries that contain the word 'slashdot' in the URL.
48
Copyright (C) 2007, http://www.dabeaz.com 2-
slackers.py# slackers.pyimport sysimport subprocess
# Run findcache.py as a subprocessfinder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE)
dirlist = [line.strip() for line in finder.stdout]
# Run request.py as a subprocessfor cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line,
49
Copyright (C) 2007, http://www.dabeaz.com 2-
Launching a subprocess# slackers.pyimport sysimport subprocess
# Run findcache.py as a subprocessfinder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE)
dirlist = [line.strip() for line in finder.stdout]
# Run request.py as a subprocessfor cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line,
50
This is launching a python script as a subprocess, connecting its stdout stream to a pipe.
Collection of output with newline stripping.
Copyright (C) 2007, http://www.dabeaz.com 2-
Python Executable# slackers.pyimport sysimport subprocess
# Run findcache.py as a subprocessfinder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE)
dirlist = [line.strip() for line in finder.stdout]
# Run request.py as a subprocessfor cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line,
51
Full pathname of python interpreter
Copyright (C) 2007, http://www.dabeaz.com 2-
Subprocess Arguments# slackers.pyimport sysimport subprocess
# Run findcache.py as a subprocessfinder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE)
dirlist = [line.strip() for line in finder.stdout]
# Run request.py as a subprocessfor cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line,
52
List of arguments to subprocess. Corresponds to what would appear on a shell command line.
Copyright (C) 2007, http://www.dabeaz.com 2-
slackers.py# slackers.pyimport sysimport subprocess
# Run findcache.py as a subprocessfinder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE)
dirlist = [line.strip() for line in finder.stdout]
# Run request.py as a subprocessfor cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line,
53
More of the same idea. For each directory we found in the last step, we run requests.py to produce requests.
Copyright (C) 2007, http://www.dabeaz.com 2-
Commentary
54
• subprocess is a large module with many options.
• However, it takes care of a lot of annoying platform-specific details for you.
• Currently the "recommended" way of dealing with subprocesses.
Copyright (C) 2007, http://www.dabeaz.com 2-
Low Level Subprocesses
55
• Running a simple system commandos.system("shell command")
• Connecting to a subprocess with pipespout, pin = popen2.popen2("shell command")
• Exec/spawnos.execv(),os.execl(),os.execle(),...os.spawnv(),os.spawnvl(), os.spawnle(),...
• Unix fork()os.fork(), os.wait(), os.waitpid(), os._exit(), ...
Copyright (C) 2007, http://www.dabeaz.com 2-
Interactive Processes
• Python does not have built-in support for controlling interactive subprocesses (e.g., "Expect")
• Must install third party modules for this
• Example: pexpect
• http://pexpect.sourceforge.net
56
Copyright (C) 2007, http://www.dabeaz.com 2-
Commentary• Writing small Unix-like utilities is fairly
straightforward in Python
• Support for standard kinds of operations (files, regular expressions, pipes, subprocesses, etc.)
• However, our solution is also kind of clunky
• Only returns some information
• Not particularly memory efficient (reads large files into memory)
57
Copyright (C) 2007, http://www.dabeaz.com 2-
Interlude
• Python is well-suited to building libraries and frameworks.
• In the next part, we're going to take a totally different approach than simply writing simple utilities.
• Will build libraries for manipulating cache data and use those libraries to build tools.
58
Copyright (C) 2007, http://www.dabeaz.com 2-
Problem : Parsing Data• Extract the cache data (for real)
Write a module ffcache.py that contains a set of functions for reading Firefox cache data into useful data structures that can be used by other programs.
Capture all available information including URLs, timestamps, sizes, locations, content types, etc.
59
• Use case: Blood and guts
Writing programs that can process foreign file formats. Processing binary encoded data. Creating code for later reuse.
Copyright (C) 2007, http://www.dabeaz.com 2-
The Firefox Cache• There are four critical files
60
_CACHE_MAP_ # Cache index _CACHE_001_ # Cache data_CACHE_002_ # Cache data_CACHE_003_ # Cache data
• All files are binary-encoded
• _CACHE_MAP_ is used by Firefox to locate data, but it is not updated until Firefox exits.
• We will ignore _CACHE_MAP_ since we want to observe caches of live Firefox sessions.
Copyright (C) 2007, http://www.dabeaz.com 2-
Firefox _CACHE_ Files• _CACHE_00n_ file organization
61
Free/used block bitmap
Blocks
4096 bytes
Up to 32768 blocks
• The block size varies according to the file:_CACHE_001_ 256 byte blocks_CACHE_002_ 1024 byte blocks_CACHE_003_ 4096 byte blocks
Copyright (C) 2007, http://www.dabeaz.com 2-
Cache Entries• Each cache entry:
• A maximum of 4 cache blocks
• Can either be data or metadata
• If >16K, written to a file instead
62
• Notice how all the "cryptic" files are >16K-rw------- beazley 111169 Sep 25 17:15 01CC0844d01-rw------- beazley 104991 Sep 25 17:15 01CC3844d01-rw------- beazley 47233 Sep 24 16:41 021F221Ad01...-rw------- beazley 26749 Sep 21 11:19 FF8AEDF0d01-rw------- beazley 58172 Sep 25 18:16 FFE628C6d01
Copyright (C) 2007, http://www.dabeaz.com 2-
Cache Metadata• Metadata is encoded as a binary structure
63
Header
Request String
Request Info
36 bytes
Variable length (in header)
Variable length (in header)
• Header encoding (binary, big-endian)magic (???)locationfetchcountfetchtimemodifytimeexpiretimedatasizerequestsizeinfosize
unsigned int (0x00010008)unsigned intunsigned intunsigned int (system time)unsigned int (system time)unsigned int (system time)unsigned int (byte count)unsigned int (byte count)unsigned int (byte count)
0-34-78-1112-1516-1920-2324-2728-3132-35
Copyright (C) 2007, http://www.dabeaz.com 2-
Solution Outline
• Part 1: Parsing Metadata Headers
• Part 2: Getting request information (URL)
• Part 3: Extracting additional content info
• Part 4: Scanning of individual cache files
• Part 5: Scanning an entire directory
• Part 6: Scanning a list of directories
64
Copyright (C) 2007, http://www.dabeaz.com 2-
Part I - Reading Headers
• Write a function that can parse the binary metadata header and return the data in a useful format
65
Copyright (C) 2007, http://www.dabeaz.com 2-
Reading Headersimport struct
# This function parses a cache metadata header into a dict# of named fields (listed in _headernames below)
_headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize']
def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta
66
Copyright (C) 2007, http://www.dabeaz.com 2-
Reading Headers
>>> f = open("Cache/_CACHE_001_","rb")>>> f.seek(4096) # Skip the bit map>>> headerdata = f.read(36) # Read 36 byte header>>> meta = parse_meta_header(headerdata)>>> meta{'fetchtime': 1190829792, 'requestsize': 27, 'magic': 65544, 'fetchcount': 3, 'expiretime': 0, 'location': 2449473536L, 'modifytime': 1190829792, 'datasize': 29448, 'infosize': 531}>>>
67
• How this is supposed to work:
• Basically, we're parsing the header into a useful Python data structure (a dictionary)
Copyright (C) 2007, http://www.dabeaz.com 2-
import struct
# This function parses a cache metadata header into a dict# of named fields (listed in _headernames below)
_headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize']
def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta
struct module
68
Parses binary encoded data into Python objects.
You would use this module to pack/unpack raw binary data from Python strings.
Unpacks 9 unsigned 32-bit big-endian integers
Copyright (C) 2007, http://www.dabeaz.com 2-
import struct
# This function parses a cache metadata header into a dict# of named fields (listed in _headernames below)
_headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize']
def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta
struct module
69
Result is always a tuple of converted values.head = (65544, 0, 1, 1191682051, 1191682051, 0, 8645, 190, 218)
Copyright (C) 2007, http://www.dabeaz.com 2-
import struct
# This function parses a cache metadata header into a dict# of named fields (listed in _headernames below)
_headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize']
def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta
Dictionary Creation
70
zip(s1,s2) makes a list of tupleszip(_headernames,head) [('magic',head[0]),
('location',head[1]), ('fetchcount',head[2])...]
Make a dictionary
Copyright (C) 2007, http://www.dabeaz.com 2-
Commentary• Dictionaries as data structures
71
meta = { 'fetchtime' : 1190829792, 'requestsize' : 27, 'magic' : 65544, 'fetchcount' : 3, 'expiretime' : 0, 'location' : 2449473536L, 'modifytime' : 1190829792, 'datasize' : 29448, 'infosize' : 531 }
• Useful if data has many partsdata = f.read(meta[8]) # Huh?!?
vs.
data = f.read(meta['infosize']) # Better
Copyright (C) 2007, http://www.dabeaz.com 2-
Mini-reference : struct• struct module
72
items = struct.unpack(fmt,data) data = struct.pack(fmt,item1,...,itemn)
• Sample Format codes'c' char (1 byte string)'b' signed char (8-bit integer)'B' unsigned char (8-bit integer)'h' signed short (16-bit integer)'H' unsigned short (16-bit integer)'i' int (32-bit integer)'I' unsigned int (32-bit integer)'f' 32-bit single precision float'd' 64-bit double precision float's' char s[] (String)'>' Big endian modifier'<' Little endian modifier'!' Network order modifier'n' Repetition count modifier
Copyright (C) 2007, http://www.dabeaz.com 2-
Part 2 : Parsing Requests
• Write a function that will read the URL request string and request information
• Request String : A Null-terminated string
73
• Request Info : A sequence of Null-terminated key-value pairs (like a dictionary)
Copyright (C) 2007, http://www.dabeaz.com 2-
Parsing Requestsimport repart_pat = re.compile(r'[\n\r -~]*$')
def parse_request_data(meta,requestdata): parts = requestdata.split('\x00') for part in parts: if not part_pat.match(part): return False
request = parts[0] if len(request) != (meta['requestsize'] - 1): return False info = dict(zip(parts[1::2],parts[2::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True
74
Copyright (C) 2007, http://www.dabeaz.com 2-
Usage : Requests
>>> f = open("Cache/_CACHE_001_","rb")>>> f.seek(4096) # Skip the bit map>>> headerdata = f.read(36) # Read 36 byte header>>> meta = parse_meta_header(headerdata)>>> requestdata = f.read(meta['requestsize']+meta['infosize'])>>> parse_request_data(meta,requestdata)True>>> meta['request']'http://www.yahoo.com/'>>> meta['info']{'request-method': 'GET', 'request-User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7', 'charset': 'UTF-8', 'response-head': 'HTTP/1.1 200 OK\r\nDate: Wed, 26 Sep 2007 18:03:17 ...' }>>>
75
• Usage of the function:
Copyright (C) 2007, http://www.dabeaz.com 2-
import repart_pat = re.compile(r'[\n\r -~]*$')
def parse_request_data(meta,requestdata): parts = requestdata.split('\x00') for part in parts: if not part_pat.match(part): return False
request = parts[0] if len(request) != (meta['requestsize'] - 1): return False info = dict(zip(parts[1::2],parts[2::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True
String Stripping
76
The request data is a sequence of null-terminated strings. This splits the data up into parts.
requestdata = 'part\x00part\x00part\x00part\x00...'
.split('\x00')
parts = ['part','part','part','part',...]
Copyright (C) 2007, http://www.dabeaz.com 2-
import repart_pat = re.compile(r'[\n\r -~]*$')
def parse_request_data(meta,requestdata): parts = requestdata.split('\x00') for part in parts: if not part_pat.match(part): return False
request = parts[0] if len(request) != (meta['requestsize'] - 1): return False info = dict(zip(parts[1::2],parts[2::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True
String Validation
77
Individual parts are printable characters except for newline characters ('\n\r').
We use the re module to match each string. This would help catch cases where we might be reading bad data (false headers, raw data, etc.).
Copyright (C) 2007, http://www.dabeaz.com 2-
import repart_pat = re.compile(r'[\n\r -~]*$')
def parse_request_data(meta,requestdata): parts = requestdata.split('\x00') for part in parts: if not part_pat.match(part): return False
request = parts[0] if len(request) != (meta['requestsize'] - 1): return False info = dict(zip(parts[1::2],parts[2::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True
URL Request String
78
The request string is the first part. The check that follows makes sure it's the right size (a further sanity check on the data integrity).
Copyright (C) 2007, http://www.dabeaz.com 2-
import repart_pat = re.compile(r'[\n\r -~]*$')
def parse_request_data(meta,requestdata): parts = requestdata.split('\x00') for part in parts: if not part_pat.match(part): return False
request = parts[0] if len(request) != (meta['requestsize'] - 1): return False info = dict(zip(parts[1::2],parts[2::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True
Request Info
79
Each request has a set of associated data represented as key/value pairs.
parts = ['request','key','val','key','val','key','val']
parts[1::2] ['key','key','key']parts[2::2] ['val','val','val']
zip(parts[1::2],parts[2::2]) [('key','val'), ('key','val') ('key','val')]
Makes a dictionary from (key,val) tuples
Copyright (C) 2007, http://www.dabeaz.com 2-
# Given a dictionary of header information and a file,# this function extracts the request data from a cache# metadata entry and saves it in the dictionary. Returns# True or False depending on success.
def read_request_data(header,f): request = f.read(header['requestsize']).strip('\x00') infodata = f.read(header['infosize']).strip('\x00')
# Validate request and infodata here (nothing now)
# Turn the infodata into a dictionary parts = infodata.split('\x00') info = dict(zip(parts[::2],parts[1::2]))
meta['request'] = request.split(':',1)[1] meta['info'] = info return True
Fixing the Request
80
Cleaning up the request stringrequest = "HTTP:http://www.google.com"
.split(':',1)
['HTTP','http://www.google.com']
[1]
'http://www.google.com'
Copyright (C) 2007, http://www.dabeaz.com 2-
Commentary• Emphasize that Python has very powerful
list manipulation primitives
• Indexing
• Slicing
• List comprehensions
• Etc.
• Knowing how to use these leads to rapid development and compact code
81
Copyright (C) 2007, http://www.dabeaz.com 2-
Part 3: Content Info
• All documents on the internet have optional content-type, encoding, and character set information.
• Let's add this information since it will make it easier for us to determine the type of files that are stored in the cache (i.e., images, movies, HTML, etc.)
82
Copyright (C) 2007, http://www.dabeaz.com 2-
HTTP Responses• The cache metadata includes an HTTP
response header
83
>>> print meta['info']['response-head']HTTP/1.1 200 OKDate: Sat, 29 Sep 2007 20:51:37 GMTCache-Control: privateVary: User-AgentContent-Type: text/html; charset=utf-8Content-Encoding: gzip
>>>
Content type, character set,and encoding.
Copyright (C) 2007, http://www.dabeaz.com 2-
Solution# Given a metadata dictionary, this function adds additional# fields related to the content type, charset, and encoding
import emaildef add_content_info(meta): info = meta['info'] if 'response-head' not in info: return else: rhead = info.get('response-head').split("\n",1)[1] m = email.message_from_string(rhead) content = m.get_content_type() encoding = m.get('content-encoding',None) charset = m.get_content_charset() meta['content-type'] = content meta['content-encoding'] = encoding meta['charset'] = charset
84
Copyright (C) 2007, http://www.dabeaz.com 2-
# Given a metadata dictionary, this function adds additional# fields related to the content type, charset, and encoding
import emaildef add_content_info(meta): info = meta['info'] if 'response-head' not in info: return else: rhead = info.get('response-head').split("\n",1)[1] m = email.message_from_string(rhead) content = m.get_content_type() encoding = m.get('content-encoding',None) charset = m.get_content_charset() meta['content-type'] = content meta['content-encoding'] = encoding meta['charset'] = charset
Internet Data Handling
85
Python has a vast assortment of internet data handling modules.
email. Parsing of email messages, MIME headers, etc.
Copyright (C) 2007, http://www.dabeaz.com 2-
# Given a metadata dictionary, this function adds additional# fields related to the content type, charset, and encoding
import emaildef add_content_info(meta): info = meta['info'] if 'response-head' not in info: return else: rhead = info.get('response-head').split("\n",1)[1] m = email.message_from_string(rhead) content = m.get_content_type() encoding = m.get('content-encoding',None) charset = m.get_content_charset() meta['content-type'] = content meta['content-encoding'] = encoding meta['charset'] = charset
Internet Data Handling
86
In this code, we parse the HTTP reponse headers using the email module and extract content-type, encoding, and charset information.
Copyright (C) 2007, http://www.dabeaz.com 2-
Commentary
• Python is heavily used in Internet applications
• There are modules for parsing common types of data (email, HTML, XML, etc.)
• There are modules for processing bits and pieces of internet data (URLs, MIME types, RFC822 headers, etc.)
87
Copyright (C) 2007, http://www.dabeaz.com 2-
Part 4: File Scanning
• Write a function that scans a single cache file and produces a sequence of records containing all of the cache metadata.
• This is just one more of our building blocks
• The goal is to hide some of the nasty bits
88
Copyright (C) 2007, http://www.dabeaz.com 2-
File Scanning# Scan a single file in the firefox cachedef scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] < maxsize): requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta
# Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1)
89
Copyright (C) 2007, http://www.dabeaz.com 2-
Usage : File Scanning
>>> f = open("Cache/_CACHE_001_","rb")>>> for meta in scan_cache_file(f,256)... print meta['request']...http://www.yahoo.com/http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif...
90
• Usage of the scan function
• We can just open up a cache file and write a for-loop to iterate over all of the entries.
Copyright (C) 2007, http://www.dabeaz.com 2-
# Scan a single file in the firefox cachedef scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] < maxsize): requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta
# Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1)
Python File I/O
91
File Objects
Modeled after ANSI C.Files are just bytes.File pointer keeps track.
f.read() # Read bytesf.tell() # Current fpf.seek(n,off) # Move fp
Copyright (C) 2007, http://www.dabeaz.com 2-
# Scan a single file in the firefox cachedef scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] < maxsize): requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta
# Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1)
Using Earlier Code
92
Here we are using our header parsing functions written in previous parts.
Note: We are progressively adding more data to a dictionary.
Copyright (C) 2007, http://www.dabeaz.com 2-
# Scan a single file in the firefox cachedef scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] < maxsize): requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta
# Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1)
Data Validation
93
This is a sanity check to make sure the header data looks like a valid header.
Copyright (C) 2007, http://www.dabeaz.com 2-
# Scan a single file in the firefox cachedef scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] < maxsize): requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta
# Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1)
Generating Results
94
We are using yield to produce data for a single cache entry. If someone uses a for-loop, they will get all of the entries.
Note: This allows us to process the cache without reading all of the data into memory.
Copyright (C) 2007, http://www.dabeaz.com 2-
Commentary
• Have created a function that can scan a single _CACHE_00n_ file and produce a sequence of dictionaries with metadata.
• It's still somewhat low-level
• Just need to package it a little better
95
Copyright (C) 2007, http://www.dabeaz.com 2-
Part 5 : Scan a Directory
• Write a function that takes the name of a Firefox cache directory, scans all of the cache files for metadata, and produces a single sequence of records.
• Make it real easy to extract data
96
Copyright (C) 2007, http://www.dabeaz.com 2-
Solution : Directory Scan# Given the name of a Firefox cache directory, the function# scans all of the _CACHE_00n_ files for metadata. A sequence# of dictionaries containing metadata is returned.
import osdef scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), ('_CACHE_003_',4096)]
for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close()
97
Copyright (C) 2007, http://www.dabeaz.com 2-
# Given the name of a Firefox cache directory, the function# scans all of the _CACHE_00n_ files for metadata. A sequence# of dictionaries containing metadata is returned.
import osdef scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), ('_CACHE_003_',4096)]
for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close()
Solution : Directory Scan
98
General idea:
We loop over the three _CACHE_00n_ files and produce a sequence of the cache records
Copyright (C) 2007, http://www.dabeaz.com 2-
# Given the name of a Firefox cache directory, the function# scans all of the _CACHE_00n_ files for metadata. A sequence# of dictionaries containing metadata is returned.
import osdef scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), ('_CACHE_003_',4096)]
for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close()
Solution : Directory Scan
99
We use the low-level file scanning function here to generate a sequence of records.
Copyright (C) 2007, http://www.dabeaz.com 2-
# Given the name of a Firefox cache directory, the function# scans all of the _CACHE_00n_ files for metadata. A sequence# of dictionaries containing metadata is returned.
import osdef scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), ('_CACHE_003_',4096)]
for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close()
More Generation
100
By using yield here, we are chaining together the results obtained from all three cache files into one big long sequence of results.
The underlying mechanics and implementation details are hidden (user doesn't care)
Copyright (C) 2007, http://www.dabeaz.com 2-
# Given the name of a Firefox cache directory, the function# scans all of the _CACHE_00n_ files for metadata. A sequence# of dictionaries containing metadata is returned.
import osdef scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), ('_CACHE_003_',4096)]
for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close()
Additional Data
101
Adding path and file information to the data (May be useful later)
Copyright (C) 2007, http://www.dabeaz.com 2-
Usage : Cache Scan
>>> for meta in scan_cache("Cache/"):... print meta['request']...http://www.yahoo.com/http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif...
102
• Usage of the scan function
• Given the name of a cache directory, we can just loop over all of the metadata. Trivial!
• With work, could perform various kinds of queries and processing of the data
Copyright (C) 2007, http://www.dabeaz.com 2-
Another Example
>>> for meta in scan_cache("Cache/"):... if 'slashdot' in meta['request']:... print meta['request']...http://www.slashdot.org/http://images.slashdot.org/topics/topiccommunications.gifhttp://images.slashdot.org/topics/topicstorage.gifhttp://images.slashdot.org/comments.css?T_2_5_0_176...
103
• Find all requests related to Slashdot
• Well, that was pretty easy.
Copyright (C) 2007, http://www.dabeaz.com 2-
Another Example
>>> jpegs = (meta for meta in scan_cache("Cache/") if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000)>>> for j in jpegs:... print j['request']...http://images.salon.com/ent/video_dog/comedy/2007/09/27/cereal/story.jpghttp://images.salon.com/ent/video_dog/ifc/2007/09/28/apocalypse/story.jpghttp://www.lakesideinns.com/images/fallroadphoto2006.jpg...>>>
104
• Find all large JPEG images in the cache
• That was also pretty easy
Copyright (C) 2007, http://www.dabeaz.com 2-
Part 6 : Scan Everything
• Write a function that takes a list of cache directories and produces a sequence of all cache metadata found in all of them.
• A single utility function that let's us query everything.
105
Copyright (C) 2007, http://www.dabeaz.com 2-
Scanning Everything
# scan an entire list of cache directories producing# a sequence of records
def scan(cachedirs): if isinstance(cachedirs,str): cachedirs = [cachedirs] for cdir in cachedirs: for meta in scan_cache(cdir): yield meta
106
Copyright (C) 2007, http://www.dabeaz.com 2-
Type Checking
# scan an entire list of cache directories producing# a sequence of records
def scan(cachedirs): if isinstance(cachedirs,str): cachedirs = [cachedirs] for cdir in cachedirs: for meta in scan_cache(cdir): yield meta
107
This bit of code is an example of type checking.
If the argument is a string, we convert it to a list with one item. This allows the following usage:scan("CacheDir")scan(["CacheDir1","CacheDir2",...])
Copyright (C) 2007, http://www.dabeaz.com 2-
Putting it all together# slack.py# Find all of those slackers who should be workingimport sys, os, ffcache
if len(sys.argv) != 2: print >>sys.stderr,"Usage: python slack.py dirname" raise SystemExit(1)
caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files)
for meta in ffcache.scan(caches): if 'slashdot' in meta['request']: print meta['request'] print meta['cachedir'] print
108
Copyright (C) 2007, http://www.dabeaz.com 2-
Intermission
• Have written a simple library ffcache.py
• Library takes a moderate complex data processing problem and breaks it up into pieces.
• About 100 lines of code.
• Now, let's build an application...
109
Copyright (C) 2007, http://www.dabeaz.com 2-
Problem : CacheSpy• Big Brother (make an evil sound here)
110
Write a program that first locates all of the Firefox cache directories under a given directory. Then have that program run forever as a network server, waiting for connections. On each connection, send back all of the current cache metadata.
• Big Picture
We're going to write a daemon that will find and quietly report on browser cache contents.
Copyright (C) 2007, http://www.dabeaz.com 2-
cachespy.pyimport sys, os, pickle, SocketServer, ffcache
SPY_PORT = 31337caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files]
def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f)
class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close()
SocketServer.TCPServer.allow_reuse_address = Trueserv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler)print "CacheSpy running on port %d" % SPY_PORTserv.serve_forever()
111
Copyright (C) 2007, http://www.dabeaz.com 2-
import sys, os, pickle, SocketServer, ffcache
SPY_PORT = 31337caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files]
def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f)
class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close()
SocketServer.TCPServer.allow_reuse_address = Trueserv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler)print "CacheSpy running on port %d" % SPY_PORTserv.serve_forever()
SocketServer Module
112
SocketServer
A module for easily creating low-level internet applications using sockets.
Copyright (C) 2007, http://www.dabeaz.com 2-
import sys, os, pickle, SocketServer, ffcache
SPY_PORT = 31337caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files]
def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f)
class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close()
SocketServer.TCPServer.allow_reuse_address = Trueserv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler)print "CacheSpy running on port %d" % SPY_PORTserv.serve_forever()
SocketServer Handlers
113
You define a simple class that implements handle().
This implements the server logic.
Copyright (C) 2007, http://www.dabeaz.com 2-
import sys, os, pickle, SocketServer, ffcache
SPY_PORT = 31337caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files]
def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f)
class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close()
SocketServer.TCPServer.allow_reuse_address = Trueserv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler)print "CacheSpy running on port %d" % SPY_PORTserv.serve_forever()
SocketServer Servers
114
Next, you just create a Server object, hook the handler up to it, and run the server.
Copyright (C) 2007, http://www.dabeaz.com 2-
import sys, os, pickle, SocketServer, ffcache
SPY_PORT = 31337caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files]
def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f)
class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close()
SocketServer.TCPServer.allow_reuse_address = Trueserv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler)print "CacheSpy running on port %d" % SPY_PORTserv.serve_forever()
Data Serialization
115
Here, we are turning a socket into a file and dumping cache data on it.
socket corresponding to client that connected.
Copyright (C) 2007, http://www.dabeaz.com 2-
import sys, os, pickle, SocketServer, ffcache
SPY_PORT = 31337caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files]
def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f)
class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close()
SocketServer.TCPServer.allow_reuse_address = Trueserv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler)print "CacheSpy running on port %d" % SPY_PORTserv.serve_forever()
pickle Module
116
The pickle module takes any Python object and serializes it into a byte string.
There are really only two ops:
pickle.dump(obj,f) # Dump objectobj = pickle.load(f) # Load object
Copyright (C) 2007, http://www.dabeaz.com 2-
Running our Server
% python cachespy.py /UsersCacheSpy running on port 31337
117
• Example:
• Server is just sitting there waiting
• You can try connecting with telnet % telnet localhost 31337Trying 127.0.0.1...Connected to localhost.Escape character is '^]'.(dp0S'info'p1... bunch of cryptic data ...
Copyright (C) 2007, http://www.dabeaz.com 2-
Problem : CacheMon
• The Evil Overlord (make a more evil sound)
118
Write a program cachemon.py that contains a function for retrieving the cache contents from a remote machine.
• Big Picture
Writing network clients. Programs that make outgoing connections to internet services.
Copyright (C) 2007, http://www.dabeaz.com 2-
# cachemon.pyimport pickle, socket
def scan_remote_cache(host): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() try: while True: meta = pickle.load(f) meta['host'] = host # Add host to metadata yield meta except EOFError: pass f.close() s.close()
cachemon.py
119
Copyright (C) 2007, http://www.dabeaz.com 2-
# cachemon.pyimport pickle, socket
def scan_remote_cache(host): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() try: while True: meta = pickle.load(f) meta['host'] = host # Add host to metadata yield meta except EOFError: pass f.close() s.close()
Solution : Socket Module
120
socket module provides direct access to low-level socket API.s = socket(addr,type)
s.connect(host)s.bind(addr)s.listen(n)s.accept()s.recv(n)s.send(data)...
Copyright (C) 2007, http://www.dabeaz.com 2-
# cachemon.pyimport pickle, socket
def scan_remote_cache(host): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() try: while True: meta = pickle.load(f) meta['host'] = host # Add host to metadata yield meta except EOFError: pass f.close() s.close()
Unpickling a Sequence
121
Here we use pickle to repeatedly load objects off of the socket. We use yield to generate a sequence of received objects.
Copyright (C) 2007, http://www.dabeaz.com 2-
Example Usage
>>> rcache = scan_remote_cache(("localhost",31337))>>> jpegs = (meta for meta in rcache... if meta['content-type'] == 'image/jpeg'... and meta['datasize'] > 100000)>>> for j in jpegs:... print j['request']...http://images.salon.com/ent/video_dog/comedy/2007/09/27/cereal/story.jpghttp://images.salon.com/ent/video_dog/ifc/2007/09/28/apocalypse/story.jpghttp://www.lakesideinns.com/images/fallroadphoto2006.jpg...
122
• Example: Find all JPEG images > 100K on a remote machine
• This looks almost identical to old code
Copyright (C) 2007, http://www.dabeaz.com 2-
Code Similarity
rcache = scan_remote_cache(("localhost",31337))jpegs = (meta for meta in rcache if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000)for j in jpegs: print j['request']
123
• A Remote Scan
• A Local Scancache = ffcache.scan(cachedirs)jpegs = (meta for meta in cache if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000)for j in jpegs: print j['request']
Copyright (C) 2007, http://www.dabeaz.com 2-
Big Picture
124
for meta in ffcache.scan(dirs): pickle.dump(meta,f)
while True: meta = pickle.load(f) yield meta
cachespy.py
for meta in remote_scan(host): # ...
cachemon.pysocket
Copyright (C) 2007, http://www.dabeaz.com 2-
Problem : Clusters
• Scan a whole cluster of machines
125
Write a function that can easily scan the caches of an entire collection of remote hosts.
• Big Picture
Collecting data from a group of machines on the network.
Copyright (C) 2007, http://www.dabeaz.com 2-
# cachemon.py...def scan_cluster(hostlist): for host in hostlist: try: for meta in scan_remote_cache(host): yield meta except (EnvironmentError,socket.error): pass
cachemon.py
126
A bit of exception handling to deal with dead machines, and other problems (would probably need to be expanded)
Copyright (C) 2007, http://www.dabeaz.com 2-
Example Usage
>>> hosts = [('host1',31337),('host2',31337),...]>>> rcaches = scan_cluster(hosts)>>> jpegs = (meta for meta in rcache... if meta['content-type'] == 'image/jpeg'... and meta['datasize'] > 100000)>>> for j in jpegs:... print j['request']......
127
• Example: Find all JPEG images > 100K on a set of remote machines
• Think about the abstraction of "iteration" here. Query code is exactly the same.
Copyright (C) 2007, http://www.dabeaz.com 2-
Problem : Concurrency
• Collect data from a large set of machines
128
In the last section, the scan_cluster() function retrieves data from one machine at a time. However, a world-wide quasi-evil organization is likely to have at least several dozen machines.
• Your task
Modify the scanner so that it can manage concurrent client connections, reading data from multiple sources at once.
Copyright (C) 2007, http://www.dabeaz.com 2-
Concurrency
129
• Python provides full support for threads
• They are real threads (pthreads, system threads, etc.)
• However, a lock within the Python interpreter (Global Interpreter Lock), prevents concurrency across more than one CPU.
Copyright (C) 2007, http://www.dabeaz.com 2-
Programming with Threads
130
• threading module provides a Thread object.
• A variety of synchronization primitives are provided (Locks, Semaphores, Condition Variations, Events, etc.)
• Can program very traditional kinds of threaded programs (multiple threads, lots of locking, race conditions, horrible debugging, etc.).
Copyright (C) 2007, http://www.dabeaz.com 2-
Threads with Queues
131
• One technique for thread programming is to have independent threads that share data via thread-safe message queues.
• Variations of "producer-consumer" problems.
• Will use this in our solution. Keep in mind, it's not the only way to program threads.
Copyright (C) 2007, http://www.dabeaz.com 2-
# cachemon.py...import threading
class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): for meta in scan_remote_cache(self.host): self.msg_q.put(meta)
A Cache Scanning Thread
132
Copyright (C) 2007, http://www.dabeaz.com 2-
# cachemon.py...import threading
class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): for meta in scan_remote_cache(self.host): self.msg_q.put(meta)
threading Module
133
threading module.
Contains most functionality related to threads.
Copyright (C) 2007, http://www.dabeaz.com 2-
# cachemon.py...import threading
class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): for meta in scan_remote_cache(self.host): self.msg_q.put(meta)
Thread Base Class
134
Threads are defined by inheriting from the Thread base class.
Copyright (C) 2007, http://www.dabeaz.com 2-
# cachemon.py...import threading
class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): for meta in scan_remote_cache(self.host): self.msg_q.put(meta)
Thread Initialization
135
initialization and setup
Copyright (C) 2007, http://www.dabeaz.com 2-
# cachemon.py...import threading
class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): for meta in scan_remote_cache(self.host): self.msg_q.put(meta)
Thread Execution
136
run() method
Contains code that executes in the thread.
The thread performs a scan of a single host.
Copyright (C) 2007, http://www.dabeaz.com 2-
Launching a Thread
137
• You create a thread object and start itt1 = ScanThread(("host1",31337),msg_q)t1.start()
t2 = ScanThread(("host2",31337),msg_q)t2.start()
• .start() starts the thread and calls .run()
Copyright (C) 2007, http://www.dabeaz.com 2-
Thread Safe Queues
138
• Queue module. Provides a thread-safe queue.import Queuemsg_q = Queue.Queue()
• Queue insertionmsg_q.put(obj)
• Queue removalobj = msg_q.get()
• Queue can be shared by as many threads as you want without worrying about locking.
Copyright (C) 2007, http://www.dabeaz.com 2-
# cachemon.py...import threading
class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): for meta in scan_remote_cache(self.host): self.msg_q.put(meta)
Use of a Queue Object
139
A Queue object. Where incoming objects are placed.
Get data from the remote machine and put into the Queue
Copyright (C) 2007, http://www.dabeaz.com 2-
Primitive Use of a Queue
140
• You first create a queue, then launch the threads to insert data into it.
msg_q = Queue.Queue()t1 = ScanThread(("host1",31337),msg_q)t1.start()
t2 = ScanThread(("host2",31337),msg_q)t2.start()
while True: meta = msg_q.get() # Get metadata
Copyright (C) 2007, http://www.dabeaz.com 2-
Monitor Architecture
141
Host Host Host
MonitorThread Thread Thread
msg_q
socket socket socket
Consumer
.put()
.get()
????
Copyright (C) 2007, http://www.dabeaz.com 2-
Concurrent Monitorimport threading, Queuedef concurrent_scan(hostlist, msg_q): thr_list = [] for host in hostlist: thr = ScanThread(host,msg_q) thr.start() thr_list.append(thr) for thr in thr_list: thr.join() msg_q.put(None) # Sentinel
def scan_cluster(hostlist): msg_q = Queue.Queue() threading.Thread(target=concurrent_scan, args=(hostlist,msg_q)).start() while True: meta = msg_q.get() if meta: yield meta else: break
142
Copyright (C) 2007, http://www.dabeaz.com 2-
import threading, Queuedef concurrent_scan(hostlist, msg_q): thr_list = [] for host in hostlist: thr = ScanThread(host,msg_q) thr.start() thr_list.append(thr) for thr in thr_list: thr.join() msg_q.put(None) # Sentinel
def scan_cluster(hostlist): msg_q = Queue.Queue() threading.Thread(target=concurrent_scan, args=(hostlist,msg_q)).start() while True: meta = msg_q.get() if meta: yield meta else: break
Launching Threads
143
The above function is a thread that launches ScanThreads. It then waits for the threads to terminate by joining with them. After all threads have terminated, a sentinel is dropped in the Queue.
Copyright (C) 2007, http://www.dabeaz.com 2-
import threading, Queuedef concurrent_scan(hostlist, msg_q): thr_list = [] for host in hostlist: thr = ScanThread(host,msg_q) thr.start() thr_list.append(thr) for thr in thr_list: thr.join() msg_q.put(None) # Sentinel
def scan_cluster(hostlist): msg_q = Queue.Queue() threading.Thread(target=concurrent_scan, args=(hostlist,msg_q)).start() while True: meta = msg_q.get() if meta: yield meta else: break
Collecting Results
144
The function below creates a Queue and launches a thread to launch all of the scanning threads.
It then produces a sequence of cache data until the sentinel (None) is pulled off of the queue.
Copyright (C) 2007, http://www.dabeaz.com 2-
More on Threads
145
• There are many more issues to thread programming that we could discuss.
• All issues concerning locking, synchronization, event handling, and race conditions apply to Python.
• Because of global interpreter lock, threads are not a way to achieve higher performance (generally).
Copyright (C) 2007, http://www.dabeaz.com 2-
Thread Synchronization
146
• threading module has various primitivesLock() # Mutex LockRLock() # Reentrant Mutex LockSemaphore(n) # Semaphore
• Example use:x = value # Some kind of shared objectx_lock = Lock() # A lock associated with x...
x_lock.acquire()# Modify or do something with x (critical section)...x_lock.release()
Copyright (C) 2007, http://www.dabeaz.com 2-
Story so Far
147
• Wrote a module ffcache.py that parsed contents of caches (~100 lines)
• Wrote cachespy.py that allows cache data to be retrieved by a remote client (~25 lines)
• Wrote a concurrent monitor for getting that data (~50 lines)
Copyright (C) 2007, http://www.dabeaz.com 2-
A subtle observation
148
• In none of our programs have we read the entire contents of any Firefox cache into memory.
• In cachespy.py, the contents are read iteratively and piped through a socket (not stored in memory).
• In cachemon.py, contents are received and routed through message queues. Processed iteratively (no temporary lists of results).
Copyright (C) 2007, http://www.dabeaz.com 2-
Another Observation
149
• For every connection, cachespy sends the entire contents of the Firefox cache metadata back to the monitor.
• Given that caches are ~50 MB by default, this could result in large network traffic.
• Question: Given that we're normally performing queries on the data, could we do any of this work on the remote machines?
Copyright (C) 2007, http://www.dabeaz.com 2-
Remote Filtering
• Distribute the work
150
Modify the cachespy program so that some of the query work can be performed remotely on each of the machines. Only send back a subset of the data to the monitor program.
• Big Picture
Distributed computation. Massive security nightmare.
Copyright (C) 2007, http://www.dabeaz.com 2-
The idea• Modify scan_cluster() and all related
functions to accept an optional filter specification. Pass this on to the remote machine and use it to process the data remotely before returning results.
151
filter = """if meta['content-type'] == 'image/jpeg'and meta['datasize'] > 100000"""
rcaches = scan_cluster(hostlist,filter)
Copyright (C) 2007, http://www.dabeaz.com 2-
Changes to the Monitor# cachemon.pydef scan_remote_cache(host,filter=""): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() pickle.dump(filter,f) f.flush() try: while True: meta = pickle.load(f) meta['host'] = host yield meta except EOFError: pass
152
Send the filter to the remote host right after connecting.
Add a filter parameter
Copyright (C) 2007, http://www.dabeaz.com 2-
Changes to the Monitor# cachemon.py...class ScanThread(threading.Thread): def __init__(self,host,msg_q,filter=""): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q self.filter = filter def run(self): try: for meta in scan_remote_cache(self.host,self.filter): self.msg_q.put(meta) except (EnvironmentError,socket.error): pass
153
filter added to thread data
Copyright (C) 2007, http://www.dabeaz.com 2-
def concurrent_scan(hostlist, msg_q,filter): thr_list = [] for host in hostlist: thr = ScanThread(host,msg_q,filter) thr.start() thr_list.append(thr) for thr in thr_list: thr.join() msg_q.put(None) # Sentinel
Changes to the Monitor
154
filter passed to thread creation
Copyright (C) 2007, http://www.dabeaz.com 2-
Changes to the Monitor
# cachemon.py...def scan_cluster(hostlist,filter=""): msg_q = Queue.Queue() threading.Thread(target=concurrent_scan, args=(hostlist,msg_q,filter)).start() while True: meta = msg_q.get() if not meta: break yield meta
155
filter added
Copyright (C) 2007, http://www.dabeaz.com 2-
Commentary
• Have modified the cache monitor program to accept a filter string and to pass that string to remote clients upon connecting.
• How to use the filter in the spy server.
156
Copyright (C) 2007, http://www.dabeaz.com 2-
Changes to CacheSpy
# cachespy.py...def dump_cache(f,filter): values = """(meta for meta in ffcache.scan(caches) %s)""" % filter try: for meta in eval(values): pickle.dump(meta,f) except: pickle.dump({'error' : traceback.format_exc()},f)
157
Copyright (C) 2007, http://www.dabeaz.com 2-
# cachespy.py...def dump_cache(f,filter): values = """(meta for meta in ffcache.scan(caches) %s)""" % filter try: for meta in eval(values): pickle.dump(meta,f) except: pickle.dump({'error' : traceback.format_exc()},f)
Changes to CacheSpy
158
Filter added and used to create an expression string.
filter = "if meta['datasize'] > 100000"
values = """(meta for meta in ffcache.scan(caches) if meta['datasize'] > 100000)"""
Copyright (C) 2007, http://www.dabeaz.com 2-
# cachespy.py...def dump_cache(f,filter): values = """(meta for meta in ffcache.scan(caches) %s)""" % filter try: for meta in eval(values): pickle.dump(meta,f) except: pickle.dump({'error' : traceback.format_exc()},f)
Eval()
159
eval(s). Evaluates s as a Python expression.
A bit error of handling. traceback module creates stack traces for exceptions.
Copyright (C) 2007, http://www.dabeaz.com 2-
Changes to the Server
# cachespy.py...class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() filter = pickle.load(f) dump_cache(f,filter) f.close()
160
Get filter from the monitor
Copyright (C) 2007, http://www.dabeaz.com 2-
Putting it all Together
• A remote query to find slackers
161
# Find all of those slashdot slackersimport cachemon
hosts = [('host1',31337),('host2',31337), ('host3',31337),...]
filter = "if 'slashdot' in meta['request']"
rcaches = cachemon.scan_cluster(hosts,filter)for meta in rcaches: print meta['request'] print meta['host'],meta['cachedir'] print
Copyright (C) 2007, http://www.dabeaz.com 2-
Putting it all Together
• Queries run remotely on all the hosts
• Only data of interest is sent back
• No temporary lists or large data structures
• Concurrent execution on monitor
• Concurrency is hidden from user
162
Copyright (C) 2007, http://www.dabeaz.com 2-
The Power of Iteration• Loop over all entries in a cache file:
163
for meta in scan_cache_file(f,256): ...
• Loop over all entries in a cache directoryfor meta in scan_cache(dirname): ...
• Loop over all cache entries on remote hostfor meta in scan_remote_cache(host): ...
• Loop over all cache entries on many hostsfor meta in scan_cluster(hostlist): ...
Copyright (C) 2007, http://www.dabeaz.com 2-
Wrapping Up
• A lot of material has been presented
• Again, the goal was to do something interesting with Python, not to be just a reference manual.
• This is only a small taste of what's possible
• And it's only a small taste of why people like programming in Python
164
Copyright (C) 2007, http://www.dabeaz.com 2-
Other Python Examples
• Python makes many annoying tasks relatively easy.
• Will end by showing very simple examples of other modules.
165
Copyright (C) 2007, http://www.dabeaz.com 2-
Fetching a Web Page• urllib and urllib2 modules
166
import urllibw = urllib.urlopen("http://www.foo.com")for line in w: # ...
page = urllib.urlopen("http://www.foo.com").read()
• Additional options support uploading of form values, cookies, passwords, proxies, etc.
Copyright (C) 2007, http://www.dabeaz.com 2-
A Web Server with CGI
167
• Serve files and allow CGI scriptsfrom BaseHTTPServer import HTTPServerfrom CGIHTTPServer import CGIHTTPRequestHandlerimport osos.chdir("/home/docs/html")serv = HTTPServer(("",8080),CGIHTTPRequestHandler)serv.serve_forever()
• Can easily throw up a server with just a few lines of Python code.
Copyright (C) 2007, http://www.dabeaz.com 2-
A Custom HTTP Server• BaseHTTPServer module
168
from BaseHTTPServer import BaseHTTPRequestHandler,HTTPServer
class MyHandler(BaseHTTPRequestHandler): def do_GET(self): ... def do_POST(self): ... def do_HEAD(self): ... def do_PUT(self): ...
serv = HTTPServer(("",8080),MyHandler)serv.serve_forever()
• Could use to put a web server in an application
Copyright (C) 2007, http://www.dabeaz.com 2-
XML-RPC Server/Client
169
• How to create a stand-alone serverfrom SimpleXMLRPCServer import SimpleXMLRPCServer
def add(x,y): return x+y
s = SimpleXMLRPCServer(("",8080))s.register_function(add)s.serve_forever()
• How to test it (xmlrpclib)>>> import xmlrpclib>>> s = xmlrpclib.ServerProxy("http://localhost:8080")>>> s.add(3,5)8>>> s.add("Hello","World")"HelloWorld">>>
Copyright (C) 2007, http://www.dabeaz.com 2-
Where to go from here?
• Network/Internet programming. Python has a large user base developing network applications, web frameworks, and internet data handling tools.
• C/C++ extension building. Python is easily extended with C/C++ code. Can use Python as a high-level control application for existing systems software.
170
Copyright (C) 2007, http://www.dabeaz.com 2-
Where to go from here?
• GUI programming. There are several major GUI packages for Python (Tkinter, wxPython, PyQT, etc.).
• Jython and IronPython. Implementations of the Python interpreter for Java and .NET.
171
Copyright (C) 2007, http://www.dabeaz.com 2-
Where to go from here?
• Everything Pythonic:
172
http://www.python.org
• Get involved. PyCon'2008 (Chicago)
• Have an on-site course (shameless plug)
http://www.dabeaz.com/python.html
Copyright (C) 2007, http://www.dabeaz.com 2-
Thanks for Listening!
• Hope you got something out of the class
173
• Please give me feedback!