Python: a popular high-level language
2
Learning Python from scratch
3
http://www.codecademy.com/en/tracks/python
Free, interactive web-based tutorialGreat for new programmers
http://learnpythonthehardway.org/book/
Free web book. Exercise based, comprehensive.
Python topics to be covered
4
- Virtual Environment & Anaconda
- Ipython Notebook
- NumPy/SciPy
- Matplotlib
- Interactive plotting using Bokeh
- Turn python package to web app
A published Python package by Rangananthan and Reynolds from GCSB- Command line based- Nice IPython Notebook tutorial- Detailed documentation
Challenges:- Use from command line- Dependencies- Interactive presentation- Broader impact
Solutions:- A python web app
Online PySCA – An ongoing case study
5
Python: pros
6
A clean, easy to learn language
Huge number of community created packages
Booming popularity for scientific computing
Python bindings / API for a lot of other software
Open source – Free!
Python: cons
7
Dependency Hell
Affects all modern languages, especially interpreted ones.
Python especially challenging:
• Huge number of 3rd party packages
• Rapidly changing APIs
• Scientific packages need non-python dependencies.
Solutions - Anaconda / virtualenv etc…
Anaconda
8
Download for your own machine (free):
http://continuum.io/downloads
Use on BioHPC cluster or clients:
module load python/2.7.x-anaconda
module load python/3.4.x-anaconda
Other python modules are deprecated
Manages python packages AND their non-python dependencies.
Allows creation of multiple environments, with versions you need for specific projects.
Anaconda – Default Environment
9
162 packages, including full scientific python stack
Spyder scientific development environment
Ipython notebook ready to run
$ conda list
$ ipython notebook
$ spyder
Python on BioHPC
10
Anaconda – Create Your Own Environment
11
The main module installation must be stableWe won’t update packages in it frequently.
The conda tool lets you create your own environments with versions you needThe virtual environment is installed in $HOME/.conda
# Create a new environment with latest anaconda package setconda create -n test anaconda
# See environments availableconda env list
# Start using/switch back to defaultsource activate testsource deactivate
# Remove environmentconda env remove –-name test
Anaconda – Create Your Own Environment
12
# Create a minimal environment with specific python and numpy# Won't install all of the conda package setconda create -n test1 numpy scipy matplotlib bokeh
# Start using the environmentsource activate test1
# Add ipython to this active environmentconda install ipython
# Update the numpy package to the latest versionconda update numpy
# Install a non-conda package from PyPI using pip conda search colourpip search colourpip install colour
SCA – Statistical coupling analysis of protein families
13
Characterize pattern of evolutionary constraints of Amino acid positionsGiven sequence alignments, SCA measures functional constraints at each position and the correlations.
4 steps to use pySCA software, eg:
./annotate_MSA.py Inputs/PF00186_full.txt\–o Outputs/PF00186_full.an –a ‘pfam’
./scaProcessMSA.py Inputs/PF00186_full.an\–s 1RX2 –c A –f ‘Escherichia coli’ -t
./scaCore.py PF00186_full.db –n ‘frob’ –l 0.03 –t 10
./scaSectorid.py PF00186_full.db –p 0.95
Ipython Notebook
14
Numpy
15
a = range(100000)
for i in range(100):
for n in a:
b = n**2
100 loops, lapsed 1.53494286537 sec
import numpy as np
a = np.arange(100000)
for i in range(100):
b = a**2
100 loops, lapsed 0.0342528820038 sec
The numpy method is 44.8120793223 times faster
Much Faster !!!
Notebook-1,2
Python: cons
16
Python is slooooooow…..
Trades execution speed for development speed.
Solution: Move critical portions closer to machine code.
• Directly call C code - Cython
• Use modules built on optimized, compiled code.e.g. NumPy builds on BLAS / LAPACK
NumPy
17
NumPy performs (multi-dimensional) array arithmetic much faster than native python objects, by using low-level contiguous arrays and compiled libraries:
SciPy and MatPlotLib
18
Notebook-3,4
import matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3Dfrom matplotlib import cmfrom scipy import *from scipy.special import jn , jn_zeros # Bessel function
def drumhead_height (n, k, distance, angle, t ):nth_zero = jn_zeros (n, k)return cos(t)* cos(n * angle ) * jn(n, distance * nth_zero )
theta = r_[0 : 2 * pi :50j]radius = r_[0 : 1 : 50j]x = array([ r* cos( theta ) for r in radius ])y = array([ r* sin( theta ) for r in radius ])z = array([ drumhead_height(1, 1, r, theta, 0.5) for r in radius ])fig = plt.figure ()ax = Axes3D( fig )ax.plot_surface (x, y, z, rstride =1, cstride=1, cmap=cm.jet )ax.set_xlabel('x')ax.set_ylabel('y')ax.set_zlabel('z')plt.show()
Turn pySCA into a website
19
Tasks
- Web interface- Multiple user- Run analysis- Display results to explore
Design
- Django framework- Database model design- Simple form interface- Celery backend execution- Direct call pySCA- Interactive plottings
Django – web server glues everything
20
Forms for users
Django turns forms into database entries
Celery – distributed task queue
21
User submits form
Celery takes command, runs job in queue
Django converts command, issues async celery tasks
Bokeh - Interactive plotting tool
22
Notebook – 5,6
Output can be an html file, on the web, or Ipython notebook
Take home messages
23
- Python is easy and fun to use.
- Use virtual environment for different projects.
- Use compiled libraries for speed.
- Python has interactive packages for data sharing and exploring.
- Python can do a lot!
pySCA is a package developed by Olivier Rivoire, Kimberly Reynolds, and Rama
Ranganathan. cf: Olivier Rivoire, Kimberly Reynolds, and Rama Ranganathan,
Evolution-Based Functional Decomposition of Proteins, PLOS Computational Biology,
12(6): e1004817.
NumPy, SciPy, mpi4py and multiprocess code was taken from the Texas Advanced
Computing Center hpc-python course. Slides have been reformatted to UTSW style.
Code was changed accordingly. Source: https://portal.tacc.utexas.edu/-/hpc-python
”HPC Python”, Texas Advanced Computing Center, 2015. Available under a Creative
Commons Attribution Non-Commercial 3.0 Unported License.
Bokeh medal example was taken from bokeh.pydata.org
More reference: www.cism.ucl.ac.be/Services/Formations//python/2015
Acknowledgements / License
25
Python: cons
26
Global Interpreter Lock
Can create many threads, but only runs 1 at a time.
Solution – multiple processes
a = list()
a.append(1)
item=a.pop()
a = list()
a.append(1)
item=a.pop()
Thread A Thread B
Time
Multiprocessing
27
One way around the GIL is to use the multiprocessing module.
Convenient methods to:
• Create individual child processes executing a function
• Create and use a pool of processes
• Perform a ‘map’ from inputs to output using multiple processes
• Share data between processes using shared memory objects *
• Run a server process holding shared objects that can be manipulated by workers
Multiprocessing – Direct Creation & Management
28
multiproc_test.py
import random, os, multiprocessing
def list_append(count, out_list):# Appends a random number to the list 'count' number# of times. A CPU-heavy operation!print os.getpid(), 'is working'for i in range(count):
out_list.append(random.random())
if __name__ == "__main__":size = 10000000 # Number of random numbers to addprocs = 2 # Number of processes to create# Create a list of processes and define work for each processprocess_list = []
for i in range(0, procs):out_list = list()process = multiprocessing.Process(target=list_append,
args=(size, out_list))process_list.append(process)
# Start the processes (i.e. calculate the random number lists)for p in process_list:
p.start()
# End all of the processes have finishedfor p in process_list:
p.join()
print "List processing complete."
Multiprocessing – Map on an iterable object
29
multip.py
from multiprocessing import Poolimport time
def f(x): # do some tedius workfor i in range(10000):
a = x * xreturn a
if __name__ == '__main__':pool = Pool(processes=10) # start a pool of 10 workersn = 10000results = pool.apply_async(f, (n,)) # use 10 worke processes to run f 10k times.print results # apply_async returns an objects, non-blocking
while not results.ready():print "results are coming ..."time.sleep(1)
print "results are ready: ", results.ready()
ts = time.time() # we time how long it takes to run with one processa = map(f, range(n)) te = time.time()a = pool.map(f, range(n)) # use 10 pool workers to run the tasktep = time.time()
print "a 10k calls takes ", te-ts, ' sec'print "a multiprocessing 10k calls takes ", tep-te, ' sec'
MPI
30
A interface for parallel computation using message passing between processes
Small set of instructions, but quite complex to use
Mpi4py – MPI wrappers for python
31
Install the module
$ pip install mpi4py$ module add openmpi/gcc/64/1.6.5
$ mpirun –n 10 python hello.py
Run the code
hello.py
from mpi4py import MPIimport socketcomm=MPI.COMM_WORLDrank = comm.Get_rank()
msg = comm.bcast('hello')print "rank %d of %d says %s from host %s" % (rank, comm.size, msg, socket.gethostname())
mpi4py – Communication of python objects
32
$ mpirun –n 2 python p2p.py
SLOW! – Python objects must be serialized & deserialized.
Note – send and receive lower case!
p2p.py
# send message p2pfrom mpi4py import MPI
comm = MPI.COMM_WORLDrank = MPI.COMM_WORLD.Get_rank()a=range(100)if rank == 0:
data = acomm.send(data, dest=1, tag=98)
else:data = comm.recv(source=0, tag=98)
if rank == 1:print data
mpi4py – Communication of numpy arrays
33
$ mpirun –n 2 python Bcast.py
Faster – numpy arrays can be sent / received as mem buffer, directly by the MPI layer
Note – send and receive lower case!
Bcast.py
# send collective messagefrom mpi4py import MPIimport numpy as np
comm = MPI.COMM_WORLDrank = MPI.COMM_WORLD.Get_rank()if rank == 0:
data = np.arange(100, dtype='i')else:
data = np.empty(100, dtype='i')
comm.Bcast(data, root=0)
if rank == 1:print data