for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC...

transcript

for Scientific Developing

BioHPC Training

10/19/2016

biohpc-help@utsouthwestern.edu

Python: a popular high-level language

Learning Python from scratch

http://www.codecademy.com/en/tracks/python

Free, interactive web-based tutorialGreat for new programmers

http://learnpythonthehardway.org/book/

Free web book. Exercise based, comprehensive.

Python topics to be covered

- Virtual Environment & Anaconda

- Ipython Notebook

- NumPy/SciPy

- Matplotlib

- Interactive plotting using Bokeh

- Turn python package to web app

A published Python package by Rangananthan and Reynolds from GCSB- Command line based- Nice IPython Notebook tutorial- Detailed documentation

Challenges:- Use from command line- Dependencies- Interactive presentation- Broader impact

Solutions:- A python web app

Online PySCA – An ongoing case study

Python: pros

A clean, easy to learn language

Huge number of community created packages

Booming popularity for scientific computing

Python bindings / API for a lot of other software

Open source – Free!

Python: cons

Dependency Hell

Affects all modern languages, especially interpreted ones.

Python especially challenging:

• Huge number of 3rd party packages

• Rapidly changing APIs

• Scientific packages need non-python dependencies.

Solutions - Anaconda / virtualenv etc…

Anaconda

Download for your own machine (free):

http://continuum.io/downloads

Use on BioHPC cluster or clients:

module load python/2.7.x-anaconda

module load python/3.4.x-anaconda

Other python modules are deprecated

Manages python packages AND their non-python dependencies.

Allows creation of multiple environments, with versions you need for specific projects.

Anaconda – Default Environment

162 packages, including full scientific python stack

Spyder scientific development environment

Ipython notebook ready to run

$ conda list

$ ipython notebook

$ spyder

Python on BioHPC

Anaconda – Create Your Own Environment

The main module installation must be stableWe won’t update packages in it frequently.

The conda tool lets you create your own environments with versions you needThe virtual environment is installed in $HOME/.conda

# Create a new environment with latest anaconda package setconda create -n test anaconda

# See environments availableconda env list

# Start using/switch back to defaultsource activate testsource deactivate

# Remove environmentconda env remove –-name test

Anaconda – Create Your Own Environment

# Create a minimal environment with specific python and numpy# Won't install all of the conda package setconda create -n test1 numpy scipy matplotlib bokeh

# Start using the environmentsource activate test1

# Add ipython to this active environmentconda install ipython

# Update the numpy package to the latest versionconda update numpy

# Install a non-conda package from PyPI using pip conda search colourpip search colourpip install colour

SCA – Statistical coupling analysis of protein families

Characterize pattern of evolutionary constraints of Amino acid positionsGiven sequence alignments, SCA measures functional constraints at each position and the correlations.

4 steps to use pySCA software, eg:

./annotate_MSA.py Inputs/PF00186_full.txt\–o Outputs/PF00186_full.an –a ‘pfam’

./scaProcessMSA.py Inputs/PF00186_full.an\–s 1RX2 –c A –f ‘Escherichia coli’ -t

./scaCore.py PF00186_full.db –n ‘frob’ –l 0.03 –t 10

./scaSectorid.py PF00186_full.db –p 0.95

Ipython Notebook

a = range(100000)

for i in range(100):

for n in a:

b = n**2

100 loops, lapsed 1.53494286537 sec

import numpy as np

a = np.arange(100000)

for i in range(100):

b = a**2

100 loops, lapsed 0.0342528820038 sec

The numpy method is 44.8120793223 times faster

Much Faster !!!

Notebook-1,2

Python: cons

Python is slooooooow…..

Trades execution speed for development speed.

Solution: Move critical portions closer to machine code.

• Directly call C code - Cython

• Use modules built on optimized, compiled code.e.g. NumPy builds on BLAS / LAPACK

NumPy performs (multi-dimensional) array arithmetic much faster than native python objects, by using low-level contiguous arrays and compiled libraries:

SciPy and MatPlotLib

Notebook-3,4

import matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3Dfrom matplotlib import cmfrom scipy import *from scipy.special import jn , jn_zeros # Bessel function

def drumhead_height (n, k, distance, angle, t ):nth_zero = jn_zeros (n, k)return cos(t)* cos(n * angle ) * jn(n, distance * nth_zero )

theta = r_[0 : 2 * pi :50j]radius = r_[0 : 1 : 50j]x = array([ r* cos( theta ) for r in radius ])y = array([ r* sin( theta ) for r in radius ])z = array([ drumhead_height(1, 1, r, theta, 0.5) for r in radius ])fig = plt.figure ()ax = Axes3D( fig )ax.plot_surface (x, y, z, rstride =1, cstride=1, cmap=cm.jet )ax.set_xlabel('x')ax.set_ylabel('y')ax.set_zlabel('z')plt.show()

Turn pySCA into a website

- Web interface- Multiple user- Run analysis- Display results to explore

Design

- Django framework- Database model design- Simple form interface- Celery backend execution- Direct call pySCA- Interactive plottings

Django – web server glues everything

Forms for users

Django turns forms into database entries

Celery – distributed task queue

User submits form

Celery takes command, runs job in queue

Django converts command, issues async celery tasks

Bokeh - Interactive plotting tool

Notebook – 5,6

Output can be an html file, on the web, or Ipython notebook

Take home messages

- Python is easy and fun to use.

- Use virtual environment for different projects.

- Use compiled libraries for speed.

- Python has interactive packages for data sharing and exploring.

- Python can do a lot!

pySCA is a package developed by Olivier Rivoire, Kimberly Reynolds, and Rama

Ranganathan. cf: Olivier Rivoire, Kimberly Reynolds, and Rama Ranganathan,

Evolution-Based Functional Decomposition of Proteins, PLOS Computational Biology,

12(6): e1004817.

NumPy, SciPy, mpi4py and multiprocess code was taken from the Texas Advanced

Computing Center hpc-python course. Slides have been reformatted to UTSW style.

Code was changed accordingly. Source: https://portal.tacc.utexas.edu/-/hpc-python

”HPC Python”, Texas Advanced Computing Center, 2015. Available under a Creative

Commons Attribution Non-Commercial 3.0 Unported License.

Bokeh medal example was taken from bokeh.pydata.org

More reference: www.cism.ucl.ac.be/Services/Formations//python/2015

Acknowledgements / License

Python: cons

Global Interpreter Lock

Can create many threads, but only runs 1 at a time.

Solution – multiple processes

a = list()

a.append(1)

item=a.pop()

a = list()

a.append(1)

item=a.pop()

Thread A Thread B

Multiprocessing

One way around the GIL is to use the multiprocessing module.

Convenient methods to:

• Create individual child processes executing a function

• Create and use a pool of processes

• Perform a ‘map’ from inputs to output using multiple processes

• Share data between processes using shared memory objects *

• Run a server process holding shared objects that can be manipulated by workers

Multiprocessing – Direct Creation & Management

multiproc_test.py

import random, os, multiprocessing

def list_append(count, out_list):# Appends a random number to the list 'count' number# of times. A CPU-heavy operation!print os.getpid(), 'is working'for i in range(count):

out_list.append(random.random())

if __name__ == "__main__":size = 10000000 # Number of random numbers to addprocs = 2 # Number of processes to create# Create a list of processes and define work for each processprocess_list = []

for i in range(0, procs):out_list = list()process = multiprocessing.Process(target=list_append,

args=(size, out_list))process_list.append(process)

# Start the processes (i.e. calculate the random number lists)for p in process_list:

p.start()

# End all of the processes have finishedfor p in process_list:

p.join()

print "List processing complete."

Multiprocessing – Map on an iterable object

multip.py

from multiprocessing import Poolimport time

def f(x): # do some tedius workfor i in range(10000):

a = x * xreturn a

if __name__ == '__main__':pool = Pool(processes=10) # start a pool of 10 workersn = 10000results = pool.apply_async(f, (n,)) # use 10 worke processes to run f 10k times.print results # apply_async returns an objects, non-blocking

while not results.ready():print "results are coming ..."time.sleep(1)

print "results are ready: ", results.ready()

ts = time.time() # we time how long it takes to run with one processa = map(f, range(n)) te = time.time()a = pool.map(f, range(n)) # use 10 pool workers to run the tasktep = time.time()

print "a 10k calls takes ", te-ts, ' sec'print "a multiprocessing 10k calls takes ", tep-te, ' sec'

A interface for parallel computation using message passing between processes

Small set of instructions, but quite complex to use

Mpi4py – MPI wrappers for python

Install the module

$ pip install mpi4py$ module add openmpi/gcc/64/1.6.5

$ mpirun –n 10 python hello.py

Run the code

hello.py

from mpi4py import MPIimport socketcomm=MPI.COMM_WORLDrank = comm.Get_rank()

msg = comm.bcast('hello')print "rank %d of %d says %s from host %s" % (rank, comm.size, msg, socket.gethostname())

mpi4py – Communication of python objects

$ mpirun –n 2 python p2p.py

SLOW! – Python objects must be serialized & deserialized.

Note – send and receive lower case!

p2p.py

# send message p2pfrom mpi4py import MPI

comm = MPI.COMM_WORLDrank = MPI.COMM_WORLD.Get_rank()a=range(100)if rank == 0:

data = acomm.send(data, dest=1, tag=98)

else:data = comm.recv(source=0, tag=98)

if rank == 1:print data

mpi4py – Communication of numpy arrays

$ mpirun –n 2 python Bcast.py

Faster – numpy arrays can be sent / received as mem buffer, directly by the MPI layer

Note – send and receive lower case!

Bcast.py

# send collective messagefrom mpi4py import MPIimport numpy as np

comm = MPI.COMM_WORLDrank = MPI.COMM_WORLD.Get_rank()if rank == 0:

data = np.arange(100, dtype='i')else:

data = np.empty(100, dtype='i')

comm.Bcast(data, root=0)

if rank == 1:print data

for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC...

Documents