+ All Categories
Home > Documents > for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC...

for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC...

Date post: 04-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
33
for Scientific Developing 1 BioHPC Training 10/19/2016 [email protected]
Transcript
Page 1: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

for Scientific Developing

1

BioHPC Training

10/19/2016

[email protected]

Page 2: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Python: a popular high-level language

2

Page 3: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Learning Python from scratch

3

http://www.codecademy.com/en/tracks/python

Free, interactive web-based tutorialGreat for new programmers

http://learnpythonthehardway.org/book/

Free web book. Exercise based, comprehensive.

Page 4: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Python topics to be covered

4

- Virtual Environment & Anaconda

- Ipython Notebook

- NumPy/SciPy

- Matplotlib

- Interactive plotting using Bokeh

- Turn python package to web app

Page 5: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

A published Python package by Rangananthan and Reynolds from GCSB- Command line based- Nice IPython Notebook tutorial- Detailed documentation

Challenges:- Use from command line- Dependencies- Interactive presentation- Broader impact

Solutions:- A python web app

Online PySCA – An ongoing case study

5

Page 6: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Python: pros

6

A clean, easy to learn language

Huge number of community created packages

Booming popularity for scientific computing

Python bindings / API for a lot of other software

Open source – Free!

Page 7: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Python: cons

7

Dependency Hell

Affects all modern languages, especially interpreted ones.

Python especially challenging:

• Huge number of 3rd party packages

• Rapidly changing APIs

• Scientific packages need non-python dependencies.

Solutions - Anaconda / virtualenv etc…

Page 8: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Anaconda

8

Download for your own machine (free):

http://continuum.io/downloads

Use on BioHPC cluster or clients:

module load python/2.7.x-anaconda

module load python/3.4.x-anaconda

Other python modules are deprecated

Manages python packages AND their non-python dependencies.

Allows creation of multiple environments, with versions you need for specific projects.

Page 9: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Anaconda – Default Environment

9

162 packages, including full scientific python stack

Spyder scientific development environment

Ipython notebook ready to run

$ conda list

$ ipython notebook

$ spyder

Page 10: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Python on BioHPC

10

Page 11: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Anaconda – Create Your Own Environment

11

The main module installation must be stableWe won’t update packages in it frequently.

The conda tool lets you create your own environments with versions you needThe virtual environment is installed in $HOME/.conda

# Create a new environment with latest anaconda package setconda create -n test anaconda

# See environments availableconda env list

# Start using/switch back to defaultsource activate testsource deactivate

# Remove environmentconda env remove –-name test

Page 12: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Anaconda – Create Your Own Environment

12

# Create a minimal environment with specific python and numpy# Won't install all of the conda package setconda create -n test1 numpy scipy matplotlib bokeh

# Start using the environmentsource activate test1

# Add ipython to this active environmentconda install ipython

# Update the numpy package to the latest versionconda update numpy

# Install a non-conda package from PyPI using pip conda search colourpip search colourpip install colour

Page 13: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

SCA – Statistical coupling analysis of protein families

13

Characterize pattern of evolutionary constraints of Amino acid positionsGiven sequence alignments, SCA measures functional constraints at each position and the correlations.

4 steps to use pySCA software, eg:

./annotate_MSA.py Inputs/PF00186_full.txt\–o Outputs/PF00186_full.an –a ‘pfam’

./scaProcessMSA.py Inputs/PF00186_full.an\–s 1RX2 –c A –f ‘Escherichia coli’ -t

./scaCore.py PF00186_full.db –n ‘frob’ –l 0.03 –t 10

./scaSectorid.py PF00186_full.db –p 0.95

Page 14: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Ipython Notebook

14

Page 15: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Numpy

15

a = range(100000)

for i in range(100):

for n in a:

b = n**2

100 loops, lapsed 1.53494286537 sec

import numpy as np

a = np.arange(100000)

for i in range(100):

b = a**2

100 loops, lapsed 0.0342528820038 sec

The numpy method is 44.8120793223 times faster

Much Faster !!!

Notebook-1,2

Page 16: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Python: cons

16

Python is slooooooow…..

Trades execution speed for development speed.

Solution: Move critical portions closer to machine code.

• Directly call C code - Cython

• Use modules built on optimized, compiled code.e.g. NumPy builds on BLAS / LAPACK

Page 17: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

NumPy

17

NumPy performs (multi-dimensional) array arithmetic much faster than native python objects, by using low-level contiguous arrays and compiled libraries:

Page 18: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

SciPy and MatPlotLib

18

Notebook-3,4

import matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3Dfrom matplotlib import cmfrom scipy import *from scipy.special import jn , jn_zeros # Bessel function

def drumhead_height (n, k, distance, angle, t ):nth_zero = jn_zeros (n, k)return cos(t)* cos(n * angle ) * jn(n, distance * nth_zero )

theta = r_[0 : 2 * pi :50j]radius = r_[0 : 1 : 50j]x = array([ r* cos( theta ) for r in radius ])y = array([ r* sin( theta ) for r in radius ])z = array([ drumhead_height(1, 1, r, theta, 0.5) for r in radius ])fig = plt.figure ()ax = Axes3D( fig )ax.plot_surface (x, y, z, rstride =1, cstride=1, cmap=cm.jet )ax.set_xlabel('x')ax.set_ylabel('y')ax.set_zlabel('z')plt.show()

Page 19: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Turn pySCA into a website

19

Tasks

- Web interface- Multiple user- Run analysis- Display results to explore

Design

- Django framework- Database model design- Simple form interface- Celery backend execution- Direct call pySCA- Interactive plottings

Page 20: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Django – web server glues everything

20

Forms for users

Django turns forms into database entries

Page 21: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Celery – distributed task queue

21

User submits form

Celery takes command, runs job in queue

Django converts command, issues async celery tasks

Page 22: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Bokeh - Interactive plotting tool

22

Notebook – 5,6

Output can be an html file, on the web, or Ipython notebook

Page 23: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Take home messages

23

- Python is easy and fun to use.

- Use virtual environment for different projects.

- Use compiled libraries for speed.

- Python has interactive packages for data sharing and exploring.

- Python can do a lot!

Page 24: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

pySCA is a package developed by Olivier Rivoire, Kimberly Reynolds, and Rama

Ranganathan. cf: Olivier Rivoire, Kimberly Reynolds, and Rama Ranganathan,

Evolution-Based Functional Decomposition of Proteins, PLOS Computational Biology,

12(6): e1004817.

NumPy, SciPy, mpi4py and multiprocess code was taken from the Texas Advanced

Computing Center hpc-python course. Slides have been reformatted to UTSW style.

Code was changed accordingly. Source: https://portal.tacc.utexas.edu/-/hpc-python

”HPC Python”, Texas Advanced Computing Center, 2015. Available under a Creative

Commons Attribution Non-Commercial 3.0 Unported License.

Bokeh medal example was taken from bokeh.pydata.org

More reference: www.cism.ucl.ac.be/Services/Formations//python/2015

Acknowledgements / License

Page 25: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

25

Page 26: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Python: cons

26

Global Interpreter Lock

Can create many threads, but only runs 1 at a time.

Solution – multiple processes

a = list()

a.append(1)

item=a.pop()

a = list()

a.append(1)

item=a.pop()

Thread A Thread B

Time

Page 27: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Multiprocessing

27

One way around the GIL is to use the multiprocessing module.

Convenient methods to:

• Create individual child processes executing a function

• Create and use a pool of processes

• Perform a ‘map’ from inputs to output using multiple processes

• Share data between processes using shared memory objects *

• Run a server process holding shared objects that can be manipulated by workers

Page 28: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Multiprocessing – Direct Creation & Management

28

multiproc_test.py

import random, os, multiprocessing

def list_append(count, out_list):# Appends a random number to the list 'count' number# of times. A CPU-heavy operation!print os.getpid(), 'is working'for i in range(count):

out_list.append(random.random())

if __name__ == "__main__":size = 10000000 # Number of random numbers to addprocs = 2 # Number of processes to create# Create a list of processes and define work for each processprocess_list = []

for i in range(0, procs):out_list = list()process = multiprocessing.Process(target=list_append,

args=(size, out_list))process_list.append(process)

# Start the processes (i.e. calculate the random number lists)for p in process_list:

p.start()

# End all of the processes have finishedfor p in process_list:

p.join()

print "List processing complete."

Page 29: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Multiprocessing – Map on an iterable object

29

multip.py

from multiprocessing import Poolimport time

def f(x): # do some tedius workfor i in range(10000):

a = x * xreturn a

if __name__ == '__main__':pool = Pool(processes=10) # start a pool of 10 workersn = 10000results = pool.apply_async(f, (n,)) # use 10 worke processes to run f 10k times.print results # apply_async returns an objects, non-blocking

while not results.ready():print "results are coming ..."time.sleep(1)

print "results are ready: ", results.ready()

ts = time.time() # we time how long it takes to run with one processa = map(f, range(n)) te = time.time()a = pool.map(f, range(n)) # use 10 pool workers to run the tasktep = time.time()

print "a 10k calls takes ", te-ts, ' sec'print "a multiprocessing 10k calls takes ", tep-te, ' sec'

Page 30: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

MPI

30

A interface for parallel computation using message passing between processes

Small set of instructions, but quite complex to use

Page 31: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

Mpi4py – MPI wrappers for python

31

Install the module

$ pip install mpi4py$ module add openmpi/gcc/64/1.6.5

$ mpirun –n 10 python hello.py

Run the code

hello.py

from mpi4py import MPIimport socketcomm=MPI.COMM_WORLDrank = comm.Get_rank()

msg = comm.bcast('hello')print "rank %d of %d says %s from host %s" % (rank, comm.size, msg, socket.gethostname())

Page 32: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

mpi4py – Communication of python objects

32

$ mpirun –n 2 python p2p.py

SLOW! – Python objects must be serialized & deserialized.

Note – send and receive lower case!

p2p.py

# send message p2pfrom mpi4py import MPI

comm = MPI.COMM_WORLDrank = MPI.COMM_WORLD.Get_rank()a=range(100)if rank == 0:

data = acomm.send(data, dest=1, tag=98)

else:data = comm.recv(source=0, tag=98)

if rank == 1:print data

Page 33: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu

mpi4py – Communication of numpy arrays

33

$ mpirun –n 2 python Bcast.py

Faster – numpy arrays can be sent / received as mem buffer, directly by the MPI layer

Note – send and receive lower case!

Bcast.py

# send collective messagefrom mpi4py import MPIimport numpy as np

comm = MPI.COMM_WORLDrank = MPI.COMM_WORLD.Get_rank()if rank == 0:

data = np.arange(100, dtype='i')else:

data = np.empty(100, dtype='i')

comm.Bcast(data, root=0)

if rank == 1:print data


Recommended