+ All Categories
Home > Documents > Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of...

Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of...

Date post: 21-Dec-2015
Category:
View: 219 times
Download: 1 times
Share this document with a friend
Popular Tags:
38
Lecture 9: Back to the Basics: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Python and Application in Bioinformatics Y.Z. Chen Y.Z. Chen Department of Pharmacy Department of Pharmacy National University of Singapore National University of Singapore Tel: 65-6616-6877; Email: [email protected] ; Web: Tel: 65-6616-6877; Email: [email protected] ; Web: http://bidd.nus.edu.sg http://bidd.nus.edu.sg Content Content What is python? What is python? Python basics Python basics Application in bioinformatics Application in bioinformatics
Transcript
Page 1: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Lecture 9: Back to the Basics: Lecture 9: Back to the Basics: Python and Application in BioinformaticsPython and Application in Bioinformatics

Y.Z. ChenY.Z. ChenDepartment of PharmacyDepartment of Pharmacy

National University of SingaporeNational University of Singapore Tel: 65-6616-6877; Email: [email protected] ; Web: http://bidd.nus.edu.sgTel: 65-6616-6877; Email: [email protected] ; Web: http://bidd.nus.edu.sg

ContentContent

• What is python?What is python?

• Python basicsPython basics

• Application in bioinformaticsApplication in bioinformatics

Page 2: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Why Programming?

Programming skills needed for tasks such as:

• Write a program to do the same PUBMED search every week and list the new hits for molecular interactions, network regulations.

• Do a BLAST search against sequences which are on your list of proteins with known kinetic data

• Merge results from different searches

• Import data into Excel for plotting

Page 3: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

What Programming Tools?

• Popularly used programming tools:

• Programming languages - Perl, Python, C, C++, Java, Visual Basic, PHP, Fortran

• Software libraries - BioPerl, Biopython, and BioJava

• Databases - MySQL, Postgres, Oracle

Page 4: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Statistics of Software UsageStatistics of Software Usage

Nature Biotech 25, 390 (2007)

Page 5: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Why Python?• Suitable for relatively small automated tasks such as search-and-

replace over a large number of text files, rename and rearrange files, write a small database, specialized GUI application, and development of simple games

• Faster and easier alternatives to C/C++/Java

• Simpler to use, available on Windows, MacOS X, and Unix operating systems

• A real programming language, more structure and support than shell scripts or batch files can offer, more error checking than C, high-level data types built in, applicable to a much larger problem domain than Awk or even Perl yet in many cases equally easy to use

• An interpreted language, which can save you considerable time during program development because no compilation and linking is necessary.

Page 6: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Why Python?

• Allows you to split program into modules used in other Python programs, comes with a large collection of standard modules such as file I/O, system calls, sockets, interfaces to graphical user interface toolkits.

• Enables programs to be written compactly and readably at typically much shorter length than equivalent C, C++, Java programs, for several reasons: • The high-level data types allow you to express complex

operations in a single statement; • statement grouping is done by indentation instead of beginning

and ending brackets; • no variable or argument declarations are necessary.

• Extensible: if you know how to program in C it is easy to add a new built-in function or module to the interpreter, you can link the Python interpreter into an application written in C and use it as an extension or command language for that application.

Page 7: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

What is Python?

Python is a Programming Language

• Started by Guido van Rossum in 1990 as a way to write software for the Amoeba operating system. Influenced by ABC, which was designed to be easy to learn. It is also very useful for large programs written by expert programmers.

• The word "Python" comes from the comedy troupe "Monty Python." Words and jokes from the skits and movies appear often in Python software, including "spam," "idle," and "grail"

Page 8: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

What is Python?

Python Properties

• Interpreted Language • Interactive mode • Imperative and "Object-Oriented" • Cross-platform • Doesn't try to guess what you mean • Great for team projects • Popular for web applications, testing, and XML • Extremely popular for chemical informatics (but not so

much in bioinformatics)

Page 9: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

What is Python?

Interactive Mode

• Python has an interactive mode. You can type Python code and see the results immediately. To start Python, open a unix shell and type "python".

> pythonPython 2.3.3 (#1, Jan 29 2004, 22:55:13) [GCC 3.3.3 [FreeBSD] 20031106] on freebsd5Type "help", "copyright", "credits" or "license" for more information.>>>

• At the >>> prompt you can enter Python code.

Page 10: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python Resources http://python.org/

Page 11: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python Resources http://www.pasteur.fr/recherche/unites/sis/formation/python/index.html

Page 12: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python Resources http://www.pasteur.fr/recherche/unites/sis/formation/python/index.html

Page 13: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Example: Using Python as a calculator>>> 2+35>>> 4+6*852>>> abs(-4)4>>> help(abs)Help on built-in function abs:

abs(...) abs(number) -> number Return the absolute value of the argument.

>>> 89**341902217732808760980190430983601716818363305103120555045416541165041L

>>> print 89**341902217732808760980190430983601716818363305103120555045416541165041

>>> "What... is the air-speed velocity of an unladen swallow?"'What... is the air-speed velocity of an unladen swallow?'

>>> print "What do you mean? An African or European swallow?"What do you mean? An African or European swallow?

Page 14: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

What is Python?Example: Importing a module

>>> import math>>> help(math)

Help on module math:NAME mathFILE /usr/local/lib/python2.3/lib-dynload/math.soDESCRIPTION This module is always available. It provides access to the mathematical

functions defined by the C standard.

>>> math.pi3.1415926535897931>>> math.sin(math.pi/2.0)1.0>>>

Page 15: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

What is Python?Example: Print the Time of Day

>>> import datetime>>> now = datetime.datetime.now()>>> nowdatetime.datetime(2008, 2, 2, 19, 23, 28, 809434)>>> print now2008-02-02 19:23:28.809434>>> print "Now is", now.strftime("%d-%m-%Y"), "at",

now.strftime("%H:%M")Now is 02-02-2008 at 19:23>>>

• The notation name1.name2 is called an attribute lookup. In this case, name2 is an attribute of name1 and has some value.

>>> now.day2>>> now.year2008>>> now.hour19

Page 16: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Simple Python scriptCode:

# file: simple_code.pyimport mathimport datetimeprint "log(1e23) =", math.log(1e23)print "2*sin(3.1414) = ", 2*math.sin(3.1414)now = datetime.datetime.now()print "Now is", now.strftime("%d-%m-%Y"), "at", now.strftime("%H:%M")print "or, more precisely, %s" % now

Output:

> python simple_code.pylog(1e23) = 52.95945713892*sin(3.1414) = 0.000385307177203Now is 02-02-2008 at 19:55or, more precisely, 2008-02-02 19:55:43.046953>

Page 17: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python Script

Creating Python Script• A Python program is just a text file. You can use any text

(programmer's) editor. There are several on the Linux machines, including vi, XEmacs, Kate, xvim, and nedit. You can also use one of the free IDEs like idle, PyShell, or (under Microsoft Windows) Pythonwin.

Running Python Script• Option 1: Run the python program from the command line, giving it the

name of the script file to run.

> python now.pyNow is 02-02-2004 at 19:55or, more precisely, 2004-02-02 19:55:43.046953>

Page 18: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python ScriptRunning Python Script• Option 2: Put the magic comment #!/usr/bin/env python as the very first

line in the program.

Code:

#!/usr/bin/env python# now.pyimport datetimenow = datetime.datetime.now()print "Now is", now.strftime("%d-%m-%Y"), "at", now.strftime("%H:%M")print "or, more precisely, %s" % now

Make the script executable with chmod +x now.py

> chmod +x now.py

Then run the program as if it's any other Unix program

> now.pyNow is 02-02-2004 at 19:55or, more precisely, 2004-02-02 19:55:43.046953

Page 19: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python Statements

Statement examples:

sum = 2 + 2 # this is a statement

name = raw_input("What is your name?") # these are two statementsprint "Hello,", name

print "Did you know that your name has", \ len(name), "letters?" # This is one statement spread across 2 lines

# Another way to extend a statement across several linesprint "Here is your name repeated 7 times:", ( name * 7 )

Page 20: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python Statements

Blocks, If and for statements

EcoRI = "GAATTC"sequence = raw_input("Enter a DNA sequence:")if EcoRI in sequence: print "Sequence contains an EcoRI site" # This is a one-line block

import syssequence2 = raw_input("Enter another sequence:")if len(sequence2) < 100: print "Sequence is too small. Throw it back." # a two-line block sys.exit(0)

sequences = (sequence, sequence2)for seq in sequences: print "sequence length =", len(seq) # a block ... for c in "ATCG": print "#%s = %d" % (c, seq.count("C")) # ... with a block inside it

Page 21: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python Objects and LiteralsString Literals

# single quotes'Who said "to be or not to be"?'

# double quotes"DNA goes from 5' to 3'."

# escaped quotes"\"That's not fair!\" yelled my sister."# creates: "That's not fair!" yelled my sister

# triple quoted strings, with single quotes'''This one string can goover several lines'''

# "raw" strings, mostly used for regular expressionsr"\"That's not fair!\" yelled my sister."# creates: \"That's not fair!\" yelled my sister

# You can even have raw triple double quoted strings!r"""So there!"“”

Page 22: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python Objects and LiteralsNumeric Literals

123 # an integer

1.23 # a floating point number -1.23 # a negative floating point number 1.23E45; # scientific notation 0x7b; # hexadecimal notation (decimal 123) 0173; # octal notation (decimal 123)

12+3*j; # complex number 12 + 3i (Note that Python uses "j"!)

2147483648L # a long integer

Page 23: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python Objects and LiteralsList literal

>>> data = [1, 4, 9, 16]>>> data[0]1>>> data[1]4>>> data[2] = 7>>> data[1, 4, 7, 16]>>> data[1:3][4, 9]>>>

Page 24: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python Objects and LiteralsTuple literal

>>> data = (1, 4, 9, 16)>>> data[1]4>>> data[2] = 7Traceback (most recent call last): File "", line 1, in ?TypeError: object doesn't support item assignment>>>

Dictionary literal

>>> d = {"A": "ALA", "C": "CYS", "D": "ASP"}>>> print d["A"]ALA>>>

Page 25: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python OperatorsSome operation using numbers

>>> (1+2)**29>>> (2+3*4)/27>>> 7%3 # % is the modulo operator1>>> 7 == 7True>>>

Page 26: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python Operators

Some operation using strings

>>> "Andrew" + " " + "Dalke"'Andrew Dalke‘

>>> "*" * 10'**********'>>> "My name is %s. What's your name?" % "Andrew"'My name is Andrew. What's your name‘

>>> "My first name is %s and family name is %s" % ("Andrew", "Dalke")

'My first name is Andrew and family name is Dalke‘

>>> "My first name is %(first)s. Is yours also %(first)s?" % \... {"first": "Andrew", "family": "Dalke"}'My first name is Andrew. Is yours also Andrew?‘

>>> "Andrew" == "Dalke"False>>>

Page 27: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python Functionshttp://python.org/doc/current/lib/built-in-funcs.html

Page 28: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python Functions

String Methods>>> seq = "AATGCCG">>> seq.lower()'aatgccg'>>> seq.count("A")2>>> seq.find("GC")3>>> seq.find("gc")-1>>> seq.replace("C", "U")'AATGUUG'>>> import string>>> seq.translate(string.maketrans("ATCG", "TAGC"))'TTACGGC'>>> # Make the reverse complement>>> seq.translate(string.maketrans("ATCG", "TAGC"))[::-1]'CGGCATT'>>>

Page 29: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python Functions

Special MethodsSome methods are used so often that they have special syntax.

>>> s = "AATGCCGTTTAT">>> s[0] # index'A'>>> s[1:4] # slice from beginning to end'ATG'>>> s[:4] # default beginning is position 0'AATG'>>> s[-1] # index from the end'T'>>> s[-3:] # default end includes the last character'TAT'>>> s[3:-3]'GCCGTT'>>> s[::2] # the optional third parameter is the stride'ATCGTA'>>> s[::-1] # returns the string, reversed'TATTTGCCGTAA'>>>

Page 30: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python Processing Command Line Arguments

• When a Python script is run, its command-line arguments (if any) are stored in the list sys.argv.

Code:

#!/usr/bin/env python# file: echo.pyimport sysprint sys.argv

Output:

> chmod +x echo.py> echo.py tuna['echo.py', 'tuna']> echo.py tuna fish['echo.py', 'tuna', 'fish']> echo.py "tuna fish"['echo.py', 'tuna fish']> echo.py['echo.py']>

Page 31: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python Processing Command Line Arguments

Computing the Hypotenuse of a Right Triangle

Code:

#!/usr/bin/env python# file: hypotenuse.py

import sys, math

if len(sys.argv) != 3: # the program name and the two arguments # stop the program and print an error message sys.exit("Must provide two positive numbers")

# Convert the two arguments from strings into numbersx = float(sys.argv[1])y = float(sys.argv[2])print "Hypotenuse =", math.sqrt(x**2+y**2)

Output:

> hypotenuse.py 5 12Hypotenuse = 13.0>

Page 32: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python I/O (Input / Output) Input• Text input comes from sys.stdin. It has a method called readline which

reads a line of input.

>>> import sys>>> s = sys.stdin.readline()This is a line of text. The line ends when I press 'Enter'.>>> s"This is a line of text. The line ends when I press 'Enter'.\n">>>

• You can also use the raw_input function to get a string from sys.stdin. This function takes an optional argument which is used as the prompt.

>>> name = raw_input("What is your name? ")What is your name? Andrew>>> print name, "is a nice name"Andrew is a nice name>>>

Page 33: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python I/O (Input / Output) Output• Most Python text output goes to the sys.stdout file object. You've been

using the print statement, which uses sys.stdout under the covers. Output file handles have a write function which writes a string to the file with no extra interpretation.

>>> a, b, c = 1, 4, 9>>> print "The first three squares are", a, b, "and", cThe first three squares are 1 4 and 9>>> print "The first three squares are", a, ",", b, "and", c, "."The first three squares are 1 , 4 and 9 .>>> print "The first three squares are %s, %s and %s." % (a, b, c)The first three squares are 1, 4 and 9.>>> import sys>>> sys.stdout.write("The first three squares are %s, %s and %s.\

n" %... (a, b, c))The first three squares are 1, 4 and 9>>>

Page 34: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python Applications in BioinformaticsBLAST output parsing• BLAST is the most widely used bioinformatics tool to search large

sequence databases. The original BLAST authors expected the output to be read by people only. But many use BLAST as part of a larger algorithm and want to automate the BLAST step by using parsers for BLAST output flavors (BLASTN, BLASTP, TBLASTX, WU-BLAST, and so on). BLAST parsers have been developed and put into library in Bioperl, Biopython, BioJava, etc., which all have BLAST output parsers.

First few lines of the BLASTP output

Page 35: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python Applications in BioinformaticsBLAST output parsing• Getting program version information

• Program reporting the version information of a BLAST file

Page 36: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python Applications in BioinformaticsBLAST output parsing• Getting no of sequences in the database and no of letters

Page 37: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python Applications in BioinformaticsBLAST output parsing• Reading description lines

Page 38: Lecture 9: Back to the Basics: Python and Application in Bioinformatics Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877;

Python Applications in BioinformaticsBLAST output parsing• Reading description lines


Recommended