Perl for Bioinformatics, 140.636!F. Pineda
1) Please fill out questionnaire while waiting for class to start2) If you did not get permission from instructor to register for !
credit please talk to me after class...it could end badly...
Course mechanicsn The course website is authoritative
n http://www.pinedalab.org/perln check the schedule regularly for updatesn Lecture notes, homeworks, etc. will all be on-line
n Homeworkn Read and understand the homework policies on the course websiten Homework submission consists of:
n HTML page – description of code and resultsn Code
n Homework is NOT accepted on paper or via emailn Today you must fill out questionnaire
n Needed to get user accountn Only registered students get user accountsn Mark Miller for system issues (not help with class)
This afternooon
n Mark Miller will get you oriented on the cluster and will give out passwords
Historical background
n Perl (Practical Extraction & Report Language) originally developed by Larry Wall as System Administration tool. His tasks were 90% text manipulation and 10% everything else. �Version 1.0 released in 1987.�
n Linux (open source Unix) originally developed by Linus Torvalds (CS graduate student in Finland).�First kernel released ~1992. Allegedly done for the fun of it.
n Both released under GNU “open source” license
Some typical bioinformatics tasks
n I have a huge output file from a BLAST search. What are all the Mus musculus matches with high scores?
n I want to digest (with Trypsin) each of the ~900 proteins in a FASTA file and calculate the mass of each fragments. Oh, I also want the results in a format that I can feed to the MASCOT proteomics search engine .
n What’s the most common 6-mer consisting of only A’s and T’s in the P. falciparum genome (14 chromosomes and 2.2 million nucleotides)
n I want to automatically update the database on our local BLAST server every weekend with the latest sequences deposited at NCBI.
n Search for potential RNA hairpins in the 14 chromosomes of the Plasmodium falciparum genome
The common characteristic of these tasks is that they !involve ~90% text manipulation, and ~10% everything else.
A quickie task
---------A: 40.31 %T: 40.30 %C: 9.70 %G: 9.69 %total(ATCG only ) = 22853497 nucleotidestotal(everything) = 22853764 nucleotides---------1 ATATAT 5705772 TATATA 5298263 AAAAAA 4202324 TTTTTT 4174395 TATTAT 1475066 ATAATA 1469487 TATATT 1467118 AATATA 146695
-- snip --
57 ATTTAA 5061758 ATTAAT 4985659 TTAAAT 4973160 TAATTA 4578161 AATTAA 4527862 TTAATT 4521263 AATTTA 4510164 TAAATT 44553
Question:
What 6-mer containing only A’s and T’s occurs mostly commonly in the P. falciparum !genome
Answer:
Scan through the 2.2 million nucleotides !of the 14 chromosomes using a 6-nt window!and find that ATATAT occurs most !frequently
Runs in a minute or so on a slow laptop
A bigger task: “middleware” for a “pipeline”
n Applications for processing raw data from DNA sequencers 1. A trace editor to analyze, display, and allow biologists to edit the
short DNA read chromatograms from DNA-sequencing machines. 2. A read assembler to find overlaps between the reads and assemble
them in long contiguous sections. 3. An assembly editor to view the assemblies and make changes in
places where the assembler went wrong. 4. A database to keep track of everything.
n Write a set of scripts that runs the applications sequentially and converts the output format of one application into the input format of the subsequent application.
see e.g. L. Stein “How Perl Saved the Human Genome Project”
MBL/TIGR Assembly Pipeline From Advanced Genomics and Bioinformatics course
F. Pineda, 10/31/2002
ABI 3700sequencer .chr
PreTA(Lucy, Phred)
.x
.y
phd2fasta
.qual .seq
RunTATIGR Assembler
.asm ta2ace
.ace
Consed
ace2contig.contig
Traceviewer
.fasta
RepeatFinder
.repeatsBambus
.stats.details
.dot
.mates ?
Closure
Closed?
No
Yes
files
Legend applications
format conversionPUC19
PUC19.splice
sampleDNA
Finishedsequence
Lab
To database for annotation
Step 1: Got perl?
n Unix, Linux and MacOS systems come with a Perl interpreter.n Open a shell window for entering command linesn Find the perl interpreter by typing: “whereis perl” (remember this)n To find out which version of perl you have type “perl -v”
n If you insist on working on a Windows PC, you have two choices (neither of which we will support):n Cygwin (a “linux-like environment with a limited set of tools)
n http://www.cygwin.com/
n Active state perl (a native perl interpreter).n (http://www.activestate.com/activeperl/)
Invoking the perl interpreter
n Perl is an application that processes perl statements.n Feeding perl a one line command
n perl -e ‘print “hello world\n”’
n Feeding perl a sequence of commandsn step 1. invoke perln step 2. write the statement(s)n step 3. tell perl you are finished writing statements (control-D)
(this causes perl to interpret the statements)n Example: use print statement to print a haiku:
an old pond the sound of a frog jumping into water - Basho
print “an old pond\nthe sound of a frog jumping\ninto water\n”
Perl application = Compiler+Virtual Machine
n The perl interpreter is an application consisting of a compiler and a “virtual machine.”n The perl interpreter accepts input consisting of perl instructions and
executes them in a two-step process:n A compiler converts source code into instructions for a “virtual
machine”n Bottom-up parsern Top-down Optimizer & Peephole Optimizern Generates Opcodes (instructions) n Code is compiled each time it is executed – no binaries
n The Perl “virtual machine” executes the instructionsn Executes the opcodesn Emulates a stack-based computer !
(like HP calculator or Forth)
Step 2: Got editor?
n The more normal way of using perl is to put perl statements into a text file and then tell perl to process the text file.
n Create programs using a text editor, which allows you to create plain text (ASCII) documents.
n On UNIX and Linux computers use a powerful text editor like vi, vim, or emacs. If you are working in a GUI envirionment, you can use gedit. Alternately if you are in a command line environment, you can use a basic text editor like pico or nano.
n On a Mac, I highly recommend TextWrangler n http://www.barebones.com �
n Under Microsoft Windows, use notepad or editplus n http://www.editplus.com
n Do not use Word, Word Perfect, laTeX or any other word processor for writing your programs. Word processors embed formatting commands into the text.
n Both TextWrangler and editplus will allow you to edit files on the teaching server (recommended).
n Text between ‘#’ and the end of the line is a commentn print() is a function that prints the text string n Double quotes delimit a stringn The two characters \n are not interpreted literally. !
Instead they represents newline.n Run the program by invoking the perl interpreter with helloworld.pl as an argument.
1st perl program: Hello World
# my first perl programprint(“hello world\n”);
bash> perl helloworld.pl
n Three things to remember
n Add a special first line that will tell bash where to find perl
n Tell bash it has permission to execute this script (just once)
n Invoke the script
stand alone perl script
#!/usr/bin/perl# my first perl scriptprint(“hello world\n”);
bash> chmod u+x helloperl.pl
bash> ./helloperl.pl
Character data, Sequences & codes
n 99% of what you will do with perl is to manipulate characters and text files, so you might as well get intimately familiar with character data and how it is represented (coded) on computers.
n Many important biological compounds are polymers. These are compounds of usually high molecular weight consisting of up to millions of repeated linked units, each a relatively light and simple molecule. Different single letter codes are used to represent simple molecules in different classes of compounds (e.g. DNA, RNA and proteins).
!Biological sequences!
!representing information contained in
biological polymers!
DNAProtein
RNA
(RNA polymerase II transcribing DNA into RNA)
The fundamental “dogma”
RNA
Proteins
DNA
Translation
Transcription
n after discovery of retroviruses
Replication
Reverse-transcription
DNA n Double stranded molecule with a spiral
arrangement n Nucleotides pair by hydrogen bonding n Purines (A,G) bind to pyrimidines (T,C) A pairs to T with 2 hydrogen bonds G pairs to C with 3 hydrogen bonds n The strands run opposite to each other
(antiparallel) n 4 letter alphabet: A,T,G,C
n A human genome is about 3 billion base-pairs long spread across 23 chromosomes
RNA n Single stranded molecule with complex secondary structure n Nucleotides pair by hydrogen bonding n Purines bind to pyrimidines
n A pairs to U with 2 hydrogen bonds n G pairs to C with 3 hydrogen bonds n G pairs with U ( a wobble pair)
n 4 letter alphabet: A,U,G,C
Proteins!20 letter amino acid alphabet of proteins
Summary of the important polymers & their “alphabets”
n DNAn Double stranded (helical)n deoxyribose-phosphate backbonen 4 letter alphabet (A,T,C,G) represent nucleotidesn Complementary base pairing (A-T), (C-G)
n RNAn Single stranded (messenger RNA)n ribose-phosephate backbonen 4 letter alphabet (A,U,C,G)n Complementary base pairing (A-U), (C-G),(U-G)
n Proteinsn Single stranded (complex folding structure)n 20 letter alphabet (represent amino acid residues)
Notational and directional conventions
DNAsense (+) strand
anti-sense (-) strand(template)
RNA
Transcription
Translation
Protein
downstreamupstream
5’ 3’
3’ 5’
“gene”
5’ 3’
N-terminal ( amino end )
C -terminal(carboxyl end)
UTR UTR
promoterRNA polymerease
Ribosome
!Character sequences in
computing !!
How are characters represented on computers?
0 1 1 0 0 0 0 1‘a’ =
8 bits = 1 byte = 1 character
Up to 2 = 256 characterscan be represented withthe ASCII code.
The map from bit patterns to characters is a convention: ASCII
8
Mega, Giga, Tera, Peta and all that
n kilobyte – 103 bytesn megabyte – 106 bytesn gigabyte – 109 bytesn terabyte – 1012 bytesn petabyte – 1015 bytes
ASCII codes Decimal Octal Hex Binary Value ------- ----- --- ------ ----- 048 060 030 00110000 0 049 061 031 00110001 1 050 062 032 00110010 2 051 063 033 00110011 3 052 064 034 00110100 4 053 065 035 00110101 5 054 066 036 00110110 6 055 067 037 00110111 7 056 070 038 00111000 8 057 071 039 00111001 9 058 072 03A 00111010 : (colon) 059 073 03B 00111011 ; (semi-colon) 060 074 03C 00111100 < (less than) 061 075 03D 00111101 = (equal sign) 062 076 03E 00111110 > (greater than) 063 077 03F 00111111 ? (question mark) 064 100 040 01000000 @ (AT symbol) 065 101 041 01000001 A 066 102 042 01000010 B 067 103 043 01000011 C 068 104 044 01000100 D 069 105 045 01000101 E 070 106 046 01000110 F 071 107 047 01000111 G 072 110 048 01001000 H 073 111 049 01001001 I 074 112 04A 01001010 J 075 113 04B 01001011 K 076 114 04C 01001100 L 077 115 04D 01001101 M 078 116 04E 01001110 N 079 117 04F 01001111 O
ASCII codes Decimal Octal Hex Binary Value ------- ----- --- ------ ----- 000 000 000 00000000 NUL (Null char.) 001 001 001 00000001 SOH (Start of Header) 002 002 002 00000010 STX (Start of Text) 003 003 003 00000011 ETX (End of Text) 004 004 004 00000100 EOT (End of Transmission) 005 005 005 00000101 ENQ (Enquiry) 006 006 006 00000110 ACK (Acknowledgment) 007 007 007 00000111 BEL (Bell) 008 010 008 00001000 BS (Backspace) 009 011 009 00001001 HT (Horizontal Tab) 010 012 00A 00001010 LF (Line Feed) 011 013 00B 00001011 VT (Vertical Tab) 012 014 00C 00001100 FF (Form Feed) 013 015 00D 00001101 CR (Carriage Return) 014 016 00E 00001110 SO (Shift Out) 015 017 00F 00001111 SI (Shift In) 016 020 010 00010000 DLE (Data Link Escape) 017 021 011 00010001 DC1 (XON) (Device Control 1) 018 022 012 00010010 DC2 (Device Control 2) 019 023 013 00010011 DC3 (XOFF)(Device Control 3) 020 024 014 00010100 DC4 (Device Control 4) 021 025 015 00010101 NAK (Negative Acknowledgement) 022 026 016 00010110 SYN (Synchronous Idle) 023 027 017 00010111 ETB (End of Trans. Block) 024 030 018 00011000 CAN (Cancel) 025 031 019 00011001 EM (End of Medium) 026 032 01A 00011010 SUB (Substitute) 027 033 01B 00011011 ESC (Escape)
newline and end-of-line conventions
n How is newline ‘\n’ represented in ASCII ? It depends!
n DOS carriage-return (ascii 13) AND linefeed (ascii 10) n MACINTOSH carriage-return only (ascii 13)n UNIX linefeed only (ascii 10)
n Moving text files across platforms must be done carefully!n It will cause mysterious and unexpected problems unless you “fix” the end-of-
lines
Fixing end-of-line conventions
n DOS to UNIX (delete the carriage-return)n bash> tr -d ‘\r’ < input.txt > output.txt
n MAC to UNIX ( change carraige-return to line-feed)n bash> tr ‘\r’ ‘\n’ < input.txt > output.txt
n UNIX to MAC ( change line-feed to carriage return)n bash> tr ‘\n’ ‘\r’ < input.txt > output.txt
n MAC to UNIX (using perl)n bash> perl -pi -e 's/\r/\n/g' input.txt
You will understand these commands in a few weeks, for now just remember the magical incantations. For more details: http://www.answers.com/topic/newline