Perl for Bioinformatics
Lecture 4
Variables - review
• A variable name starts with a $• It contains a number or a text string• Use my to define a variable• Use = to assign a value• Use \ to stop the variable being
interpolated• Take care with variable names and with
changing the contents of variables
Conditional Blocks, review• An if test can be used to control a
command in a conditional block, according to the outcome of a decision made by comparing variables.
• It’s important to keep track of whether variables are strings or numbers. Numbers are compared with ==, strings with eq.
• It’s usual to indent the block to make it easier to read the code
Arrays• An array can store multiple pieces of data. • They are essential for the most useful
functions of Perl. They can store data such as:
– the lines of a text file (e.g. primer sequences)– a list of numbers (e.g. BLAST e values)
• Arrays are designated with the symbol @
my @bases = (“A”, “C”, “G”, “T”);
Converting a variable to an array
split splits a variable into parts and puts them in an array.
my $dnastring = "ACGTGCTA";
my @dnaarray = split //, $dnastring;
@dnaarray is now (A, C, G, T, G, C, T, A)
@dnaarray = split /T/, $dnastring;
@dnaarray is now (ACG, GC, A)
• join combines the elements of an array into a single scalar variable (a string)
$dnastring = join('', @dnaarray);
Converting an array to a variable
which arrayspacer(empty here)
Loops• A loop repeats a bunch of functions until it is done.
The functions are placed in a BLOCK – some code delimited with curly brackets {}
• Loops are really useful with arrays.
• The “foreach” loop is probably the most useful of all:
foreach my $base (@dnaarray) {
print "$base “;
}
• String comparison (is the text the same?)
• eq (equal ) • ne (not equal )
There are others but beware of them!
Comparing strings
Getting part of a string
• substr takes characters out of a string
$letter = substr($dnastring, $position, 1)
which string where in the string
how many letters to take
Combining strings
• Strings can be concatenated (joined).
• Use the dot . operator$seq1= “ACTG”;
$seq2= “GGCTA”;
$seq3= $seq1 . $seq2;
print $seq3;ACTGGGCTA
Making Decisions - review
• The if operator is generally used together with numerical or string comparison operators, inside an (expression).
numerical: ==, !=, >, <, ≥, ≤strings: eq, ne
• You can make decisions on each member of an array using a loop which puts each part of the array through the test, one at a time
More healthy exercise
• Write a program that asks the user for a DNA restriction site, and then tells them whether that particular sequence matches the site for the restriction enzyme EcoRI, or Bam HI, or Hind III.
• Site for EcoR1: GAATTC• Bam H1: GGATCC• Hind III: AAGCTT
• Read in restriction site to variable
• Remove newline character
• Check if variable contains “GAATTC”
• Check if variable contains “GGATCC”
• ..etc.
pseudocode
• Read in sequence to variable
• Remove newline character
• Split sequence in variable to array using “GAATTC”.
• Count and report number of fragments.
• Measure length of fragments and report site positions, adding six for missing sites
What about longer sequences?
Arrays and loops - review
• An array starts with @. It contains multiple bits of data in a list-like format.
• @bases = (“A”, “C”, “G”, “T”);
• You can make decisions on each member of an array using a foreach loop which puts each part of the array through the test, one at a time
Test time, again
• Remember –
keep track of what’s in a variable
don’t over-write a variable with another value, unless you intend to
syntax and case are critical
lines end with a semicolon
brackets and quotes must match.
Opening and closing files
• So we can input large amounts of data, Perl has to read data out of files, and write results into output files
• This is done in two steps
• First, you must give the file a name within the script - this is known as a filehandle
• Use the open command:open MYFILE, ‘exampleprotein.txt’;
Reading a file• Once the file is open, you can read from it, line by
line, using the readline <> operator again
– (put the filehandle between the angle brackets)
• Perl reads files one line at a time, each time you input data from the file, the next line is read:
open FILE1,’exampleprotein.txt’;$line1 = <FILE1>;chomp $line1;$line2 = <FILE1>;
Using loops to read in a file
• The while loop just keeps doing an expression while it’s true. So it will keep reading lines from the file until it runs out.
my $longsequence;open FILE, ‘exampleprotein.txt’;while (my $line = <FILE>){
chomp $line;$longsequence = $longsequence . $line;
}close FILE;
• This reads the whole file, and puts each line into the variable $longsequence one at a time.
• Think about what happens to the ID line….
Now More Fun Excercises
• Read a DNA sequence from a fasta format file
• Calculate the GC content.• What about the non-DNA characters in
the file?>header lines with the name of the sequencecarriage returns !! You know this one.blank spacesN’s or X’s or unexpected letters
• Open file
• Load ID line and sequence into different variables
• Split the sequence into an array of individual letters
• For each letter in the array: – if A, increment “A” counting number variable– If T, increment “T” counting variable– …etc
Pseudocode
Writing to a File• Writing to a file is similar to reading from it• Use the > operator to open a file for writing:
open OUTPUT, ‘>/home/class30/output.txt’;
• This creates a new file with that name, or overwrites an existing file
• Use >> to append text to an existing file• print to the file using the filehandle:
print OUTPUT $myoutputdata;
else
• Instead of just letting the script go on if it fails an if test, you can get it to execute a second block of code if the statement in brackets isn’t true.
Some more stuff you need to know
elsif• You can string a lot of “if”s together using elsif
if ($site eq “GAATTC” {print “EcoR1 site\n”;
} elsif ($site eq “CCATGG” {print “BamHI site\n”;
} elsif ($site eq “AAGCTT”) {print “HindIII site\n”;
} else { #only happens if none of the preceeding are truedie “I can’t find any of the sites I know\n”;
}
Subscripts
• Bioinformatics data often can be made into array format:– multi-line sequence files– Microarray or statistics data in “tab delimited”
format
• You can address part of the array as if it was a variable using a subscript
@numbers = (8, 8, 8, 23984092, 8);print “$numbers[3]\n”;
Please note – the first element is number zero! Second is 1!
Regular Expressions• Sounds odd, doesn’t it? It means a pattern that
the computer can match, in a standard format.
• Very useful in bioinformatics work
• DNA patterns• restriction sites• promoters/transcription factor binding sites• intron splice site
• Protein patterns• conserved domains (motifs)• active sites• structural motifs (membrane spanning, signal peptide, etc.)
The Binding and Match Operators: =~ / /
• The =~ operator binds functions together• The // operator matches things to patterns
It can be translated as “contains”
• The forward slashes contain the pattern to be matched, like this:
if ($dnaseq =~ /GAATTC/) {print “EcoRI site found\n”}
A regular expression..is a joy forever. And a pattern to match:
can be just a text string, such as: /GATC/
it can have alternative characters: /G[AT]TC/
or contain a wildcard that matches any character: /G.TC/
Or be something bizzare: /\/[^\/]*\/\.\./
Perl Regular Expressions
• It never ceases to amaze me what people can do with regular expressions, but you can match pretty much anything you can think of and a lot you can’t:
#man perlrequick
Alternative Characters• Square brackets within the match expression allow
for alternative characters:if ($dna =~ /CAG[AT]CAG/)
• This will match an DNA string that starts with CAG; has A or T in the 4th position, followed by another CAG.
• A vertical line within the /expression/ means “or”; it allows you to look for either of two completely different patterns:
if ($dna =~ /GAAT|ATTC/)
Special characters• Perl has a large set of special characters to use
in regular expressions:
– the dot (.) matches any character– \d matches any digit (a number from 0-9)– \w matches any “word” character
(a letter or a number, not punctuation or space)
– \s matches white space (any amount)– \t matches a tab (useful for tab delimited files)– ^ matches the beginning of a line– $ matches the end of a line
– Knowing this makes you lots of fun at parties.
“Special” characters
• What if you need to match text that contains a special character?
• Aren’t there dots at the end of sentences?
• Now you have to use a backslash (\) to “escape” the special meaning of that character:
if $onewordsentence =~ /\w+ \ ./
-This would match any text that has one or more text characters, followed by a dot.
Finished.
Bringing it together
• So now, when you think about it, you can:
Open a fileCheck whether each line of the file contains a particular
patternRecover part of that lineWrite it out to another file
So.. isn’t that what you wanted to know?But really, it’s very useful combined with the UNIX
command line.
A last exercise?...
• Now we’re getting up to speed with Perl, lets try something more fun:
• Open up a BLAST output file
• Spit out the name of the query sequence, the top hit, and how many hits there were.
Only the beginning
• Sadly, there is much, much more than this to the Power of Perl.
• You can make, create and download other people’s websites
• Make Linux and Windows graphical programs• Do almost anything on the internet• Interact with databases• And much much more
Why won’t I teach you more stuff?
• Whoa!• Programming takes time to learn properly• You’ve got the tools now to get started on a
programming project
• We will go through some more Perl functions in the later classes, especially modules such as Bioperl.
Practice makes perfect
• You can now practice your Perl skills and understand a lot of the books and help files, which are probably more useful.
#man perlintro
#man perlrequick
#perldoc bioperl
• Also, check out Radhika’s resource page
http://compbio.sph.harvard.edu/chb/training/training-resources