Programming Part 3
Introduction to Perl
Perl
• Like so many things in computer science, Perl is an acronym: Practical Extraction and Reporting Language (you may now forget this)
• Perl has several advantages for us:– can handle large amounts of data
– includes rich set of functions for analysis of string data, and in particular pattern detection
– Syntax (language rules) relatively flexible; more forgiving of variation than many other programming languages
A simple Perl program#!/usr/bin/perl –w# Chapter 1 - Exercise 1print "Enter single DNA strand: ";my $dnaseq = <STDIN>;chomp $dnaseq;print "\nOpposite strand: ";for (my $i=0;$i<length($dnaseq);$i++) {
my $nucleo = substr($dnaseq, $i, 1);if ($nucleo eq "A") {print "T";}elsif ($nucleo eq "C") {print "G";}elsif ($nucleo eq "G") {print "C";}else {print "A";}
}
Running the program
What’s going on here?
• Lines that begin with ‘#’ character are comments:
– they provide opportunity to explain something
– they are not code – computer ignores them
– examples:
#!/usr/bin/perl -w
# Chapter 1 - Exercise 1
Output statements
• The print command sends output to the screen
• The data to be printed appears in quotes after the name of the command; examples:
print "Enter single DNA strand: ";
print "\nOpposite strand: ";
print "T";
Input statements and assignment
• Input statements read data from external sources; in our example, we’re reading input from the keyboard; example:
my $dnaseq = <STDIN>;
• Assignment statements assign values to variables; the statement above includes an assignment operation, as do the statements below:
my $i=0;
my $nucleo = substr($dnaseq, $i, 1);
Declaring variables
• There are three kinds of variables in Perl; they include:
– Scalar variables (declared with $)
– Arrays (declared with @)
– Hashes (declared with #)
• Variables may be declared and assigned values in the same statement; this is the case with all of the examples so far
Declaring variables
• The line of code below declares a local scalar variable named dnaseq, then assigns it the value typed in at the keyboard (represented by the constant <STDIN>):my $dnaseq = <STDIN>;– “my” makes the variable local– “$” makes the variable a scalar – that is, a variable
that holds a single value (like a Scratch variable)– “=” is the assignment operator – we read the symbol
as “gets”– This instruction ends with the “;” character
Data types in Perl
• A scalar variable in Perl can store two kinds of data:
– strings
– numbers (integers and real numbers)
• We can assign either kind of data to any scalar variable, although it is useful to store only one type or the other in an individual variable, which is given a name that reflects the kind of data it will hold - this makes the code less confusing
The chomp command
• When a Perl program reads data from the keyboard, every character entered by the user is read, including the newline character created by pressing the Enter key
• The “chomp” command removes the extraneous newline character from the end of the data
Perl control structures
• Perl supports both loops and selection structures• Our example program contains both; in this case,
a multiway selection structure contained within a loop:for (my $i=0;$i<length($dnaseq);$i++) {
my $nucleo = substr($dnaseq, $i, 1);if ($nucleo eq "A") {print "T";}elsif ($nucleo eq "C") {print "G";}elsif ($nucleo eq "G") {print "C";}else {print "A";}
}
Operations on strings
• Perl is ideally suited for bioinformatics programming because of its rich set of built-in operations on string data; two of these operations are used in the loop:length($dnaseq) andsubstr($dnaseq, $i, 1)
• The length operation tells the program the number of characters in the string; we use this to tell when the loop should end
• The substr operation tells the program the content of a segment of the original string
Substrings
• A substring is a section of a string
• Substrings can be any length, from 1 character to the entire length of the original string
• The substr operation (or function) takes in 3 data items (the original string, the starting position of the substring, and the length of the substring) and gives back one: the actual substring found at the given position, of the given length
Examples
• Suppose we have the following variable:
my $name = “Cathleen Mary Ruth Sheller”;
# it really is!
then these expressions: represent these substrings:
substr($name, 1, 3) “ath”
substr($name, 9, 4) “Mary”
substr($name, 12, 5) “y Rut”
The for loop
• A for loop is an example of a count-controlled loop; that is, one that repeats a certain number of times
• The structure of the loop is as follows:for (my $i=0;$i<length($dnaseq);$i++) {
# body of loop here}– we start by declaring and initializing the counter, or control
variable: my $i=0;– we then check for the loop ending condition; in this case, we
want to know if the counter has reached a value equal to the number of characters in the dnaseq string: $i<length($dnaseq);
– if the test succeeds, we perform the code in the body of the loop – that is, the statements between the two brackets: { … }
– finally, we increment the counter: $i++
The selection structure
• The code below:if ($nucleo eq "A") {print "T";}
elsif ($nucleo eq "C") {print "G";}elsif ($nucleo eq "G") {print "C";}else {print "A";}
represents a selection structure; the first part of the expression: if ($nucleo eq "A") tests to see if the value in variable nucleo is equal to the string value “A”
if the expression tests true, a T is output to the screen; otherwise, the next expression: elsif ($nucleo eq "C") is tested and, if true, a G is printed
and so on – until the last: else {print "A";} prints out an A if none of the previous expressions tested true
The logic
• As the loop runs, different statements within the selection structure execute
• The next slide shows the loop and selection structure as it executes on the following input string: ATTAGCAG
The logic
dnaseq: ATTAGCAG
for (my $i=0;$i<length($dnaseq);$i++) {my $nucleo = substr($dnaseq, $i, 1);if ($nucleo eq "A") {print "T";}elsif ($nucleo eq "C") {print "G";}elsif ($nucleo eq "G") {print "C";}else {print "A";}
}
Variable values: Output:i nucleo0 A T1 T A2 T A3 A T4 G C5 C G6 A T7 G C
Making improvements
• The program correctly produces a DNA string’s complement, provided it is given good data
• What happens if the user types a letter that isn’t A, C, G or T?
Another example#!/usr/bin/perl -w# Source: Gibbs, Cynthia and Per Jambeck, Developing Bioinformatics Computer Skills,# O'Reilly, 2001, page 334
my $target = "ACCCTG";my $search_string =
'CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACA'.'CATCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCACCTGTCTCTCAACTT'.'ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCATCCGAAC';
my @matches;
foreach my $i (0..length $search_string) {if ($target eq substr($search_string, $i, length $target)) {
push @matches, $i;}
}print "My matches occurred at the following offsets: @matches.\n";print "done\n";exit;
Output from example
Extending strings
• This example introduces some new aspects of Perl programming; consider this line:
my $search_string =
'CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACA'.
'CATCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCACCTGTCTCTCAACTT'.
'ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCATCCGAAC';
• A string that is too long to fit on a single line is created by concatenation – three lines (in this case) are glued together (with the ‘.’ character) to make a single string
Arrays
• You may remember the term “array” from our brief discussion of their use in Scratch– an array is a variable that holds a collection of data
– each individual data element can be accessed using a subscript, or index number (although that isn’t done here)
– index values start at 0, so an array with n elements has indexes 0 .. n-1
• The array variable in this program is declared in the following line of code: my @matches;
Arrays
• Two array operations are illustrated in this program:
– The push operation adds a value to the array:
push @matches, $i;
– The print operation, when given an entire array as data, prints the array contents as a list of values separated by commas:
print "My matches occurred at the following offsets: @matches.\n";