Managing Managing complexitycomplexity
(Advanced Perl)(Advanced Perl)Using perl for specific tasks Using perl for specific tasks with help from Bioperl and with help from Bioperl and
othersothers
LoginLogin
Username: bioinfouserUsername: bioinfouser Password: loginbioinfoPassword: loginbioinfo
Funny?Funny?
GoalsGoals
I already assume you know perl basics I already assume you know perl basics -- some more advanced features-- some more advanced features
Learn how to write OO codeLearn how to write OO code More flexible modulesMore flexible modules Understand other modulesUnderstand other modules
Some API’s that you may need.Some API’s that you may need. BioperlBioperl PerlDBIPerlDBI
What I assume you What I assume you already knowalready know
ScalarsScalars ArraysArrays HashesHashes Control structures (if-then, for, Control structures (if-then, for,
foreach, while, etc.)foreach, while, etc.) File IOFile IO
Managing complexity Managing complexity By managing complexityBy managing complexity
Make hard tasks easy(er)Make hard tasks easy(er) Perl itself does thisPerl itself does this
Regular expressions, text manipulationsRegular expressions, text manipulations Extensions (modules) do thisExtensions (modules) do this
May come at the expense of execution speedMay come at the expense of execution speed You may not careYou may not care Consider the big pictureConsider the big picture
Development timeDevelopment time ErrorsErrors
Extremely custom softwareExtremely custom software Some things need speedSome things need speed
How complex is it now?How complex is it now?
Perl is a very compact language in terms Perl is a very compact language in terms of human languagesof human languages
Perl is large compared with other Perl is large compared with other languageslanguages TMTOWTDITMTOWTDI Perl has approximately 233 reserved wordsPerl has approximately 233 reserved words Java has approximately 47 reserved wordsJava has approximately 47 reserved words
Both are easy to learn harder to use Both are easy to learn harder to use effectivelyeffectively
General practicesGeneral practices
Always use #!/usr/bin/perl –w or use Always use #!/usr/bin/perl –w or use warnings;warnings;
Consider use strict; for scripts Consider use strict; for scripts longer than 10 lineslonger than 10 lines
You can’t have too many commentsYou can’t have too many comments ## =head=head =cut=cut perldocperldoc
Getting values into the Getting values into the program or subroutine.program or subroutine.
Perl is pass by valuePerl is pass by value A scalar can have as a value a “pointer” A scalar can have as a value a “pointer”
to an array, hash, function etc.to an array, hash, function etc. The args to a program or function The args to a program or function
arrive in a special variable called @_arrive in a special variable called @_ my $first_value = shift @_;my $first_value = shift @_; my $first_value = $_[1];my $first_value = $_[1]; my $first_value = shift;my $first_value = shift;
ReferencesReferences
my @array = (“one”, “two”, “three”, “four”);
function_call(@array);
function_call(\@array);
function_call([“one”,”two”,”three”]);
sub function_call{
my $passed = shift @_;
print $passed;
}
Output
oneARRAY(0x80601a0)ARRAY(0x804c9a0)
Debugging complex data Debugging complex data structures.structures.
Print the referencePrint the reference It will tell you a little bit of informationIt will tell you a little bit of information
Use the Dumper module.Use the Dumper module. This will give you a snapshot of the This will give you a snapshot of the
whole data structurewhole data structure
Some more advanced Some more advanced featuresfeatures
Regular expressionsRegular expressions
Not Perl specific Not Perl specific Very usefulVery useful What they do:What they do:
String comparisonsString comparisons String substitutionsString substitutions Substring selectionSubstring selection
RegexRegex$string =~ /find/ $string =~ /find$/
$string =~ /^find/ $string =~ /^find$/
. Match any character\w Match "word" character (alphanumeric plus "_")\W Match non-word character\s Match whitespace character\S Match non-whitespace character\d Match digit character\D Match non-digit character\t Match tab\n Match newline\r Match return
Could put ‘m’
RepetitionRepetition
$string =~ /(ti){2}/
$string =~ /A*T+G?C{3}A{3,}T{4,6}/
Character ClassesCharacter Classes$string =~ /[ATGCN]/$string =~ /[^ATGCNatgcn]/i
Selection/ReplacementSelection/Replacement
$string =~ /(A{3,8})/;print $1;
$string =~ s/a/A/
$string =~ tr/[atgc]/[ATGC]/
Additional syntaxAdditional syntax
$string =~ /AT*?AT/
$string =~ m#/var/log/messages#
$_ = “ATATATAGTGTGCGTGATATGGG”;
($one,$two,$three) =~ /AT..AT/g;
What is a moduleWhat is a module
Two typesTwo types Object-oriented typeObject-oriented type
Provides something similar to a class Provides something similar to a class definitiondefinition
Remote function call Remote function call Provides a method to import subroutines or Provides a method to import subroutines or
variables for the main program to usevariables for the main program to use
Howto: Howto: MakingMaking a module a module
Create a file called workSaver.pm###########package workSaver;
sub doStuff {print “Stuff done\n”;
}
1; #statement that evaluates to true###########Now you can use with “use workSaver;”*
*Some restrictions apply
Howto:Making a module Howto:Making a module cont.cont.
This method would work very well for This method would work very well for subroutines that are used in several subroutines that are used in several programs.programs.
Reduces the “clutter” in your Reduces the “clutter” in your programprogram
Provides one maintenance point Provides one maintenance point instead of unknown number.instead of unknown number. Eases bug fixesEases bug fixes Careful of boundariesCareful of boundaries
More Complete method:More Complete method:
Allows you to “pollute” the Allows you to “pollute” the namespace of the original program namespace of the original program selectively.selectively.
Makes the use of functions and Makes the use of functions and variables easiervariables easier
Still used about the same way as the Still used about the same way as the simple method but things are clearersimple method but things are clearer
More CompleteMore Complete
package functional;use strict;use Exporter;our @ISA = ("Exporter");our @EXPORT = qw ();our @EXPORT_OK = qw ($variable1 $variable2 printout);our $VERSION = 2.0;
our $variable1 = "var1";our $variable2 = "var2";my $variable3 = "var3";
sub printout { my $passed_variable = shift; print "Your variable is $passed_variable mine are $variable1 , $variable2, $variable3 \n";}
1;
CPANCPAN
Wouldn’t it be nice to have a place Wouldn’t it be nice to have a place where:where: You could find a bunch of perl modulesYou could find a bunch of perl modules It would be brows ableIt would be brows able SearchableSearchable Big pipe for people to download stuffBig pipe for people to download stuff Other people would be encouraged to Other people would be encouraged to
submit fixes and updatessubmit fixes and updates And it was all freeAnd it was all free
Sources of Sources of modules/Informationmodules/Information
www.CPAN.orgwww.CPAN.org www.bioperl.orgwww.bioperl.org www.perl.comwww.perl.com www.cetus-links.org/oo_infos.htmlwww.cetus-links.org/oo_infos.html
BioperlBioperl
Set of modules that are extremely Set of modules that are extremely useful for working with biological useful for working with biological data. Actively maintained.data. Actively maintained.
www.bioperl.orgwww.bioperl.org is a very good is a very good place to get the basics of bioperlplace to get the basics of bioperl
We will go through an example to We will go through an example to see a typical usesee a typical use
Bioperl has several basic types of Bioperl has several basic types of objects:objects: Seq: a sequence the most common type Seq: a sequence the most common type
Bio::SeqBio::Seq Location objects: where it is how long it Location objects: where it is how long it
is etc.is etc. Interface objects: Bio::xyzI No Interface objects: Bio::xyzI No
implementation mostly a documentationimplementation mostly a documentation
Bioperl documentationBioperl documentation
Several different ways to find out Several different ways to find out about a moduleabout a module perldoc Bio::Seqperldoc Bio::Seq bioperl.org/usr/lib/perl5/site_perl/bioperl.org/usr/lib/perl5/site_perl/
5.8.0/bptutorial.pl 100 Bio::Seq5.8.0/bptutorial.pl 100 Bio::Seq Data::Dumper to print the data Data::Dumper to print the data
structurestructure Print the variablePrint the variable
Bio perl demoBio perl demo
Why use a databaseWhy use a database
Transaction control - only one user Transaction control - only one user can modify the data at any one time.can modify the data at any one time.
Access control - some people can Access control - some people can modify data, some can read data, modify data, some can read data, others can create data-structures.others can create data-structures.
Fast handling of lots of dataFast handling of lots of data Precise definition of data (mostly).Precise definition of data (mostly). Easy to share data resources with Easy to share data resources with
othersothers
Many choicesMany choices
There are many types: MS Access, There are many types: MS Access, Excel(sortof), sybase, oracle, Excel(sortof), sybase, oracle, postgres, msql, mysql …postgres, msql, mysql …
They each have their niche and They each have their niche and function best in certain cases, there function best in certain cases, there is also considerable overlap.is also considerable overlap.
SQL – structured query language is SQL – structured query language is a common threada common thread
MySQL is better than MySQL is better than YourSQLYourSQL
Free on UnixFree on Unix Good developer supportGood developer support Constant bug fixes and feature additionConstant bug fixes and feature addition Good scalability to medium size and load, Good scalability to medium size and load,
OK performance.OK performance. Easy to install.Easy to install. Used at Ensemble and UCSC genome Used at Ensemble and UCSC genome
browsers, so a lot of information is readily browsers, so a lot of information is readily available in that format.available in that format.
Table Structure - SchemaTable Structure - Schema
Gene tableGene_IDName
Alias tableAlias_IDGene_IDAlias
Reference tableReference_IDGene_IDReferenceDataSource
Gene: ATP7BAliases:
Wilson disease-associated proteinCopper-transporting ATPase 2
References: Enzyme Commission: 3.6.3.4UniGene: Hs.84999AffyProbeU133: 204624_atAffyProbeU95: 37930_atRefSeq: NM_000053GenBank: AF034838GenBank: U11700LocusLink: 540
SQL (MySQL dialect)SQL (MySQL dialect)
SELECT col_name FROM table SELECT col_name FROM table WHERE col_name = value;WHERE col_name = value;
SELECT COUNT(*) FROM table SELECT COUNT(*) FROM table WHERE col_name is like ‘%value%’;WHERE col_name is like ‘%value%’;
SELECT count(distinct(col_name)) SELECT count(distinct(col_name)) FROM table where col_name is not FROM table where col_name is not null;null;
CREATE, UPDATE, DELETE, INSERT CREATE, UPDATE, DELETE, INSERT have similar formshave similar forms
SQL cont.SQL cont.
USE database_nameUSE database_name Also can be specified on the command line –DAlso can be specified on the command line –D
SHOW TABLES – lists all the tables in SHOW TABLES – lists all the tables in that database (also SHOW DATABASES).that database (also SHOW DATABASES).
DESCRIBE table_name – lists the columns DESCRIBE table_name – lists the columns and datatypes for each columnand datatypes for each column
or SHOW COLUMNS FROM table_nameor SHOW COLUMNS FROM table_name
More advanced SELECTSMore advanced SELECTS
SELECT (column_list) FROM SELECT (column_list) FROM (table_list) WHERE (constraints) (table_list) WHERE (constraints) GROUP_BY (grouping columns) GROUP_BY (grouping columns) ORDER_BY (sorting columns) LIMIT ORDER_BY (sorting columns) LIMIT (limit number);(limit number);
SELECT col_name from (table1, SELECT col_name from (table1, table2) where table1_val = table2) where table1_val = table2_val and table1_val2 > value;table2_val and table1_val2 > value; Example of a equi-joinExample of a equi-join
Getting the names rightGetting the names right
If you only have one table you only If you only have one table you only need to use the column nameneed to use the column name
When you are using joins this may When you are using joins this may not be adequate.not be adequate. If two tables have the column primary If two tables have the column primary
you would need to call the column you would need to call the column table1.primary or table2.primarytable1.primary or table2.primary
Data TypesData Types INTINT
Tinyint –128 to 127Tinyint –128 to 127 Smallint –32768 to 32767Smallint –32768 to 32767 Mediumint –8388608 to 8388607Mediumint –8388608 to 8388607 Int –2147683648 to 2147483647Int –2147683648 to 2147483647 Bigint –9223372036854775808 to Bigint –9223372036854775808 to
9223372036854775807 9223372036854775807 FLOATFLOAT
Float 4 bytesFloat 4 bytes Double 8 bytesDouble 8 bytes
CHARCHAR Char(n) character string of n n bytesChar(n) character string of n n bytes Varchar(n) character string up to n long Varchar(n) character string up to n long
L+1 bytesL+1 bytes Text upto 2^16 bytesText upto 2^16 bytes
BLOBs Binary Large OBjects BLOBs Binary Large OBjects
Perl DBIPerl DBI
Method for perl to connect to a Method for perl to connect to a database (virtually any database) database (virtually any database) and read or modify data. and read or modify data.
The statements are constructed very The statements are constructed very similar to SQL statements that similar to SQL statements that would be entered on the command would be entered on the command line so learning SQL is still line so learning SQL is still necessarynecessary
Statements in DBIStatements in DBI
ConnectConnect Used to establish initial connectionUsed to establish initial connection
PreparePrepare Prepare a statement to executePrepare a statement to execute
ExecuteExecute Execute the statementExecute the statement
DoDo prepare a statement that does not return prepare a statement that does not return
results and execute it results and execute it
FetchFetch Several types used to get returned dataSeveral types used to get returned data
DisconnectDisconnect Disconnect from the serverDisconnect from the server
Types of fetchTypes of fetch
““fetchrow_array”fetchrow_array” Used to fetch an array of scalars each Used to fetch an array of scalars each
timetime Can also use “fetchrow_arrayref”Can also use “fetchrow_arrayref”
““fetchrow_hash”fetchrow_hash” Used to fetch a hash indexed by column Used to fetch a hash indexed by column
name.name. Slower but cleaner code.Slower but cleaner code. Can also use “fetchrow_hashref”.Can also use “fetchrow_hashref”.
More advanced More advanced statementsstatements
QuoteQuote Used to properly quote data for use with a Used to properly quote data for use with a
prepare statementprepare statement ““$value = $dbh->quote($blast_result);”$value = $dbh->quote($blast_result);”
PlaceholdersPlaceholders Speeds up execution, optionalSpeeds up execution, optional
my $prep = $dbh->prepare (“select x from y where z my $prep = $dbh->prepare (“select x from y where z = ?”);= ?”);
loop_startloop_start $prep->bind_param(1,$z);$prep->bind_param(1,$z); $prep->execute();$prep->execute(); loop_endloop_end