Post on 11-Jan-2016
transcript
Unix for Bioinformaticists: Unix Tools, Emacs, and Perl
helpdesk at stat.rice.eduAug 2004Some slides are borrowed from Dr. Woely’s (BCM) presentation.
Do I Have to Know/Use Unix?
Simple answer: no. Windows can do almost everything.
Complicated answer: yes, if youare lazy (would like to automate things) are good at reading manuals and writing scriptswant to make better use of your machineare as poor as I am (can not afford pricy windows software) especially if you will be a bioinformaticist
Why Unix Is Useful in Bioinformatics
Many tasks involve processing on large text based datasets. Unix tools in many cases are better than their windows counterparts.You may need to use several tools to accomplish a task. Windows is not particularly good at gluing them.When you need more CPU power, servers and clusters are usually *nix-based. Many tools are available only under Unix-like systems.
Outline
Unix in generalUnix toolsEmacsPerl
Unix Commands
Single command:
> foreach file (*.txt)
sort –k1 $file > $file:r_sorted.txt
end
> sort –k1 file.txt
Combine other commands:> sort –k1 file.txt | grep “Tag=Mouse” > output.txt
Operate multiple files:
More commands
> rename .html .htm *.html
There are many such convenient tools. Scripts can be used if you can not find one,
> foreach f (*.html)
mv $f $f:r.htm
end
More commands
> convert –rotate 90 file.jpg file.png
Convert a .jpg file to .png format after rotating 90 degrees.
> wget -r -l1 --no-parent -A.tar.gz -Ppackages http://cran.r-project.org/src/contrib/PACKAGES.html
download all .tar.gz files to packages directory, This command can do everything ‘teleport’ etc. under windows can do.
A shell script: lyx2pdf
#!/bin/csh
set file = $1:r
lyx --export latex $file.lyx
latex $file.tex
dvips -o $file.ps $file.dvi
ps2pdf $file.ps
> lyx2pdf myfile.lyx
A Makefile%.html: %.tex
latex2html -local_icons -no_subdir -split 0 $*.tex
%.tex: %.lyx
lyx2tex $*.lyx
%.dvi: %.tex
latex $*.tex
%.ps: %.dvi
dvips -o $*.ps $*.dvi
%.pdf: %.ps
ps2pdf $*.ps
> make file.dvi> make file.ps> make file.pdf
A Perl Script
#!/usr/bin/perl
# read all the things at once
undef $/;
# read in the file and look for /* */
($comm) = <> =~ /.*\/\*(.*)\*\//ms;
# print comments
print $comm, "\n";
crontab
# do not forget to renew your library books
0 0 15 7 * mail bpeng@rice.edu %subject reminder Renew all the books!
# backup your files to server every day at 6AM
6 * * * * /usr/local/bin/rsync -avz /home/bpeng thor.stat.rice.edu::backup > logfile
Graphviz
digraph G
{
A->B->C
B->D->C
}
File: try.dot
> dot –Tps try.dot –o try.eps
Useful (and free) tools
Servers: Apache, openssh, openldap
Web: Mozilla/firefox, Konqueror, lynx
Mail clients: Pine, Mutt, Mozilla/thunderbird, kmail, evolution
Text processing: tetex/lyx, open office, koffice
Languages: gcc, Perl, python, gmake, kdevelop
Scientific libraries and tools: GNU Scientific Library, bioPython, bioPerl, R, Graphviz, gnuplot, octave
Misc: VNC, wget,
Unix text-processing toolsAccess to Unix
Mac OSX + developers kit Linux Stat and ruf/owlnet servers (Solaris) Windows + cygwin
Tools - in contrast to Excel, faster, operate on larger files
Grep, Pipes, Sort, Comm, Diff, Join Sed - regular expression substitution editor, replaced by perl in
most contexts Man - to list manual pages with options for most commands (if
installed and concurrent version)
Grep
Grab lines that match a text phraseOnly the line that matchesLines before or after the matched lineLines that do not matchPiping multiple searches
GenBank Files
Grab the Locus, Definition and Keyword lines
phase2.txt.out
temp
Select Non-Human Definition Lines and Use Pipe
temp
kworley% grep -v Homo temp | grep DEF
Specify Lines to returngrep -1
grep -B1
grep -A1
Sort
In dictionary (-d), month (-M), or numerical (-n) orderIgnore case (-f)Specify output file (-o)Specify the separator between fields (-t)Unique lines only (-u)Specify field on which to sort (-k POS1,[-POS2]), numbered starting from 0, can specify which character in the field (field.char)Merge more than one sorted file (-m)
Comm
Select or reject lines in common between two sorted filesOptions suppress printing of columns comm [-123] file1 file2 Column 1 is lines only in file 1 Column 2 is lines only in file 2 Column 3 is lines in both files
Diff
Compares two files (or sets of files in a directory) and output lines with differencesCompare as text (-a)Ignore changes in white space (-b) or blank lines (-B), case difference (-i)For directory comparisons Report only files that differ not details (-q) Compare subdirectories recursively (-r)
Join
Combines lines from two files based on a common field (-1 field -2 field)Specify the fields from each file and the order to output (-o file_number.field file_number.field file_number.field)
What is Emacs?
A Unix text editor with additional functionalityColumn functionsSettings for DNA modeSettings for programming modeSeamless integration with matlab, R, S-Plus, SAS etc.
Emacs Demonstrations
Search and replace By query All New lines Counting things
Column functions Select Kill Copy Paste
Query replace
Esc % Replace phrase With phrase Designate carriage return with control Q
control J
Y or N! To replace all
Starting File
Query Replace
End file
Rectangle functions
Mark, select rectangleControl x r r a
To register the rectangle as buffer a k
To kill the rectangle r i a
To insert previously registered rectangle a from buffer
Select Rectangle, Kill
Select Rectangle, Mark, Insert
What is Perl?
A general purpose programming language.Invented to replace awk, sed, and sh.A scripting language.Practical Extraction and Reporting LanguagePathologically Eclectic Rubbish Lister
“There is more than one way to do it” TIMTOWTDI
How to Use Perl
Perl “scripts” (programs) are text and are interpreted by the the perl program.TIMTOWTDI:
You can put the script on the command line:>perl -e 'print "Hello, world!\n";'
You can pass it as an argument to perl:>perl my_program.pl
You can make the script self-executing:>my_program.pl
print, ", ', \n
'print "Hello, world!\n";'
In most programming languages, "print" means "display" or "output".The single and double quote characters ( " ' ) are used to set apart blocks of "text". In this example, the single quote sets apart the perl script, and the double quotes sets apart the text to display. (Perl has others ways to quote.)The backslash, '\', is used to change the meaning of a character, e.g. to generate special characters. \n means "start a new line" (e.g. the Carriage Return, or Return, or Enter.)
Example of a One Liner(Thanks to Dr. Wheeler)
perl -nle '@f=split/\t/; print if ($f[2] > 95 );' blast_tbl_in.txt > blast_tbl_out.txt
perl -nle
'@f=split/\t/; print if ($f[2] > 95);'
blast_tbl_in.txt >blast_tbl_out.txt
A One Liner: TIMTOWTDI
1. perl -nle '@f=split/\t/; print if ($f[2] > 95 );' blast_tbl_in.txt > blast_tbl_out1.txt
2. perl -ne '@f=split; print if ($f[2] > 95 );' blast_tbl_in.txt > blast_tbl_out2.txt
3. perl -ane 'print if ($F[2] > 95 );' blast_tbl_in.txt > blast_tbl_out3.txt
split, if, variables@f=split/\t/; print if ($f[2] > 95);
split is a function. It can be written with parens like in most languages, and takes UP TO three arguments:split( where_to_split, what_to_split, how_many_to_split)split, like many Perl statements, uses defaults for missing arguments.Special characters mark @whole_arrays, $array_members[1], %whole_hashes, $hash_members{'one'}, $simple_variables.if acts like its common English meaning. It can go before a block or at the end of a statement (as above).Perl converts between numbers and text. '>' is a numeric operator so 95 and $f[2] are treated as numbers. If gt replaced >, they would be treated as strings.
FASTA to XML
perl -pi.bak -e 's"^>(.*)$"</seq><title>\1</title><seq>";'
test.fa
[localhost:~/test] steffen% lstest.fa test.fa.bak[localhost:~/test] steffen% perl -pi.bak -e 's"^>(.*)$"</seq><title>\1</title><seq>";' test.fa[localhost:~/test] steffen% lstest.fa test.fa.bak[localhost:~/test] steffen% more test.fa</seq><title>CSTAP1E0101A</title><seq>gttgcctgcgtcttcggxaacaacgtagttctcagGCCGCCCGACCAGGTACTTTTTTGCTTTTTTTTTTTTTATTTTTTACAAATTATCAAAAGTTCTTGTGCTTTCAGGAGCGATTAACATTCTCATGGGCCATACCCTTGTCAGGTTTCATAAACTAAGTTAGATGGACCTGCTTGGTATTGTGGTGGAAGACCTCCAAGAAAACAAAGTCCCGGAATCTCAACGTCCTCTGTCTTCTGGCATTTCATCTTCAAGAAACAATGTCTTATAGTTATTATTGCATGTTTTGGGAGGTTAAAGGGTAAAGTTTGTAATGCCTTGACTAAAAACTTCCAGTTGTTATGGTGcacaacaatttttggtatgctaacttatacttgtgcctaatccttaaggaaaagaaagagccatatacctaaaactgactttatttttcaaaaggta</seq><title>CSTAP1E0102A</title><seq>tttttgctggcgaactatcaggagactacagxaactacttttcagtxcgaactcacatcatcactggccgtcgttttacaacgtcgtgattgggaaaaccctggcgttacccaacttaatcgccttgcagcacatccccctttcgccagctggcgtaatagcgaagaggcccgcaccgatcgcccttcccaacagttgcgcagcctgaatggcgaatggcgcctgatgcggtattttctccttacgctttcaatgatgagcacttxtaaaggtctgx</seq><title>CSTAP1E0103A</title><seq>atttgagcagcatctattgaaaactaxcgxagxtcttcaggcgcgCCCACCCGAGGTACTACCAAGCCAGTGTCCTGCCCGGTTTTAAGCCCTCGTCCTCTCCCTTCGCTCTCCTCCAAACTGAGCAGCATTAGTTCCACAAGCACAGAAGTTAAACGAAAAACTGTCTTGCTCCACGGTCTCCTACAGTAGAATGCTGGATAATAATGCTTTCAGAAGCCACTTCTACAACCAGAACATTCTGACCACCACAATCATCAGGTTTACACACACCCTACGAAACACTAGCGAGTTAACAAGactgatgaactacttgcagtcgaactccaatcattactggccgtcgttttaa
Executing a Perl Script in a File
$line = <>;$line =~ s">(.*)"<title>\1</title><seq>";print $line;
while( $line = <> ) {$line =~ s">(.*)"</seq><title>\1</title><seq>";print $line;
}
print "</seq>\n";
File Reading, Binding, while
$line = <>;<> reads one line from the "current file"
$line =~ s">(.*)"<title>\1</title><seq>";=~ makes the preceding string the "current line" (Binding)
while( $line = <> ) {print $line;
}Repeats the statements between { and } while there is another line.
Self-executing Perl Scripts
You need to know the path to your Perl program:>which perl/usr/bin/perl
The first line of your script must be:#!/usr/bin/perl
Permissions need to allow execution>chmod 755 my_program.pl
FASTA to XML Fleshed Out#!/usr/bin/perl## fasta2xml by David Steffen 6/2/2004# - Converts fasta file to mini-xml format
$inpfile = shift( @ARGV );
if( not( $inpfile =~ m/^(.*)\.fa$/ ) ) {die( "Input file, $inpfile, must be a fasta file and end in .fa\n" );
}$basefile = $1;
open( INPFILE, $inpfile ) or die( "Can't open $inpfile: $!\n" );
$outfile = '>' . $basefile . '.xml';open( OUTFILE, $outfile ) or die( "Can't open $outfile: $!\n" );
$line = <INPFILE>;$line =~ s">(.*)"<title>\1</title><seq>";print OUTFILE $line;
while( $line = <INPFILE> ) {$line =~ s">(.*)"</seq><title>\1</title><seq>";print OUTFILE $line;
}
print OUTFILE "</seq>\n";
Running Other Programs from Perl
$files = `ls`;The "backtic" (` `) characters execute the text in
between as a command to the operating system, returning the output of that command (e.g. to the $files) variable.
$error = system( "mv $file ${basefile}.abi" );The system statement executes its argument as a
command to the operating system, returning ERROR MESSAGES from that command. (Output is printed as usual.) There are other, subtle differences between ` ` and system.