Lecture 5File Manipulation and the Command
Line- OR -
Intro to Using Unix Richard Rauscher, PSU CoM
So *not+ easy that…
• Graphical user interfaces (GUIs) were developed in the 80’s to significantly reduce the learning curve of applications
• The benefit of GUIs is a small learning curve at the cost of perpetual inefficiency
• Textual interfaces (specifically “line” interfaces) come from the early days of computing
• Much steeper learning curve but much greater long-term efficiencies
..that a caveman can[n’t] do it
Quick examples
• You should have the file: saccharomyces_cerevisiae.gff
• Using the best way you know how do the following:– Calculate the number of lines in the file
– Count the number of lines with the word “Dubious” in the file
– Extract only the third and 5th columns from file3
• You have five minutes
Example – Showing off with Unix commands
Quick Outline of Hour
1. Install Cygwin / Getting to a terminal on a Mac2. Introduction to the shell (bash, tcsh)3. Editing the command history (emacs style line-editing)4. The Unix file system.5. Basic file commands (i.e. ls, cat, more, less, cp, mv, grep, egrep, diff, sort, cut)6. Shell-based redirects (>,<,|), stdin, stdout, stderr7. Background operations8. Transfering files, remote terminals (ssh, scp, sftp)9. XWindows (if time permits)
Cygwin or Darwin
• Macs
– Macs are based on Unix (OS X)
– Macs have a “terminal” program that you may run
• PCs
– PCs can run Linux natively (instead of Windows)
– Or you can run a Unix semi “virtual computer” under Windows using Cygwin
– Cygwin is free
The Shell
• Command interpreter
• Different shells
• Examples, Bourne Shell, Korn Shell, C-Shell, TC-Shell, Bourne Again Shell (Bash)
• Most current systems default to Bash
• Can be used to create “scripts”
• Scripts: small interpreted programs
How the shell works
• You type something into the shell
• Other than a few special characters, everything you type in the shell causes the shell to find a program (executable file) with that name and run it
• There are special characters (|, >, <) that connect the input and output into other programs
Some Shell Basics
• Every command usually has three “handles” attached to it
• A stdin, stdout, and stderr. – Stdin = standard input; by default most unix programs
take commands from the terminal’s keyboard
– Stdout = standard output; by default, most unixprograms output information to the terminal’s screen
– Stderr = like standard output but is treated differently when it is redirected so that normal output and error output isn’t confused
Simple Example
• Let’s create a file
% cat > f1
This is
a test
of the
keyboard system.
^D This means Control-D – end of file marker in Unix
%
What did you just do?
• You started the program called “cat” and redirected its stdout to a file called f1.
• You typed on your keyboard into its stdin
• “cat” (short for concatenate) just takes input and puts it to output.
• You probably wouldn’t do what we just did for anything practical but it was an easy way to create a file for demonstration purposes.
Finding help
• In most Unix systems, there is something called the “manual” installed
• Thus, you can usually find out more than you ever wanted to know by saying “man <command>”
• For example, if you type “man cat” you find out more about cat
Command-Line Editing
• You can usually also edit your command line– ^P = previous line
– ^N = next line
– ^B = back up (one character)
– ^F = go forward (one character)
– ^D = equals delete a character
• This can be very useful if you’re not an extraordinary typist and type one character wrong
• Cool feature: arrow keys usually also work
Unix-style directories
• In Unix, everything is a file (programs, data, email, devices, memory, sound cards)
• Unix directory structures are completely hierarchal (unlike DOS that has drive letters)
• Everything starts at / (forward slash)
• On a unix system, your home directory can usually be reached by “cd ~/”
• [You can use similar constructs to get to the home directory of others, e.g. ~rlr27]
Basic File Commands• ls – lists the contents of a directory
• pwd – tells you the present working directory
• cat – output one or many files
• more – output a file such that it doesn’t fly off the top of the screen
• less – a more functional version of more
• cp – copy a file
• mv – ostensibly rename a file
• grep, egrep, fgrep – search for a specific string one or more files
• diff – compare two files
• sort – sort the lines of a file by alpha or numerical order
More commands
• cut – extract columns from a file
• wc –l count the number of lines in a file
• wc –c count the number of characters in a file
• rm – removes files or a file
• more tricks: cat file | sed‘s/HELLO/GOODBYE/g’ > file2
Commands on Directories
• mkdir = create a subdirectory
• cd = change directory
• rmdir = remove an empty directory
Shell substitutions (wildcards)
• * = match anything in the cwd except files that start with a .
• f* = match anything in the cwd that starts with an f
• f?? = matching anything that starts with an f and has exactly two characters after the f
• ~userid refers to the home directory of userid
• ~/ refers to your own home directory
Redirecting
• c1| c2
– Takes the output from c1 and makes it the input into c2
• c1 > f1
– Takes the output from c1 and puts it into a file
• c1 < f1
– Makes c1 get it’s input from f1
Remote Terminals• Connecting to other computers is done through a
remote terminal– Old ways include “telnet, rsh, rlogin”
• Not encrypted, not secure and not used much these days• Listed here for completeness and in case someone still uses
the old lingo
– “New” way is ssh
• Syntax is “ssh userid@hostname”– E.g. ssh [email protected]
• Warning: performance can vary with the options you chose and the power of your computer
Transferring Files
• Old way: ftp, rcp, rdist– ftp is still used for anonymous transfers; it is not
considered to be secure for password’d xfers
• New ways: scp, sftp– Uses the same protocol as SSH
– Use scp if you know what you want to move; sftp if you don’t
• Other: http, https– There are command-line tools for accessing web sites
and files from websites (e.g. curl
Problem Set
• Select all genes from chromosome X and put them into a file called new.txt
• Select all non-gene features from chromosome X and put them into a file called new1.txt
• Remove all of the dubious findings from these results and put them into a file called new2.txt
Lecture 6aRegular Expressions, Access to
Applications
Richard Rauscher, PSU CoM
More advanced look
• Suppose we wanted to only pick out a range within a the file
• In the GFF format, the range is specified in columns 4 (beginning) and 5 (end)
• cat saccharomyces_cerevisiae.gff| awk '{print $4 $5}'
• What’s wrong with that?• cat saccharomyces_cerevisiae.gff| awk '{ if (($4 > 10) && ($5 < 1000)) {print $0} }'
Matching/Regular Expressions
• This can get confusing so try to follow…
• There are “shell substitutions”
– Wildcards (*, ?)
– Expansions (~/, ~ userid, environmental variables, local variables, aliases)
• The wildcards match filenames
Regular Expressions
• Regular expressions match patterns in files
• The concept is (almost) consistently implemented throughout Unix utilities and programs
• Reasonably easy to write (once you know what you’re doing)
• Difficult to read what is intended
Know they exists
• But you don’t have to be an expert
• Basic ideas
– ^ matches the beginning of a line
– $ matches the end of the line
– . matches “any” character
– * matches any number of characters proceeding it (0 to infinity)
– + machines any number but at least one
Other Funky Stuff
• Searching for invisible things can be difficult
• How do you search for a tab for example?
• Some systems will use “C” style representations (\t)
• In some, you actually need to put the tab character
Practical Bioinformatics
Week 3, Lecture 6b
István Albert
Bioinformatics Consulting CenterPenn State
Short read mapping
Challenge: instruments generate a very high number (hundreds of millions) of short (50-100bp) reads.
Numerous software packages have been developed to map (align) these reads against known genomes.
Every method applies a “heuristics”
• Heuristics “experience-based”, “trial and error” techniques
• Each method has huge number of tunable parameters!
• The inputs are sequencing reads (fasta, fastq or colorspace)
• The outputs are in a format that contains genomic coordinates for each input read.
A sample of short read mappers
• BFAST• Bowtie• BWA• ELAND• GMAP• MAQ• SHRiMP• SOAP• SWIFT… etc …
Each has strenghts and weaknesses that may not be readily apparent.
Everything is about tradeoffs between resource consumption, specificity, sensitivity and usability.
1. Tool A is fast, uses very little memory but cannot find insertions or deletions.
2. Tool B is very efficient but it requires us to generate and precompute a genome index that takes one week.
…etc..
Very few people have a deep understanding of all methods
• Usually they are experts in one particular method
• Some approaches are better suited for some types of experiments, or more often some approaches are ill suited for certain tasks.
• It is important to be aware of past accumulated knowledge in a field of study. Publications that often explain a particular choice.
Bowtie
• Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25.
• Download it onto your computer (see the lecture webpage for links)
• http://bowtie-bio.sourceforge.net/index.shtml
Download the genomic index from the course webpage
• Copy the content of this file into the bowtie/index folder• Copy the data.fastq file to your bowtie installation directory
Run the program with no parameters
tunable parameters
Run it on your dataset
Investigate results.txt
So called Sequence Alignment/Map – SAM file (more info in later lectures)
Tab delimited: 1st read_id, 2nd
column , 3rd chromosome 4th coordinate
A few exercises
• Run the mapping on your computer.
• How many alignments did you get?
• How many reads map to chromosome 10?
Read the Bowtie manual on the web.
• What is the flag that makes bowtie only report reads that map uniquely (in single location) to the genome
• How many alignments do you get now?
• How many reads map to chromosome 10?
Where to go next?
• The manual and publication.
• Describes substantial number of use cases
• Bowtie is also the base building block software for several higher order pipelines:
Crossbow: GenotypingTopHat: RNA-Seq splice junction mapperCufflinks: Isoform assembly, quantitation