+ All Categories
Home > Documents > Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2....

Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2....

Date post: 13-Jul-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
41
Lecture 5 File Manipulation and the Command Line - OR - Intro to Using Unix Richard Rauscher, PSU CoM
Transcript
Page 1: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Lecture 5File Manipulation and the Command

Line- OR -

Intro to Using Unix Richard Rauscher, PSU CoM

Page 2: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

So *not+ easy that…

• Graphical user interfaces (GUIs) were developed in the 80’s to significantly reduce the learning curve of applications

• The benefit of GUIs is a small learning curve at the cost of perpetual inefficiency

• Textual interfaces (specifically “line” interfaces) come from the early days of computing

• Much steeper learning curve but much greater long-term efficiencies

Page 3: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

..that a caveman can[n’t] do it

Page 4: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Quick examples

• You should have the file: saccharomyces_cerevisiae.gff

• Using the best way you know how do the following:– Calculate the number of lines in the file

– Count the number of lines with the word “Dubious” in the file

– Extract only the third and 5th columns from file3

• You have five minutes

Page 5: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Example – Showing off with Unix commands

Page 6: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Quick Outline of Hour

1. Install Cygwin / Getting to a terminal on a Mac2. Introduction to the shell (bash, tcsh)3. Editing the command history (emacs style line-editing)4. The Unix file system.5. Basic file commands (i.e. ls, cat, more, less, cp, mv, grep, egrep, diff, sort, cut)6. Shell-based redirects (>,<,|), stdin, stdout, stderr7. Background operations8. Transfering files, remote terminals (ssh, scp, sftp)9. XWindows (if time permits)

Page 7: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Cygwin or Darwin

• Macs

– Macs are based on Unix (OS X)

– Macs have a “terminal” program that you may run

• PCs

– PCs can run Linux natively (instead of Windows)

– Or you can run a Unix semi “virtual computer” under Windows using Cygwin

– Cygwin is free

Page 8: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

The Shell

• Command interpreter

• Different shells

• Examples, Bourne Shell, Korn Shell, C-Shell, TC-Shell, Bourne Again Shell (Bash)

• Most current systems default to Bash

• Can be used to create “scripts”

• Scripts: small interpreted programs

Page 9: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

How the shell works

• You type something into the shell

• Other than a few special characters, everything you type in the shell causes the shell to find a program (executable file) with that name and run it

• There are special characters (|, >, <) that connect the input and output into other programs

Page 10: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Some Shell Basics

• Every command usually has three “handles” attached to it

• A stdin, stdout, and stderr. – Stdin = standard input; by default most unix programs

take commands from the terminal’s keyboard

– Stdout = standard output; by default, most unixprograms output information to the terminal’s screen

– Stderr = like standard output but is treated differently when it is redirected so that normal output and error output isn’t confused

Page 11: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Simple Example

• Let’s create a file

% cat > f1

This is

a test

of the

keyboard system.

^D This means Control-D – end of file marker in Unix

%

Page 12: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

What did you just do?

• You started the program called “cat” and redirected its stdout to a file called f1.

• You typed on your keyboard into its stdin

• “cat” (short for concatenate) just takes input and puts it to output.

• You probably wouldn’t do what we just did for anything practical but it was an easy way to create a file for demonstration purposes.

Page 13: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Finding help

• In most Unix systems, there is something called the “manual” installed

• Thus, you can usually find out more than you ever wanted to know by saying “man <command>”

• For example, if you type “man cat” you find out more about cat

Page 14: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Command-Line Editing

• You can usually also edit your command line– ^P = previous line

– ^N = next line

– ^B = back up (one character)

– ^F = go forward (one character)

– ^D = equals delete a character

• This can be very useful if you’re not an extraordinary typist and type one character wrong

• Cool feature: arrow keys usually also work

Page 15: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Unix-style directories

• In Unix, everything is a file (programs, data, email, devices, memory, sound cards)

• Unix directory structures are completely hierarchal (unlike DOS that has drive letters)

• Everything starts at / (forward slash)

• On a unix system, your home directory can usually be reached by “cd ~/”

• [You can use similar constructs to get to the home directory of others, e.g. ~rlr27]

Page 16: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Basic File Commands• ls – lists the contents of a directory

• pwd – tells you the present working directory

• cat – output one or many files

• more – output a file such that it doesn’t fly off the top of the screen

• less – a more functional version of more

• cp – copy a file

• mv – ostensibly rename a file

• grep, egrep, fgrep – search for a specific string one or more files

• diff – compare two files

• sort – sort the lines of a file by alpha or numerical order

Page 17: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

More commands

• cut – extract columns from a file

• wc –l count the number of lines in a file

• wc –c count the number of characters in a file

• rm – removes files or a file

• more tricks: cat file | sed‘s/HELLO/GOODBYE/g’ > file2

Page 18: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Commands on Directories

• mkdir = create a subdirectory

• cd = change directory

• rmdir = remove an empty directory

Page 19: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Shell substitutions (wildcards)

• * = match anything in the cwd except files that start with a .

• f* = match anything in the cwd that starts with an f

• f?? = matching anything that starts with an f and has exactly two characters after the f

• ~userid refers to the home directory of userid

• ~/ refers to your own home directory

Page 20: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Redirecting

• c1| c2

– Takes the output from c1 and makes it the input into c2

• c1 > f1

– Takes the output from c1 and puts it into a file

• c1 < f1

– Makes c1 get it’s input from f1

Page 21: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Remote Terminals• Connecting to other computers is done through a

remote terminal– Old ways include “telnet, rsh, rlogin”

• Not encrypted, not secure and not used much these days• Listed here for completeness and in case someone still uses

the old lingo

– “New” way is ssh

• Syntax is “ssh userid@hostname”– E.g. ssh [email protected]

• Warning: performance can vary with the options you chose and the power of your computer

Page 22: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Transferring Files

• Old way: ftp, rcp, rdist– ftp is still used for anonymous transfers; it is not

considered to be secure for password’d xfers

• New ways: scp, sftp– Uses the same protocol as SSH

– Use scp if you know what you want to move; sftp if you don’t

• Other: http, https– There are command-line tools for accessing web sites

and files from websites (e.g. curl

Page 23: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Problem Set

• Select all genes from chromosome X and put them into a file called new.txt

• Select all non-gene features from chromosome X and put them into a file called new1.txt

• Remove all of the dubious findings from these results and put them into a file called new2.txt

Page 24: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Lecture 6aRegular Expressions, Access to

Applications

Richard Rauscher, PSU CoM

Page 25: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

More advanced look

• Suppose we wanted to only pick out a range within a the file

• In the GFF format, the range is specified in columns 4 (beginning) and 5 (end)

• cat saccharomyces_cerevisiae.gff| awk '{print $4 $5}'

• What’s wrong with that?• cat saccharomyces_cerevisiae.gff| awk '{ if (($4 > 10) && ($5 < 1000)) {print $0} }'

Page 26: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Matching/Regular Expressions

• This can get confusing so try to follow…

• There are “shell substitutions”

– Wildcards (*, ?)

– Expansions (~/, ~ userid, environmental variables, local variables, aliases)

• The wildcards match filenames

Page 27: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Regular Expressions

• Regular expressions match patterns in files

• The concept is (almost) consistently implemented throughout Unix utilities and programs

• Reasonably easy to write (once you know what you’re doing)

• Difficult to read what is intended

Page 28: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Know they exists

• But you don’t have to be an expert

• Basic ideas

– ^ matches the beginning of a line

– $ matches the end of the line

– . matches “any” character

– * matches any number of characters proceeding it (0 to infinity)

– + machines any number but at least one

Page 29: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Other Funky Stuff

• Searching for invisible things can be difficult

• How do you search for a tab for example?

• Some systems will use “C” style representations (\t)

• In some, you actually need to put the tab character

Page 30: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Practical Bioinformatics

Week 3, Lecture 6b

István Albert

Bioinformatics Consulting CenterPenn State

Page 31: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Short read mapping

Challenge: instruments generate a very high number (hundreds of millions) of short (50-100bp) reads.

Numerous software packages have been developed to map (align) these reads against known genomes.

Page 32: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Every method applies a “heuristics”

• Heuristics “experience-based”, “trial and error” techniques

• Each method has huge number of tunable parameters!

• The inputs are sequencing reads (fasta, fastq or colorspace)

• The outputs are in a format that contains genomic coordinates for each input read.

Page 33: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

A sample of short read mappers

• BFAST• Bowtie• BWA• ELAND• GMAP• MAQ• SHRiMP• SOAP• SWIFT… etc …

Each has strenghts and weaknesses that may not be readily apparent.

Everything is about tradeoffs between resource consumption, specificity, sensitivity and usability.

1. Tool A is fast, uses very little memory but cannot find insertions or deletions.

2. Tool B is very efficient but it requires us to generate and precompute a genome index that takes one week.

…etc..

Page 34: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Very few people have a deep understanding of all methods

• Usually they are experts in one particular method

• Some approaches are better suited for some types of experiments, or more often some approaches are ill suited for certain tasks.

• It is important to be aware of past accumulated knowledge in a field of study. Publications that often explain a particular choice.

Page 35: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Bowtie

• Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25.

• Download it onto your computer (see the lecture webpage for links)

• http://bowtie-bio.sourceforge.net/index.shtml

Page 36: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Download the genomic index from the course webpage

• Copy the content of this file into the bowtie/index folder• Copy the data.fastq file to your bowtie installation directory

Page 37: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Run the program with no parameters

tunable parameters

Page 38: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Run it on your dataset

Page 39: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Investigate results.txt

So called Sequence Alignment/Map – SAM file (more info in later lectures)

Tab delimited: 1st read_id, 2nd

column , 3rd chromosome 4th coordinate

Page 40: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

A few exercises

• Run the mapping on your computer.

• How many alignments did you get?

• How many reads map to chromosome 10?

Read the Bowtie manual on the web.

• What is the flag that makes bowtie only report reads that map uniquely (in single location) to the genome

• How many alignments do you get now?

• How many reads map to chromosome 10?

Page 41: Intro to Using UnixQuick Outline of Hour 1. Install Cygwin / Getting to a terminal on a Mac 2. Introduction to the shell (bash, tcsh) 3. Editing the command history (emacs style line-

Where to go next?

• The manual and publication.

• Describes substantial number of use cases

• Bowtie is also the base building block software for several higher order pipelines:

Crossbow: GenotypingTopHat: RNA-Seq splice junction mapperCufflinks: Isoform assembly, quantitation


Recommended