+ All Categories
Home > Documents > Essential Skills for Bioinformatics: Unix/Linux

Essential Skills for Bioinformatics: Unix/Linux

Date post: 01-Jan-2022
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
37
Essential Skills for Bioinformatics: Unix/Linux
Transcript
Page 1: Essential Skills for Bioinformatics: Unix/Linux

Essential Skills for Bioinformatics: Unix/Linux

Page 2: Essential Skills for Bioinformatics: Unix/Linux

Text processing with awk

• While awk is designed to work with whitespace-separated tabular data, it is easy to set a different field separator.

• Simply specify which separator to use with the -F argument. For example, we could work with a CSV file in awk by starting with awk –F”,”

Page 3: Essential Skills for Bioinformatics: Unix/Linux

Text processing with awk

• We have illustrated two ways awk can come in handy:• For filtering data using rules that can combine regular expression sand

arithmetic• Reformatting the columns of data using arithmetic

• We will learn more advanced use cases by introducing two special patterns: BEGIN and END.

• The BEGIN pattern specifies what to do before the first record is read in. It is useful to initialize and set up variables.

• The END specifies what to do after the last record’s processing is complete. It is useful to print data summaries at the end of file processing.

Page 4: Essential Skills for Bioinformatics: Unix/Linux

Text processing with awk

• Suppose we want to calculate the mean feature length in Homo_sapiens.GRCh38.87.bed:

• NR is the current record number, so on the last record NR is set to the total number of records processed.

Page 5: Essential Skills for Bioinformatics: Unix/Linux

Text processing with awk

• We can use NR to extract ranges of lines:

Page 6: Essential Skills for Bioinformatics: Unix/Linux

Text processing with awk

• awk makes it easy to convert between bioinformatics files like BED and GTF.

• Our previous solution using grep and cut has an error:

Page 7: Essential Skills for Bioinformatics: Unix/Linux

Text processing with awk

• We can generate a three-column BED file from the GTF file as follows.

• Note that we subtract 1 from the start position to convert to BED format. This is because BED uses zero-indexing while GTF uses 1-indexing.

Page 8: Essential Skills for Bioinformatics: Unix/Linux

Text processing with awk

• Using sort | uniq –c, we counted the number of features belonging to a particular gene.

• awk also have a very useful data structure known as an associative array which behaves like Python’s dictionaries or hashes in other languages.

Page 9: Essential Skills for Bioinformatics: Unix/Linux

Text processing with awk

• We can create an associative array by simply assigning a value to a key.

Page 10: Essential Skills for Bioinformatics: Unix/Linux

Text processing with awk

• This example illustrates that awk is a programming language: within our action blocks, we can use standard programming statements like if, for, and while.

• However, when awk programs become complex or start to span multiple lines, you should switch to Python at that point.

Page 11: Essential Skills for Bioinformatics: Unix/Linux

Stream editing with sed

• We learned how Unix pipes are fast because they operate on streams of data rather than data written to disk. Additionally, pipes don’t require that we load an entire file in memory at once. Instead, we can operate one line at a time.

• Often we need to make trivial edits to a stream, usually to prepare it for the next step in a Unix pipeline.

• The stream editor, sed, allows us to do exactly that.

Page 12: Essential Skills for Bioinformatics: Unix/Linux

Stream editing with sed

• sed reads data from a file or standard input and can edit a line at a time. Let’s look at a very simple example: converting a file containing a single column of chromosomes in the format “chrom1” to the format “chr1”.

Page 13: Essential Skills for Bioinformatics: Unix/Linux

Stream editing with sed

• We can edit it without opening the entire file in memory. Our edited output stream is then easy to redirect to a new file.

• In the previous example, we uses sed’s substitute command, by far the most popular use of sed. sed’s substitute takes the first occurrence of the pattern between the first two slashes, and replaces it with the string between the second and third slashes. In other words, the syntax of sed’s substitute is s/pattern/replacement.

Page 14: Essential Skills for Bioinformatics: Unix/Linux

Stream editing with sed

• By default, sed only replaces the first occurrence of a match of each line. We can replace all occurrences of strings that match our pattern by setting the global flag g after the last slash: s/pattern/replacement/g

• If we need matching to be case-insensitive, we can enable this with the flag I

• By default, sed’s substitutions use POSIX BRE. As with grep, we can use the –E option to enable POSOX ERE.

Page 15: Essential Skills for Bioinformatics: Unix/Linux

Stream editing with sed

• Most important is the ability to capture chunks of text that match a pattern, and use these chunks in the replacement.

• Suppose we want to capture the chromosome name, and start and end positions in a string containing a genomic region in the format “chr1:28427874-28425431”, and output this as three columns.

Page 16: Essential Skills for Bioinformatics: Unix/Linux

Stream editing with sed

• ^(chr[^:]+): This matches the text that begins at the start of the line (^ enforces this), and then captures everything between ( and ). This pattern used for capturing begins with “chr” and matches one or more characters that are not “:”, our delimiter. We match until the first “:” through a character class defined by everything that’s not “:”, [^:]+

Page 17: Essential Skills for Bioinformatics: Unix/Linux

Stream editing with sed

• ([0-9]+): Match and capture more than one number.

• Finally, our replacement is these three captured groups, interspersed with tabs, \t.

• Regular expressions are tricky and take time and practice to master.

Page 18: Essential Skills for Bioinformatics: Unix/Linux

Stream editing with sed

• Explicitly capturing each component of our genomic region is one way to tackle this, and nicely demonstrates sed’s ability to capture patterns. However, there are numerous ways to use sed or other Unix tools to parse strings like this.

Page 19: Essential Skills for Bioinformatics: Unix/Linux

Stream editing with sed

• sed 's/[:-]/\t/g‘: We just replace both delimiters with a tab. Note that we’ve enabled the global flag, which is necessary for this approach to work.

• sed 's/:/\t/' | sed 's/-/\t/‘: For complex substitutions, it can be much easier to use two or more calls to sed rather than trying to do this with one regular expression.

• tr ':-' '\t‘: tr translates all occurrences of its first argument to its second.

Page 20: Essential Skills for Bioinformatics: Unix/Linux

Stream editing with sed

• By default, sed prints every lines, making replacements to matching lines. Suppose we want to capture all transcript names from the last column of a GTF file.

• Some lines of the last column of the GTF file don’t contain transcript_id, so sed prints the entire line rather than the captured group.

Page 21: Essential Skills for Bioinformatics: Unix/Linux

Stream editing with sed

• One way to solve this would be to use grep “transcript_id” before sed.

• A cleaner way is to disable sed from outputting all lines with –n. Then, by appending p after the last slash sed will print all lines it’s made a replacement on.

Page 22: Essential Skills for Bioinformatics: Unix/Linux

Stream editing with sed

• This example uses an important regular expression idiom: capturing text between delimiters. This is a useful pattern.

1. First, match zero or more of any character (.*) before the string “transcript_id”

2. Then, match and capture one or more characters that are not a quote ([^”]+). This is an important idiom. The brackets make up a character class. Character classes specify what characters the expression is allowed to match. Here, we use a caret (^) inside the brackets to match anything except what’s inside these brackets. The end result is that we match and capture one or more nonquote characters.

Page 23: Essential Skills for Bioinformatics: Unix/Linux

Stream editing with sed

• The following approach ((.*) rather than ([^”]+)) will not work

Page 24: Essential Skills for Bioinformatics: Unix/Linux

Stream editing with sed

• It is also possible to select and print certain ranges of lines with sed. In this case, we are not doing pattern matching, so we don’t need slashes. To print lines 20 through 50 of a file, we use

• sed has features that allow you to make any type of edit to a stream of test, but for complex stream processing tasks it can be easier to write a Python script than a long and complicated sed command.

Page 25: Essential Skills for Bioinformatics: Unix/Linux

ADVANCED SHELL TRICKS

Page 26: Essential Skills for Bioinformatics: Unix/Linux

Subshells

• Remember the difference between sequential commands (connected with &&or ;) and piped commands (connected with |).

• Sequential commands are simply run one after the other. The previous command’s standard output does not get passed to the next program’s standard in. In contrast, connecting two programs with pipes means the first program’s standard out will be piped into the next program’s standard in.

• If we run two commands with command1; command2, command2 will always run, regardless of whether command1 exits successfully. In contrast, if we use command1 && command2, command2 will only run if command1 completed with a zero-exit stats.

Page 27: Essential Skills for Bioinformatics: Unix/Linux

Subshells

• Subshells allow us to execute sequential commands together in a separate shell process. This is useful primarily to group sequential commands together such that their output is a single stream. This gives us a new way to construct clever one-liners and has practical uses in command-line data processing.

Page 28: Essential Skills for Bioinformatics: Unix/Linux

Subshells

• In the first example, only the second command’s standard out is piped into sed. This is because your shell interprets echo “this command” as one command and echo “that command” | sed ‘s/command/step/’ as a second separate command.

• In the second example, grouping both echo commands together with parentheses causes these two commands to be run in a separate subshell, and both commands’ combined standard output is passed to sed.

• Combining two sequential commands’ standard output into a single stream with a subshell is a useful trick, and one we can apply to shell problems in bioinformatics

Page 29: Essential Skills for Bioinformatics: Unix/Linux

Subshells

• Consider the problem of sorting a GTF file with a metadata header. We want to sort everything except the header, but still include the header at the top of the final sorted file.

• Because we’ve used a subshell, all standard output from these sequential commands will be combined into a single stream, which is redirected to a file.

Page 30: Essential Skills for Bioinformatics: Unix/Linux

Named pipes

• We use pipes to connect command-line tools to build custom data-processing pipelines. However, some programs won’t interface with the Unix pipes. For example, certain bioinformatics tools read in multiple input files and write to multiple output files:

processing_tool --in1 in1.fq --in2 in2.fa --out1 out2.fa --out2 out2.fq

• The imaginary program requires two separate input files, and produces two separate output files. Many bioinformatics programs that process two paired-end files together like this.

Page 31: Essential Skills for Bioinformatics: Unix/Linux

Named pipes

• In addition to the inconvenience of not being able to use a Unix pipe to interface processing_tool with other programs, there’s a more serious problem. We would have to write and read four intermediate files to disk to use this program.

• A named pipe, known as FIFO, is a special sort of file. Regular pipes are anonymous – they don’t have a name, and only persist while both processes are running. Named pipes behave like files, and are persistent on your filesystem. We can create a named pipe with the program mkfifo:

Page 32: Essential Skills for Bioinformatics: Unix/Linux

Named pipes

• Notice that this is a special type of a file: the p before the file permissions is for pipe. Just like pipes, one process writes data into the pipe, and another process reads data out of the pipe.

Page 33: Essential Skills for Bioinformatics: Unix/Linux

Named pipes

• echo "hello, the named pipe." > fqin &

Write some lines to the named pipe we created earlier.• cat fqinTreating the named pipe just as we would any other file, we can access the data we wrote to it earlier. Any process can access this file and read from this pipe. Like a standard Unix pipe, data that has been read from the pipe is no longer there.• rm -f fqin

We clean up by using rm to remove it.

Page 34: Essential Skills for Bioinformatics: Unix/Linux

Named pipes

• Although the syntax is similar to shell redirection to a file, we are not actually writing anything to our disk. Named pipes provide all of the computational benefits of pipes with the flexibility of interfacing with files.

• In our earlier example, in1.fq and in2.fq could be named pipes that other processes are writing input to. Additionally, out1.fq and out2.fq could be named pipes that processing_tool is writing to, that other processes read from.

Page 35: Essential Skills for Bioinformatics: Unix/Linux

Process substitution

• Creating and removing these file-like named pipes a bit tedious. There a way to use named pipes without having to explicitly create them. This is called process substitution, or sometimes known as anonymous named pipes.

• Remember that cat takes file arguments. The chunk <(echo "hello, process substitution.") runs the echo command and pipes the output to an anonymous named pipe.

Page 36: Essential Skills for Bioinformatics: Unix/Linux

Process substitution

• Your shell then replaces this chunk with the path to this anonymous named pipe. No named pipes need to be explicitly created, but you get the same functionality.

• Process substitution allows us to connect two programs, even if one doesn’t take input through standard input.

Page 37: Essential Skills for Bioinformatics: Unix/Linux

Process substitution

• In the previous example of processing_tool, we can use process substitution instead of creating two explicit named pipes.

processing_tool --in1 <(makein raw1.txt) --in2 <(makein raw2.txt) --out1 out2.fa --out2 out2.fq

• It can also be used to capture an output stream.processing_tool --in1 <(makein raw1.txt) --in2 <(makein raw2.txt) --out1 >(gzip > out1.fq.gz) --out2 >(gzip > out2.fq.gz)


Recommended