Essential Skills for Bioinformatics: Unix/Linux

Essential Skills for Bioinformatics: Unix/Linux

Text processing with awk

• While awk is designed to work with whitespace-separated tabular data, it is easy to set a different field separator.

• Simply specify which separator to use with the -F argument. For example, we could work with a CSV file in awk by starting with awk –F”,”


• We have illustrated two ways awk can come in handy:• For filtering data using rules that can combine regular expression sand

arithmetic• Reformatting the columns of data using arithmetic

• We will learn more advanced use cases by introducing two special patterns: BEGIN and END.

• The BEGIN pattern specifies what to do before the first record is read in. It is useful to initialize and set up variables.

• The END specifies what to do after the last record’s processing is complete. It is useful to print data summaries at the end of file processing.


• Suppose we want to calculate the mean feature length in Homo_sapiens.GRCh38.87.bed:

• NR is the current record number, so on the last record NR is set to the total number of records processed.


• We can use NR to extract ranges of lines:


• awk makes it easy to convert between bioinformatics files like BED and GTF.

• Our previous solution using grep and cut has an error:


• We can generate a three-column BED file from the GTF file as follows.

• Note that we subtract 1 from the start position to convert to BED format. This is because BED uses zero-indexing while GTF uses 1-indexing.


• Using sort | uniq –c, we counted the number of features belonging to a particular gene.

• awk also have a very useful data structure known as an associative array which behaves like Python’s dictionaries or hashes in other languages.


• We can create an associative array by simply assigning a value to a key.


• This example illustrates that awk is a programming language: within our action blocks, we can use standard programming statements like if, for, and while.

• However, when awk programs become complex or start to span multiple lines, you should switch to Python at that point.

Stream editing with sed

• We learned how Unix pipes are fast because they operate on streams of data rather than data written to disk. Additionally, pipes don’t require that we load an entire file in memory at once. Instead, we can operate one line at a time.

• Often we need to make trivial edits to a stream, usually to prepare it for the next step in a Unix pipeline.

• The stream editor, sed, allows us to do exactly that.


• sed reads data from a file or standard input and can edit a line at a time. Let’s look at a very simple example: converting a file containing a single column of chromosomes in the format “chrom1” to the format “chr1”.


• We can edit it without opening the entire file in memory. Our edited output stream is then easy to redirect to a new file.

• In the previous example, we uses sed’s substitute command, by far the most popular use of sed. sed’s substitute takes the first occurrence of the pattern between the first two slashes, and replaces it with the string between the second and third slashes. In other words, the syntax of sed’s substitute is s/pattern/replacement.


• By default, sed only replaces the first occurrence of a match of each line. We can replace all occurrences of strings that match our pattern by setting the global flag g after the last slash: s/pattern/replacement/g

• If we need matching to be case-insensitive, we can enable this with the flag I

• By default, sed’s substitutions use POSIX BRE. As with grep, we can use the –E option to enable POSOX ERE.


• Most important is the ability to capture chunks of text that match a pattern, and use these chunks in the replacement.

• Suppose we want to capture the chromosome name, and start and end positions in a string containing a genomic region in the format “chr1:28427874-28425431”, and output this as three columns.


• ^(chr[^:]+): This matches the text that begins at the start of the line (^ enforces this), and then captures everything between ( and ). This pattern used for capturing begins with “chr” and matches one or more characters that are not “:”, our delimiter. We match until the first “:” through a character class defined by everything that’s not “:”, [^:]+


• ([0-9]+): Match and capture more than one number.

• Finally, our replacement is these three captured groups, interspersed with tabs, \t.

• Regular expressions are tricky and take time and practice to master.


• Explicitly capturing each component of our genomic region is one way to tackle this, and nicely demonstrates sed’s ability to capture patterns. However, there are numerous ways to use sed or other Unix tools to parse strings like this.


• sed 's/[:-]/\t/g‘: We just replace both delimiters with a tab. Note that we’ve enabled the global flag, which is necessary for this approach to work.

• sed 's/:/\t/' | sed 's/-/\t/‘: For complex substitutions, it can be much easier to use two or more calls to sed rather than trying to do this with one regular expression.

• tr ':-' '\t‘: tr translates all occurrences of its first argument to its second.


• By default, sed prints every lines, making replacements to matching lines. Suppose we want to capture all transcript names from the last column of a GTF file.

• Some lines of the last column of the GTF file don’t contain transcript_id, so sed prints the entire line rather than the captured group.


• One way to solve this would be to use grep “transcript_id” before sed.

• A cleaner way is to disable sed from outputting all lines with –n. Then, by appending p after the last slash sed will print all lines it’s made a replacement on.


• This example uses an important regular expression idiom: capturing text between delimiters. This is a useful pattern.

1. First, match zero or more of any character (.*) before the string “transcript_id”

2. Then, match and capture one or more characters that are not a quote ([^”]+). This is an important idiom. The brackets make up a character class. Character classes specify what characters the expression is allowed to match. Here, we use a caret (^) inside the brackets to match anything except what’s inside these brackets. The end result is that we match and capture one or more nonquote characters.


• The following approach ((.*) rather than ([^”]+)) will not work


• It is also possible to select and print certain ranges of lines with sed. In this case, we are not doing pattern matching, so we don’t need slashes. To print lines 20 through 50 of a file, we use

• sed has features that allow you to make any type of edit to a stream of test, but for complex stream processing tasks it can be easier to write a Python script than a long and complicated sed command.

ADVANCED SHELL TRICKS

Subshells

• Remember the difference between sequential commands (connected with &&or ;) and piped commands (connected with |).

• Sequential commands are simply run one after the other. The previous command’s standard output does not get passed to the next program’s standard in. In contrast, connecting two programs with pipes means the first program’s standard out will be piped into the next program’s standard in.

• If we run two commands with command1; command2, command2 will always run, regardless of whether command1 exits successfully. In contrast, if we use command1 && command2, command2 will only run if command1 completed with a zero-exit stats.

Subshells

• Subshells allow us to execute sequential commands together in a separate shell process. This is useful primarily to group sequential commands together such that their output is a single stream. This gives us a new way to construct clever one-liners and has practical uses in command-line data processing.

Subshells

• In the first example, only the second command’s standard out is piped into sed. This is because your shell interprets echo “this command” as one command and echo “that command” | sed ‘s/command/step/’ as a second separate command.

• In the second example, grouping both echo commands together with parentheses causes these two commands to be run in a separate subshell, and both commands’ combined standard output is passed to sed.

• Combining two sequential commands’ standard output into a single stream with a subshell is a useful trick, and one we can apply to shell problems in bioinformatics

Subshells

• Consider the problem of sorting a GTF file with a metadata header. We want to sort everything except the header, but still include the header at the top of the final sorted file.

• Because we’ve used a subshell, all standard output from these sequential commands will be combined into a single stream, which is redirected to a file.

Named pipes

• We use pipes to connect command-line tools to build custom data-processing pipelines. However, some programs won’t interface with the Unix pipes. For example, certain bioinformatics tools read in multiple input files and write to multiple output files:

processing_tool --in1 in1.fq --in2 in2.fa --out1 out2.fa --out2 out2.fq

• The imaginary program requires two separate input files, and produces two separate output files. Many bioinformatics programs that process two paired-end files together like this.

Named pipes

• In addition to the inconvenience of not being able to use a Unix pipe to interface processing_tool with other programs, there’s a more serious problem. We would have to write and read four intermediate files to disk to use this program.

• A named pipe, known as FIFO, is a special sort of file. Regular pipes are anonymous – they don’t have a name, and only persist while both processes are running. Named pipes behave like files, and are persistent on your filesystem. We can create a named pipe with the program mkfifo:

Named pipes

• Notice that this is a special type of a file: the p before the file permissions is for pipe. Just like pipes, one process writes data into the pipe, and another process reads data out of the pipe.

Named pipes

• echo "hello, the named pipe." > fqin &

Write some lines to the named pipe we created earlier.• cat fqinTreating the named pipe just as we would any other file, we can access the data we wrote to it earlier. Any process can access this file and read from this pipe. Like a standard Unix pipe, data that has been read from the pipe is no longer there.• rm -f fqin

We clean up by using rm to remove it.

Named pipes

• Although the syntax is similar to shell redirection to a file, we are not actually writing anything to our disk. Named pipes provide all of the computational benefits of pipes with the flexibility of interfacing with files.

• In our earlier example, in1.fq and in2.fq could be named pipes that other processes are writing input to. Additionally, out1.fq and out2.fq could be named pipes that processing_tool is writing to, that other processes read from.

Process substitution

• Creating and removing these file-like named pipes a bit tedious. There a way to use named pipes without having to explicitly create them. This is called process substitution, or sometimes known as anonymous named pipes.

• Remember that cat takes file arguments. The chunk <(echo "hello, process substitution.") runs the echo command and pipes the output to an anonymous named pipe.


• Your shell then replaces this chunk with the path to this anonymous named pipe. No named pipes need to be explicitly created, but you get the same functionality.

• Process substitution allows us to connect two programs, even if one doesn’t take input through standard input.


• In the previous example of processing_tool, we can use process substitution instead of creating two explicit named pipes.

processing_tool --in1 <(makein raw1.txt) --in2 <(makein raw2.txt) --out1 out2.fa --out2 out2.fq

• It can also be used to capture an output stream.processing_tool --in1 <(makein raw1.txt) --in2 <(makein raw2.txt) --out1 >(gzip > out1.fq.gz) --out2 >(gzip > out2.fq.gz)

Date post:	01-Jan-2022
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

Essential Skills for Bioinformatics: Unix/Linux

Documents