+ All Categories
Home > Documents > Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens...

Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens...

Date post: 28-Dec-2015
Category:
Upload: kristian-newton
View: 218 times
Download: 0 times
Share this document with a friend
68
Parsing data records
Transcript
Page 1: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Parsing data records

Page 2: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN

Page 3: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN

A sequence record in FASTA format

Page 4: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

seq = ">sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens \MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS\WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY\LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY\YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD\AGEGEN"

for i in seq: print i

Page 5: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

seq = open("SingleSeq.fasta")

for line in seq: print line

Page 6: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

seq = open("SingleSeq.fasta")seq_2 = open("SingleSeq-2.fasta")

for line in seq: seq_2.write(line)

seq_2.close()

Page 7: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Writing different things depending on a condition

Read a sequence in FASTA format and print only the header of the sequence

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN

Page 8: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

seq = open("SingleSeq.fasta")

for line in seq: if line[0] == '>': print line

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN

Page 9: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Making choices: The if/elif/else statements

if <condition 1>: if expression in <condition1> is TRUE<statements 1> execute statements 1

[elif <condition 2>]: else if exp in <condition2> is TRUE<statements 2>] execute statements 2....

[elif <condition 3>]: etc...pass]

…[else:

<statements N>]

Page 10: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

>>> s="MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAE">>> s_len = float(len(s))>>> G_num = s.count('G’) >>> A_num = s.count('A’)>>> freq_G = G_num/s_len>>> freq_A = A_num/s_len>>> print freq_G0.116666666667>>> print freq_A0.216666666667

Write different things depending on a condition

Page 11: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

>>> s="MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAE">>> s_len = float(len(s))>>> G_num = s.count('G’)>>> A_num = s.count('A’)>>> freq_G = G_num/s_len>>> freq_A = A_num/s_len>>> print freq_G0.116666666667>>> print freq_A0.216666666667

>>> if freq_G > freq_A:... print "Gly is more frequent than Ala"... elif freq_G < freq_A:... print "Ala is more frequent than Gly"... else:... print "The frequency of Gly and Ala is the same"...Ala is more frequent than Glycines

Write different things depending on a condition

Page 12: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

The if/elif/else construct produces different effects compared with the use of a series of if conditions

Page 13: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

seq = open("SingleSeq.fasta")

for line in seq: if line[0] != '>': print line

seq = open("SingleSeq.fasta")

for line in seq: if line[0] == '>': print line

Page 14: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

seq = open("SingleSeq.fasta")

for line in seq: if line[0] != '>': print line

== != => <= > <

Page 15: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Exercises 1, 2, and 3

1) Read a file in FASTA format and write to a new file only the header of the record.2) Read a file in FASTA format and write to a new file only the sequence (without the header).3) Merge 1) and 2). In other words, read a file in FASTA format and write the header to a file and the sequence to a different one.

Page 16: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

fasta = open('SingleSeq.fasta')header = open('header.txt', 'w’)

for line in fasta: if line[0] == '>': header.write(line) header.close()

Page 17: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

fasta = open('SingleSeq.fasta')seq = open('seq.txt','w')

for line in fasta: if line[0] != '>': seq.write(line)

seq.close()

Page 18: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

fasta = open('SingleSeq.fasta')header = open('header.txt', 'w')seq = open('seq.txt','w')

for line in fasta: if line[0] == '>': header.write(line) else: seq.write(line)

header.close()seq.close()

Page 19: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Let’s increase the difficulty just a bit…

Page 20: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

seq_fasta = open("SingleSeq.fasta")

seq = ''

for line in seq_fasta: if line[0] == '>': header = line else: seq = seq + line.strip()

num_cys = seq.count("C")

print header, seq, num_cys

Page 21: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Exercise 4

4) Read a file in FASTA format. Print or write the record to a file only if the sequence is from Homo sapiens.

Page 22: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

seq_fasta = open("SingleSeq.fasta")

seq = ''header = ''

for line in seq_fasta: if line[0] == '>': if "Homo sapiens" in line: header = line else:

if header: seq = seq + line

if header: print header + seqelse: print "The record is not from H. sapiens"

Page 23: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

In general, you will need to analyse several sequences….

Page 24: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN.........

SwissProt-Human.fasta

Page 25: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Read the records from a file and write them to a new file

fasta = open('SwissProt-Human.fasta')fasta_2 = open('SwissProt-Human_2.fasta', 'w')

for line in fasta:fasta_2.write(line)

this must be a string

Page 26: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Strings can be concatenated

Strings can be indexed and sliced

String elements cannot be re-assigned

>>> print "ACTGGTA" + "ATGTAACTT"ACTGGTAATGTAACTT

>>> s = "ACTGGTA">>> s[0]'A'>>> s[1:3]'CT'

>>> s[2] = 'Z'Traceback (most recent call last): File "<stdin>", line 1, in <module>TypeError: 'str' object does not support item assignment

Page 27: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Read the sequences from a file and write them to a new file

fasta = open('SwissProt-Human.fasta')fasta_2 = open('SwissProt-Human_2.fasta', 'w')

n = 0for line in fasta:

n = n + 1l_n = str(n)fasta_2.write(l_n + "\t" + line)

fasta_2.close()

Number the lines starting from 1

Page 28: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN.........

Page 29: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Exercise 5

5) Download a Uniprot multiple sequence FASTA file. Write the record headers to a new file.

Page 30: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

fasta = open('SwissProt-Human.fasta')headers = open('headers.txt', 'w')

for line in fasta:if line[0] == '>':

headers.write(line)

headers.close()

Page 31: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN.........

Page 32: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Exercise 6

6) Read a multiple sequence FASTA file and write the sequences to a new file separated by a blank line

Page 33: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

fasta = open('SwissProt-Human.fasta.fasta')seqs = open('seqs.txt', 'w')

for line in fasta: if line[0] == '>’: seqs.write('\n') elif line[0] != '>': seqs.write(line)seqs.close()

seqs.write(line.strip() + '\n’)

Page 34: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Exercise 7

7) Read a multiple sequence FASTA file and write to a new file only the records from Homo sapiens.

Page 35: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

fasta = open('sprot_prot.fasta')output = open('homo_sapiens.fasta', 'w')

seq = ''

for line in fasta: if line[0] == '>' and seq == '': header = line elif line[0] != '>': seq = seq + line elif line[0] == '>' and seq != '': if "Homo sapiens" in header: output.write(header + seq) header = line seq = ''

if "Homo sapiens" in header: output.write(header + seq)

output.close()

Page 36: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Exercise 8

8) Read FASTA records from a file and count the cysteine residues in each sequence.

Page 37: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

fasta = open('sprot_prot.fasta')

seq = ''

for line in fasta: if line[0] == '>' and seq == '': header = line[4:10] elif line[0] != '>': seq = seq + line.strip() elif line[0] == '>' and seq != '': cys_num = seq.count('C') print header, ': ', cys_num header = line[4:10] seq = ''

print header, ': ', cys_num

Read the records from a file and count the cysteine residues in each sequence

Page 38: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Exercises 9, 10, and 11

9) Read a multiple sequence file in FASTA format and write to a new file only the records of the sequences starting with a methionine ('M').10) Read a multiple sequence file in FASTA format and write to a new file only the records of the sequences having at least two tryptophan residues ('W'). 11) Read a multiple sequence file in FASTA format and write to a new file only the records the sequences of which start with a methionine ('M') and have at least two tryptophans ('W').

Page 39: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

outfile = open('SwissProtHuman-Filtered.fasta','w')fasta = open('SwissProtHuman.fasta','r')

seq = ''

for line in fasta: if line[0:1] == '>' and seq == '': header = line elif line [0:1] != '>': seq = seq + line elif line[0:1] == '>' and seq != '':

TRP_num = seq.count('W') if seq[0] == 'M' and TRP_num > 1:

outfile.write(header + seq) seq = '' header = line

TRP_num = seq.count('W')if seq[0] == 'M' and TRP_num > 1: outfile.write(header + seq)outfile.close()

Page 40: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

In many cases you will need to compare data from different files

Page 41: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN.........

SwissProt-Human.fasta

cancer-expressed.txt

Page 42: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.
Page 43: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

1) Read 10 SwissProt ACs from a file2) Store them into a data structure

cancer_file = open('cancer-expressed.txt')

cancer_list = []

for line in cancer_file:AC = line.strip()cancer_list.append(AC)

print cancer_list

Page 44: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

List data structure

A list is a mutable ordered collection of objects

L = [1, [2,3], 4.52, ‘DNA’]

The elements of a list can be any kind of object: numbersstringstupleslistsdictionariesfunction callsetc.

L = [] The empty list

Page 45: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.
Page 46: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

>>> L = [1,”hello”,12.1,[1,2,”three”],”seq”,(1,2)]>>> L[0] # indexing 1>>> L[3] # indexing[1, 2, ’three']>>> L[3][2] # indexing ‘three’>>> L[-1] # negative indexing(1, 2)>>> L[2:4] # slicing[12.1, [1, 2, ‘three’]]>>> L[2:] # slicing shorthand[12.1, [1, 2, ‘three’], ‘seq’, (1, 2)]>>>

Page 47: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

The elements of a list can be changed/replaced after the list has been defined

l[i] = x

l[i:j] = t

del l[i:j]

del l[i:j:k]

l.append(x)

l.extend(x)

>>> l = [2,3,5,7,8,['a','b'],'a','b','cde']>>> l[0] = 1>>> l[1, 3, 5, 7, 8, ['a', 'b'], 'a', 'b', 'cde']>>> l[0:3] = 'DNA'>>> l['D', 'N', 'A', 7, 8, ['a', 'b'], 'a', 'b', 'cde']>>> del l[0:5]>>> l[['a', 'b'], 'a', 'b', 'cde']>>> l.append('DNA')>>> l[['a', 'b'], 'a', 'b', 'cde', 'DNA']>>> l.extend('dna')>>> l[['a', 'b'], 'a', 'b', 'cde', 'DNA', 'd', 'n', 'a']>>>

These operations CHANGE the list

Page 48: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

l.count(x)

l.index(x)

l.insert(i, x)

l.pop(i)

l.remove(x)

>>> l = [1,3,5,7,8,['a','b'],'a','b','cde']>>> l.count(‘a’)>>> l1>>> l.index(8)4>>> l.insert(4, 80)>>> l[1, 3, 5, 7, 80, 8, [‘a’, ‘b’], ‘a’, ‘b’, ‘cde’]>>> l.pop(4)80>>> l[1, 3, 5, 7, 8, [‘a’, ‘b’], ‘a’, ‘b’, ‘cde’]>>> l.pop()‘cde’>>> l[1, 3, 5, 7, 8, [‘a’, ‘b’], ‘a’, ‘b’]>>> l.remove(8)[1, 3, 5, 7, [‘a’, ‘b’], ‘a’, ‘b’]

The elements of a list can be changed/replaced after the list has been defined

Page 49: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

l.reverse()

l.sort()

sorted(l)

>>> l = [4, 3, 2, 1, 5, 6, 7, 8]>>> l.reverse()>>> l[8, 7, 6, 5, 1, 2, 3, 4]>>> new = sorted(l)>>> new[1, 2, 3, 4, 5, 6, 7, 8]>>> l[8, 7, 6, 5, 1, 2, 3, 4]>>> l.sort()>>> l[1, 2, 3, 4, 5, 6, 7, 8]

The elements of a list can be changed/replaced after the list has been defined

Page 50: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Putting together lists and loopsrange() and xrange() built-in functions

>>> range(10)[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]>>> range(1, 11)[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]>>> range(0, 30, 5)[0, 5, 10, 15, 20, 25]>>> range(0, 10, 3)[0, 3, 6, 9]>>> range(0, -10, -1)[0, -1, -2, -3, -4, -5, -6, -7, -8, -9]>>> range(0)[]>>> range(1, 0)[]# the xrange()method is more commonly used in for loops than range()>>>for i in xrange(5):… print i…0,1,2,3,4

The xrange()method generates the values upon call, i.e. it does not store them into a variable

Page 51: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Exercise 12

12) Create a list containing Uniprot ACs extracted from a FASTA file. Print the list.

Page 52: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

InputFile = open("SwissProtHuman.fasta","r")AC_list = []for line in InputFile: if line[0] == '>': fields = line.split('|') AC_list.append(fields[1])print AC_list

Page 53: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

By the way…. Exercise 13

13) Read a file in FASTA format and copy to a new file the record ACs.

Page 54: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

human_fasta = open('SwissProt-Human.fasta')Outfile = open('SwissProt-Human-AC.txt’)

for line in human_fasta:if line[0] == '>':

AC = line.split('|')[1]Outfile.write(AC + '\n')

Outfile.close()

Selectively extract ACs froma a FASTA file

Page 55: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Exercise 14

14) Read the human FASTA file one record after the other. Check if the record header contains one of the 10 ACs. If YES, copy the header to a new file.

Page 56: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Read the human FASTA file one record after the other.

Check if the record header contains one of the 10 ACs.

If YES, copy the header to a new file.

cancer_file = open('cancer-expressed.txt')human_fasta = open('SwissProt-Human.fasta')Outfile = open(‘cancer-expressed.fasta’,’w’)cancer_list = []for line in cancer_file:

AC = line.strip()cancer_list.append(AC)

for line in human_fasta:if line[0] == '>':

AC = line.split('|')[1]if AC in cancer_list:

Outfile.write(line)Outfile.close()

We are not writing the whole record but the header line only

Page 57: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN

SwissProt-Human.fasta

Page 58: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Exercise 15

15) Read a multiple sequence file in FASTA format and write to a new file only the records the Uniprot ACs of which are present in the list created in 12).

Page 59: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN

cancer_file = open('cancer-expressed.txt')human_fasta = open('SwissProt-Human.fasta')Outfile = open('cancer_expressed.fasta','w')

cancer_list = []

for line in cancer_file:AC = line.strip()cancer_list.append(AC)

for line in human_fasta: if line[0] == ">":

field = line.split("|")AC = field[1]if AC in cancer_list:

Outfile.write(line)else:

if AC in cancer_list:Outfile.write(line)

Outfile.close()

Page 60: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN

cancer_file = open('cancer-expressed.txt')human_fasta = open('SwissProt-Human.fasta')Outfile = open('cancer_expressed.fasta','w')

cancer_list = []seq = ''

for line in cancer_file:AC = line.strip()cancer_list.append(AC)

for line in human_fasta:if line[0] == '>' and seq == '':

header = lineAC = line.split('|')[1]

elif line[0] != '>':seq = seq + line

elif line[0] == '>' and seq != '':if AC in cancer_list:

Outfile.write(header+seq)header = lineAC = line.split('|')[1]seq = ''

if AC in cancer_list:Outfile.write(header+seq)

The same but with more control…

Page 61: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.
Page 62: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Extract and write to a file the gene sequence from the Candida albicans genomic DNA, chromosome 7, complete sequence (file ap006852.gbk)

Try to write it in FASTA format:

>AP006852CcactgtccaatggctcaacacgccaatcatcatacaatacccccaacaggaatcaccaaagtactgatgcttctcactatcaatagtttgtactttcaccacacaatagcagatgatccatctaaatccaccttcctatcgatcgtgaccacccccataaaataggtcaactccataaacacctccatcaccaacgctagactcacaacccagaacatgttaatcaaccggtgggccaaGtaccgttgtagctctctcgtaaacacaagaaccaacaccaaacaacatactacaactga......

Page 63: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Exercise 16

16) Read a Genbank record and write to a file the nucleotide sequence in FASTA format.

Page 64: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

InputFile = open("ap006852.gbk")OutputFile = open("ap006852.fasta","w")flag = 0

for line in InputFile:if line[0:9] == 'ACCESSION':

AC = line.split()[1].strip()OutputFile.write('>'+AC+'\n')

if line[0:6] == 'ORIGIN': flag = 1continue

if flag == 1:fields = line.split()if fields != []:

seq = ''.join(fields[1:])OutputFile.write(seq +'\n')

InputFile.close()OutputFile.close()

Page 65: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Parsing data records

• Start by visually inspecting the file you want to parse

• Identify the information you want to extract

• Identify separators to select your information using if conditions

• Use lists if you have to compare data from different files

Page 66: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

cancer_file = open('cancer-expressed.txt')

cancer_list = []line = cancer_file.readline()while line:

AC = line.strip()cancer_list.append(AC)line = cancer_file.readline()

We can use while loops to read files(but usually we won’t do it)

Page 67: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

You can repeat all exercises using ncbi_gene.fasta as input file

Page 68: Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Summary

• Parsing sequence records in FASTA format

• Lists

• Making choices: if/elif/else

• range() and xrange()


Recommended