+ All Categories
Home > Education > P3 2017 python_regexes

P3 2017 python_regexes

Date post: 23-Jan-2018
Category:
Upload: prof-wim-van-criekinge
View: 1,651 times
Download: 0 times
Share this document with a friend
32
Transcript
Page 1: P3 2017 python_regexes
Page 2: P3 2017 python_regexes

FBW

17-10-2017

Wim Van Criekinge

Page 4: P3 2017 python_regexes

Recap

if condition:

statements

[elif condition:

statements] ...

else:

statements

while condition:

statements

for var in sequence:

statements

break

continue

Strings

Page 5: P3 2017 python_regexes

Lists

• Flexible arrays, not Lisp-like linked

lists• a = [99, "bottles of beer", ["on", "the",

"wall"]]

• Same operators as for strings• a+b, a*3, a[0], a[-1], a[1:], len(a)

• Item and slice assignment• a[0] = 98

• a[1:2] = ["bottles", "of", "beer"]

-> [98, "bottles", "of", "beer", ["on", "the", "wall"]]

• del a[-1] # -> [98, "bottles", "of", "beer"]

Page 6: P3 2017 python_regexes

Dictionaries

• Hash tables, "associative arrays"• d = {"duck": "eend", "water": "water"}

• Lookup:• d["duck"] -> "eend"

• d["back"] # raises KeyError exception

• Delete, insert, overwrite:• del d["water"] # {"duck": "eend", "back": "rug"}

• d["back"] = "rug" # {"duck": "eend", "back":

"rug"}

• d["duck"] = "duik" # {"duck": "duik", "back":

"rug"}

Page 7: P3 2017 python_regexes

Reverse Complement Revisited

Page 8: P3 2017 python_regexes

if condition:

statements

[elif condition:

statements] ...

else:

statements

while condition:

statements

for var in sequence:

statements

break

continue

Strings

REGULAR EXPRESSIONS

Page 9: P3 2017 python_regexes

Regular Expressions

http://en.wikipedia.org/wiki/Regular_expression

In computing, a regular expression, also

referred to as "regex" or "regexp", provides a

concise and flexible means for matching

strings of text, such as particular characters,

words, or patterns of characters. A regular

expression is written in a formal language that

can be interpreted by a regular expression

processor.

Really clever "wild card" expressions for

matching and parsing strings.

Page 10: P3 2017 python_regexes

Understanding Regular Expressions

• Very powerful and quite cryptic

• Fun once you understand them

• Regular expressions are a language

unto themselves

• A language of "marker characters" -

programming with characters

• It is kind of an "old school"

language - compact

Page 11: P3 2017 python_regexes

Regular Expression Quick Guide

^ Matches the beginning of a line

$ Matches the end of the line

. Matches any character

\s Matches whitespace

\S Matches any non-whitespace character

* Repeats a character zero or more times

*? Repeats a character zero or more times (non-greedy)

+ Repeats a chracter one or more times

+? Repeats a character one or more times (non-greedy)

[aeiou] Matches a single character in the listed set

[^XYZ] Matches a single character not in the listed set

[a-z0-9] The set of characters can include a range

( Indicates where string extraction is to start

) Indicates where string extraction is to end

Page 12: P3 2017 python_regexes

The Regular Expression Module

• Before you can use regular expressions in

your program, you must import the library

using "import re"

• You can use re.search() to see if a string

matches a regular expression similar to

using the find() method for strings

• You can use re.findall() extract portions of

a string that match your regular expression

similar to a combination of find() and

slicing: var[5:10]

Page 13: P3 2017 python_regexes

Wild-Card Characters

• The dot character matches any

character

• If you add the asterisk character,

the character is "any number of

times"

^X.*:

Match the start of the line

Match any character

Many times

Page 14: P3 2017 python_regexes

Matching and Extracting Data

• The re.search() returns a True/False

depending on whether the string matches

the regular expression

• If we actually want the matching strings

to be extracted, we use re.findall()

>>> import re

>>> x = 'My 2 favorite numbers are 19 and 42'

>>> y = re.findall('[0-9]+',x)

>>> print y

['2', '19', '42']

Page 15: P3 2017 python_regexes

Warning: Greedy Matching

• The repeat characters (* and +) push outward in both directions

(greedy) to match the largest possible string

>>> import re

>>> x = 'From: Using the : character'

>>> y = re.findall('^F.+:', x)

>>> print y

['From: Using the :']

^F.+:

One or more

characters

First character in the

match is an F

Last character in the

match is a :

Page 16: P3 2017 python_regexes

Non-Greedy Matching

• Not all regular expression repeat codes are

greedy! If you add a ? character - the + and *

chill out a bit...

>>> import re

>>> x = 'From: Using the : character'

>>> y = re.findall('^F.+?:', x)

>>> print y

['From:']

^F.+?:

One or more

characters but

not greedily

First character in the

match is an F

Last character in the

match is a :

Page 17: P3 2017 python_regexes

Fine Tuning String Extraction

• Parenthesis are not part of the match -

but they tell where to start and stop what

string to extract

From [email protected] Sat Jan 5 09:14:16

2008

>>> y = re.findall('\S+@\S+',x)

>>> print y

['[email protected]']

>>> y = re.findall('^From (\S+@\S+)',x)

>>> print y

['[email protected]']

^From (\S+@\S+)

Page 18: P3 2017 python_regexes

The Double Split Version

• Sometimes we split a line one way and then grab

one of the pieces of the line and split that piece

again

From [email protected] Sat Jan 5 09:14:16

2008

words = line.split()

email = words[1]

pieces = email.split('@')

print pieces[1]

[email protected]

['stephen.marquard', 'uct.ac.za']

'uct.ac.za'

Page 19: P3 2017 python_regexes

The Regex Version

From [email protected] Sat Jan 5 09:14:16

2008

import re

lin = 'From [email protected] Sat Jan 5 09:14:16 2008

y = re.findall('@([^ ]*)',lin)

print y['uct.ac.za']

'@([^ ]*)'

Look through the string until you find an at-sign

Match non-blank character

Match many of them

Page 20: P3 2017 python_regexes

Escape Character

• If you want a special regular expression

character to just behave normally (most

of the time) you prefix it with '\'

>>> import re

>>> x = 'We just received $10.00 for cookies.'

>>> y = re.findall('\$[0-9.]+',x)

>>> print y

['$10.00']

\$[0-9.]+

A digit or periodA real dollar sign

At least one

or more

Page 21: P3 2017 python_regexes

Real world problems

• Match IP Addresses, email addresses, URLs

• Match balanced sets of parenthesis

• Substitute words

• Tokenize

• Validate

• Count

• Delete duplicates

• Natural Language processing

Page 22: P3 2017 python_regexes
Page 23: P3 2017 python_regexes
Page 24: P3 2017 python_regexes

RE in Python

• Unleash the power - built-in re module

• Functions

– to compile patterns

• compile

– to perform matches

• match, search, findall, finditer

– to perform operations on match object

• group, start, end, span

– to substitute

• sub, subn

• - Metacharacters

Page 25: P3 2017 python_regexes

Examples 1

pattern = re.compile(r"tes")

print (pattern.findall("test testing"))

Page 26: P3 2017 python_regexes

Examples 2

import re

dna = "ATCGCGAATTCAC"

if re.search(r"GAATTC", dna):

print("restriction site found!")

Page 27: P3 2017 python_regexes

Examples 3

scientific_name = "Homo sapiens"

m = re.search("(.+) (.+)", scientific_name)

if m:

genus = m.group(1)

species = m.group(2)

print("genus is " + genus + ", species is " + species)

Page 28: P3 2017 python_regexes

Examples 4

regex = r"([a-zA-Z]+) \d+"

#finditer() returns an iterator that produces Match instances instead of the strings

returned by findall()

matches = re.finditer(regex, "June 24, August 9, Dec 12")

for match in matches:

print(match)

print ("Match at index:",match.group(0),match.group(1),match.start(), match.end())

Page 29: P3 2017 python_regexes

Examples 5

text = 'abbaaabbbbaaaaa'

pattern = 'ab'

for match in re.finditer(pattern, text):

s = match.start()

e = match.end()

print ('Found "%s" at %d:%d' % (text[s:e], s, e))

Page 30: P3 2017 python_regexes

Exercise 1

1. Which of following 4 sequences (seq1/2/3/4)

a) contains a “Galactokinase signature”

b) How many of them?

http://us.expasy.org/prosite/

Page 31: P3 2017 python_regexes

>SEQ1

MGNLFENCTHRYSFEYIYENCTNTTNQCGLIRNVASSIDVFHWLDVYISTTIFVISGILNFYCLFIALYT YYFLDNETRKHYVFVLSRFLSSILVIISLLVLESTLFSESLSPTFAYYAVAFSIYDFSMDTLFFSYIMIS LITYFGVVHYNFYRRHVSLRSLYIILISMWTFSLAIAIPLGLYEAASNSQGPIKCDLSYCGKVVEWITCS LQGCDSFYNANELLVQSIISSVETLVGSLVFLTDPLINIFFDKNISKMVKLQLTLGKWFIALYRFLFQMT NIFENCSTHYSFEKNLQKCVNASNPCQLLQKMNTAHSLMIWMGFYIPSAMCFLAVLVDTYCLLVTISILK SLKKQSRKQYIFGRANIIGEHNDYVVVRLSAAILIALCIIIIQSTYFIDIPFRDTFAFFAVLFIIYDFSILSLLGSFTGVAM MTYFGVMRPLVYRDKFTLKTIYIIAFAIVLFSVCVAIPFGLFQAADEIDGPIKCDSESCELIVKWLLFCI ACLILMGCTGTLLFVTVSLHWHSYKSKKMGNVSSSAFNHGKSRLTWTTTILVILCCVELIPTGLLAAFGK SESISDDCYDFYNANSLIFPAIVSSLETFLGSITFLLDPIINFSFDKRISKVFSSQVSMFSIFFCGKR

>SEQ2

MLDDRARMEA AKKEKVEQIL AEFQLQEEDL KKVMRRMQKE MDRGLRLETH EEASVKMLPT YVRSTPEGSE VGDFLSLDLG GTNFRVMLVK VGEGEEGQWS VKTKHQMYSI PEDAMTGTAE MLFDYISECI SDFLDKHQMK HKKLPLGFTF SFPVRHEDID KGILLNWTKG FKASGAEGNN VVGLLRDAIK RRGDFEMDVV AMVNDTVATM ISCYYEDHQC EVGMIVGTGC NACYMEEMQN VELVEGDEGR MCVNTEWGAF GDSGELDEFL LEYDRLVDES SANPGQQLYE KLIGGKYMGE LVRLVLLRLV DENLLFHGEA SEQLRTRGAF ETRFVSQVES DTGDRKQIYN ILSTLGLRPS TTDCDIVRRA CESVSTRAAH MCSAGLAGVI NRMRESRSED VMRITVGVDG SVYKLHPSFK ERFHASVRRL TPSCEITFIE SEEGSGRGAA LVSAVACKKA CMLGQ

>SEQ3

MESDSFEDFLKGEDFSNYSYSSDLPPFLLDAAPCEPESLEINKYFVVIIYVLVFLLSLLGNSLVMLVILY SRVGRSGRDNVIGDHVDYVTDVYLLNLALADLLFALTLPIWAASKVTGWIFGTFLCKVVSLLKEVNFYSGILLLACISVDRY LAIVHATRTLTQKRYLVKFICLSIWGLSLLLALPVLIFRKTIYPPYVSPVCYEDMGNNTANWRMLLRILP QSFGFIVPLLIMLFCYGFTLRTLFKAHMGQKHRAMRVIFAVVLIFLLCWLPYNLVLLADTLMRTWVIQET CERRNDIDRALEATEILGILGRVNLIGEHWDYHSCLNPLIYAFIGQKFRHGLLKILAIHGLISKDSLPKDSRPSFVGSSSGH TSTTL

>SEQ4

MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA

Oefening 1

Page 32: P3 2017 python_regexes

http://www.pythonchallenge.com


Recommended