Lab 1: Introduction to Python Programming
Adapted fromNicole Rockweiler
01/09/20191
Overview
• Logistics• Getting Started• Intro to Unix• Intro to Python • Assignment 1
2
Getting the most out of this course
1. Start the homework EARLY
2. Collaborate
3. Use your resources – TAs, professors, labmates, Piazza discussions, the internet
3
Logistics
• Office Hours: Wednesdays, 11:30 am - 12:30 pm (right after class)• Contact TAs:
• For assignment-related questions: Piazza• For other questions: [email protected]
• Register for 4 credits• Course website: http://genetics.wustl.edu/bio5488/ • Bring your laptop to every lab• NO extensions on homeworks• Late penalty is 50% per day
4
Assignments
• Assignments are posted on the course website Wednesdays
• We will send out emails when assignments are posted
• Assignments are due the following Friday at 10am (before lab)
• Assignment format• Given a bioinformatics problem• Write/complete a Python script• Analyze data with your script• Answer biological questions about your results
• Turn in format• More on this later ☺
5
yaozu
Assignment policies• See the Course Information → Assignment policies document in the course
website • There are 13 assignments
• You must turn in all assignments• All assignments are weighted equally
• Collaboration• Group work is encouraged, but plagiarism is unacceptable• Try to “Google it” first• Cite your sources
• Read the assignment before coming to lab
6
Grading
• Each assignment is out of 10 points
• Graded on• Does the code work?
• It doesn’t have to be the “fastest” or “most efficient” to get full credit
• If doesn’t work, describe where you had problems
• Is the code well commented and readable? (more on commenting later ☺)
• Are the answers correct?
• Grades will be returned in a file called grades.txt on the class server• Only you and the TAs will be able to read this file
7
Getting started
8
Remote computers• We will be doing all of our work on a remote computer, a server• This is a Unix-based computer that we can securely connect to through a protocol
called secure shell (SSH).• The shell is a program that takes commands from the keyboard and
gives them to the operating system to execute
9
How do I access the server?
• The way we are using here is command-line interfaces (CLI)
• A terminal emulator is a program that allows you to interact with the shell through a CLI
• There are different terminal programs that vary across operating systems
• We’ll be using PuTTY(Windows) or Terminal (Mac, Ubuntu)
A PuTTY window
A Terminal window10
How to log onto the remote computer (PuTTY users)
1. Launch PuTTY
2. In the host name field, enter <username>@genomic.wustl.edu
3. In the port field, enter 22
4. Enter a session nickname, e.g., bio5488 (whatever name you want!)
5. Click Save
6. Click Open
11
How to log onto the remote computer (Mac/Ubuntu users)
1. Open Terminal (found in /Applications/Utilities)
12
How to log onto the remote computer (Mac/Ubuntu users)
2. SSH to the remote computer. Type:
ssh <username>@genomic.wustl.edu
where <username> is replaced with your username
3. A security message may be printed. Type yes and hit enter.
13
How to log onto the remote computer (Mac users)
4. Enter your password - it will not show that you are typing! Hit enter.
14
A couple of notes
• When you log onto the class server you will be located in YOUR home directory.
• Every command that you run after logging onto a remote computer will be run on that computer.
15
Exercise: changing your password(passwd)
• To change your password, type the command
$ passwd
• This will launch the interactive password changer• It will ask you for your current password, then your new password twice• When typing your password, it will not show that you are typing!
• Example$ passwdChanging password for xinxin.wang.(current) UNIX password:Enter new UNIX password:Retype new UNIX password:passwd: password updated successfully
16
Sublime Text• Sublime Text is a text editor for writing and editing scripts
• We’ll use Sublime to edit both local and remote files
• Installation: https://www.sublimetext.com/3
• Documentation: http://www.sublimetext.com/support Useful commands:• View > Syntax > Python• Set Tab Size = 4 spaces• Comment multiple lines• Find and replace multiple selections• Split frames
17
Cyberduck
• Cyberduck is a secure file transfer client and will allow you to transfer files from your local computer to a remote computer
18
Exercise: setting up Cyberduck
• Create a bookmark• Launch the Cyberduck application• Click Bookmark → New Bookmark• Select SFTP (SSH File Transfer Protocol) from the drop down menu• Enter a nickname for the bookmark, e.g., bio5488• Enter genomic.wustl.edu as the server name• Click the X
• Set the default text editor• Click Edit → Preferences → Editor• Select sublime text from the drop down menu. (You may need browse your
computer for the editor)• Check Always use this application• Restart Cyberduck
19
Exercise: transferring files with Cyberduck
• To download a file to your local computer
• Drag and drop a file from Cyberduck to your Finder/File Explorer window
• Or, double-click
• To upload a file to the remote computer• Drag and drop a file from Finder/File Explorer to Cyberduck
20
Exercise: editing remote files with Sublime Text and Cyberduck
• New files• Click File → New file• Enter a filename• Click edit• Sublime Text should now launch• Add some text to the file• Click File → Save or ctrl+S
• Existing files• Select the file by clicking the filename 1X• Click the Edit button in the navigation bar• Edit the file• Click File → Save or ctrl+S
21
22
Cyberduck
Attention about using Cyberduck:• When clicking on
o Make sure you see this
• When saving the file, make sure you see the following to make sure the upload is complete before you close the editor
• Before closing the editor, check the time stamp of file
23
FileZilla
• FileZilla is an alternative approach for Cyberduck• Can be downloaded for free here:
https://filezilla-project.org/
24
FileZilla
• Follow the instructions• Finally we should see this
Basic Unix
25
A few preliminary words…
A lot of Unix skills revolve around the file system
• This concept is similar to using Apple Finder or the Windows File Explorer GUIs, only this time, we can’t use a mouse or see any fancy graphics
26
The file system
• The file system is the part of the operating system (OS) responsible for managing files and folders
• In Unix, folders are called directories.
• Unix keeps files arranged in a hierarchical structure• The topmost directory is called the root directory• Each directory can contain
• Files• Subdirectories
• You will always be “in” a directory• When you open a terminal you will be in your own home
directory.• Only you can modify things in your home directory
27
user
Determining where you are(pwd)
• If you get lost in the file system, you can determine where you are by typing:
$ pwd/home/user
• pwd stands for print working directory
• pwd prints the full path of the current working directory
28
Listing directory contents(ls)
• To list the contents of a directory:
$ ls
assignment1 foo
• ls stands for list directory contents
29
Changing directories(cd)
• To change to different directory
$ cd <directory_name>where
<directory_name> = the path you want to move to• A path is a location in the file system
• cd stands for change directory
• To get back to your home directory
$ cd ~
• ~ is shorthand for your home directory
30
Changing directories (cont.)
• To move one directory above the current directory
$ cd ..
• To move two directories above the current directory
$ cd ../../
• You can string as many ../ as you need to
31
user
Making directories(mkdir)
• To make a directory
$ mkdir <new_directory_name>
where<new_directory_name> = name of the directory to create
• mkdir stands for make directory
• Do not use spaces or “/” in directory or file names
32
Exercise: create some directories
Try to create this directory structure:
Hints
• Use pwd to determine where you are in the directory structure
• Use cd to navigate through the directory structure.
• Use mkdir to create new directories
33
34
Copying things(cp)
• To create a copy of a file
$ cp –i <filename> <copy_of_filename>
where
<filename> = file you want to copy<copy_of_filename> = name of copied fileThe -i flag is a safety feature to make sure you do not overwrite a file that already exists
• To create a copy of a directory
$ cp -r <directory> <copy_of_directory>
where
<directory> = directory you want to copy<copy_of_directory> = name of copied directoryThe -r flag is required to copy all of the directory’s files and subdirectories
35
Copying things (cont.)(cp)
• cp stands for copy files/directories
• To create a copy of file and keep the name the same
$ cp –i <filename> .
where<filename> = file you want to copy
• The shortcut is the same for directories, just remember to include the -r flag
36
Exercise: copying things
Copy /home/assignments/assignment1/README.txt to your work directory. Keep the name the same.
37
38
Renaming/moving things(mv)
• To rename/move a file/directory
$ mv -i <original_filename> <new_filename>
where<original_filename> = name of file/dir you want to rename<new_filename> = name you want to rename it to
• mv stands for move files/directories
39
Printing contents of files(cat)
• To print a file
$ cat <filename>
where<filename> = name of file you want to print
• cat stands for concatenate file and print to the screen
• Other useful commands for printing parts of files:• more• less• head• tail
40
Deleting Things(rm)
• To delete a file
$ rm <file_to_delete>
where<file_to_delete> = name of the file you want to delete
• To delete a directory
$ rm –r -i <directory_to_delete>
where<directory_to_delete> = name of the directory you want to delete
• rm stands for remove files/directories
IMPORTANT: there is no recycle bin/trash folder on Unix!!Once you delete something, it is gone forever.
Be very careful when you use rm!! 41
TIP: Check that you’re going to delete the correct files by first testing with 'ls' and then committing to 'rm'
Exercise: deleting things
Delete the test directory that you created in a previous exercise.
42
43
Saving output to files
• Save the output to a file
$ <cmd> > <output_file>where
<cmd> = command<output_file> = name of output file
• WARNING: this will overwrite the output file if it already exists!
• Append the output to the end of a file$ <cmd> >> <output_file>
There are 2 “>”
44
45
Learning more about a command(man)
• To view a command’s documentation
$ man <cmd>where
<cmd> = command
• man stands for manual page
• Use the and arrow keys to scroll through the manual page
• Type “q” to exit the manual page
↑ ↑
46
47
48
49
Getting yourself out of trouble
• Abort a command
• Temporarily stop a command
To bring the job back just run fg
50
Unix commands cheatsheet--your new bestie
https://ubuntudanmark.dk/filer/fwunixref.pdf 51
Python in minutes*
*not really
Programming Language
Freely Usable Even for Commercial Use
Cross Platform
Created in 1991 by Guido van Rossum
• There are 2 widely used versions of Python: Python2.7 and Python3.x
• We’ll use Python3
• Many help forums still refer to Python2, so make sure you’re aware which version is being referenced
NOTE
How do I program in python?
•Two Main Ways:• Normal mode
• Write all your code in a file and save it with a .py extension•Execute it using python3 <file name> on the terminal.
• Interactive mode•Start Interactive mode by typing python3 on the terminal and pressing enter/return.
•Start writing your python code
Python Variables
• The most basic component of any programming language are "things," also called variables
• Variables can be integers, decimal numbers (floats), words and sentences (string), lists etc. etc.
• Int : -5, 0, 1000000• Float : -2.0, 3.14159, 453.234• Boolean : True, False• String : "Hello world!", "K3WL", “AGCTGCTAGTAGCT”• List: [1, 2, 3, 4], ["Hello", "world!"], [1, "Hello", True, 0.2], [“A”, “T”, “C”, “G”
]
How do I create a variable and assign it a value?
• x = 2• This creates a variable named x with value 2• 2 = x is not a valid command; variable name needs to be on the left.
• print(x)• This prints the value stored in x (2 in this case) on the terminal.
a = 3b = 4c = a + bprint(c)
a = "Hello"b = " "c = "World"print(a+b+c)
Prints 7 on the terminal
Prints Hello World on the terminal
Variables naming rules
•Must start with a letter
•Can contain letters, numbers, and underscores ← no spaces!
•Python is case-sensitive: x ≠ X
•Variable names should be descriptive and have reasonable length (more of a styling advice)
•Use ALL CAPS for constants, e.g., PI
•Do not use names already reserved for other purposes (min, max, int)
Want to learn more tips? Check out http://www.makinggoodsoftware.com/2009/05/04/71-tips-for-naming-variables/
Cool, what else can I do in python?
• Conditionals• If a condition is TRUE do something, if it is FALSE do something else
if(boolean-expression-1):
code-block-1
else:
code-block-2
CODE BLOCKS ARE INDENTED, USE 4 SPACES
Cool, what else can I do in python?
• Conditionals• If a condition is TRUE do something, if it is FALSE do something else
x = 2if(x == 2): print(“x is 2”)
else: print(“x is not 2”)
Prints x is 2 on the terminal
Prints x is not 2 on the terminal
x = 3if(x == 2): print(“x is 2”)
else: print(“x is not 2”)
• Conditionals with multiple conditions
grade = 89.2if grade >= 80:
print("A")
elif grade >= 65:print("B")
elif grade >= 55:print("C")
else:print("E")
Prints A on the terminal
Operator Description Example< Less than >>> 2 < 3
True
<= Less than or equal to
>>> 2 <= 3True
> Greater than >>> 2 > 3False
>= Greater than or equal to
>>> 2 >= 3False
== Equal to >>> 2 == 3False
!= Not equal to >>> 2 != 3True
Loops
For loop
Start with a list of items
Have we reached the last item?
Do stuff
Exit loop
No
Yes
• Useful for repeating code!
for <counter> in <collection_of_stuff>:
code-block-1
For loop
Start with a list of items
Have we reached the last item?
Do stuff
Exit loop
No
Yes
genes = ["GATA4", "GFP", "FOXA1", "UNC-21"]
for i in genes:print(i)
print("printed all genes")
GATA4GFPFOXA1UNC-21printed all genes
• Useful for repeating code!
More examplesmy_string = "Hello"
for i in my_string:print(i)
Hello
my_number = 2500
for i in my_number:print(i)
2500
FURTHER READING: while loops in python http://learnpythonthehardway.org/book/ex33.html
Functions
Does some stuffinput output
def <function name>(<input variables>):do some stuffreturn <output>
def celsius_to_fahrenheit(celsius):fahrenheit = celsius * 1.8 + 32.0return fahrenheit
But how do I use a function?
temp1 = celsius_to_fahrenheit(37) #sets temp1 to 98.6
temp2 = celsius_to_fahrenheit(100) #sets temp2 to 212
temp3 = celsius_to_fahrenheit(0) #sets temp3 to 32
def celsius_to_fahrenheit(celsius):fahrenheit = celsius * 1.8 + 32.0return fahrenheit
But how do I use a function?
sum = addition(4,5) #sets sum to 9
A = 2
B = 3
sum2 = addition(A, B) #sets sum2 to 5
sum3 = addition(5) #throws an error
def addition(num1, num2):num3 = num1 + num2return num3
Python functions: where can I learn more?
• Python.org tutorial• User-defined functions:
https://docs.python.org/3/tutorial/controlflow.html#defining-functions
• Python.org documentation• Built-in functions: https://docs.python.org/3/library/functions.html
68
Commenting your code
• Why is this concept useful?• Makes it easier for--you, your future self, TAs ☺, anyone unfamiliar with
your code--to understand what your script is doing
• Comments are human readable text. They are ignored by Python.
• Add comments forThe how
• What the script does
• How to run the script
• What a function does
• What a block of code does
TREAT YOUR CODE LIKE A LAB NOTEBOOK
The why• Biological relevance• Rationale for design and methods• Alternatives
Commenting rule of thumb
Always code [and comment] as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live. Code for readability.
-- John Woods
• Points will be deducted if you do not comment your code• If you use code from a resource, e.g., a website, cite it
Comment syntax
Syntax Example
Block comment# <your_comment># <your_comment>
In-line comment<code> # <your_comment>
Python modules
• A module is file containing Python definitions and statements for a particular purpose, e.g.,
• Generating random numbers• Plotting
• Modules must be imported at the beginning of the script• This loads the variables and functions from the module into your script, e.g.,
import sysimport random
• To access a module’s features, type <module>.<feature>, e.g., sys.exit()
Random module
• Contains functions for generating random numbers for various distributions
• TIP: will be useful for assignment 1
Function Description
random.choice Return a random element from a list
random.randint Return a random interger in a given rangerandom.random Return a random float in the range [0, 1)Random.seed Initialize the (pseudo) random number generator
https://docs.python.org/3.4/library/random.html
Example
import random
numberList = [111,222,333,444,555]
#assigns a values from numberList to x at randomx = random.choice(numberList)
• String is a sequence of characters, like "Python is cool"
• Each character has an index
•Accessing a character: string[index]
x = "Python is cool"
print(x[10])
•Accessing a substring via slicing: string[start:finish]
print(x[2:5])
P y t h o n i s c o o l
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Strings
Prints tho and not thon
More string stuff
>>> x = "Python is cool"
>>> "cool" in x
>>> len(x)
>>> x + "?"
>>> x.upper()
# membership# length of string
x # concatenation
# to upper case
>>> x.replace("c", "k") # replace characters in a string
Lists
•If a string is a sequence of characters, then
a list is a sequence of items!
•List is usually enclosed by square brackets [ ]
•As opposed to strings where the object is fixed (= immutable), we are free to modify lists (that is, lists are mutable).
x = [1, 2, 3, 4]x[0] = 4x.append(5)print(x) # [4, 2, 3, 4, 5]
More lists stuff
>>> x = [ "Python", "is", "cool" ]
>>> x.sort()
>>> x[0:2]
>>> len(x)
>>> x + ["!"]
>>> x[2] = "hot"
# sort elements in x
# slicing
# length of string
x # concatenation
# replace element at index 2 with "hot">>> x.remove("Python") # remove the first occurrence of
"Python"
>>> x.pop(0) # remove the element at index 0
Lists: where can I learn more?
• Python.org tutorial: https://docs.python.org/3.4/tutorial/datastructures.html#more-on-lists
• Python.org documentation: https://docs.python.org/3.4/library/stdtypes.html#list
79
Command-line arguments
• Why are they useful?• Passing command-line arguments to a Python script allows a script to be
customized
• Example• make_nuc.py can create a random sequence of any length• If the length wasn’t a command-line argument, the length would be
hard-coded• To make a 10bp sequence, we would have to 1) edit the script, 2) save the script, and 3)
run the script.• To make a 100bp sequence, we’d have to 1) edit the script, 2) save the script, and 3) run
the script.• This is tedious & error-prone• Remember: be a lazy programmer!
80
81
Command-line arguments
• Python stores the command-line arguments as a list called sys.argv
• sys.argv[0] # script name• sys.argv[1] # 1st command-line argument• …
• IMPORTANT: arguments are passed as strings!
• If the argument is not a string, convert it, e.g., int(), float()
• sys.argv is a list of variables
• The values of the variables, are not “plugged in” until the script is run
82
Reading (and writing) to files in Python
Why is this concept useful?
• Often your data is much larger than just a few numbers:• Billions of base pairs• Millions of sequencing reads• Thousands of genes
• It’s may not feasible to write all of this data in your Python script
• Memory• Maintenance
How do we solve this problem?
83
Output file 2
Reading (and writing) to files in Python
The solution:
• Store the data in a separate file
• Then, in your Python script• Read in the data (line by line)• Analyze the data• Write the results to a new output file or print
them to the terminal
• When the results are written to a file, other scripts can read in the results file to do more analysis
84
Python script 1
Input file
Output file 1
Python script 2
Reading a file syntaxSyntax Example
with open(<file>) as <file_handle>: for <current_line> in open(<file>) , ‘r’): <current_line> = <current_line>.rstrip() # Do something
Output
>chr1ACGTTGATACGTA
85
The anatomy of a (simple) script
86
• The first line should always be
#!/usr/bin/env python3• This special line is called a shebang• The shebang tells the computer how
to run the script• It is NOT a comment
The anatomy of a (simple) script
87
• This is a special type of comment called a doc string, or documentation string
• Doc strings are used to explain 1) what script does and 2) how to run it
• ALWAYS include a doc string• Doc strings are enclosed in triple
quotes, “““
The anatomy of a (simple) script
88
• This is a comment• Comments help the reader better
understand the code• Always comment your code!
The anatomy of a (simple) script
89
• This is an import statement• An import statement loads
variables and functions from an external Python module
• The sys module contains system-specific parameters and functions
The anatomy of a (simple) script
90
• This grabs the command line argument using sys.argv and stores it in a variable called name
The anatomy of a (simple) script
91
• This prints a statement to the terminal using the print function
• The first list of arguments are the items to print
• The argument sep=“” says do not print a delimiter (i.e., a separator) between the items
• The default separator is a space.
Python resources
• Documentation• https://docs.python.org/3/
• Tutorials• https://www.learnpython.org/• https://www.w3schools.com/python/• https://www.codecademy.com/learn/learn-python-3
Assignment 1
93
How to complete & “turn in” assignments
1. Create a separate directory for each assignment
2. Create “submission” and “work” subdirectories• work = scratch work• submission = final version• The TAs will only grade content that is in your submission
directory
3. Copy the starter scripts and README to your work directory
4. Copy the final version of the files to your submission directory• Do not edit your submission files after 10 am on the due date
(always Friday)94
README files• README.txt file contains information on how to run your code and answers to any of the
questions in the assignment
• A template will be provided for each assignment
• Copy the template to your work folder
• Replace the text in {} with your answers
• Leave all other lines alone ☺
95
Question 1:{nuc_count.py nucleotide count output}-Comments:{Things that went wrong or you can not figure out}-
Question 1:A: 10C: 15G: 20T: 12-Comments:The wording for part 2 was confusing.-
README.txt template Completed README.txt
Usage statements in README and scripts
• Purpose
• Tells a user (you, TA, anyone unfamiliar with the script) how to run the script
• Documents how you created your results
• In your README
• Write out exactly how you ran the script:
python3 foo.py 10 bar
• In your scripts
• Write out how to run the script in general with placeholders for command-line arguments
python3 foo.py <#_of_genes> <gene_of_interest>
• TIP: copy and paste your commands into your README
• TIP: use the command history to view previous commands96
Assignment 1 Set Up
• Create assignment1 directory
• Create work, submission subdirectories
• Copy assignment material (README, starter scripts) to work directory
• Download human chromosome 20 with wget or FTP
97
Fasta file format
• Standard text-based file format used to define sequences
• .fa, .fasta, .fna, …, extensions
• Each sequence is defined by multiple lines• Line 1: Description of sequence. Starts with “>”• Lines 2-N: Sequence
• A fasta can contain ≥ 1 sequence
>chr22ACGGTACGTACCGTAGATNAGTAN>chr23ACCGATGTGTGTAGGTACGTNACGTAGTGATGTAT
Example fasta file
1
2
3
4
5
98
Assignment 1 To-Do’s
• Given a starter script (nuc_count.py) that counts the total number of A, C, G, T nucleotides
• Modify the script to calculate the nucleotide frequencies• Modify the script to calculate the dinucleotide frequencies
• Complete a starter script (make_seq.py) to generate a random sequence given nucleotide frequencies
• Use make_seq.py to generate random sequence with the same nucleotide frequencies as chr20
• Compare the chr20 di/nucleotide frequencies (observed) with the random model (expected)
• Answer conceptual questions in README
99
Requirements
• Due next Friday (1/24) at 10am
• Your submission folder should contain:□ A Python script to count nucleotides (nuc_count.py)□ A Python script to make a random sequence file
(make_seq.py)
□ An output file with a random sequence (random_seq_1M.txt)
□ A README.txt file with instructions on how to run your programs and answers to the questions.
• Remember to comment your script!
100
101