STAT 157Homework #1
Problem #4
Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) SongTeam 1
Problem Method 40 reads 25 reads Comparison
Problem #4 Part 2 Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) Song
Agenda
• Problem
• Method
– Random sampling
– Alignment
– Ordering reads
– Construction of contigs
• Summary of findings
• Comparison
Problem Method 40 reads 25 reads Comparison
Problem #4 Part 2 Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) Song
Problem
Download the set of 40 sequence reads (8 nucleotides long each).Perform the sequence assembly using the greedy algorithm.
Problem Method 40 reads 25 reads Comparison
Problem #4 Part 2 Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) Song
Problem
Download the set of 40 sequence reads (8 nucleotides long each).Perform the sequence assembly using the graph greedy algorithm.
Download the set of 40 sequence reads (8 nucleotides long each).Perform the sequence assembly using the graph greedy algorithm.
Suppose that you were told that read 2 and 23* are from the same clone and round 70 bp apart. How would this affect your sequence assembly?
Suppose that you were told that read 2 and 23* are from the same clone and round 70 bp apart. How would this affect your sequence assembly?
Now we sample 25 reads without replacement from the set of 40 reads. Perform the assembly again as above on the 25 reads. How does this affect the sizes of the assembled contigs and the number of contigs?
Now we sample 25 reads without replacement from the set of 40 reads. Perform the assembly again as above on the 25 reads. How does this affect the sizes of the assembled contigs and the number of contigs?
Problem Method 40 reads 25 reads Comparison
Problem #4 Part 2 Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) Song
Alignment of base pairs and linking of reads
Alignment of base pairs and linking of reads
Find reads to be used in contigconstruction (greedy algorithm)
Find reads to be used in contigconstruction (greedy algorithm)
Find weights and their directions of overlap
Find weights and their directions of overlap
Method
AlignmentAlignment
Construction of contigsConstruction of contigs
Ordering readsOrdering reads
Problem Method 40 reads 25 reads Comparison
Problem #4 Part 2 Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) Song
Random sampling
Write code to select a random sample of 25 reads without replacement
Our PROBLEM !!!
Problem Method 40 reads 25 reads Comparison
Problem #4 Part 2 Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) Song
Alignment
Find the number of overlapping base pairs for each combination of reads in both directions.
ACTGG TTG
TTG AGCAC
Weight: 3
1
2
TTGAGC AC
AC TGGTTG
Weight: 2
2
1
1
1
2
2
8
82
3Weight
matrix
Example:
Problem Method 40 reads 25 reads Comparison
Problem #4 Part 2 Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) Song
Ordering reads
Order the reads by decreasing weightsIf a tie occurs, pick randomly
Follow the greedy algorithm
Example: W S E
8
8
7
7
2
3
8
6
3
2
5
5
1
2
3
4
1 3
2 4
1
2
Reads used for contig construction
Randomly order reads w/ same weight W = Weight S = Start E = End
Problem Method 40 reads 25 reads Comparison
Problem #4 Part 2 Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) Song
Construction of contigs
Match the read starts with read ends and vice versa
S E
2
8
5
6
3
5
9
2
1
2
3
4
Problem Method 40 reads 25 reads Comparison
Problem #4 Part 2 Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) Song
Computer Simulation���
Problem Method 40 reads 25 reads Comparison
Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) SongProblem #4 Part 2
10.0017.00Max
9.4515.47Mean
9.0014.00Min.
Sample 25 Reads
40 Reads
Number of Contigs��
Problem Method 40 reads 25 reads Comparison
Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) SongProblem #4 Part 2
102.00173.00Max
98.15166.67Mean
95.00161.00Min.
Sample 25 Reads
40 Reads
Length of contigs
Note: Total number of bp is 320
Problem Method 40 reads 25 reads Comparison
Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) SongProblem #4 Part 2
10.5611.50Max
10.410.80Mean
10.210.18Min.
Sample 25 Reads
40 Reads
Average Length of contigs
APPENDIX
APPENDIX
Sample 25 reads w/o replacement###input
reads=reads<-read.table("http://www.cmb.usc.edu/deonier/data/data_files/r.reads.txt",skip=1)
reads<-reads[sample(nrow(reads),25,replace=FALSE),]
APPENDIX
Getting the weights
for(x in 1:dim(weights)[1]) {
for(y in 1:dim(weights)[2]){
if(x==y) weights[x,y]<- 0
else weights[x,y] <- {
max(sapply(1:8, function(i)
attr(sapply(1:8, function(ss)
gregexpr(substring(reads.v[x], 1, ss), substring(reads.v[y], 9-ss)))[[i]], 'match.length')
))}}}
weights <- replace(weights, weights==-1, 0)
APPENDIX
Ordering the reads
ordered.weights <- ss.weights[order(ss.weights$weights, decreasing=TRUE), ]
APPENDIX
Tie breaker
rand.order.weights <-ordered.weights[c(rev(sapply(min(ordered.weights$weights):max(ordered.weights$weights),
function(x) {sample(which(ordered.weights$weights==x),
length(which(ordered.weights$weights==x)))
} )), recursive=TRUE), ]
APPENDIX
Constructing the Contig
dataframename <- 0repeat {for(i in 1:nrow(reads.use)) {
contig <- reads.use[1, ]
repeat{contig <- rbind(reads.use[which(contig$start[1]==reads.use$end), ],contig,reads.use[which(contig$end[nrow(contig)]==reads.use$start),])reads.use <- reads.use[-(which(rownames(reads.use) %in% rownames(contig))),]
if(length(which(contig$start[1]==reads.use$end))==0 &length(which(contig$end[nrow(contig)]==reads.use$start))==0)
break}
dataframename[i] <- paste("contigseq", i, sep='’)assign(dataframename[i], contig)}if(dim(reads.use)[1]==0)break}
nona <- sapply(1:length(dataframename), function(x) length(which(!is.na(get(dataframename[x])))))nona <- which(nona>0)sapply(nona, function(x) print(get(dataframename[x])))
APPENDIX
Getting the letterings
contig.name<-0
repeat {for(cc in 1:length(nona)) {contig.letter <- reads.v[get(dataframename[cc])$start[1]]
for(i in 1:nrow(get(dataframename[cc]))) {{contig.letter <- paste(substring(contig.letter,
1,nchar(contig.letter)-get(dataframename[cc])$weight[i]),
reads.v[get(dataframename[cc])$end[i]],sep='')}}
contig.name[cc] <- paste("full.contig.letter", cc, sep='')
assign(contig.name[cc], contig.letter)}
break}
sapply(1:length(contig.name), function(x) get(contig.name[x]))}
Problem Method 40 reads 25 reads Comparison
Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) SongProblem #4 Part 2
Summary Statistics: 25 reads
Average size of contigs
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.20 10.20 10.56 10.40 10.56 10.56
Total number of base pairs
Min. 1st Qu. Median Mean 3rd Qu. Max.
95.00 95.00 95.00 98.15 102.00 102.00
Number of contigs
Min. 1st Qu. Median Mean 3rd Qu. Max.
9.00 9.00 9.00 9.45 10.00 10.00
Problem Method 40 reads 25 reads Comparison
Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) SongProblem #4 Part 2
Summary Statistics: 40 reads
Average size of contigs
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.18 10.38 10.87 10.80 10.93 11.50
Total number of base pairs
Min. 1st Qu. Median Mean 3rd Qu. Max.
161.0 163.8 166.0 166.7 170.0 173.0
Number of contigs
Min. 1st Qu. Median Mean 3rd Qu. Max.
14.00 15.00 15.00 15.47 16.00 17.00