STAT 157 Homework #1nolan/vigre/reports/Team1.pdfProblem #4 Part 2 Team 1: Jiejing Zhao, Mei Zhou,...

Post on 17-Mar-2018

216 views 3 download

transcript

STAT 157Homework #1

Problem #4

Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) SongTeam 1

Problem Method 40 reads 25 reads Comparison

Problem #4 Part 2 Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) Song

Agenda

• Problem

• Method

– Random sampling

– Alignment

– Ordering reads

– Construction of contigs

• Summary of findings

• Comparison

Problem Method 40 reads 25 reads Comparison

Problem #4 Part 2 Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) Song

Problem

Download the set of 40 sequence reads (8 nucleotides long each).Perform the sequence assembly using the greedy algorithm.

Problem Method 40 reads 25 reads Comparison

Problem #4 Part 2 Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) Song

Problem

Download the set of 40 sequence reads (8 nucleotides long each).Perform the sequence assembly using the graph greedy algorithm.

Download the set of 40 sequence reads (8 nucleotides long each).Perform the sequence assembly using the graph greedy algorithm.

Suppose that you were told that read 2 and 23* are from the same clone and round 70 bp apart. How would this affect your sequence assembly?

Suppose that you were told that read 2 and 23* are from the same clone and round 70 bp apart. How would this affect your sequence assembly?

Now we sample 25 reads without replacement from the set of 40 reads. Perform the assembly again as above on the 25 reads. How does this affect the sizes of the assembled contigs and the number of contigs?

Now we sample 25 reads without replacement from the set of 40 reads. Perform the assembly again as above on the 25 reads. How does this affect the sizes of the assembled contigs and the number of contigs?

Problem Method 40 reads 25 reads Comparison

Problem #4 Part 2 Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) Song

Alignment of base pairs and linking of reads

Alignment of base pairs and linking of reads

Find reads to be used in contigconstruction (greedy algorithm)

Find reads to be used in contigconstruction (greedy algorithm)

Find weights and their directions of overlap

Find weights and their directions of overlap

Method

AlignmentAlignment

Construction of contigsConstruction of contigs

Ordering readsOrdering reads

Problem Method 40 reads 25 reads Comparison

Problem #4 Part 2 Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) Song

Random sampling

Write code to select a random sample of 25 reads without replacement

Our PROBLEM !!!

Problem Method 40 reads 25 reads Comparison

Problem #4 Part 2 Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) Song

Alignment

Find the number of overlapping base pairs for each combination of reads in both directions.

ACTGG TTG

TTG AGCAC

Weight: 3

1

2

TTGAGC AC

AC TGGTTG

Weight: 2

2

1

1

1

2

2

8

82

3Weight

matrix

Example:

Problem Method 40 reads 25 reads Comparison

Problem #4 Part 2 Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) Song

Ordering reads

Order the reads by decreasing weightsIf a tie occurs, pick randomly

Follow the greedy algorithm

Example: W S E

8

8

7

7

2

3

8

6

3

2

5

5

1

2

3

4

1 3

2 4

1

2

Reads used for contig construction

Randomly order reads w/ same weight W = Weight S = Start E = End

Problem Method 40 reads 25 reads Comparison

Problem #4 Part 2 Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) Song

Construction of contigs

Match the read starts with read ends and vice versa

S E

2

8

5

6

3

5

9

2

1

2

3

4

Problem Method 40 reads 25 reads Comparison

Problem #4 Part 2 Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) Song

Computer Simulation���

Problem Method 40 reads 25 reads Comparison

Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) SongProblem #4 Part 2

10.0017.00Max

9.4515.47Mean

9.0014.00Min.

Sample 25 Reads

40 Reads

Number of Contigs��

Problem Method 40 reads 25 reads Comparison

Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) SongProblem #4 Part 2

102.00173.00Max

98.15166.67Mean

95.00161.00Min.

Sample 25 Reads

40 Reads

Length of contigs

Note: Total number of bp is 320

Problem Method 40 reads 25 reads Comparison

Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) SongProblem #4 Part 2

10.5611.50Max

10.410.80Mean

10.210.18Min.

Sample 25 Reads

40 Reads

Average Length of contigs

APPENDIX

APPENDIX

Sample 25 reads w/o replacement###input

reads=reads<-read.table("http://www.cmb.usc.edu/deonier/data/data_files/r.reads.txt",skip=1)

reads<-reads[sample(nrow(reads),25,replace=FALSE),]

APPENDIX

Getting the weights

for(x in 1:dim(weights)[1]) {

for(y in 1:dim(weights)[2]){

if(x==y) weights[x,y]<- 0

else weights[x,y] <- {

max(sapply(1:8, function(i)

attr(sapply(1:8, function(ss)

gregexpr(substring(reads.v[x], 1, ss), substring(reads.v[y], 9-ss)))[[i]], 'match.length')

))}}}

weights <- replace(weights, weights==-1, 0)

APPENDIX

Ordering the reads

ordered.weights <- ss.weights[order(ss.weights$weights, decreasing=TRUE), ]

APPENDIX

Tie breaker

rand.order.weights <-ordered.weights[c(rev(sapply(min(ordered.weights$weights):max(ordered.weights$weights),

function(x) {sample(which(ordered.weights$weights==x),

length(which(ordered.weights$weights==x)))

} )), recursive=TRUE), ]

APPENDIX

Constructing the Contig

dataframename <- 0repeat {for(i in 1:nrow(reads.use)) {

contig <- reads.use[1, ]

repeat{contig <- rbind(reads.use[which(contig$start[1]==reads.use$end), ],contig,reads.use[which(contig$end[nrow(contig)]==reads.use$start),])reads.use <- reads.use[-(which(rownames(reads.use) %in% rownames(contig))),]

if(length(which(contig$start[1]==reads.use$end))==0 &length(which(contig$end[nrow(contig)]==reads.use$start))==0)

break}

dataframename[i] <- paste("contigseq", i, sep='’)assign(dataframename[i], contig)}if(dim(reads.use)[1]==0)break}

nona <- sapply(1:length(dataframename), function(x) length(which(!is.na(get(dataframename[x])))))nona <- which(nona>0)sapply(nona, function(x) print(get(dataframename[x])))

APPENDIX

Getting the letterings

contig.name<-0

repeat {for(cc in 1:length(nona)) {contig.letter <- reads.v[get(dataframename[cc])$start[1]]

for(i in 1:nrow(get(dataframename[cc]))) {{contig.letter <- paste(substring(contig.letter,

1,nchar(contig.letter)-get(dataframename[cc])$weight[i]),

reads.v[get(dataframename[cc])$end[i]],sep='')}}

contig.name[cc] <- paste("full.contig.letter", cc, sep='')

assign(contig.name[cc], contig.letter)}

break}

sapply(1:length(contig.name), function(x) get(contig.name[x]))}

Problem Method 40 reads 25 reads Comparison

Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) SongProblem #4 Part 2

Summary Statistics: 25 reads

Average size of contigs

Min. 1st Qu. Median Mean 3rd Qu. Max.

10.20 10.20 10.56 10.40 10.56 10.56

Total number of base pairs

Min. 1st Qu. Median Mean 3rd Qu. Max.

95.00 95.00 95.00 98.15 102.00 102.00

Number of contigs

Min. 1st Qu. Median Mean 3rd Qu. Max.

9.00 9.00 9.00 9.45 10.00 10.00

Problem Method 40 reads 25 reads Comparison

Team 1: Jiejing Zhao, Mei Zhou, Yang Yang Liu, Robin Hestir, Henry Ye Li, Yi (Eve) SongProblem #4 Part 2

Summary Statistics: 40 reads

Average size of contigs

Min. 1st Qu. Median Mean 3rd Qu. Max.

10.18 10.38 10.87 10.80 10.93 11.50

Total number of base pairs

Min. 1st Qu. Median Mean 3rd Qu. Max.

161.0 163.8 166.0 166.7 170.0 173.0

Number of contigs

Min. 1st Qu. Median Mean 3rd Qu. Max.

14.00 15.00 15.00 15.47 16.00 17.00