CSIS-400: Bioinformatics Dr. Eric Breimer. Good News & Bad News Remind me to tell you the good news.

CSIS-400: Bioinformatics

Dr. Eric Breimer

Good News & Bad News

Remind me to tell you the good news

Quizzes

At least 6 pop-quizzes– The lowest one gets dropped

20 minutes long– Not difficult if you come to class

I may announce them ahead of time– if attendance is not a problem

Projects

3 Projects– Kind of hard

Open due dates– Submit early and I’ll give you feedback– Then, submit again to get a better grade– Final submission a week before classes end

Exams

First exam (early October)– Fail it and you should probably drop the course– Things are only going to get harder

Second exam is a warm up for the final– (i.e. late November)

Third exam is cumulative– during finals week.

Lecture

My PowerPoint slides are sometimes bare– Not enough detail to study from

Textbook is a reference manual– Look things up, not a tutorial

Most of the material comes from lecture.– Miss lecture a lot and you will be lost

Grading

25% Cumulative final exam 15% Higher exam grade 10% Lower exam grade 20% Quiz average

– 5 quizzes (equally weights)

30% Project average– 3 projects (different weights)

Bioinformatics

Computer Science/Math meets Biology/Genetics Data-driven Science

– Data collection – wet lab – 5%– Analyzing the data – computer lab – 95%

Human Genome Project– best example of bioinformatics

Machine Learning & Data Mining– used heavily in bioinformatics

Science(remind me to tell you a story)

What is science? Make an observation Develop a hypothesis

– explains the observation Run some experiments

– to test your hypothesis Collect enough data to

support your hypothesis– and it becomes a Theory

Science(again)

1. Observation

2. Hypothesis

3. Experiments

4. Supportive Data

5. Theory

Bad Science

Collect some data Look at the data exhaustively and try to find

some property/hypothesis Develop tools and run experiments to help

you discover some property/hypothesis Then develop a hypothesis that ‘makes

sense’ (scientifically speaking)

Bad Science(example)

For years, material science and engineering followed the ‘bad science’ model.

The search for better goop 1. Mix materials to make new types of goop

2. Test all the new types of goop

3. Pick the best goop

4. Then, try to explain scientifically:Why is this goop so good?

Bad Science(again)

1. Collect Data

2. Experiment

3. Design Tools

4. Analyze Data

5. Hypothesis/Theory

Some History

Researcher were shunned for conducting bad science

– Shunned by academia– Not shunned by industry

Then came the Human Genome Project Industry (Celera) kicked Academia’s (NIH)

butt. Q:What was their secret weapon? A: Bad Science

Please don’t shun me!

Analogy

Problem: You are in a cage with 500 lbs. hungry tiger

Only options:1. Reason with the tiger2. Shoot the tiger

Nobody wants to hurt a tiger, but…

Some problems can’t be reasoned with…

Analogy

The Human Genome data is like a 500 lbs. hungry tiger. A high-throughput computer is like a shotgun. Pure scientists will die reasoning with the problem Meanwhile, Industry will use computers and ‘bad

science’ to perform amazing genetic experiments.

The Genome Project(overly simplified version)

Biologists/Chemists figured out a way of reading the genetic material in a cell (sequencing)

Unfortunately, they couldn’t read it from beginning to end

They could only read it in small chunks And, the process was prone to error.


Over time, Biologists/Chemists sequenced a lot of DNA.

Lots of different organisms Lots of different segments Before any ‘real’ hypotheses could be made they

had to1. Assemble the data2. Correct the errors3. Put the segments together


Thus, algorithms, techniques, and methodologies were pioneered just to put together the segments.

Assembling entire Genomes is just the beginning

The next goal is to generate a complete mapping between sequence and function

Heartformation

Mapping

Which pieces do What function?

Glucosemetabolism

Liverfunction

Brain development

Vision

Cancerprob.

BlaaBlaa

I’m not good at making stuff up

Analogy

Imagine a C++ Program Imagine a really big one 4 million lines long…

int main(void) {

int x = 0;

int y = 0;

cout << "Enter two numbers: ";

cin >> x >> y;

int z = x + y;

cout >> z;

}

Analogy

Imagine compiling it into assembly language

00424D90 mov eax,dword ptr [ecx]00424D92 mov edx,7EFEFEFFh00424D97 add edx,eax00424D99 xor eax,0FFh00424D9C xor eax,edx00424D9E add ecx,400424DA1 test eax,81010100h00424DA6 je main_loop (00424d90)00424DA8 mov eax,dword ptr [ecx-4]

Analogy

Imagine what the machine code would look like

Shown here in Hex format

Analogy

Now picture the hex code as a binary sequence

– 10101010100011011010111100110001010101000110010011010101101101101010101010100011010101000110110101111001100010101010001100100101010100011011010111010100011011010111100110001010101000110010011010101101101101010101010100011101010100011011010111100110001010101000110010010101010001101101011101010001101101011110011000101010100011001001101010110110110101010101010001101010100011011010111100110001010101000110010010101010001101101011101010001101101011110011000101010100011001001101010110110111101010100011011010111100110001010101000110010010101010001101101011101010001101101011110011000101010100011001001101010110110110101010101010001101010100011011010111100110001010101000110010010101010001101101011101010001101101011110011000101010100011001001

Analogy

Now change some of the bits (2% error rate)

– 10101010100011011000111100110001010101000111010011010101101101101010101010100011000101000110110101111001100010100010001100101101010100011011010111010100011011010111100110001010101000110010011010101101101101010101010100011101010100011011010111100110001011101000110010010101010001101101011101010001101101011010011000101010100001001001001010010110110101010101010001101110100011011010111100110001010101000110010010101010001101101011101010001101101011010011000101010100011001001101010110010111101010100011011010111100110001010101000110010010101110001101101011101010001101101010010011000101010100011001001101010111110110101010101010001101010100011011010111100110001010101000110010010101010001101101011101010101101101011110011000101010100001001001

Analogy

Lets randomly sample segments of the sequences

– 10101010100011011000111100110001010101000111010011010101101101101010101010100011000101000110110101111001100010100010001100101101010100011011010111010100011011010111100110001010101000110010011010101101101101010101010100011101010100011011010111100110001011101000110010010101010001101101011101010001101101011010011000101010100001001001001010010110110101010101010001101110100011011010111100110001010101000110010010101010001101101011101010001101101011010011000101010100011001001101010110010111101010100011011010111100110001010101000110010010101110001101101011101010001101101010010011000101010100011001001101010111110110101010101010001101010100011011010111100110001010101000110010010101010001101101011101010101101101011110011000101010100001001001

1000110110001111001100010101010001

1110101000110110101001001

Analogy

Finally, lets collect 1,000,000 random segments

1010001000110011011010101000110110

101000100101010101000110110

00101010101010011010101000110110

011010111010101000110110

00101010101010011010101101010101

…

Analogy Complete

Problem:– Given 1,000,000 random samples.

Remember there are random errors

– Try to re-construct the original 4 million line program

You don’t have to really re-construct it You just have to tell me EXACTLY what it does.

Analogy Complete

Here’s the catch:– I’m not even going to tell you what programming

language I used– In fact, the only thing I’ll tell you is that it’s a

language you’ve never seen in your entire life.

Date post:	20-Dec-2015
Category:	Documents
View:	217 times
Download:	0 times

CSIS-400: Bioinformatics Dr. Eric Breimer. Good News & Bad News Remind me to tell you the good news.

Documents