+ All Categories
Home > Documents > The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved...

The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved...

Date post: 20-Dec-2015
Category:
View: 217 times
Download: 0 times
Share this document with a friend
64
The Shocking Details of Genome.ucsc.edu QuickTime™ and aTIFF QuickTime™ and aTIFF
Transcript
Page 1: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

The Shocking Details of Genome.ucsc.edu

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Page 2: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

History of the Code

• Started in 1999 in C after Java proved hopelessly unportable across browsers.

• Early modules include a Worm genome browser (Intronerator), and GigAssembler which produced working draft of human genome.

• In 2001 a few other grad students started working on the code.

• In 2002 hired staff to help with Genome Browser• Currently project employs ~20 full time people.

Page 3: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

The Genome Browser Staff• 5 programmers: Mark, Angie, Hiram, Kate, Rachel, Fan, Jim

• 4 quality assurance engineers - Heather, Bob, Mike, Galt

• 3 post-docs - Terry, Gill, Katie

• 9 grad students - Chuck, Daryl, Brian, Robert, Yontao, Krish, Adam, Ryan, Andy

• 3 system administrators - Paul, Jorge, Patrick

• 1 writer - Donna

• David Haussler and CBSE Staff

• About 1/3 of staff (including me 3 days a week) telecommutes.

Page 4: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

The GoalMake the human genome understandable by humans.

Page 5: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Prognosis

Maybe we’ll understand it one of these days

Page 6: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.
Page 7: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Cardiac Troponin T2

Page 8: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Comparative Genomics at BMP10

Page 9: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Normalized eScores

Page 10: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Conservation Levels of Regulatory Regions

Page 11: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Complex Transcription

Page 12: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Add Your Own Tracks

• Users can extend the browser with their own tracks.

• User tracks can be private or public.

• No programming required.

• GFF, GTF, PSL or BED formats supported#chrom start end [name strand score …]

chr1 1302347 1302357 SP1 + 800

chr1 1504778 1504787 SP2 – 980

Page 13: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

The Underlying Database• Power users and bioinformaticians sometimes want

underlying database.

• There is a table for each track.

• Larger tracks have a table for each chromosome.

• Format of a track table generally similar to add-your-own track formats.

• Pieces of database available from ‘tables’ browser.

• Whole database available as tab-separated files.

• Most of database served via DAS.

Page 14: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Parasol and Kilo Cluster

• UCSC cluster has 1000 CPUs running Linux

• 1,000,000 BLASTZ jobs in 25 hours for mouse/human alignment

• We wrote Parasol job scheduler to keep up.– Very fast and free.

– Jobs are organized into batches.

– Error checking at job and at batch level.

Page 15: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Science is Hard

Page 16: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Coding: Discipline Is Required

• While software development is immune from almost all physical laws, entropy his us hard. - The Pragmatic Programmer

• To keep the system from devolving into disorder we have to follow code conventions and insist on a lot of testing.

• We use CVS (concurrent version system) to help all of us work on the same code at once.

Page 17: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Obtaining the Code from CVS• See http://genome.ucsc.edu/admin/cvs.html• This gets you a ‘sandbox’ - a local copy of the

source to compile and edit.• Type ‘make’ in the lib and utilities directory.• You can do a ‘cvs update’ to get our updates to the

code base.• To add permanently to code base email me to

enable ‘cvs commit’

Page 18: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Expand Your Mental Capacity With…

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Page 19: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Lagging Edge Software

• C language - compilers still available!

• CGI Scripts - portable if not pretty.

• SQL database - at least MySQL is free.

Page 20: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Problems with C

• Missing booleans and strings.

• No real objects.

• Must free things

Page 21: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Advantages of C

• Very fast at runtime.• Very portable.• Language is simple.• No tangled inheritance hierarchy.• Excellent free tools are available.• Libraries and conventions can

compensate for language weaknesses.

Page 22: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Coping with Missing Data Types in C

• #define boolean int

• Fixing lack of real string type much harder– lineFile/common modules and autoSql code

generator make parsing files relatively painless– dyString module not a horrible string ‘class’

Page 23: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Object Oriented Programming in C

• Build objects around structures.

• Make families of functions with names that start with the structure name, and that take the structure as the first argument.

• Implement polymorphism/virtual functions with function pointers in structure.

• Inheritance is still difficult. Perhaps this is not such a bad thing.

Page 24: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

struct dnaSeq

/* A dna sequence in one-letter-per-base format. */

{

struct dnaSeq *next; /* Next in list. */

char *name; /* Sequence name. */

char *dna; /* a’s c’s g’s and t’s. Null terminated */

int size; /* Number of bases. */

};

struct dnaSeq *dnaSeqFromString(char *string);

/* Convert string containing sequence and possibly

* white space and numbers to a dnaSeq. */

void dnaSeqFree(struct dnaSeq **pSeq);

/* Free dnaSeq and set pointer to NULL. */

void dnaSeqFreeList(struct dnaSeq **pList);

/* Free list of dnaSeq’s. */

Page 25: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

struct screenObj

/* A two dimensional object in a sleazy video game. */

{

struct screenObj *next; /* Next in list. */

char *name; /* Object name. */

int x,y,width,height; /* Bounds of object. */

void (*draw)(struct screenObj *obj); /* Draw object */

boolean (*in)(struct screenObj *obj, int x, int y);

/* Return true if x,y is in object */

void *custom; /* Custom data for a particular type */

void (*freeCustom)(struct screenObj *obj);

/* Free custom data. */

};

#define screenObjDraw(obj) (obj->draw(obj))

/* Draw object. */

void screenObjFree(struct screenObj **pObj);

/* Free up screen object including custom part. */

Page 26: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Naming Conventions• Code is constrained by few natural laws.• There are many ways to do things, so

programmers make arbitrary decisions.• Arbitrary decisions are hard to remember.• Conventions make decisions less arbitrary.• varName vs. VarName vs varname vs var_name.

We use varName.• variable vs. var vs. vrbl vs. vble vs varible: if you

need to abbreviate, keep it short.

Page 27: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Commenting Conventions

• Each module has a comment describing it’s overall purpose.

• Each function also has an overall comment.• Each field in a structure has a comment.• Longer functions broken into ‘paragraphs’

that each begin with a comment.• The module, function, and structure

comments are replicated in the .h file, which serves as an index to the module.

Page 28: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Error Handling

• Code prints out a message and aborts (via the errAbort function) when there is a problem.

• This saves loads of error handling code and is generally the right thing to do.

• You can ‘catch’ an errAbort if necessary, though it rarely is.

Page 29: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Memory• Uninitialized memory leads to difficult bugs.• Compiler set to warn of uninitialized vars• Dynamic memory goes through needMem. It is

always zeroed.• Memory usually freed with freez(), which sets

pointer to null as well as freeing it.• ‘Careful’ memory handler can be pushed to help

track down memory bugs:– Sentinal values to detect writing past end of array– Detects memory freed twice or not freed– Detects heap corruption in general.

Page 30: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Page 31: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Generally Useful Modules

• String handling - common dystring wildcmp• Collections - common (singly linked lists), hash,

dlist, binRange rbTree• DNA - dnautils dnaseq• Web - htmshell, cheapcgi, htmlPage• I/O - linefile, xap (XML), fa, nib, twoBit,

blastParse, blastOut, maf, chain, gff• Graphics - memgfx, gifwrite, psGfx, vGfx

Page 32: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Anatomy of a CGI Script• Gets called by Web Server when user clicks

submit or follows a cgi link.

• Input is in environment variables and sometimes also stdin. Routines in cheapCgi move this to a hash table.

• Output is to stdout. Routines in htmshell help with output formatting.

• In the middle often access a database.

Page 33: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Challenges of CGI

• Each click launches program anew.– User state can be kept in ‘cart’ variables

• Run from Web Server, harder to debug– Use cgiSpoof to run from command line– Push an error handler that will close out web

page, so can see your error messages. htmShell does this, but webShell may not….

• Ideally should run in less than 2 seconds.

Page 34: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Relational Databases• Relational databases consist of tables, indices, and

the Structured Query Language (SQL).• Tables are much like tab-separated files:

#chrom start end name strand score chr22 14600000 14612345 ldlr + 0.989 chr21 18283999 18298577 vldlr - 0.998

Fields are simple - no lists or substructures.• Can join tables based on a shared field. This is

flexible, but only as fast as the index.• Tables and joins are accessed a row at a time.• The row is represented as an array of strings.

Page 35: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Converting A Row to Object

struct exoFish *exoFishLoad(char **row)

/* Load a exoFish from row fetched with select * from exoFish

* from database. Dispose of this with exoFishFree(). */

{

struct exoFish *ret;

AllocVar(ret);

ret->chrom = cloneString(row[0]);

ret->chromStart = sqlUnsigned(row[1]);

ret->chromEnd = sqlUnsigned(row[2]);

ret->name = cloneString(row[3]);

ret->score = sqlUnsigned(row[4]);

return ret;

}

Page 36: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Motivation for AutoSql• Row to object code is tedious at best.

• Also have save object, free object code to write.

• SQL create statement needs to match C structure.

• Lack of lists without doing a join can seriously impact performance and complicate schema.

Page 37: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

AutoSql Data Declarationtable exoFish"An evolutionarily conserved region (ecore) with Tetroadon" ( string chrom; "Human chromosome or FPC contig" uint chromStart; "Start position in chromosome" uint chromEnd; "End position in chromosome" string name; "Ecore name in Genoscope database" uint score; "Score from 0 to 1000" )

See autoSql.doc for more details.See also autoXml

Page 38: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Coding Conclusion

• It’s always safer on the lagging edge

• Consider redesigning system as COBOL character-based application

Page 39: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

UCSC Gene Family Browser

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Expression and other information on genes in a big sorted, linked table

Page 40: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 41: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 42: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 43: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 44: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 45: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 46: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 47: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 48: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 49: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 50: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 51: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Up in Testes, Down in Brain

Page 52: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 53: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 54: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 55: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 56: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 57: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 58: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 59: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 60: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 61: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 62: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 63: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 64: The Shocking Details of Genome.ucsc.edu. History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules.

Conclusions

• Genome browser - good for exploring genome and displaying your custom tracks

• ‘kent’ code base - a good starting point for many programming projects

• Family browser - a fine way to collect data sets.

• Browser staff - helpful but overworked.


Recommended