Holland R - Pistoia Alliance Sequence Squeeze

Date post: 23-Jan-2015
Upload: jan-aerts
View: 767 times
Download: 0 times
Presentation at BOSC2012 by Holland R - Pistoia Alliance Sequence Squeeze
Pistoia Alliance Sequence Squeeze Using a competition model to spur development of novel opensource algorithms Richard Holland (Eagle/Pistoia), Nick Lynch (AZ/Pistoia) BOSC July 2012
    

      

Pistoia  Alliance  Sequence  Squeeze  Using  a  compe--on  model  to  spur  development  of  novel  open-­‐source  algorithms  

Richard  Holland  (Eagle/Pistoia),  Nick  Lynch  (AZ/Pistoia)  

BOSC   July  2012  

    

Order  of  Service  

•  What/who  is  the  Pistoia  Alliance?  •  What  is/was  Sequence  Squeeze?  •  Who  won,  how,  and  why?  •  Why  did  Pistoia  do  this?  •  Why  is  this  good  for  BOSC  delegates?  •  Will  it  happen  again?  

July  14,  2012   2  Pistoia  Alliance  Sequence  Squeeze  

    

What/who  is  the  Pistoia  Alliance?  

July  14,  2012   3  Pistoia  Alliance  Sequence  Squeeze  

    

Who  is  Pistoia?  

•  The  Pistoia  Alliance  is  –  global  –  not-­‐for-­‐profit  –  precompeWWve  alliance    –  life  science  companies,  vendors,  publishers,  and  academic  groups  –  aims  to  lower  barriers  to  innovaWon    –  by  improving  the  interoperability  of  R&D  business  processes.  

•  We  differ  from  standards  groups  because    –  we  bring  together  the  key  consWtuents  to  idenWfy  the  root  causes  that  

lead  to  R&D  inefficiencies    –  develop  best  pracWces  and  technology  pilots  to  overcome  common  


July  14,  2012   4  Pistoia  Alliance  Sequence  Squeeze  

    

What  is/was  Sequence  Squeeze?  

July  14,  2012   5  Pistoia  Alliance  Sequence  Squeeze    

    

The  NGS  problem  

•  Storing  millions  of  NGS  reads  and  their  quality  scores  uncompressed  is  imprac,cal,  yet  current  compression  technologies  are  becoming  inadequate.    

•  There  is  a  need  for  a  new  and  novel  method  of  compressing  sequence  reads  and  their  quality  scores  in  a  way  that  preserves  100%  of  the  informa,on  whilst  achieving  much-­‐improved  linear  (or,  even  be\er,  non-­‐linear)  compression  raWos.  

July  14,  2012   6  Pistoia  Alliance  Sequence  Squeeze  

    

What  was  Sequence  Squeeze?  

•  Contest  to  find  a  be\er  FASTQ  compression  algorithm  –  easiest  format  for  ranking  entries  in  an  automated  se_ng.  

•  Open  source,  non-­‐restricWve  licence  required  for  entries  –  benefit  the  whole  community.  

•  Entries  tested  on  an  extract  of  the  1000  genomes  data  stored  in  AWS.  •  Prize  fund  of  US$15,000  to  the  best  algorithm  submi\ed  before  the  

closing  date  of  15  March  2012.    •  Winner  was  announced  at  the  Pistoia  Alliance  Conference  in  Boston  MA  

on  24  April  2012  –  more  on  that  story  later.  

•  Organised  and  administered  by  Eagle  under  contract  to  Pistoia.  

July  14,  2012   7  Pistoia  Alliance  Sequence  Squeeze  

    

Who  entered?  

•  108  disWnct  entries.  •  But  all  these  from  only  12  entrants!  –  some  entrants  were  groups  or  consorWa  but  most  were  individuals.  

•  Public  leaderboard  encouraged  fiercer  compeWWon.  

•  Entrants  seemingly  driven  to  outdo  their  compeWtors.  

July  14,  2012   8  Pistoia  Alliance  Sequence  Squeeze  

    

Who  judged?  

•  Yingrui  Li  –  Duty  OperaWon  Officer  of  Science  &  Technology  Department  of  the  BGI-­‐Shenzhen.  

•  Nick  Lynch  –  President  of  the  Pistoia  Alliance  (2009-­‐11).  

•  Guy  Coates  –  leader  of  the  InformaWcs  Systems  Group  at  the  Wellcome  Trust  Sanger  InsWtute.  

•  Tim  Fennell  –  Assistant  Director  for  Sequencing  Pipeline  InformaWcs  at  the  Broad  InsWtute.  

July  14,  2012   9  Pistoia  Alliance  Sequence  Squeeze  

    

Who  won,  how,  and  why?  

July  14,  2012   10  Pistoia  Alliance  Sequence  Squeeze    

    

What  were  the  results?  

•  Entrants  were  judged  by  –  compression  raWo  –  compression  Wme  and  memory  –  decompression  Wme  and  memory  –  accuracy  (lossiness  –  100%  target)  –  manual  review  for  code  quality,  scalability,  and  other  factors.  

•  The  same  three  people  showed  up  at  the  top  of  every  category  –  in  a  different  order  –  with  different  versions  of  their  entries.  

July  14,  2012   11  Pistoia  Alliance  Sequence  Squeeze  

    

Who  won,  and  why?  

•  James  Bonfield  won  overall  –  majority  of  top  places  in  each  category  –  using  various  versions  of  his  entry  –  forming  a  suite  of  suitable  tools.  

•  11.41%  compression  raWo  (test  data  ~6GB)  –  or  109.90  seconds  compression  Wme  –  or  100.91  seconds  decompression  Wme  –  or  35.76MB  compression  memory  usage  –  or  16.01MB  decompression  memory  usage  –  but  not  all  at  once!  

July  14,  2012   12  Pistoia  Alliance  Sequence  Squeeze  

    

ImplicaWons  of  winning    entry  

•  The  approach  is  very  simple  –  essenWally:  –  convert  the  FASTQ  to  BAM  alignments  against  a  reference  genome,  preserving  quality  scores.  

–  compress  the  BAM  files.    

•  Many  other  entries  followed  the  same  pa\ern:    –  convert  to  some  other  format  then  compress  using  standard  techniques.  

July  14,  2012   13  Pistoia  Alliance  Sequence  Squeeze  

    

Other  interesWng    results  

•  Ma\  Mahoney  (Dell)  submi\ed  a  specialised  version  of  the  standard  tool  paq  which  performed  extremely  well.  

•  Even  vanilla  paq  wasn’t  too  bad.  •  Discarding  the  quality  scores  enWrely  gets  a  compression  raWo  of  

2.87%  vs.  the  original  FASTQ  (not  FASTA).  •  If  this  contest  truly  represented  the  latest  and  greatest  ideas  in  the  

field,  then  NGS  storage  must  therefore  either  be    –  highly  compressed,  very  slow  access,    –  or  less  compressed,  relaWvely  fast  access.  

•  Its  quite  hard  to  beat  bzip2.  

July  14,  2012   14  Pistoia  Alliance  Sequence  Squeeze  

    

And  unexpected  benefits  James  Bonfield  donated  his  enWre  prize  fund  –  US$15,000  –  to  charity.  

50%  to  the  Wellcome  Trust  Sanger  InsWtute.  50%  to  the  BriWsh  Heart  FoundaWon.  


July  14,  2012   15  

David  Flanders  (Eagle  CEO)  and  John  Wise  (Pistoia  chairman)  present  James  Bonfield  with  his  prize.  

Pistoia  Alliance  Sequence  Squeeze  

    


•  Formal  paper  being  wri\en  at  the  moment  by  James  Bonfield  –  in  collaboraWon  with  close-­‐second  Ma\  Mahoney  –  and  judge  Nick  Lynch  –  and  the  authors  of  other  significant  entries.  

•  Source  code  of  ALL  entries  is  available  at  www.sequencesqueeze.org    –  all  under  BSD  licence  –  all  hosted  at  SourceForge  or  similar  –  click  entry  names  to  be  taken  to  download  page.  

•  Interviews  with  entrants  at  the  Pistoia  blog  www.pistoiaalliance.org/blog  –  search  for  arWcles  with  the  tag  ‘compression  algorithms’.  

July  14,  2012   16  Pistoia  Alliance  Sequence  Squeeze  

    

Why  did  Pistoia  do  this?  

July  14,  2012   17  Pistoia  Alliance  Sequence  Squeeze    

    

Why  did  Pistoia  do  this?  

•  Encouraging  innovaWon  through  prize-­‐backed  contests.    

•  Open  innovaWon  model  allows  industry  to  state  its  requirements  –  then  let  the  free  market  decide  how  to  deliver  something  that  saWsfies  these.  

July  14,  2012   18  Pistoia  Alliance  Sequence  Squeeze  

    

Why  did  Pistoia  do  this?  

•  Typical  bioinformaWcs  open-­‐source  hackers  do  things  because  they  enjoy  them  –  but  someWmes  also  because  of  the  challenge,  the  kudos,  the  

saWsfacWon  of  solving  a  real-­‐world  problem.  •  James’  charity  donaWon  is  a  great  example  of  this  

–  he  wasn’t  in  it  for  the  money  –  but  the  prize  fund  created  a  tangible  goal  to  aim  at.  

•  Amazon  kindly  sponsored  vouchers  for  all  parWcipants  that  should  have  covered  the  cost  of  developing  and  submi_ng  an  entry  –  contest  was  AWS-­‐based  –  entries  had  to  be  submi\ed  as  S3  buckets.  

July  14,  2012   19  Pistoia  Alliance  Sequence  Squeeze  

    

Why  did  Pistoia  do  this?  

•  Leaderboard  encouraged  compeWWon  – one-­‐upmanship  –  innovaWon.  

•  Does  not  discourage  collaboraWon  –  James  and  Ma\  both  discussed  their  entries  with  the  data  compression  community  at  encode.ru    

July  14,  2012   20  Pistoia  Alliance  Sequence  Squeeze  

    

Why  did  Pistoia  do  this?  

•  BSD-­‐licence  requirement  ensured  that  the  winning  entry  was  not  going  to  be  available  only  to  those  willing  to  pay  a  fee.  

•  EnWre  community  benefits,  not  just  Pistoia  members  or  those  with  deep  pockets  to  pay  for  sosware  licence  agreements.  

July  14,  2012   21  Pistoia  Alliance  Sequence  Squeeze  

    

Why  is  this  good  for  BOSC  delegates?  

July  14,  2012   22  Pistoia  Alliance  Sequence  Squeeze    

    

Why  is  this  good  for    BOSC  delegates?  

•  If  the  entries  had  been  closed/commercial  then  only  organisaWons  willing  to  pay  to  licence/buy  the  resulWng  products  would  benefit.  

•  But  this  way  the  enWre  community  benefits  from  results,  for  free,  without  restricWon.    

•  Beneficiaries  include  big  pharma  and  other  large  corporaWons  that  commissioned  the  contest    –  but  also  all  universiWes    –  all  non-­‐profits  –  all  small  businesses  in  biotech  –  and  everyone  else  involved  in  NGS  work.  

•  Pistoia  is  about  pre-­‐compeWWve  alliance    –  there  is  no  reason  to  make  the  Alliance’s  output  exclusive  –  they  are  there  to  develop  and  share  ideas,  not  to  build  an  empire.  

July  14,  2012   23  Pistoia  Alliance  Sequence  Squeeze  

    

Will  it  happen  again?  

July  14,  2012   24  Pistoia  Alliance  Sequence  Squeeze    

    

Will  it  happen  again?  

•  Pleased  with  outcome  and  level  of  interest.  •  So,  yes.  •  Goal  is  to  run  two  such  contests  a  year.  •  But,  your  community  needs  you!  

–  we  need  a  topic/subject/idea  that  can  be  raWonally/objecWvely  judged/ranked  

–  and  that  is  relevant  to  the  research  acWviWes  of  life  science  companies  and  other  Pistoia  members.  

•  Ideas  can  be  sent  to  Pistoia  Ops  team  c/o  [email protected]    

July  14,  2012   25  Pistoia  Alliance  Sequence  Squeeze  

    


•  Pistoia  Alliance  for  the  idea  and  funding.  •  Eagle  for  organising  and  administering.  •  All  contestants  for  entering.  •  1000  Genomes  for  the  test  data.  •  AWS  for  sponsoring  parWcipants.  •  BOSC/OBF  for  accepWng  this  talk.  

July  14,  2012   26  Pistoia  Alliance  Sequence  Squeeze  

    

      

