Date post: | 23-Jan-2015 |
Category: |
Technology |
Upload: | jan-aerts |
View: | 767 times |
Download: | 0 times |
©Eagle Genomics Ltd
©Eagle Genomics Ltd.
Pistoia Alliance Sequence Squeeze Using a compe--on model to spur development of novel open-‐source algorithms
Richard Holland (Eagle/Pistoia), Nick Lynch (AZ/Pistoia)
BOSC July 2012
©Eagle Genomics Ltd
Order of Service
• What/who is the Pistoia Alliance? • What is/was Sequence Squeeze? • Who won, how, and why? • Why did Pistoia do this? • Why is this good for BOSC delegates? • Will it happen again?
July 14, 2012 2 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd ©Eagle Genomics Ltd
What/who is the Pistoia Alliance?
July 14, 2012 3 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd
Who is Pistoia?
• The Pistoia Alliance is – global – not-‐for-‐profit – precompeWWve alliance – life science companies, vendors, publishers, and academic groups – aims to lower barriers to innovaWon – by improving the interoperability of R&D business processes.
• We differ from standards groups because – we bring together the key consWtuents to idenWfy the root causes that
lead to R&D inefficiencies – develop best pracWces and technology pilots to overcome common
obstacles.
July 14, 2012 4 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd ©Eagle Genomics Ltd
What is/was Sequence Squeeze?
July 14, 2012 5 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd
The NGS problem
• Storing millions of NGS reads and their quality scores uncompressed is imprac,cal, yet current compression technologies are becoming inadequate.
• There is a need for a new and novel method of compressing sequence reads and their quality scores in a way that preserves 100% of the informa,on whilst achieving much-‐improved linear (or, even be\er, non-‐linear) compression raWos.
July 14, 2012 6 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd
What was Sequence Squeeze?
• Contest to find a be\er FASTQ compression algorithm – easiest format for ranking entries in an automated se_ng.
• Open source, non-‐restricWve licence required for entries – benefit the whole community.
• Entries tested on an extract of the 1000 genomes data stored in AWS. • Prize fund of US$15,000 to the best algorithm submi\ed before the
closing date of 15 March 2012. • Winner was announced at the Pistoia Alliance Conference in Boston MA
on 24 April 2012 – more on that story later.
• Organised and administered by Eagle under contract to Pistoia.
July 14, 2012 7 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd
Who entered?
• 108 disWnct entries. • But all these from only 12 entrants! – some entrants were groups or consorWa but most were individuals.
• Public leaderboard encouraged fiercer compeWWon.
• Entrants seemingly driven to outdo their compeWtors.
July 14, 2012 8 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd
Who judged?
• Yingrui Li – Duty OperaWon Officer of Science & Technology Department of the BGI-‐Shenzhen.
• Nick Lynch – President of the Pistoia Alliance (2009-‐11).
• Guy Coates – leader of the InformaWcs Systems Group at the Wellcome Trust Sanger InsWtute.
• Tim Fennell – Assistant Director for Sequencing Pipeline InformaWcs at the Broad InsWtute.
July 14, 2012 9 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd ©Eagle Genomics Ltd
Who won, how, and why?
July 14, 2012 10 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd
What were the results?
• Entrants were judged by – compression raWo – compression Wme and memory – decompression Wme and memory – accuracy (lossiness – 100% target) – manual review for code quality, scalability, and other factors.
• The same three people showed up at the top of every category – in a different order – with different versions of their entries.
July 14, 2012 11 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd
Who won, and why?
• James Bonfield won overall – majority of top places in each category – using various versions of his entry – forming a suite of suitable tools.
• 11.41% compression raWo (test data ~6GB) – or 109.90 seconds compression Wme – or 100.91 seconds decompression Wme – or 35.76MB compression memory usage – or 16.01MB decompression memory usage – but not all at once!
July 14, 2012 12 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd
ImplicaWons of winning entry
• The approach is very simple – essenWally: – convert the FASTQ to BAM alignments against a reference genome, preserving quality scores.
– compress the BAM files.
• Many other entries followed the same pa\ern: – convert to some other format then compress using standard techniques.
July 14, 2012 13 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd
Other interesWng results
• Ma\ Mahoney (Dell) submi\ed a specialised version of the standard tool paq which performed extremely well.
• Even vanilla paq wasn’t too bad. • Discarding the quality scores enWrely gets a compression raWo of
2.87% vs. the original FASTQ (not FASTA). • If this contest truly represented the latest and greatest ideas in the
field, then NGS storage must therefore either be – highly compressed, very slow access, – or less compressed, relaWvely fast access.
• Its quite hard to beat bzip2.
July 14, 2012 14 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd
And unexpected benefits James Bonfield donated his enWre prize fund – US$15,000 – to charity.
50% to the Wellcome Trust Sanger InsWtute. 50% to the BriWsh Heart FoundaWon.
July 14, 2012 15
David Flanders (Eagle CEO) and John Wise (Pistoia chairman) present James Bonfield with his prize.
Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd
PublicaWon
• Formal paper being wri\en at the moment by James Bonfield – in collaboraWon with close-‐second Ma\ Mahoney – and judge Nick Lynch – and the authors of other significant entries.
• Source code of ALL entries is available at www.sequencesqueeze.org – all under BSD licence – all hosted at SourceForge or similar – click entry names to be taken to download page.
• Interviews with entrants at the Pistoia blog www.pistoiaalliance.org/blog – search for arWcles with the tag ‘compression algorithms’.
July 14, 2012 16 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd ©Eagle Genomics Ltd
Why did Pistoia do this?
July 14, 2012 17 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd
Why did Pistoia do this?
• Encouraging innovaWon through prize-‐backed contests.
• Open innovaWon model allows industry to state its requirements – then let the free market decide how to deliver something that saWsfies these.
July 14, 2012 18 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd
Why did Pistoia do this?
• Typical bioinformaWcs open-‐source hackers do things because they enjoy them – but someWmes also because of the challenge, the kudos, the
saWsfacWon of solving a real-‐world problem. • James’ charity donaWon is a great example of this
– he wasn’t in it for the money – but the prize fund created a tangible goal to aim at.
• Amazon kindly sponsored vouchers for all parWcipants that should have covered the cost of developing and submi_ng an entry – contest was AWS-‐based – entries had to be submi\ed as S3 buckets.
July 14, 2012 19 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd
Why did Pistoia do this?
• Leaderboard encouraged compeWWon – one-‐upmanship – innovaWon.
• Does not discourage collaboraWon – James and Ma\ both discussed their entries with the data compression community at encode.ru
July 14, 2012 20 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd
Why did Pistoia do this?
• BSD-‐licence requirement ensured that the winning entry was not going to be available only to those willing to pay a fee.
• EnWre community benefits, not just Pistoia members or those with deep pockets to pay for sosware licence agreements.
July 14, 2012 21 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd ©Eagle Genomics Ltd
Why is this good for BOSC delegates?
July 14, 2012 22 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd
Why is this good for BOSC delegates?
• If the entries had been closed/commercial then only organisaWons willing to pay to licence/buy the resulWng products would benefit.
• But this way the enWre community benefits from results, for free, without restricWon.
• Beneficiaries include big pharma and other large corporaWons that commissioned the contest – but also all universiWes – all non-‐profits – all small businesses in biotech – and everyone else involved in NGS work.
• Pistoia is about pre-‐compeWWve alliance – there is no reason to make the Alliance’s output exclusive – they are there to develop and share ideas, not to build an empire.
July 14, 2012 23 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd ©Eagle Genomics Ltd
Will it happen again?
July 14, 2012 24 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd
Will it happen again?
• Pleased with outcome and level of interest. • So, yes. • Goal is to run two such contests a year. • But, your community needs you!
– we need a topic/subject/idea that can be raWonally/objecWvely judged/ranked
– and that is relevant to the research acWviWes of life science companies and other Pistoia members.
• Ideas can be sent to Pistoia Ops team c/o [email protected]
July 14, 2012 25 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd
Credits
• Pistoia Alliance for the idea and funding. • Eagle for organising and administering. • All contestants for entering. • 1000 Genomes for the test data. • AWS for sponsoring parWcipants. • BOSC/OBF for accepWng this talk.
July 14, 2012 26 Pistoia Alliance Sequence Squeeze
©Eagle Genomics Ltd
©Eagle Genomics Ltd.
[email protected] (ideas to: [email protected] )
+44 (0)1223 654481 x3
www.pistoiaalliance.org www.sequencesqueeze.org www.eaglegenomics.com
facebook.com/eaglegenomics blog.eaglegenomics.com
www.pistoiaalliance.org/blog
@eaglegen @sequencesqueeze
@pistoiaalliance
Eagle® is a registered trademark no. 010418135 of Eagle Genomics Ltd. Postal address: Eagle Genomics Ltd., Babraham Research Campus, Cambridge CB22 3AT, United Kingdom.