Date post: | 16-Aug-2015 |
Category: |
Technology |
Upload: | andreas-olofsson |
View: | 159 times |
Download: | 0 times |
What I Learned Building a Parallel Processor from Scratch
ISPASS, Philadelphia
March 31, 2015
1
My 2007 “Epiphany”
It's more efficient to design a pipeline of specialized processors than having a single massive general purpose processor that runs a complete application.
DUH!
2
My Background
8 years designing DSPs
100 people, $100M R&D
TigerSHARC was a tech success,
but a huge financial flop!
Mixed signal SOCs for CCDs
Invented new ISA in 2 weeks
3-4 designers per derivative
Huge success (>$100M)
+
3
My 2008 Processor Design Goals
● ANSI-C programmable● Floating point (industrial + medical markets)● Scalabale● 10X boost in GFLOPS/Watt (otherwise, who would buy?)● Simple enough to be built by < 5 engineers
4
The Epiphany Multicore Architecture
RISC SRAM
NOCData
Mover
(3,3)
(0,0)
● ANSI-C Programmable● IEEE 32bit floating point ● Nothing shared● Distributed memory model● Scalabale to “infinity”● 50 GFLOPS/W
RISC(1GHZ)
SRAM
NOCData
Mover
5
20/20 Decision Report Card
6
✔ NO HW cache✔ NO HW coherence✔ Tiny custom RISC ISA✔ Floating point✔ 2D Mesh✔ C-programmability✔ Distributed memory✔ 64-bit LDR/STR✔ 64 registers✔ 4 bank local memory✔ 5+3 dual issue pipeline✔ Single flit NOC✔ Byte addressability
✔ HW loops✔ Autoincrement LDR/STR ✔ Slow reads✔ Fetch from external✔ NO 64bit FPU✔ NO 64bit addressing
✔ 32KB of memory✔ NO native integer mult✔ NO DRAM interface
My Development Process● One person owns the whole path from arch to layout● Spreadsheet (triangulating architectures)● Emacs/VI (does code look clean on paper?)● Manually “count” gates in code. “Feel complexity”● Verilog Simulation (checks for logical/mental mistakes)● Synthesize design (in FPGA or EDA to measure cost!!)● Use advisors (peer review to look for gotchas) ● Application development (No but I should have!!)
(*considering the lack of benchmarks, it's a small miracle Epiphany works as well as it does)7
Step 1: Study History - 3 months
My inspiration of what to do (and NOT to do)● Henessy and Patterson's classic books. (Re)Read them!● Blackfin, TigerSHARC, and c64x (DSP features)● ARM and MIPS (standard features)● Tilera and Ambric (manycore features)
GOTCHA: Watch out for “clever” features! The really useful features will be common to all architectures.
8
Step 2: Create a baseline - 3 months● How much area goes to computation?● How expensive is SRAM and cache?● What is the $ cost of IP and design effort?● How expensive is the network?● What is the performance gain of instruction X in app Y?● How often is instruction X used?● What does a “typical” application look like
These numbers are your anchor, get them right!9
Step 3: Verify assumptions - 6 months
● Design and simulate (to verify cost assumptions)
● Build prototypes (to verify simulation assumptions)
● Talk to customers (to verify product feature assumptions)
10
Epiphany (-1)● May 2008● Naive but close...● Could have had 64
cores in 2008!● Epiphany arch has
only changed by “delta” since 2008
● Submitted many grant proposals, papers (all rejected)
11
Epiphany-0● Oct 2008 synthesis and layout prototype● Virtual prototypes (synthesis and layout of cores)● Showed that 1 Ghz operation and 50GFLOPS/W was feasible● Gave me the courage to ask friends and family for money ($200K)
12
Epiphany-I● Jun 2009 tapeout● 16 cores, 65nm prototype● ~1 person, 12 weeks● Changes 7 days after TO!● Barely worked but proved
it was possible● ~10mm^2
13
Raised capital (Nov 2009)
14
Series-A● $1.5M from BittWare, a
DSP board company.
(a TigerSHARC customer)● They invested because
their customers were interested
● Complicated deal● 6 traunches x $250K
Building a Team (Dec, 2009)
Oleg RaikhmanDV/Tools
Roman TroganDesign/Layout
Andreas OlofssonArch/Design
Lesson: Team makeup is the #1 determinant for success. You must have it in place before you set out on journey.
15
Epiphany-II● May 2010 tapeout● 16 cores, 65nm● ~3 person, ~6 weeks● Changes 1 day before TO● A vendor screwed up!● Should have been a
product!● ~12mm^2
16
It Works!● Sept 2010 bringup● Ran floating point
program!● Proved efficiency!● Packaging yield was
less than 10% on eLink (vendor!!).
● SPI debug channel was key to debugging vendor packaging issue17
Epiphany-III● Dec 2010 tapeout● 16 cores, 65nm● This is a product!● Kept improving with
every iteration● RTL changes 1 day
before TO● ~25 GFLOPS/W● ~12mm^2
18
New Packaging● 60um staggered pitch is
not trivial!● Vendor failure on E1
and E2 almost broke company!
● New vendor (Kyocera)● You can't be on the
bleeding edge without the right partners!
● Better results!19
First Delivery● May 2011● Chip worked
perfectly out of the box!
● This silicon became the E16G301
● Felt at peace● Happy moment, but
only the beginning...
20
Embedded Systems Conference
● May, 2011● Launched Epiphany-III● Lots of press!● Competitors curious● But where were the
customers!!??● Flew home depressed
and pissed off...21
First sale!● June 2011● A $5,000 pitiful eval kit● To a big competitor...● They never even turned
it on...● ...which was probably a
good thing.
22
Epiphany-IV● Aug 2011 tapeout● (Jul 2012 samples)● 64 cores, 28nm● 50 GFLOPS/W● RTL changes 2 days
before TO● 3 engineers, 12 weeks!
23
First sale!● June 2011● A $5,000 pitiful eval kit● To a big competitor...● They never even turned
it on...● ...which was probably a
good thing.
24
First Board Product
● May, 2012 ship● FMC board from
Bittware● 4 chip array● Expensive and
not a solution● Niche market● Never shipped in
volume25
Software● 2012 focus● Invested heavily in
software tools + demos
● Face recognition, FFT, Matmul
● Beat ARM by 10X on demo for BIG smartphone company!
● Response: “Meh...”26
Parallella● Sept, 2012● Running out of
money...● We needed a
parallel community and a better kit, so the $99 Parallella was born.
● KS combines pre-purchase, community, and marketing.27
The Parallella Computer
● An open parallel computer● Launched in 2012 at $99● Open source SW/HW!● Dual-core ARM A9 processor● FPGA logic● 1GB RAM, USB, HDMI, GigE● 16/64 Epiphany coprocessors● 50 Gbit/sec IO, 25/100
GFLOPS● <5 Watts● 10,000 shipped (20,000 built)● 200 Universities
OPTIMISTIC & STUBBORN
29
NaiveOptimism
BirthdayBottomed Out
PureStubborness
Relaunch
Success?● Nov 2012● Woke up
next morning stressed and behind schedule.
● Only sell what you have!!
30
Powerup (Gen0)● May 2013● My first board and it
worked!● Only minor issues● Challenge was super
aggressive deisgn targets (credit card, features, cost, schedule)
31
Gen0 Release● July 2013● Gen0 board
released!● We build working
cluster with 42 boards!
● Sent out 50 boards to KS backers, only ~1 got significant use!
● Why?32
Chips are back!● Aug 2013● Full mask tapeout● New Tier-1 package
vendor (ASE). ● ~90% yield!● Good thermals● How to test 10,000
chips with $0 NRE?
33
New Investment...● Dec 20, 2013● $3.6M from Ericsson + VC● Saved the company...but
came in too late to save eng team...
● ...and we were still buried in KS comittments and logistics issues...
● 5,000 customers waiting for us to deliver
34
2014 Distractions: Manufacturig, logistics, T-shirts, cases, logistics, power supplies, accessories...what is Adapteva about again?
35
Shipping HW● June 2014● Shipped to 200
Universities & 10,000 developers
● Built the “A1”, the world's densest cluster and launched at ISC14
● Making progress...
36
Now what?
Faulty cores
Cooperating Program(messages)
NODE:64KB1GHz RISC Core2 GFLOPS/core
Physical Constraints:1.5ns/hop latency10pJ / FLOP30pJ / off chip read/write10pJ / on chip end2end(give or take....)
Some Key Hardware Lessons ● Designing leading edge chips is fairly easy today for a small
team with the right experience (<<$100M)● Designing boards is easier than designing chips but not trivial● One bad choice can break a startup (99/100 is a failing grade)● Never push the tools, process, or capabilities of your vendor● Architects should be vertically integrated● Symmetry is a very powerful design principle● Your product is only strong as your weakest vendor partner● Tiled architectures are MAGICAL
38
Some Software Lessons● Customers take the path of least resistance● Developers take the path of least resistance● Surprisingly hard to find applications that need more
performance....
...and those applications are really complex..
..meaning they have tons of legacy stack software...● SW trumps energy efficiency in 99.99% of applications● Parallel programming still very much a niche
39
MY FINAL LESSON:(took me 7 years to learn)
IT'S THE SOFTWARE STUPID!!!!
40
..so we keep pushing the ball up hill
The Parallel Architectures Library ● A new “standard library” for parallel● Compact C library with optimized routines for vector math,
synchronization, and multi-processor communication.● Designed to be portable across multiple ISAs● Open source (apache 2.0 permissive license)● Open invitation to participate!!● https://github.com/parallella/pal
41
What I (still) Know to be True...● The world will continue to be driven by $$● Moore's law WILL come to an end● Parallel computing is inevitable● Architectures like Epiphany are the future● CPUs, FPGAs, and manycore will coexist● Never doubt the ingenuity of an engineer● It's a dirty job but someone has to do it...
4242