©By Roy Messinger
1
Xilinx(Ultrascale)
Vs.
Altera(ARRIA 10)
Test Bench
By Roy Messinger
www.HWDebugger.com
©By Roy Messinger
2
1 GENERAL
In the following document I will show a thorough comparison I've conducted between 2
FPGA's of vendor's families; Altera ARRIA 10 & Xilinx UltraScale Kinetis.
The comparison put emphasis on frequency, utilization, power & compilation time. I've
carried out this comparison in an attempt to find the 'best' vendor suited for my needs. I did
not give any 'discounts' to this or that vendor. All the tests I've conducted were purely
identical in term of exactly the same code and software preferences.
See important notes at last page for further info.
2 WHAT I'VE CHECKED WAS:
• Frequency.
• Utilization.
• Thermal power.
• Compilation time.
3 FPGA COMPONENTS
I’ve chosen these FPGA’s to compare two similar components, in term of RAM, size, and
various other characteristics.
Component System Logic [k]
RAM [Mb]
PCI-Gen 3 Transcv I/O
Altera GX480, (10AX048K1F35E1HG)
629 28 2*8 lanes 36 396
Xilinx KU035 (XCKU035-1FFVA1156C)
444 25 2*8 lanes 16 520
©By Roy Messinger
3
4 TEST BENCH METHODOLOGY
How did I carry out the comparison?
• For the comparison I have used a VHDL component of a state machine (about 20
states). This FSM implements some heavy logic and runs at 400MHz.
• I've designed 2 small projects of only this component, both in Altera (Quartus) &
Xilinx (Vivado).
• After each successful compilation, I've checked the timing analysis and replicated
the component to push the FPGA capabilities to the edge (space, frequency).
• I've used virtual pins on all comps so no need to connect the comp ports to the FPGA
pins (no connection to IO buffers).
• I did not alter anything in each of the softwares. I've left the default values of
implementation/synthesis setting as they were.
Compile in Vivado & Quartus Passes
timing req.?
Yes
Replicate
No
Replicate
component
Compare to
second vendor.
Virtual pins
FPGA
Comp.
©By Roy Messinger
4
5 TEST BENCH HARDWARE
• Compilation computers (both with Windows 7 OS):
o Altera:
▪ Quartus version 17.0.0.
▪ E5-2643 @3.4GHz (Xeon), 32GB RAM.
o Xilinx:
▪ Vivado version 2016.4.
▪ I7-6700 @3.4GHz , 32GB RAM.
• Component chosen were close to the same spec (to what I need):
o Altera: 10AX048K1F35E1HG; GX480, highest speed grade.
o Xilinx: XCKU035-1FFVA1156C; KU035, highest slowest speed grade (see
notes at last page).
o Both comps are the same package dimension (35mm*35mm).
©By Roy Messinger
5
6 TEST RESULTS
I've ran 3 sets of tests. I've defined them as Test A, Test B, Test C.
• This is NOT a real design, but one that can compare the performances between
both vendors as it uses a real component and simulates HW FPGA development
phases. The code is the same.
• Test A & Test B are closer to a real world implementation in my point of view, as it
defines relations between different instantiations inside the FPGA.
• Test B is intended to push the FPGA to the edge, in term of frequency, as both
vendors do not reach this frequency but are supposed to do their best effort.
• I've also implemented Test C to ease the vendors Synthesis, Optimizations & Place
& Route phases and see what happens then, when there's no relation between
different instantiations.
• The frequency comparison is between the WNS in Vivado (Worst Negative Slack,
it's the worse of the worst) and max frequency result in Quartus, which is based on
the setup timing in 100c of the timing report (it is the worse of the worst).
• Both vendor tools have the default preferences (no 'best efforts', etc.).
…
Inst
. 1
Inst
. 2
Inst
. 3
Inst.
24
Test A, 400MHz:
Each input is connected to all
instantiations, as shown.
Internal Outputs, obviously,
are separated:
Test B, 500MHz:
Each input is connected to all
instantiations, as shown.
Outputs, obviously, are
separated:
…
Inst
. 1
Inst
. 2
Inst
. 3
Inst.
24
2 Clocks are created for the design in SDC (Quartus) & XDC (Vivado); 100MHZ & 400MHz/500MHz
Test C, 400MHz:
Each input is connected to
each instantiation, as shown.
Outputs, obviously, are
separated:
…
Inst
. 1
Inst
. 2
Inst
. 3
Inst.
24
©By Roy Messinger
6
Test A (at 400MHz):
…
©By Roy Messinger
7
These are the results for 400MHz:
General Notes & conclusions for Test A:
a. The same VHDL component was used with exact same parameters The
code is the same.
b. Compilation times of Vivado (Xilinx) were 20% faster than Quartus.
c. Frequency column values above 400MHz shows the maximum frequency
achieved, even though not required.
d. Ultrascale(Xilinx) slope is much more stable and linear than ARRIA 10(Altera),
and keeps steady slope above the 400MHz target frequency until it cannot
hold on.
In continuous to section C., I've now compared both projects in 500MHz, where
even though both vendors cannot reach such high frequency, they will tend to do
their best effort to reach the highest frequency they can.
Max. Frequency [MHz]
Desired freq. Replicated
Components
Altera Xilinx
ARRIA 10
ULTRA-SCALE
400 4 430 423
400 5 433 413
400 7 417 409
400 8 395 411
400 9 433 414
400 10 403 414
400 11 419 411
400 12 383 411
400 13 401 411
400 14 389 410
400 15 420 409
400 16 409 409
400 17 402 410
400 18 370 412
400 19 316 417
400 20 383 420
400 25 362 411
400 30 364 416
400 35 315 410
400 37 315 411
400 40 315 387
400 45 330 392
©By Roy Messinger
8
Test B (at 500MHz):
…
©By Roy Messinger
9
These are the results for 500MHz:
General Notes & conclusions for Test B:
a. Both vendors could not reach 500MHz, nevertheless, Ultrascale managed to be way over ARRIA 10 in terms of frequency, space and
compilation time.
b. Regarding logic elements usage, there's a fix value of 86% usage ratio between Xilinx logic usage and Altera logic usage (Xilinx usage is
lower than Altera). I've used Xilinx formulas to compare CLB(LUT)'s to ALM's.
c. ARRIA 10(Altera) vs. Ultrascale (Xilinx) usage logic ratio is kept fixed all along, showing both Altera and Xilinx replication algorithm
does not change, as the usage of logic elements is raising linear when replications increase which is a good thing when comparing
‘apples to apples'.
Desired freq.Replicated
components
Xilinx Achieved
frequency [MHz]
Altera Achieved
frequency [MHz]
Xilinx Utiization
[%]
Altera Utilization
[%]
Xilinx Utilization
[LUT]
Altera Utilization
[ALM]
Xilinx
Normalized
utilization
Altera
Normalizaed
Utilization
% Xilinx/Altera
usage
500 18 471 371 24.6 21 50,056 38,519 87,598 102,075 86
500 19 497 381 26 22.2 52,825 40,712 92,444 107,887 86
500 20 480 316 27.4 23.3 55,586 42,715 97,276 113,195 86
500 21 488 341 28.7 24.4 58,373 44,743 102,153 118,569 86
500 22 450 392 30.1 25.5 61,158 46,858 107,027 124,174 86
500 23 492 341 31.5 26.7 63,951 48,995 111,914 129,837 86
500 24 461 362 32.8 27.8 66,708 51,026 116,739 135,219 86
500 25 413 312 34.2 29 69,506 53,197 121,636 140,972 86
500 26 459 396 35.6 30.3 72,288 55,595 126,504 147,327 86
500 27 450 314 37 31.4 75,087 57,685 131,402 152,865 86
500 28 473 388 38.3 32.6 77,803 59,877 136,155 158,674 86
500 29 469 332 39.7 33.9 80,616 62,173 141,078 164,758 86
500 30 489 334 41.1 35.1 83,418 64,382 145,982 170,612 86
500 31 466 384 42.4 36.2 86,152 66,394 150,766 175,944 86
©By Roy Messinger
10
Test C (at 400MHz):
…
©By Roy Messinger
11
Desired freq.Replicated
components
Xilinx
Achieved
frequency
[MHz]
Altera
Achieved
frequency
[MHz]
Xilinx
Compilation time
Altera compilation
time
Xilinx
Utiization [%]
Altera Utilization
[%]
Xilinx Utilization
[LUT]
Altera Utilization
[ALM]
Xilinx
Normalized
utilization
Altera
Normalizaed
Utilization
Xilinx/Altera
utilization ratio
[%]
Power Dissipation
Xilinx [W]
Power Dissipation
Altera [W]
400 8 410 420 08:42 15:27
400 9 411 424 09:48 18:30
400 10 412 419 10:46 20:00
400 11 409 409 11:15 21:37
400 12 410 417 12:58 20:24
400 13 414 406 13:00 25:01
400 14 409 418 13:25 28:00
400 15 410 420 13:32 28:01
400 16 418 401 14:24 31:24
400 17 408 394 14:06 32:09
400 18 419 411 15:47 33:00
400 19 410 423 15:39 36:02
400 20 411 408 16:52 37:00
400 21 420 405 28:00 40:00 29 32 1.66 3.27
400 22 409 416 30:00 38:22 30 34 1.7 3.38
400 23 408 412 32:00 39:30 31 36 1.78 3.48
400 24 418 398 32:20 41:24 33 37 1.83 3.6
400 25 420 371 33:00 43:55 34 39 1.89
400 26 411 411 36:00 45:48 36 40 1.95 3.75
400 27 409 410 36:00 45:40 37 42 2 4
400 28 410 409 40:00 50:40 38 43 2 4
400 29 411 415 41:10 52:21 40 45
400 30 409 407 26:00 54:00 41 46 83,448 85,093 146,034 225,496 65 2.17 4.172
400 31 416 406 42:00 56:29 42 48
400 32 408 407 42:00 57:44 44 49 5.3
400 33 414 402 48:14 58:23 45 51 91,761 93,598 160,582 248,035 65 2.34 4.46
400 34 412 404 46:30 58:44 47 53
400 35 409 404 50:00 01:01:52 48 54
400 36 401 380 47:37 01:05:00
400 37 401 393 52:21 59:39
400 38 408 417 50:00 01:07:02
400 39 407 334 57:30 01:10:00 53 60 108,271 110,627 189,474 293,162 65 2.577 4.9
400 40 409 395 53:03 01:02:00
400 41 409 408 55:00 01:11:00 56 63 113,857 116,295 199,250 308,182 65 2.685
400 42 404 359 56:55 01:01:05
400 43 402 395 58:52 01:13:00 59 66 5.25
400 44 390 393 01:03:00 01:12:00 60 68 122,357 124,801 214,125 330,723 65 2.846
400 45 410 406 1:04:00 01:19:00 62 70 2.9
400 46 404 394 1:05:01 01:22:00 63 71 2.95 5.457
400 47 378 397 01:09:00 01:23:00 64 73 3.008 5.5
400 48 409 371 01:06:00 01:29:00 66 3.06
Though pwr dissipation not 'real'
because virtual pins are used, still,
the comparison between vendors is
'legal' as we can compare between
them.
©By Roy Messinger
12
General Notes & conclusions for Test C:
a. In this test, though less realistic in my point of view, both vendors can hold
more replications till they fail timing requirements. Nevertheless, ARRIA 10
(Altera) keeps failing at much earlier points than Ultrascale (Xilinx).
b. Xilinx Compilation times are about 20% faster than Altera.
c. Regarding logic elements usage, there's a fix value of 65% usage ratio
between Xilinx logic usage and Altera logic usage (Xilinx usage is lower than
Altera). I've used Xilinx formulas to compare LUT's to ALM's.
d. In this test I've also compared Thermal Power: Ultrascale consumes about
50% less power than ARRIA 10 (meaning less overall heat and power supply
current needed).
©By Roy Messinger
13
7 TEST RESULTS SUMMARY
So, overall:
A. When comparing Altera ARRIA 10 GX480, F35, to Xilinx UltraScale KU035,
A1156:
• Compilation time (Xilinx 20% less).
• Frequency (Xilinx were much more stable and higher freq.)
• Thermal power (Xilinx almost 50% less power).
• Utilization (Xilinx to Altera ratio 86%).
B. Even when I compared Altera’s GX320 to Xilinx’s KU035 (Altera smaller comp to 'same' Xilinx comp), the Xilinx’s KU035 had better results, in all these characteristics. For example, when compiling Altera’s GX320, F35 (same package as Altera’s
GX480) which should be 'equal' to Xilinx’s KU035, for 44 replications:
Quartus utilization for GX320 for 44 replications, Test C:
Logic utilization (in ALMs) 139,107 / 119,900 ( 116 % )
And compilation failed. Not enough place in device.
Xilinx utilization for KU035 for 44 replications, Test C:
60%.
C. When compared ARRIA 10 GX270 to Xilinx’s KU035, I had similar results in all
characteristics (did not check all replications).
Notes:
2 very important keynotes I've discovered after conducting this comparison (which
should tip the scale in favor of Intel/Altera, and nevertheless, Xilinx results are much
better):
• Xilinx FPGA chosen was smaller than Altera. This means Xilinx P&R algorithm
must work harder to reach the desired frequency (since less space is
available). Nevertheless, Xilinx results are much better.
• Xilinx FPGA speed is the slowest, compared to Altera (which is the fastest).
This means Altera results should be better. Nevertheless, it is much worse.