Overview of Upgrade DAQ
Niko Neufeld, CERN
1st Upgrade DAQ Mini-workshop, May 27th 2013, CERN
Upgrade DAQ Overview - Niko Neufeld 2
Introduction – the task
Assuming there are 11448 GBT links for the DAQ to be read out (true in a Universe containing a pixel VeLo, a reduced OT and a SciFi)The total amount of data is thus 12000 x 3.2 == 38400 Gbit/s == 4.8 TB/s
GBT wide-mode disregarded for this talkThese data needs to be collected and filteredThe proposed system should be reliable, operated for 10 years by a small crew, scalable and as cost-effective as possible
Upgrade DAQ Overview - Niko Neufeld 3
Some constraints
Input links are bundled in units of 24 Not a magic number but a good match for current state of the art FPGA technology
With foreseeable technology the required system (network + filter-farm) does not fit into the existing space, cooling and power available
CERN has agreed to provide a new dedicated modular data-centre – this has the additional advantage that in principle installation (where it makes economical sense) can start before LS2
Also the up-graded LHCb events will be small – MEP will be needed more than ever (packing factor will increase)
Upgrade DAQ Overview - Niko Neufeld 4
The topic of this talkThe technologies and technology trends which we can use
I will limit myself here to the baseline: a farm of x86-servers interconnected with an industry standard local area network, i.e.:I do not consider the potential effect of micro-servers (Atom, ARM), GPUs or other co-processorsThis does not mean that we exclude the use of these, but the study of their implications on the DAQ architecture requires dedicated effort and needs (first) to be warranted by demonstrated potential gains (also we need topics for further workshops )
Which architectures are possible with these technologies and their relative merits
Upgrade DAQ Overview - Niko Neufeld 5
The good things about our current DAQ – What we want to keep
Simple single-stage data-flowSmallest possible number of hardware components
It’s not easy to imagine to take something away and replace it by software and/or more work in other elements in the system (a kind of “Nash” equilibrium )
Scalability
Upgrade DAQ Overview - Niko Neufeld 6
What we probably need to change in the Upgrade DAQ
Currently we buffer mostly in the networkIt looks like that buffering in the servers will be significantly cheaper and large buffers leave more options for the event-building protocols used
Currently the packing factor in MEP is determined by the Ethernet MTU (9000 bytes)
With hardware assists in network ASICs the MTU becomes less relevant, however reducing the message rate becomes even more important at 10, 20, 30 MHz
Technology
Links, PCs and networks
Upgrade DAQ Overview - Niko Neufeld 8
The evolution of PCs - 1PCs used to be relatively modest I/O performers (compared to FPGAs), this has radically changed with PCIe Gen3 Xeon processor line has now 40 PCIe Gen3 lanes / socketDual-socket system has a theoretical throughput of 1.2 Tbit/s(!)
Tests suggest that we can get quite close to the theoretical limit (using RDMA at least)
This is driven by the need for fast interfaces for co-processors (GPGPUs, XeonPhi) CPU will be the bottle-neck in the server - not the LAN interconnect – 10 Gb/s by far sufficient
Upgrade DAQ Overview - Niko Neufeld 9
The evolution of PCs - 2
More integration (Intel roadmap) Integrate all important parts of the server in the CPU:
memory controller , PCIe controller NIC (2 – 3 years?) physical LAN interface (4 – 5 years)?
Aim for high density, because ofshort distances (!)efficient coolingless real-estate needed (rack-space)
Upgrade DAQ Overview - Niko Neufeld 10
EthernetThe champion of all classes in networking10 Gbit/s, 40 Gbit/s, 100 Gbit/s out and 400 Gbit/s in preparationAs speed goes up so does power-dissipation high-density switches come with high integrationMarket seems to separate more and more in to two camps:
carrier-class, deep-buffer, high-density, flexible firmware (FPGA, network processors) (Cisco, Juniper, Brocade, Huawei), $$$/portdata-centre, shallow buffer, ASIC based, ultra-high density, focused on layer 2 and simple layer 3 features, very low latency, $/port (these are often also called Top Of the Rack TOR)These trends get more pronounced as speed goes up – i.e. from 10 Gb/s to 40 Gb/s
Upgrade DAQ Overview - Niko Neufeld 11
The “two” Ethernet’sSpeed Core [ USD / port ] TOR [ USD / port ]
10 Gb/s 400 – 1000 200 - 250
40 Gb/s 1000 - 4000 500 - 900
If possible less core and more TOR portsbuffering needs to be done elsewhere
Prices exclude optics
Upgrade DAQ Overview - Niko Neufeld 12
InfiniBandDriven by a relatively small, agile company: MellanoxEssentially HPC + some DB and storage applicationsOnly competitor: Intel (ex-Qlogic)Extremely cost-effective in terms of Gbit/s / $Open standard, but almost single-vendor – unlikely for a small startup to enter Software stack (OFED including RDMA) also supported by Ethernet (NIC) vendors
Many recent Mellanox products (as of FDR) compatible with Ethernet
Upgrade DAQ Overview - Niko Neufeld 13
InfiniBand pricesSpeed Core [ USD / port ] TOR [ USD / port ]52 Gb/s 850 250
Worth some R&D at least
Speed Core [ USD / port ] TOR [ USD / port ]
10 Gb/s 400 – 1000 200 - 25040 Gb/s 1200 - 4000 500 - 900
14
The devil’s advocate: why not InfiniBand *?
InfiniBand is (almost) a single-vendor technologyCan the InfiniBand flow-control cope with the very bursty traffic typical for a DAQ system, while preserving high bandwidth utilization? InfiniBand is mostly used in HPC, where hardware is regularly and radically refreshed (unlike IT systems like ours which are refreshed and grown over time in an evolutionary way). Is there not a big risk that the HPC guys jump onto a new technology and InfiniBand is gone?The InfiniBand software stack seems awfully complicated compared to the Berkeley socket API. What about tools, documentation, tutorial material to train engineers, students, etc…?
Upgrade DAQ Overview - Niko Neufeld
* Enter the name of your favorite non-favorite here!
Upgrade DAQ Overview - Niko Neufeld 15
The evolution of NICs
2008 2012 2013 2014 2015 2016 20170
20
40
60
80
100
120
140
1040
100
32
54
100
32
64
128
Ethernet InfiniBand x4 PCIe x8
Gbit/
sPCIe Gen4expected
Chelsio T5 (40 GbEand Intel 40 GbEexpected
Mellanox FDRMellanox 40 GbE NIC
PCIe Gen3available
EDR (100 Gb/s) HCAexpected
100 GbE NICexpected
Upgrade DAQ Overview - Niko Neufeld 16
The evolution of lane-speedAll modern interconnects (> 10 Gb/s) are multiple serialAnother aspect of “Moore’s” law is the increase of serialiser speedHigher speed reduces number of lanes (fibres)Cheaper interconnects also require availability of cheap optics (VCSEL, Silicon-Photonics)VCSEL currently runs better over MMF (OM3, OM4 for 10 Gb/s and above) per meter these fibres are more expensive than SMFCurrent lane-speed 10 Gb/s (same as 8 Gb/s, 14 Gb/s)Next lane-speed (coming soon and already available on high-end FPGAs) is 25 Gb/s (same as 16 Gb/s)
should be safely established by 2017 (a hint for GBT v3 ?)
It is widely believed that the volume of 100 Gb/s deployments and beyond will at least use 25 Gb/s (not 10 Gb/s) – example InfiniBand EDR (this year!). PCIe may soon be the only volume market for 10 Gb/s VCSELS.
Upgrade DAQ Overview - Niko Neufeld 17
The evolution of switches – BBOT(Big Beasts Out There)
Brocade MLX: 768 10-GigE Juniper QFabric: up to 6144 10-GigE (not a single chassis solution)Mellanox SX6536: 648 x 56 Gb (IB) / 40 GbE ports Huawei CE12800: 288 x 40 GbE / 1152 x 10 GbE
Date
of r
elea
se
Upgrade DAQ Overview - Niko Neufeld 18
Inside the beasts
Many ports does not mean that the interior is similar Some are a high-density combination of TOR elements (Mellanox, Juniper Qfabric)
Usually just so priced that you can’t do it cheaper by cabling up the pizza-boxes yourself
Some are classical fat cross-bars (Brocade)Some are in-between (Huawei, CLOS but lots of buffering)Upshot: Suitable devices exist for both InfiniBand and Ethernet cost will determine
Upgrade DAQ Overview - Niko Neufeld 19
The eternal copper
Surprisingly (for some) copper cables remain the cheapest option, iff distances are shortA copper interconnect is even planned for InfiniBand EDR (100 Gbit/s)For 10 Gigabit Ethernet the verdict is not yet passed, but I am convinced that we will have 10 GBaseT on main-board “for free”
It is not yet clear if there will be (a need for) truly high-density 10 GBaseT line-cards (100 ports+)
Upgrade DAQ Overview - Niko Neufeld 20
Keep distances shortMulti-lane optics (Ethernet SR4, SR10, InfiniBand QDR) over multi-mode fibres are limited to 100 (OM3) to 150 (OM4) metersCable assemblies (“direct-attach) cables are either
passive (“copper”, “twinax”), very cheap and rather short (max. 4 to 5 m), oractive – still cheaper than discreet optics , but as they use the same components internally they have similar range limitations
For comparison: price-ratio of 40G QSFP+ copper cable assembly, 40G QSFP+ active cable, 2 x QSFP+ SR4 optics + fibre (30 m) = 1 : 8 : 10On-chip transceivers (si-photonics) are expected to be limited to about 70 m (on MMF)
Architecture
Make the best of available technologies
Upgrade DAQ Overview - Niko Neufeld 22
Principles
Minimize number of expensive “core” network portsUse the most efficient technology for a given connection
different technologies should be able to co-exist (e.g. fast for building, slow for end-node)Try to be open for interconnect technologykeep distances short
Exploit the economy of scale try to do what everybody does (but smarter )
23
100 m rock
Complete read-out on/to the surface
Upgrade DAQ Overview - Niko Neufeld
Detector
DAQ network
Readout Units
Compute Units
Long distance covered by versatile links (GBT) Maximum of 303 m from FE on detector to AMC40 input Measurements with prototype FE and AMC40 show that this works well with excellent marginsCan use cheapest commercial links Easiest operation (accessibility)
24
Split read-out system
Upgrade DAQ Overview - Niko Neufeld
Detector
DAQ network
100 m rock
Event-building / HLT0
Compute Units
In this scenario the event-building is done in UX85APart of the HLT can also be run therePros:
fewer links to the surface
Cons: need to refurbish completely D1 and D2Remaining links to the surface will be very expensive – adding bandwidth (scaling!) becomes expensive
Since the bulk of the cost is in the connectors, break-points and testing of the fibres and not in the material itself, the cost savings due to reduced number of links are not as big as might be hoped
Upgrade DAQ Overview - Niko Neufeld 25
Protocols & topologies
Pull, push, barrel-shifting can be done of course on any topology (more in Daniel & Guoming’s talk)Also it is not an ideological discussion between “pushing” or “pulling” – after all the data flows always from the detector to the farm , but which management of the flow is the most efficientIt is more the switch-hardware (and there in particular the amount of buffer-space) and the properties of the low-level (layer-2) protocol which will make the difference
Upgrade DAQ Overview - Niko Neufeld 26
Classical fat-core event-builder
About 1600 core ports
Upgrade DAQ Overview - Niko Neufeld 27
Uniform event-builder network
10 GBaseT
100 Gb/sbi-directional
These 2 not necessarily the same technology
Upgrade DAQ Overview - Niko Neufeld 28
Uniform NetworkAdvantages
Fewer overall network portsTechnology independence of event-building networkAllows simple push architecture with minimum buffering in AMC40Allows using most light-weight event-building based on Remote DMA
ChallengesLots of I/O in event-building serversNeed to find the most efficient way to couple the AMC40 to a PC: Ethernet, InfiniBand, PCIeNeed careful task organization on event-building PCs (task priorities, IRQ handling etc…)
29
PlanningUnder the assumption that we need to be ready to take data in 2019
LS2 shifts so do weSystem for SD-tests ready whenever needed minimal investment
2013: DAQPIPE ready, tests of 40 GigE and IB, PC I/O technology, PCIe performance (see talks by Umberto and Beat)2013 – 16: technology following (PC and network)2014 - 15: Large scale IB and Ethernet tests2016: TDR & tender preparations 2017: Acquisition of minimal system to be able to read out every GBT (at some speed)
Acquisition of modular data-center2018: Acquisition and Commissioning of full system
starting with networkfarm as needed
Upgrade DAQ Overview - Niko Neufeld
Upgrade DAQ Overview - Niko Neufeld 30
Conclusions
Technology is on our sideAim for a simple data-flow (push wherever possible), as few components as possible Stay flexible: decide on core network technology as late as possibleUniform network currently looks like the best architecture to achieve the above goals