Download - PFQ@ 9th Italian Networking Workshop (Courmayeur)

PFQ: a Novel Architecture for Packet Capture on Parallel Commodity

Hardware

Nicola Bonelli, Andrea Di Pietro, Stefano Giordano, Gregorio Procissi

CNIT e Dip. di Ingegneria dell’Informazione - Università di Pisa

Outline

• Introduction and motivation• Multi-core programming guidelines• PFQ architecture• Performance evaluation• Conclusion and future work

Introduction and Motivations• Monitoring applications for fast links on commodity hardware is a very challenging

task– The hardware has evolved: 10Gbits links, multi-core architectures and multi-queue network

devices…

• The present software for packet capturing, including some parts of the Linux kernel, is not suitable for the new hardware.

– (+) kernel support for multi-queue network adapters is now implemented– (-) PF_PACKET is extremely slow, even when used in memory-map mode (pcap)

• Linux Networking Subsystem is slow and pointless for monitoring applications– (-) PF_RING is designed for single-processor systems

• Traffic monitoring is not limited to packet capturing…– Exploits the current hardware, scaling possibly linearly with the number of cores– Decouple the hardware parallelism from software parallelism– Divide and conquer approach to steer packets to applications

Multi-thread on Multi-core (1) • What’s wrong with the current software?

– Previous multi-threading paradigms used for single-processor systems are still valid, but prevent the software from scaling with the number of cores.

• For a software on multi-core system to be effective…– Semaphores, mutexes, R/W mutexes and spinlocks are out of

question!– Atomic operations are required, but must be used with moderation

• software design determines the use of atomic operations – Sharing (writes to shared data) must be used with moderation too– False-sharing must and can always be avoided

• wait-free algorithms are as well as cache-oblivious algorithms are our friends

PFQ preamble• PFQ is a novel capture system natively supporting 64bit multi-core architectures

written on top of all the previously exposed guidelines to provide the best possible performance

• PFQ does not memory map packet descriptors of the device driver to user-space (like most commercial vendor products do)

• PFQ is not a custom driver (such as NetMap or PF_RING DNA), it’s an architecture running on top of standard Ethernet drivers, as well as slightly modified ones “PFQ aware drivers” (PF_RING driver aware inheritance)

• PFQ enables packet capturing, filtering, hw queues and devices aggregation, packet classifications, packet steering and so forth…

• PFQ pre-processing is ideal for bidirectional connection balancing , VoIP, different kinds of tunnels, tasks otherwise left to the user-space applications.

PFQ architectureBuilt on the top of the following components…• DB-MPSC queue: multiple-producer, double buffered queue (for the

communication to user-space):– allows concurrent NAPI contexts to enqueue packets– Reduce the sharing, eliminate the false sharing between user-space and NAPI contexts– enables user-space copies from the queue to a private buffer in a batch fashion

• De-multiplexing Matrix:– perfect concurrently accessible data structure (benign race conditions)– no serialization is required to steer/copy packets

• SPSC queue: – enables batching for sk_buff, increase locality for fast packet handlers

• Driver aware:– an effective idea inherited from PF_RING

PFQ architecture

Prefetching queue

• Memory allocation in kernels prior to 2.6.39 had a spinlock on fast path that serialized threads of executions

• Allocation/deallocation of sk_buff were not completely parallelized even if running on different physical cores

• Batch processing is a well-known and efficient technique:

– Optimizes cache effectiveness through temporal reference locality

– Reduce the probability of contention on the alloc/dealloc structures

Packet steering

• Per socket filtering is a common paradigm in capture engines– Linearly scan the socket list to check which one may be

interested for each packet is O(n)!!!

• In a multi-core environment we need a new paradigm: packet steering

• Completely concurrent block (wait-free):– Shared state is mostly read only– Bitmap based that can be updated through atomics (support up

to 64 sockets)– Socket section is ~ O(1)

Packet steering• Given a packet and a set of sockets, which socket needs to receive it?

– Filtering (possibly no socket needs to receive the packet)– Load balancing (balance across multiple sockets based on a hash function)

• Load balancing groups:– A socket can subscribe to a load balancing group– It will receive a fraction of the overall traffic

• Simple subscription:– A socket can subscribe to all of the traffic coming from one or more hardware

queues

• Both modes can be supported concurrently:– Copy and balancing are handled by PFQ

Socket queue: DB-MPSC• This is an unavoidable contention point:

– Load balancing shuffles packets across sockets

• How handle contention without impacting performance?– Use a wait-free algorithm: DB-MPSC queues (double buffer multi-producer

single-consumer)– Support copies/balancing– Reduce traffic coherence among cores, a single (per-packet) atomic operation

that will be amortized in the future implementations

Testbed: Mascara & MonstersMascara Monsters

10 Gb link

Dual Xeon 6-core L5640, @2.27 GHz, 24GBytes RAM

New socket PF_DIRECT for generationIntel 82599 multi-queue 10G ethernet adapter.

By deploying 3-4 cores, it is possible to generate up to 13 Mpps of 64 bytes.

Xeon 6-core X5650 @2.57GHz, 12 GBytes RAM

Intel 82599 multi-queue 10G ethernet adapter

PFQ on board for traffic capture

Single socket layout

Fully parallel layoutNot enough generated traffic !

Load balancing across user space sockets

• Keep the number of capturing NAPI context fixed (12 with the Intel hyper-threading)

• Change the number of user space threads

All of the traffic with just 3 threads!

Packet copy• Copying the same traffic to a variable number of user space threads

• Still 12 NAPI contexts within the kernel

Future directions

• Work on a new packet steering framework:– How can we distribute packets according to an application-

specific semantic?• Implement balancing groups • Each group is associated with an “application specific hash function”• Bind a set of sockets to each group

• Use case: VoIP analysis– Steer control traffic to a specific core– Load balance candidate RTP flows across a variable number of

sockets• Easy (but inaccurate): stateless heuristic• Hard: implement a distributed stateful heuristic, where each core

works on a private state that is then synchronized with those of other cores periodically…

Conclusions• Modern commodity architectures are increasingly parallel

• Huge potential for software based network devices

• Need to strictly fulfill coding and design rules

• PFQ– A novel packet capturing engine– Better scalability with respect to competitors– Flexible packet steering– Decouples kernel space and user space parallelism

• PFQ webpage and download:– netgroup.iet.unipi.it/software/pfq