PFQ: a Novel Architecture for Packet Capture on Parallel Commodity
Hardware
Nicola Bonelli, Andrea Di Pietro, Stefano Giordano, Gregorio Procissi
CNIT e Dip. di Ingegneria dell’Informazione - Università di Pisa
Outline
• Introduction and motivation• Multi-core programming guidelines• PFQ architecture• Performance evaluation• Conclusion and future work
Introduction and Motivations• Monitoring applications for fast links on commodity hardware is a very challenging
task– The hardware has evolved: 10Gbits links, multi-core architectures and multi-queue network
devices…
• The present software for packet capturing, including some parts of the Linux kernel, is not suitable for the new hardware.
– (+) kernel support for multi-queue network adapters is now implemented– (-) PF_PACKET is extremely slow, even when used in memory-map mode (pcap)
• Linux Networking Subsystem is slow and pointless for monitoring applications– (-) PF_RING is designed for single-processor systems
• Traffic monitoring is not limited to packet capturing…– Exploits the current hardware, scaling possibly linearly with the number of cores– Decouple the hardware parallelism from software parallelism– Divide and conquer approach to steer packets to applications
Multi-thread on Multi-core (1) • What’s wrong with the current software?
– Previous multi-threading paradigms used for single-processor systems are still valid, but prevent the software from scaling with the number of cores.
• For a software on multi-core system to be effective…– Semaphores, mutexes, R/W mutexes and spinlocks are out of
question!– Atomic operations are required, but must be used with moderation
• software design determines the use of atomic operations – Sharing (writes to shared data) must be used with moderation too– False-sharing must and can always be avoided
• wait-free algorithms are as well as cache-oblivious algorithms are our friends
PFQ preamble• PFQ is a novel capture system natively supporting 64bit multi-core architectures
written on top of all the previously exposed guidelines to provide the best possible performance
• PFQ does not memory map packet descriptors of the device driver to user-space (like most commercial vendor products do)
• PFQ is not a custom driver (such as NetMap or PF_RING DNA), it’s an architecture running on top of standard Ethernet drivers, as well as slightly modified ones “PFQ aware drivers” (PF_RING driver aware inheritance)
• PFQ enables packet capturing, filtering, hw queues and devices aggregation, packet classifications, packet steering and so forth…
• PFQ pre-processing is ideal for bidirectional connection balancing , VoIP, different kinds of tunnels, tasks otherwise left to the user-space applications.
PFQ architectureBuilt on the top of the following components…• DB-MPSC queue: multiple-producer, double buffered queue (for the
communication to user-space):– allows concurrent NAPI contexts to enqueue packets– Reduce the sharing, eliminate the false sharing between user-space and NAPI contexts– enables user-space copies from the queue to a private buffer in a batch fashion
• De-multiplexing Matrix:– perfect concurrently accessible data structure (benign race conditions)– no serialization is required to steer/copy packets
• SPSC queue: – enables batching for sk_buff, increase locality for fast packet handlers
• Driver aware:– an effective idea inherited from PF_RING
PFQ architecture
Prefetching queue
• Memory allocation in kernels prior to 2.6.39 had a spinlock on fast path that serialized threads of executions
• Allocation/deallocation of sk_buff were not completely parallelized even if running on different physical cores
• Batch processing is a well-known and efficient technique:
– Optimizes cache effectiveness through temporal reference locality
– Reduce the probability of contention on the alloc/dealloc structures
Packet steering
• Per socket filtering is a common paradigm in capture engines– Linearly scan the socket list to check which one may be
interested for each packet is O(n)!!!
• In a multi-core environment we need a new paradigm: packet steering
• Completely concurrent block (wait-free):– Shared state is mostly read only– Bitmap based that can be updated through atomics (support up
to 64 sockets)– Socket section is ~ O(1)
Packet steering• Given a packet and a set of sockets, which socket needs to receive it?
– Filtering (possibly no socket needs to receive the packet)– Load balancing (balance across multiple sockets based on a hash function)
• Load balancing groups:– A socket can subscribe to a load balancing group– It will receive a fraction of the overall traffic
• Simple subscription:– A socket can subscribe to all of the traffic coming from one or more hardware
queues
• Both modes can be supported concurrently:– Copy and balancing are handled by PFQ
Socket queue: DB-MPSC• This is an unavoidable contention point:
– Load balancing shuffles packets across sockets
• How handle contention without impacting performance?– Use a wait-free algorithm: DB-MPSC queues (double buffer multi-producer
single-consumer)– Support copies/balancing– Reduce traffic coherence among cores, a single (per-packet) atomic operation
that will be amortized in the future implementations
Testbed: Mascara & MonstersMascara Monsters
10 Gb link
Dual Xeon 6-core L5640, @2.27 GHz, 24GBytes RAM
New socket PF_DIRECT for generationIntel 82599 multi-queue 10G ethernet adapter.
By deploying 3-4 cores, it is possible to generate up to 13 Mpps of 64 bytes.
Xeon 6-core X5650 @2.57GHz, 12 GBytes RAM
Intel 82599 multi-queue 10G ethernet adapter
PFQ on board for traffic capture
Single socket layout
Fully parallel layoutNot enough generated traffic !
Load balancing across user space sockets
• Keep the number of capturing NAPI context fixed (12 with the Intel hyper-threading)
• Change the number of user space threads
All of the traffic with just 3 threads!
Packet copy• Copying the same traffic to a variable number of user space threads
• Still 12 NAPI contexts within the kernel
Future directions
• Work on a new packet steering framework:– How can we distribute packets according to an application-
specific semantic?• Implement balancing groups • Each group is associated with an “application specific hash function”• Bind a set of sockets to each group
• Use case: VoIP analysis– Steer control traffic to a specific core– Load balance candidate RTP flows across a variable number of
sockets• Easy (but inaccurate): stateless heuristic• Hard: implement a distributed stateful heuristic, where each core
works on a private state that is then synchronized with those of other cores periodically…
Conclusions• Modern commodity architectures are increasingly parallel
• Huge potential for software based network devices
• Need to strictly fulfill coding and design rules
• PFQ– A novel packet capturing engine– Better scalability with respect to competitors– Flexible packet steering– Decouples kernel space and user space parallelism
• PFQ webpage and download:– netgroup.iet.unipi.it/software/pfq