+ All Categories
Home > Documents > John DeHart Flow Stats Module -- Control Design. 2 - Flow Stats Module – John DeHart and James...

John DeHart Flow Stats Module -- Control Design. 2 - Flow Stats Module – John DeHart and James...

Date post: 31-Dec-2015
Category:
Upload: augusta-barber
View: 224 times
Download: 2 times
Share this document with a friend
27
John DeHart Flow Stats Module -- Control Design
Transcript

John DeHart

Flow Stats Module--

Control Design

2 - Flow Stats Module – John DeHart and James Moscola

SPP V1 LC Egress with 1x10Gb/s Tx

SWITCH

MSF

Rx1

RBUF

Rx2Key

ExtractLookup

HdrFormat

FlowStats1NN

1x10GTx1

NN1x10G

Tx2

MSF

TBUF

NNNN NN NN

RTM

Stats(1 ME) SRAM3SCR SRAM1

SRAM2

NN

FlowStats2

TCAM

XScale

SC

R

XScale

NAT MissScratch Ring

XScale

NAT Pktreturn

SC

R

SRAMArchive Records

PortSplitter

QM0 SCR

QM1 SCR

QM2 SCR

QM3 SCR

SCR

SR

AM

Freelist

3 - Flow Stats Module – John DeHart and James Moscola

SPP V1 LC Egress with 10x1Gb/s Tx

SWITCH

MSF

Rx1

RBUF

Rx2Key

ExtractLookup

HdrFormat

MSF

TBUF

NNNN NN NN

RTM

NN

SC

R

XScale

NAT MissScratch Ring

TCAM

5x1GTx1

(P0-P4)5x1GTx2

(P5-P9)

SCR

SCR

FlowStats1

SRAM1

SRAM2FlowStats2XScale XScale

NAT Pktreturn

SC

R

SRAMArchive Records

PortSplitter

QM0 SCR

QM1 SCR

QM2 SCR

QM3 SCR

Stats(1 ME) SRAM3SCR

SCR

SR

AM

Freelist

4 - Flow Stats Module – John DeHart and James Moscola

Overview of Flow Stats

2 MEs in Fastpath to collect flow data for each pkt»Byte counter per flow»Pkt counter per flow»Archive data to XScale via SRAM ring every 5 minutes

XScale control daemon(s) to process data»Receive flow information from MEs»Reformat to put into PlanetFlow format»Maintain databases for PlanetLab archiving and for identifying internal flows (pre-NAT translation) when an external flow (post-NAT) has a complaint lodged against it.

5 - Flow Stats Module – John DeHart and James Moscola

SPP V1 LC Egress with 10x1Gb/s Tx

SWITCH

MSF

Rx1

RBUF

Rx2Key

ExtractLookup

HdrFormat

MSF

TBUF

NNNN NN NN

RTM

NN

SC

R

XScale

NAT MissScratch Ring

TCAM

5x1GTx1

(P0-P4)5x1GTx2

(P5-P9)

SCR

SCR

FlowStats1

SRAM1

SRAM2FlowStats2XScale XScale

NAT Pktreturn

SC

R

SRAMArchive Records

PortSplitter

QM0 SCR

QM1 SCR

QM2 SCR

QM3 SCR

Stats(1 ME) SRAM3SCR

SCR

SR

AM

Freelist

6 - Flow Stats Module – John DeHart and James Moscola

Start Timestamp (16b)

Packet Counter (32b)

SrcPort (16b) DestPort (16b)

Destination Address (32b)

Source Address (32b)

Protocol (8b)

LW0

LW1

LW2

LW3

LW4

LW5

LW6

LW7

Flow Record Total Record Size = 8 32-bit words

» V is valid bit Only needed at head of chain ‘1’ for valid record ‘0’ for invalid record

» Start timestamp (16-bits) is set when record starts counting flow

Reset to zero when record is archived» End timestamp (16-bits) is set each

time a packet is seen for the given flow

» Packet and Byte counters are incremented for each packet on the given flow

Reset to zero when record is archived» For TCP Flows, the TCP Flags are or’ed

in from each packet» Next Record Number is next record in

hash chain 0x1FFFF if record is tail Address of next record =

(next_record_num * record_size) + collision_table_base_addr

Next Record Number (17b)

Slice ID (VLAN) (12b)Reserved

(6b)

Byte Counter (32b)

Reserved(14b)

= Member of 6-tuple

V(1b)

End Timestamp (16b)

TCP Flags(6b)

7 - Flow Stats Module – John DeHart and James Moscola

Archiving Hash Table Records Send all valid records in hash table to

XScale for archiving every 5 minutes Set Command field to indicate

FLOW_RECORD For each record in the main table (i.e. start

of chain) ...» For each record in hash chain ...

If record is valid ... If packet count > 0 then

– Send record to XScale via SRAM ring– Set packet count to 0– Set byte count to 0– Leave record in table

If packet count == 0 then– Flow has already been archived– No packet has arrived on flow in 5 minutes – Record is no longer valid– Delete record from hash table to free

memory

Start Timestamp_high (32b)

Start Timestamp_low (32b)

End Timestamp_high (32b)

Packet Counter (32b)

SrcPort (16b) DestPort (16b)

Destination Address (32b)

Source Address (32b)

Protocol (8b)

LW0

LW1

LW2

LW3

LW4

LW5

LW6

LW7

End Timestamp_low (32b)

LW8

LW9

Slice ID (VLAN) (12b)

Byte Counter (32b)

Info Sent to XScale for eachflow every 5 minutes

TCP Flags(6b)

Command(6b)

8 - Flow Stats Module – John DeHart and James Moscola

Sending Time Records to XScale ME Precedes a series of Flow Records with

a time record. Set Command field to indicate

TIME_RECORD Time Record must be same size as Flow

Record, currently 10 Words

Timestamp_low (32b)

Timestamp_high (32b)LW0

LW1

LW2

LW3

LW4

LW5

LW6

LW7

LW8

LW9

Reserved (26b)

Time Record Sent to XscalePreceding Flow Records

Command(6b)

Reserved (32b)

Reserved (32b)

Reserved (32b)

Reserved (32b)

Reserved (32b)

Reserved (32b)

Reserved (32b)

9 - Flow Stats Module – John DeHart and James Moscola

Overview of Flow Stats Control

Main functions»Collection of Flow Information for PlanetLab Node

Used when a complaint is lodged about a misbehaving flow Must be able to identify flow and the Slice that produced it.

»Aggregation of Flow Information from: Multiple GPEs Multiple NPEs

»Correlation with NAT records to identify internal flow and external flow

External flow will be what complaint will be about. Internal flow will be what involved PlanetLab researcher will

know about.

10 - Flow Stats Module – John DeHart and James Moscola

Translations needed

NPE Flow Records:»VLAN to SliceID

Comes from SRM»IXP timestamp to wall clock time

SCD records wall clock time it started IXP How do we manage time slip between clocks?

GPE Flow Records:»NAT Port translations

Src Port from GPE record becomes SPP Orig Src Port Src Port from natd translation record becomes Src Port

natd provides port translation updates

11 - Flow Stats Module – John DeHart and James Moscola

Merging of DBs NPE Flows

» No NAT» Goes directly into Ext PF DB

SPP Orig Src Port == SrcPort

» Do they need SliceID translation? We use the VLAN, but this probably needs to be the PlanetLab version of a Slice ID. SRM will provide a VLAN to SliceID translation

Where and When?

GPE Configured Flows» How do we identify configured flow pkt?

Because they don’t match a NAT Record?

» No NAT» Goes directly into Ext PF DB

SPP Orig Src Port == SrcPort GPE NAT Flows

» Find corresponding NAT Record, extract Translated SrcPort Insert record into Ext PF DB with original SrcPort moved to SPP Orig Src Port Set Src Port to translated SrcPort

CP Traffic?

12 - Flow Stats Module – John DeHart and James Moscola

Overview of PlanetFlow PlanetFlow

»Unprivileged slice Flow Collector:

Ulogd (fprobe-ulog)– Netlink socket– Uses VSys for

privileged operations– Every 5 minutes

dumps its cache to DB DB:

On PlanetLab Node 5-minute records Flows spanning 5-minute

intervals aggregated daily.

Central Archive»At Princeton?»Updated periodically by

using rsync to retrieve new DB entries from ALL PlanetLab nodes.

X X

13 - Flow Stats Module – John DeHart and James Moscola

PlanetFlow Raw Data0005 0011 8e10638b 48a40477 00062638

0000371d 0000 0000 80fc99cd 80fc99d3

00000000 0000 0004 0000000b 0000062d

8dae5570 8dae558b cc1f 01bb 00 1f 0600

0000 0000 02000000 80fc99cd 80fc99d3

00000000 0000 0004 0000001a 000008b7

8dae54eb 8dae5533 cc1e 01bb 001e 0600

0000 0000 02000000

SA DA

IPv4 NextHop(Unused) Pkt Count Byte Count

Src Port Dst Port Pad

Tcpflag

sProto

Src

Tos

Src As(Unused)

Dst As(Unused) XID (SliceID) SA DA

In SNMP(if_nametoindex)

Out SNMP(if_nametoindex) Pkt Count Byte Count

First Switched(flow creation time)

Last Switched(time of last pkt) Src Port Dst Port Pa

d

Tcpflag

s

Src

Tos

XID (SliceID)

Uptime Unix Secs Unix nSecsVersion Count

Flow SequencePad16

(unused)

First Switched(flow creation time)

Last Switched(time of last pkt)

In SNMP(if_nametoindex)

Out SNMP(if_nametoindex)

IPv4 NextHop(Unused)

Src As(Unused)

Dst As(Unused)

NetFlow Header (beginning of file and repeats

every 30 flow records)

NetFlow FlowRecord

NetFlow FlowRecord

128.252.153.205128.252.153.211

52254 443

52255 443

128.252.153.205128.252.153.211

Proto

EngineType

(unused)

Engine Id

(unused)

223126

158111

14 - Flow Stats Module – John DeHart and James Moscola

SPP/PlanetFlow Raw Data0005 0011 8e10638b 48a40477 00062638

0000371d xx yy 0000 80fc99cd 80fc99d3

00000000 0000 0004 0000000b 0000062d

8dae5570 8dae558b cc1f 01bb 00 1f 0600

zzzz 0000 02000000 80fc99cd 80fc99d3

00000000 0000 0004 0000001a 000008b7

8dae54eb 8dae5533 cc1e 01bb 001e 0600

zzzz 0000 02000000

SA DA

IPv4 NextHop(Unused) Pkt Count Byte Count

Src Port Dst Port Pad

Tcpflag

sProto

Src

Tos

SPP OrigSrc Port

Dst As(Unused) XID (SliceID) SA DA

In SNMP(if_nametoindex)

Out SNMP(if_nametoindex) Pkt Count Byte Count

First Switched(flow creation time)

Last Switched(time of last pkt) Src Port Dst Port Pa

d

Tcpflag

s

Src

Tos

XID (SliceID)

Uptime (msecs) Unix Secs Unix nSecsVersion Count

Flow SequencePad16

(unused)

First Switched(msec)(flow creation time)

Last Switched(msec)(time of last pkt)

In SNMP(if_nametoindex)

Out SNMP(if_nametoindex)

IPv4 NextHop(Unused)

SPP OrigSrc Port

Dst As(Unused)

NetFlow Header (beginning of file and repeats

every 30 flow records)

NetFlow FlowRecord

NetFlow FlowRecord

128.252.153.205128.252.153.211

52254 443

52255 443

128.252.153.205128.252.153.211

Proto

SPPEngineType

SPPEngine

Id

223126

158111

15 - Flow Stats Module – John DeHart and James Moscola

Issues and Notes Time:

» Keeping time in sync among various machines: Flow Stats ME timestamps with IXP clock ticks.

Something has to convert this to a Unix time. GPE(s) timestamps with Unix gettimeofday(). CP collects flow records and aggregates based on time. Proposal:

XScale, GPE(s) and CP will use ntp to keep their Unix times in sync At the beginning of each reporting cycle, the Flow Stats ME should send a timestamp

record just to allow the XScale and CP to keep the time in sync. OR: Can XScale read the IXP clock tick and report that to the CP with along with the

XScale’s Unix time.» What are the times that are recorded in the Header and Flow Records?

Header Uptime (msecs): msecs since a base start time Time since Unix Epoch: time since January 1, 1970

– Unix secs– Unix nSecs

Uptime and Unix (secs, nSecs) represent the SAME time– So that the Flow times can be calculated based on them.

Flow Record First Switched (flow creation time): msecs since a base start time Last Switched (last packet in flow seen time): msecs since base start time

16 - Flow Stats Module – John DeHart and James Moscola

Issues and Notes (continued) NetFlow Header

» Filled in AFTER 30 flow records are filled in OR we get a timeout (10 minutes)» COUNT field tells how many flow records are valid.

File or data packet is ALWAYS padded out to a size that would hold 30 flow records» Flow Sequence: Running total of number of flow records emitted.

Flow Header and Flow Records» Emitted in chunks of 30 flow records plus a Flow Header

Emitted either by writing to a file or sending over a socket to a mirror site. Padded out to a size that would hold 30 flow records.

» A flow is emitted when it has been inactive for at least a minute or when it has been active for at least 5 minutes.

Fprobe-ulog threads:» emit_thread» scan_thread» cap_thread» unpending_thread

Flow lists» flows[]: hashed array of flows, buckets chained off head of list

These are flows that have been reported over netlink socket» flows_emit: linked list of flows ready to be emitted.

17 - Flow Stats Module – John DeHart and James Moscola

Issues and Notes (continued) VLANs and SliceIDs

» NPE and LC use VLANs to differentiate Slices» Flow records must record slice IDs

SRM will provide VLAN to SliceID translation

» GPE(s) do not differentiate Slices by VLAN. All flows from a GPE will use the same VLAN GPE keeps flow records locally using Slice ID Flow Stats ME could ignore GPE flow packets if it was told what the default GPE VLAN

was. Otherwise, one of the fs daemons could drop the flow records for the GPE flows that the Flow Stats

ME reports.

Slice ID:» What exactly is it?» Is the XID that is recorded by PlanetFlow actually the slice id or is it the VServer id?

18 - Flow Stats Module – John DeHart and James Moscola

Issues and Notes (continued) NAT Port Translations

» GPE flow records are the ones that need the NAT Port translation data» GPE flow records will come across from the GPE(s) to the CP via rsync or similar» natd will report NAT port translations with timestamps to the fs daemon» fs daemon will have to maintain NAT port translations (with their timestamps)

for possible later correlation with GPE flow records GPE(s) will all use the same default VLAN

» SRM will send this VLAN to scd so it can write it to SRAM for the fs ME to read in Fs ME will then filter out GPE flow records.

SRM fsd messaging» srm will push out VLAN SliceID translation creation and deletion messages

srm will wait ~10 minutes before re-using a VLAN srm will send the delete VLAN message after waiting the 10 minutes. fsd should not have to keep any history of VLAN/SliceID translations

It should get the creation before it receives any flow records for it It should get the last flow record before it gets the deleteion

» fsd will also be able to query SRM for current translation This will facilitate a restart of the fsd while the SRM maintains current state.

19 - Flow Stats Module – John DeHart and James Moscola

Issues and Notes (continued) rsync of flow record files from GPE(s) to CP

» A particular run of rsync may get a file that is still being written to by fprobe-ulog on the GPE

A subsequent rsync will may get the file again with additional records in it.

» Sample rsync command: rsync --timeout 15 -avzu -e "ssh -i /vservers/plc1/etc/planetlab/root_ssh_key.rsa " root@drn02:/vservers/pl_netflow/pf /root/pf

This will report the files that have been copied over

20 - Flow Stats Module – John DeHart and James Moscola

Issues and Notes (continued) Sample fprobe-ulog command:

» /sbin/fprobe-ulog -M -e 3600 -d 3600 -E 60 -T 168 -f pf2 -q 1000 -s 30 -D 250000» Started from /etc/rc.d/rc[2345].d/S56fprobe-ulog

All linked to /etc/init.d/fprobe-ulog GPE Flow record collection daemon: fprobe-ulog

» Scan thread Collects flow records into a linked list

» Emit thread Periodically writes flow records out to a file

Every 600 seconds – ten minutes!

» Daemon can also send flow records to a remote collector! So we could have the GPEs emit their flow records directly to the flow stats daemon on

the CP. Sample command:

/sbin/fprobe-ulog -M -e 3600 -d 3600 -E 60 -T 168 -f pf2 -q 1000 -s 30 -D 250000 <remote>:<port>[/[<local][/<type]] … There can be multiple remote host specifications Where

– remote: remote host to send to– port: destination port to send to– local: local hostname to use– type: m for mirror-site, r for rotate-site– send to all mirror-sites, rotate through rotate-sites.

21 - Flow Stats Module – John DeHart and James Moscola

SPP PlanetFlow

IngressXScale

EgressXScale

MEsS

CR

SR

AM

SC

R

scd

NATScratchRings

FlowStatsSRAMRing

CP

ExtPFDB

natd

fsd

GPE

GPE

CentralArchive

rsync

HF LK FS2Central Archive Record = <time, sliceID, Proto, SrcIP, SrcPort, DstIP, DstPort, PktCnt, ByteCnt>Ext PF DB Record = <Central Archive Record>

fprobe

fprobesrm

22 - Flow Stats Module – John DeHart and James Moscola

Plan/Design Flow Stats daemon, fsd, runs on CP

»Collects flow records from GPE(s) and NPE(s) and writes them into a series of PlanetFlow2 files with names:

pf2.#, where # is (0-162) Current file is closed after N minutes and # is incremented and new file is

opened and started. This mimics what fprobe-ulog does now on the GPE(s)

These files are then collected periodically by PLC for use and archiving I don’t think there is any explicit indication that PLC has picked up the files but the

timing must be such that we know it is done before we roll over the file names and overwrite an old file.

»Gets NAT data from natd Keep records of this with timestamps so we can correlate with flow records

coming from GPE(s) Keep NAT records on a per Src IP Address basis.

– One set of NAT records per external interface Check with Mart on how this will work

»Gets VLAN to sliceID data from srm srm will send start translation, stop translation msgs with a 10 minute wait

period when stopping a translation to make sure we are done with flow records for that slice

FS ME archives records every 5 minutes. Slices are long lived (right?) so this should not be a problem Fsd can also request a translation from srm

This is in case fsd has to be restarted while srm and other daemons continue running.

23 - Flow Stats Module – John DeHart and James Moscola

Plan/Design (continued) Fsd gathers records from GPE(s) and NPE(s)

»Gathers flow records from GPE(s) via socket(s) from fprobe-ulog on GPE(s)

Come across as one data packet with up to 30 flow records Packet is padded out to full 30 flow records with Count in Header

indicating how many of them are valid Update NetFlow header to indicate that this is an SPP and which SPP

node it is using Engine Type and Engine ID fields Update with NAT data and write immediately out to current pf2 file

keeping its NetFlow header.»Gathers flow records from NPE(s) via socket from scd on XScale

Come across one flow record at a time No NetFlow Header

Create NetFlow Header With appropriate Uptime and UnixTime (secs, nsecs) With SPP Engine Type and SPP Engine ID Modify Flow Record times to be msecs correlated with Uptime

Update NPE flow record with SliceID from srm. Collect NPE records for a period of time or until we get 30 and then

write them out to current pf2 file with NetFlow header.

24 - Flow Stats Module – John DeHart and James Moscola

Plan/Design (continued)FS ME and scd

»Use a command field in records coming across from FS ME to scd

»Use one command to set current time When FS ME is starting an archive cycle, first it sends a

timestamp command When scd gets this timestamp command it associates it with

a gettimeofday() time and sends the FS ME time and the gettimeofday() time to the fsd on the CP so it can associated ME times with Unix times.

»Use another command to indicate flow records Flow records can be sent directly on to fsd on CP

25 - Flow Stats Module – John DeHart and James Moscola

Data to fsdsrm fsd

»Start_vlan_to_sliceId_translate(vlan, sliceId)»Stop_vlan_to_sliceId_translate(vlan, sliceId)

scd fsd»Timestamp command

ME Timestamp Unix time

»flowRecord(Saddr, Daddr, Sport, Dport, tcpFlags, VLAN, protocol, pktCnt, ByteCnt, startTimeStampHigh, startTimestampLow, endTimestampHigh, endTimestampLow)

26 - Flow Stats Module – John DeHart and James Moscola

Data to fsd (continued)natd fsd

» startNatTranslation(Saddr, Daddr, internalPort, externalPort, protocol, srcMAC, timeStampHigh)

» stopNatTranslation(Saddr, Daddr, internalPort, externalPort, protocol, srcMAC, timeStampHigh)

gpe fsd»NetFlow Header

30 * NetFlow Flow Records

27 - Flow Stats Module – John DeHart and James Moscola

End


Recommended