+ All Categories
Home > Technology > OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

Date post: 23-Jun-2015
Category:
Upload: big-data-joe-rossi
View: 200 times
Download: 3 times
Share this document with a friend
Description:
Mining Human-scale Insights from Log Data with Machine Learning --- David Andrzejewski - @davidandrzej Data Sciences Engineering, Sumo Logic
Popular Tags:
70
David Andrzejewski @davidandrzej Data Sciences Engineering, Sumo Logic OC Big Data Meetup, September 17, 2014 Mining humanscale insights from log data with machine learning
Transcript
Page 1: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

   

David  Andrzejewski  -­‐  @davidandrzej  Data  Sciences  Engineering,  Sumo  Logic  OC  Big  Data  Meetup,  September  17,  2014      

Mining  human-­‐scale  insights  from  log  data  with  machine  learning  

Page 2: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

   

Page 3: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

   

Page 4: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

4  

Logs  

Page 5: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

The  Problem  We  Solve  

Human Generated

Machine Generated

Orders, Blogs, Social Networks, HR, Inventory, Manufacturing

Networks, Servers, Hypervisors

Security Devices, Desktops

Web Servers, Email

Applications, Mobile

Clickstream

Machine Data is the largest, fastest growing, most complex segment of Big Data.

2003 2005 2007 2009 2011 2013 2015

“More Logs Are Created In A Single Day Now Than in All of FY 2003,” Gartner

Page 6: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

Sumo  Logic  

6  

“Turning Machine Data Into IT and Business Insights”

Learn, classify, predict

Search, monitor, visualize

Page 7: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

Use Cases

Availability & Performance

Customer Insights

Security and Compliance

7  

Page 8: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

8  

Monitoring  and  reporOng  

Page 9: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

TroubleshooOng  and  root  cause  analysis  

9  

Custom App Code

Server / OS

Virtualization

Databases

Network

Open Source Software

Middleware

12/20/2011 17:23:44 PST [user=234fsf] failed transaction, sessionid:2F0A232324, [host=pay002.sjc] amount=1725.00

12/20/11 17:23:34 AMQ7163: WebSphere MQ job number 18429 started FOR client_session=2F0A232324.

12202011 17:23:27 /usr/local/build/mysql/libexec/mysqld: Abnormal shutdown [18429]

20-12-2011 17:23:19 database-host login[3866]: DEAD_PROCESS: 18429 ttys000

Dec 20, 2011 17:22:14,,, message=Created virtual machine user-3 on esxi01.office.thedomain.com

<134>Dec 20 2011 17:22:12: %PIX-6-106100: access-list inside_access_out denied tcp inside/68.162.72.163(4326) -> outside/45.200.244.124(3127) hit-cnt 1(first hit)

66.249.67.24 - - [20/Dec/2011:17:23:40 -0700] ”POST /APP/Order.php HTTP/1.1" 304 146 "-" SESSION=2F0A232324

Customer ID Session ID

Job number

Process ID

Root cause!

Page 10: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

Anatomy  of  a  log  message:  Five  W’s  

10  

Page 11: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

Anatomy  of  a  log  message:  Five  W’s  

11  

! When?  Timestamp  with  Ome  zone  

Page 12: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

Anatomy  of  a  log  message:  Five  W’s  

12  

! When?  Timestamp  with  Ome  zone  ! Where?  Host,  module,  code  locaOon  

Page 13: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

Anatomy  of  a  log  message:  Five  W’s  

13  

! When?  Timestamp  with  Ome  zone  ! Where?  Host,  module,  code  locaOon  ! Who?  AuthenOcaOon  context  

Page 14: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

Anatomy  of  a  log  message:  Five  W’s  

14  

! When?  Timestamp  with  Ome  zone  ! Where?  Host,  module,  code  locaOon  ! Who?  AuthenOcaOon  context  ! What?  Log  level  and  key-­‐value  pairs  

Page 15: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

! Logs:  like  “computer  tweets”  ! TwiZer  2013*    

• Peak  @  ~144k  TPS    • Avg  ~6k  tweets  /  second  

! Log  data  • Example:  1  TB  /  day    • Avg  ~25k  logs  /  second  

Inhuman  scale  

15  

* https://blog.twitter.com/2013/new-tweets-per-second-record-and-how

Page 16: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

“A  distributed  system  is  one  in  which  the  failure  of  a  computer  you  didn't  even  know  existed  can  render  your  own  computer  unusable.”  -­‐  Leslie  Lamport    

Inhuman  complexity  

16  

Transport for London December 2013

Key to symbols Explanation of zones

1

3

45

6

2

789

Station in both zones

Station in both zones

Station in both zones

Station in Zone 9

Station in Zone 6

Station in Zone 5

Station in Zone 3Station in Zone 2

Station in Zone 1

Station in Zone 4

Station in Zone 8

Station in Zone 7

National Rail

Riverboat services

Airport

Tramlink

Interchange stations

Step-free access from street to platform

Step-free access from street to train

Emirates Air Line

Check before you travel

Key to lines

Metropolitan

Victoria

Circle

Central

Bakerloo

DLR

London Overground

Piccadilly

Waterloo & City

Jubilee

Hammersmith & City

Northern

DistrictDistrict open weekends, public holidays and some Olympia events

Emirates Air Line

Bank Waterloo & City line open between Bank and Waterloo 0621-0030 Mondays to Fridays and 0802-0030 Saturdays. Between Waterloo and Bank 0615-0030 Mondays to Fridays and 0800-0030 Saturdays. Closed Sundays and Public Holidays.---------------------------------------------------------------------------------Camden Town Sunday 1300-1730 open for interchange and exit only.---------------------------------------------------------------------------------Canary Wharf Step-free interchange between Underground, Canary Wharf DLR and Heron Quays DLR stations at street level.---------------------------------------------------------------------------------Cannon Street Open until 2100 Mondays to Fridays and 0730-1930 Saturdays. Closed Sundays.---------------------------------------------------------------------------------Embankment Bakerloo and Northern line trains will not stop at this station from early January 2014 until early November 2014. ---------------------------------------------------------------------------------Emirates Greenwich Peninsula and Emirates Royal DocksSpecial fares apply. Open 0700-2000 Mondays to Fridays, 0800-2000 Saturdays, 0900-2000 Sundays and 0800-2000 Public Holidays. Opening hours are extended by one hour in the evening after 1 April 2014 and may be extended on certain events days. Please check close to the time of travel. ---------------------------------------------------------------------------------Heron Quays Step-free interchange between Heron Quays and Canary Wharf Underground station at street level.---------------------------------------------------------------------------------Hounslow WestStep-free access for manual wheelchairs only.---------------------------------------------------------------------------------Kilburn No step-free access from late January 2014 until mid May 2014.---------------------------------------------------------------------------------Stanmore Step-free access via a steep ramp. ---------------------------------------------------------------------------------Turnham Green Served by Piccadilly line trains until 0650 Mondays to Saturdays, 0745 Sundays and after 2230 every evening. At other times use District line.---------------------------------------------------------------------------------Waterloo Waterloo & City line open between Bank and Waterloo 0621-0030 Mondays to Fridays and 0802-0030 Saturdays. Between Waterloo and Bank 0615-0030 Mondays to Fridays and 0800-0030 Saturdays. Closed Sundays and Public Holidays.No step-free access from late January 2014 until late July 2014.---------------------------------------------------------------------------------West India QuayNot served by DLR trains from Bank towards Lewisham before 2100 on Mondays to Fridays.---------------------------------------------------------------------------------

River Thames

A

B

C

D

E

F

1 2 3 4 5 6 7 8 9

1 2 3 4 5 76 8 9

A

B

C

D

E

F

2 2

22

2

5

8 8 6

2

4

4

65

41

3

2

43

3

36 3 1

1

3

3

59 7 7Special fares apply

5

5

4

4

4

AmershamChorleywood

Mill Hill East

Rickmansworth

Perivale

KentishTown West

CamdenRoad

Dalston Kingsland

Wanstead Park

Vauxhall

Hanger Lane

Edgware

Burnt Oak

Colindale

Hendon Central

Brent Cross

Golders Green

WestSilvertown

EmiratesRoyal Docks

EmiratesGreenwichPeninsula Pontoon Dock

LondonCity Airport

WoolwichArsenal

King George V

Hampstead

Belsize Park

Chalk Farm

Chalfont &Latimer

Chesham

New CrossGate

Moor Park

NorthwoodNorthwoodHills

Pinner

North Harrow

Custom House for ExCeL

Prince Regent

Royal Albert

Beckton Park

Cyprus

GallionsReach

Beckton

Watford

Croxley

Fulham Broadway

LambethNorth

HeathrowTerminal 4

Harrow-on-the-Hill

KensalRise

BethnalGreen

Westferry

SevenSisters

Blackwall

BrondesburyPark

HampsteadHeath

HarringayGreen Lanes

LeytonstoneHigh Road

LeytonMidland Road

HackneyCentral

NorthwickPark

PrestonRoad

RoyalVictoria

WembleyPark

Rayners Lane

Watford High Street

RuislipGardens

South Ruislip

Greenford

Northolt

South Harrow

Sudbury Hill

Sudbury Town

Alperton

Pimlico

Park Royal

North Ealing

Acton Central

South Acton

Ealing Broadway

Watford Junction

West Ruislip

Bushey

Carpenders Park

Hatch End

North Wembley

West Brompton

Ealing Common

South Kenton

Kenton

Wembley Central

Kensal Green

Queen’s Park

Gunnersbury

Kew Gardens

Richmond

Stockwell

Bow Church

Stonebridge Park

Harlesden

Camden Town

Willesden Junction

Headstone Lane

Parsons Green

Putney Bridge

East Putney

Southfields

Wimbledon Park

Wimbledon

Island Gardens

Greenwich

Deptford Bridge

South Quay

Crossharbour

Mudchute

Heron Quays

West India Quay

Elverson Road

Oakwood

Cockfosters

Southgate

Arnos Grove

Bounds Green

Theydon Bois

Epping

Debden

Loughton

Buckhurst Hill

WalthamstowQueen’s Road

Woodgrange Park

Leytonstone

Leyton

Wood Green

Turnpike Lane

Manor House

Stanmore

Canons Park

Queensbury

Kingsbury

High Barnet

Totteridge & Whetstone

Woodside Park

West Finchley

Finchley CentralWoodford

South Woodford

Snaresbrook

Hainault

Fairlop

Barkingside

Newbury Park

East Finchley

Highgate

Archway

Devons Road

Langdon Park

All Saints

Tufnell Park

Kentish Town

Neasden

Dollis Hill

Willesden Green

South Tottenham

Swiss Cottage

ImperialWharf

Brixton

Kilburn

West Hampstead

Blackhorse Road

Acton Town

CanningTown

Finchley Road

Highbury &Islington

Canary Wharf

Stratford

StratfordInternational

FinsburyPark

Elephant & Castle

Stepney Green

Barking

East Ham

Plaistow

Upton Park

Poplar

West Ham

Upper Holloway

PuddingMill Lane

Kennington

Borough

Elm ParkDagenham

East

DagenhamHeathway

Becontree

Upney

Heathrow Terminal 5

Finchley Road& Frognal

CrouchHill

Northfields

Boston Manor

South Ealing

Osterley

Hounslow Central

Hounslow East

Clapham North

Clapham High Street

Oval

Clapham Common

Clapham South

Balham

Tooting Bec

Tooting Broadway

Colliers Wood

South Wimbledon

Arsenal

Holloway Road

Caledonian Road

Morden

West Croydon

HounslowWest

Hatton Cross

HeathrowTerminals 1, 2, 3

ClaphamJunction

WestHarrow

Brondesbury CaledonianRoad &

Barnsbury

TottenhamHale

WalthamstowCentral

HackneyWick

Homerton

WestActon

Limehouse EastIndia

Crystal Palace

ChiswickPark

RodingValley

GrangeHill

Chigwell

Redbridge

GantsHill

Wanstead

Ickenham

TurnhamGreen

Uxbridge

Hillingdon Ruislip

GospelOak

Mile End

Bow Road

Bromley-by-Bow

Upminster

Upminster Bridge

Hornchurch

Norwood Junction

Sydenham

Forest Hill

Anerley

Penge West

Honor Oak Park

Brockley

Harrow &Wealdstone

Cutty Sark for Maritime Greenwich

Ruislip Manor

Eastcote

Wapping

New Cross

Queens RoadPeckham

Peckham Rye

Denmark Hill

Surrey Quays

Whitechapel

Lewisham

Kilburn Park

Regent’s Park

KilburnHigh Road

EdgwareRoad

SouthHampstead

GoodgeStreet

Shepherd’s BushMarket

Goldhawk Road

Hammersmith

Bayswater

Warren Street

Aldgate

Euston

Farringdon

BarbicanRussellSquare

Kensington(Olympia)

MorningtonCrescent

High StreetKensington

Old Street

St. John’s Wood

Green Park

BakerStreet

NottingHill Gate

Victoria

AldgateEast

Blackfriars

Mansion House

Temple

Cannon Street

OxfordCircus

BondStreet

TowerHill

Westminster

PiccadillyCircus

CharingCross

Holborn

Tower Gateway

Monument

Moorgate

Leicester Square

London Bridge

St. Paul’s

Hyde Park Corner

Knightsbridge

StamfordBrook

RavenscourtPark

WestKensington

NorthActon

HollandPark

Marylebone

Angel

Queensway MarbleArch

SouthKensington

SloaneSquare

WandsworthRoad

Covent Garden

LiverpoolStreet

GreatPortland

Street

Bank

EastActon

ChanceryLane

LancasterGate

Warwick AvenueMaida Vale

Fenchurch Street

Paddington

BaronsCourt

GloucesterRoad St. James’s

Park

Latimer RoadLadbroke Grove

Royal Oak

Westbourne Park

Bermondsey

Rotherhithe

ShoreditchHigh Street

Dalston Junction

Haggerston

Hoxton

Wood Lane

Shepherd’sBush

WhiteCity

King’s CrossSt. Pancras

EustonSquareEdgware

Road

Southwark

Embankment

Stratford High Street

Abbey Road

Star Lane

Waterloo

TottenhamCourt Road

Canonbury

Shadwell

Earl’sCourt

NorthGreenwich

CanadaWater

Page 17: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

All-­‐too-­‐human  messiness  and  variety  

17  

! (wildly)  varying  formats    • prind,  JSON,  XML,  Windows,  X-­‐delimited,  ...  

! Specialized  knowledge  

   

[2008-05-07 09:50:08.450 'App' 3560 verbose] [VpxdHeartbeat] Invalid heartbeat from 10.17.218.46

Page 18: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

Q:  how  to  get  human-­‐scale  insights  from  log  data?  

18  

Page 19: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

Q:  how  to  get  human-­‐scale  insights  from  log  data?  

19  

A: machine learning (and friends) !   Unsupervised pattern discovery !   Anomaly / outlier detection !   Supervised classification !   Time-series data modeling !   Graph analysis !   Probabilistic data structures

Page 20: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

Too  many  logs!  “data  disorientaOon”  

   

~60k results: 30 minutes, one component

Page 21: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

21  

Unsupervised clustering !   Given: set of items !   Do: group similar items

Page 22: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

22  

Unsupervised clustering !   Given: set of items !   Do: group similar items

Page 23: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

DisOll  logs  down  to  underlying  structure  

Page 24: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

Results  "compressed”  ~1000x    

   

Page 25: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

printf("Health status check: %s is %s”,

hostid, hoststatus)

Health status check: zim-5 is OK

Health status check: gir-3 is OK

Health status check: gir-2 is TIMED OUT

Health status check: dib-1 is OK

In  the  beginning,  there  was  the  prind()  

   

Log generation

Page 26: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

printf("Health status check: %s is %s”,

hostid, hoststatus)

Health status check: zim-5 is OK

Health status check: gir-3 is OK

Health status check: gir-2 is TIMED OUT

Health status check: dib-1 is OK

Health status check: *** is ***

Reverse  engineering  prind()  

       

Log generation

“magic”

Page 27: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

27  

1.  Define string distance function

2.  Do distance-based clustering

���������

��������

Unsupervised clustering !   Given: log messages !   Do: group by “signature”

Page 28: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

Drill-­‐down  into  the  original  raw  logs  

   

Page 29: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

29  

Partially supervised clustering !   Given: set of items + side info !   Do: group similar items

Page 30: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

30  

Partially supervised clustering !   Given: set of items + side info !   Do: group similar items

Page 31: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

Too  many  wildcards!  

31  

Page 32: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

“Hint”  from  human  user  

32  

Page 33: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

Not  enough  wildcards!  

33  

Page 34: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

“Hint”  from  human  user  

34  

Page 35: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

unknown  unknowns  

35  

Page 36: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

36  

Outlier detection !   Given: data points !   Do: identify outliers

Page 37: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

37  

Outlier detection !   Given: data points !   Do: identify outliers

Page 38: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

38  

Health check OK

Request processed

Txn timeout, retry

Anomaly detection !   Given: log data !   Do: flag anomalies

Page 39: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

39  

Health check OK

Request processed

Txn timeout, retry

Anomaly detection !   Given: log data !   Do: flag anomalies

Page 40: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

InvesOgate  and  annotate  events  

40  

logs

signatures

RAW DATA

HUMAN

Page 41: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

InvesOgate  and  annotate  events  

41  

logs

signatures

RAW DATA

HUMAN

Page 42: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

InvesOgate  and  annotate  events  

42  

logs

signatures

event

RAW DATA

HUMAN

Page 43: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

InvesOgate  and  annotate  events  

43  

logs

signatures

event

RAW DATA

HUMAN

timeline / alerts

Page 44: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

44  

Supervised classification !   Given: labeled data points !   Do: predict future labels

Page 45: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

45  

Supervised classification !   Given: labeled data points !   Do: predict future labels

Page 46: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

46  

Supervised classification !   Given: log data,

annotated events !   Do: classify new

occurrences

event

timeline / alerts

Page 47: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

User  action  webID=7F92  Initiating  requestID=082A  for  webID=7F92  …    …  orderID=34C8  received  for  requestID=082A  …  Retrieving  userID=11D2  for  requestID=082A  …  …  accountID=1234  access,  userID=11D2  …  ERROR  accountID=1234  not  found!    PROCESSING  FAILED:  webID=79F92  

Connected components !   Given: nodes/edges !   Do: identify component

Page 48: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

User  action  webID=7F92  

Page 49: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

User  action  webID=7F92  Initiating  requestID=082A  for  webID=7F92  …  

Page 50: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

User  action  webID=7F92  Initiating  requestID=082A  for  webID=7F92  …    …  orderID=34C8  received  for  requestID=082A  …  

Page 51: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

User  action  webID=7F92  Initiating  requestID=082A  for  webID=7F92  …    …  orderID=34C8  received  for  requestID=082A  …  Retrieving  userID=11D2  for  requestID=082A  …  

Page 52: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

User  action  webID=7F92  Initiating  requestID=082A  for  webID=7F92  …    …  orderID=34C8  received  for  requestID=082A  …  Retrieving  userID=11D2  for  requestID=082A  …  …  accountID=1234  access,  userID=11D2  …  

Page 53: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

User  action  webID=7F92  Initiating  requestID=082A  for  webID=7F92  …    …  orderID=34C8  received  for  requestID=082A  …  Retrieving  userID=11D2  for  requestID=082A  …  …  accountID=1234  access,  userID=11D2  …  ERROR  accountID=1234  not  found!    PROCESSING  FAILED:  webID=79F92  

Page 54: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

Time-series detection !   Given: time-series metric data !   Do: identify unusual data pts

Page 55: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

Time-series detection !   Given: time-series metric data !   Do: identify unusual data pts

Level change

Page 56: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

Time-series detection !   Given: time-series metric data !   Do: identify unusual data pts

Level change Spikes

Page 57: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

“Bollinger  bands”  –  rolling  window  approach  

µ± 3σ

Page 58: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

58  

Top-K identification !   Given: stream of observations !   Do: identify k most frequent

(WITH FIXED MEMORY!)

Page 59: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

59  

Top-K identification !   Given: stream of observations !   Do: identify k most frequent

(WITH FIXED MEMORY!)

...

Page 60: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

60  

Top-K identification !   Given: stream of observations !   Do: identify k most frequent

(WITH FIXED MEMORY!)

... 4 3 2 2

Page 61: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

61  

Top-K identification !   Given: stream of observations !   Do: identify k most frequent

(WITH FIXED MEMORY!)

... 4 3 2 2

TOP  2  

Page 62: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

62  

Top-K identification !   Given: stream of observations !   Do: identify k most frequent

(WITH FIXED MEMORY!)

Count-Min Sketch (Cormode & Muthukrishnan, 2003)

Page 63: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

63  

Top-K identification !   Given: stream of observations !   Do: identify k most frequent

(WITH FIXED MEMORY!)

Count-Min Sketch (Cormode & Muthukrishnan, 2003)

Page 64: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

64  

Top-K identification !   Given: stream of observations !   Do: identify k most frequent

(WITH FIXED MEMORY!)

Count-Min Sketch (Cormode & Muthukrishnan, 2003)

Page 65: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

65  

Cardinality estimation !   Given: stream of observations !   Do: identify number of distinct

items (WITH FIXED MEMORY!)

Page 66: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

66  

Cardinality estimation !   Given: stream of observations !   Do: identify number of distinct

items (WITH FIXED MEMORY!)

...

Page 67: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

67  

Cardinality estimation !   Given: stream of observations !   Do: identify number of distinct

items (WITH FIXED MEMORY!)

...

|{ , , , }| = 4

Page 68: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

68  

Cardinality estimation !   Given: stream of observations !   Do: identify number of distinct

items (WITH FIXED MEMORY!)

HyperLogLog (Flajolet et al, 2007)

Page 69: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

Hooray!  Monoid  homomorphism!  

69  

logs

logs

logs

f(s1 + s2) = f(s1)⊕ f(s2)

Page 70: OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

<  FINAL  OBLIGATORY  PLUG  >  

70  

freesumo.com  


Recommended