+ All Categories
Home > Documents > Analyzing Big Data at Twitter Presentation 1

Analyzing Big Data at Twitter Presentation 1

Date post: 06-Apr-2018
Category:
Upload: brimstone-hide
View: 215 times
Download: 0 times
Share this document with a friend

of 75

Transcript
  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    1/75

    Web 2.0 Expo, 2010

    Kevin Weil @kevinweil

    Analyzing Big Data at

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    2/75

    Three Challenges Collecting Data

    Large-Scale Storage and Analysis

    Rapid Learning over Big Data

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    3/75

    My Background Studied Mathematics and Physics at Harvard,

    Physics at Stanford

    Tropos Networks (city-wide wireless): GBs of data

    Cooliris (web media): Hadoop for analytics, TBs of data

    Twitter: Hadoop, Pig, machine learning, visualization,

    social graph analysis, PBs of data

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    4/75

    Three Challenges Collecting Data

    Large-Scale Storage and Analysis

    Rapid Learning over Big Data

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    5/75

    Data, Data Everywhere You guys generate a lot of data

    Anybody want to guess?

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    6/75

    Data, Data Everywhere You guys generate a lot of data

    Anybody want to guess?

    12 TB/day (4+ PB/yr)

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    7/75

    Data, Data Everywhere You guys generate a lot of data

    Anybody want to guess?

    12 TB/day (4+ PB/yr) 20,000 CDs

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    8/75

    Data, Data Everywhere You guys generate a lot of data

    Anybody want to guess?

    12 TB/day (4+ PB/yr) 20,000 CDs 10 million floppy disks

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    9/75

    Data, Data Everywhere You guys generate a lot of data

    Anybody want to guess?

    12 TB/day (4+ PB/yr) 20,000 CDs 10 million floppy disks 450 GB while I give this talk

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    10/75

    Syslog? Started with syslog-ng

    As our volume grew, it didnt scale

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    11/75

    Syslog? Started with syslog-ng

    As our volume grew, it didnt scale

    Resources

    overwhelmed

    Lost data

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    12/75

    Scribe Surprise! FB had same problem, built and open-

    sourced Scribe

    Log collection framework over Thrift

    You scribe log lines, with categories

    It does the rest

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    13/75

    Scribe Runs locally; reliable in

    network outage

    FE FE FE

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    14/75

    Scribe Runs locally; reliable in

    network outage

    Nodes only know

    downstream writer;

    hierarchical, scalable

    FE FE FE

    Agg Agg

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    15/75

    Scribe Runs locally; reliable in

    network outage

    Nodes only know

    downstream writer;

    hierarchical, scalable

    Pluggable outputs

    FE FE FE

    Agg Agg

    HDFSFile

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    16/75

    Scribe at Twitter Solved our problem, opened new vistas

    Currently 40 different categories logged from

    javascript, Ruby, Scala, Java, etc

    We improved logging, monitoring, behavior during

    failure conditions, writes to HDFS, etc

    Continuing to work with FB to make it better

    http://github.com/traviscrawford/scribe

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    17/75

    Three Challenges Collecting Data

    Large-Scale Storage and Analysis

    Rapid Learning over Big Data

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    18/75

    How Do You Store 12 TB/day?

    Single machine?

    Whats hard drive write speed?

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    19/75

    How Do You Store 12 TB/day?

    Single machine?

    Whats hard drive write speed?

    ~80 MB/s

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    20/75

    How Do You Store 12 TB/day?

    Single machine?

    Whats hard drive write speed?

    ~80 MB/s

    42 hours to write 12 TB

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    21/75

    How Do You Store 12 TB/day?

    Single machine?

    Whats hard drive write speed?

    ~80 MB/s

    42 hours to write 12 TB

    Uh oh.

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    22/75

    Where Do I Put 12TB/day?

    Need a cluster of machines

    ... which adds new layers

    of complexity

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    23/75

    Hadoop Distributed file system

    Automatic replication Fault tolerance Transparently read/write across multiple machines

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    24/75

    Hadoop Distributed file system

    Automatic replication Fault tolerance Transparently read/write across multiple machines MapReduce-based parallel computation

    Key-value based computation interface allowsfor wide applicability

    Fault tolerance, again

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    25/75

    Hadoop Open source: top-level Apache project

    Scalable: Y! has a 4000 node cluster

    Powerful: sorted 1TB random integers in 62 seconds

    Easy packaging/install: free Cloudera RPMs

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    26/75

    MapReduce Workflow

    Challenge: how many tweets

    per user, given tweets table?

    Input: key=row, value=tweet info

    Map: output key=user_id, value=1

    Shuffle: sort by user_id

    Reduce: for each user_id, sum

    Output: user_id, tweet count

    With 2x machines, runs 2x faster

    Inputs

    Map

    Map

    Map

    Map

    Map

    Map

    Map

    Shuffle/Sort

    Reduce

    Reduce

    Reduce

    Outputs

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    27/75

    MapReduce Workflow

    Challenge: how many tweets

    per user, given tweets table?

    Input: key=row, value=tweet info

    Map: output key=user_id, value=1

    Shuffle: sort by user_id

    Reduce: for each user_id, sum

    Output: user_id, tweet count

    With 2x machines, runs 2x faster

    Inputs

    Map

    Map

    Map

    Map

    Map

    Map

    Map

    Shuffle/Sort

    Reduce

    Reduce

    Reduce

    Outputs

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    28/75

    MapReduce Workflow

    Challenge: how many tweets

    per user, given tweets table?

    Input: key=row, value=tweet info

    Map: output key=user_id, value=1

    Shuffle: sort by user_id

    Reduce: for each user_id, sum

    Output: user_id, tweet count

    With 2x machines, runs 2x faster

    Inputs

    Map

    Map

    Map

    Map

    Map

    Map

    Map

    Shuffle/Sort

    Reduce

    Reduce

    Reduce

    Outputs

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    29/75

    MapReduce Workflow

    Challenge: how many tweets

    per user, given tweets table?

    Input: key=row, value=tweet info

    Map: output key=user_id, value=1

    Shuffle: sort by user_id

    Reduce: for each user_id, sum

    Output: user_id, tweet count

    With 2x machines, runs 2x faster

    Inputs

    Map

    Map

    Map

    Map

    Map

    Map

    Map

    Shuffle/Sort

    Reduce

    Reduce

    Reduce

    Outputs

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    30/75

    MapReduce Workflow

    Challenge: how many tweets

    per user, given tweets table?

    Input: key=row, value=tweet info

    Map: output key=user_id, value=1

    Shuffle: sort by user_id

    Reduce: for each user_id, sum

    Output: user_id, tweet count

    With 2x machines, runs 2x faster

    Inputs

    Map

    Map

    Map

    Map

    Map

    Map

    Map

    Shuffle/Sort

    Reduce

    Reduce

    Reduce

    Outputs

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    31/75

    MapReduce Workflow

    Challenge: how many tweets

    per user, given tweets table?

    Input: key=row, value=tweet info

    Map: output key=user_id, value=1

    Shuffle: sort by user_id

    Reduce: for each user_id, sum

    Output: user_id, tweet count

    With 2x machines, runs 2x faster

    Inputs

    Map

    Map

    Map

    Map

    Map

    Map

    Map

    Shuffle/Sort

    Reduce

    Reduce

    Reduce

    Outputs

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    32/75

    MapReduce Workflow

    Challenge: how many tweets

    per user, given tweets table?

    Input: key=row, value=tweet info

    Map: output key=user_id, value=1

    Shuffle: sort by user_id

    Reduce: for each user_id, sum

    Output: user_id, tweet count

    With 2x machines, runs 2x faster

    Inputs

    Map

    Map

    Map

    Map

    Map

    Map

    Map

    Shuffle/Sort

    Reduce

    Reduce

    Reduce

    Outputs

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    33/75

    Two Analysis Challenges Compute mutual followings in Twitters interest graph

    grep, awk? No way. If data is in MySQL... self join on an n- billion row table? n,000,000,000 x n,000,000,000 = ?

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    34/75

    Two Analysis Challenges Compute mutual followings in Twitters interest graph

    grep, awk? No way. If data is in MySQL... self join on an n- billion row table? n,000,000,000 x n,000,000,000 = ? I dont know either.

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    35/75

    Two Analysis Challenges Large-scale grouping and counting

    select count(*) from users? maybe. select count(*) from tweets? uh... Imagine joining these two. And grouping. And sorting.

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    36/75

    Back to Hadoop Didnt we have a cluster of machines?

    Hadoop makes it easy to distribute the calculation

    Purpose-built for parallel calculation

    Just a slight mindset adjustment

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    37/75

    Back to Hadoop Didnt we have a cluster of machines?

    Hadoop makes it easy to distribute the calculation

    Purpose-built for parallel calculation Just a slight mindset adjustment

    But a fun one!

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    38/75

    Analysis at Scale Now were rolling

    Count all tweets: 20+ billion, 5 minutes

    Parallel network calls to FlockDB to compute

    interest graph aggregates

    Run PageRank across users and interest graph

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    39/75

    But... Analysis typically in Java

    Single-input, two-stage data flow is rigid

    Projections, filters: custom code Joins lengthy, error-prone

    n-stage jobs hard to manage

    Data exploration requires compilation

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    40/75

    Three Challenges Collecting Data

    Large-Scale Storage and Analysis

    Rapid Learning over Big Data

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    41/75

    Pig High level language

    Transformations on

    sets of records Process data one

    step at a time

    Easier than SQL?

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    42/75

    Why Pig?Because I bet you can read the following script

    Change this to your big-idea call-outs...

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    43/75

    A Real Pig Script

    Just for fun... the same calculation in Java next

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    44/75

    No, Seriously.

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    45/75

    Pig Makes it Easy 5% of the code

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    46/75

    Pig Makes it Easy 5% of the code

    5% of the dev time

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    47/75

    Pig Makes it Easy 5% of the code

    5% of the dev time

    Within 20% of the running time

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    48/75

    Pig Makes it Easy 5% of the code

    5% of the dev time

    Within 20% of the running time Readable, reusable

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    49/75

    Pig Makes it Easy 5% of the code

    5% of the dev time

    Within 20% of the running time Readable, reusable

    As Pig improves, your calculations run faster

    O

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    50/75

    One Thing Ive Learned Its easy to answer questions

    Its hard to ask the right questions

    O Thi I L d

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    51/75

    One Thing Ive Learned Its easy to answer questions

    Its hard to ask the right questions

    Value the system that promotes innovationand iteration

    O Thi I L d

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    52/75

    One Thing Ive Learned Its easy to answer questions

    Its hard to ask the right questions

    Value the system that promotes innovationand iteration

    More minds contributing = more value from your data

    C ti Bi D t

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    53/75

    Counting Big Data How many requests per day?

    C ti Bi D t

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    54/75

    Counting Big Data How many requests per day?

    Average latency? 95% latency?

    C ti Bi D t

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    55/75

    Counting Big Data How many requests per day?

    Average latency? 95% latency?

    Response code distribution per hour?

    C ti Bi D t

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    56/75

    Counting Big Data How many requests per day?

    Average latency? 95% latency?

    Response code distribution per hour?

    Twitter searches per day?

    C ti Bi D t

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    57/75

    Counting Big Data How many requests per day?

    Average latency? 95% latency?

    Response code distribution per hour?

    Twitter searches per day?

    Unique users searching, unique queries?

    Co nting Big Data

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    58/75

    Counting Big Data How many requests per day?

    Average latency? 95% latency?

    Response code distribution per hour?

    Twitter searches per day?

    Unique users searching, unique queries?

    Links tweeted per day? By domain?

    Counting Big Data

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    59/75

    Counting Big Data How many requests per day?

    Average latency? 95% latency?

    Response code distribution per hour?

    Twitter searches per day?

    Unique users searching, unique queries?

    Links tweeted per day? By domain?

    Geographic distribution of all of the above

    Correlating Big Data

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    60/75

    Correlating Big Data

    Usage difference for mobile users?

    Correlating Big Data

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    61/75

    Correlating Big Data

    Usage difference for mobile users?

    ... for users on desktop clients?

    Correlating Big Data

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    62/75

    Correlating Big Data

    Usage difference for mobile users?

    ... for users on desktop clients?

    ... for users of #newtwitter?

    Correlating Big Data

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    63/75

    Correlating Big Data

    Usage difference for mobile users?

    ... for users on desktop clients?

    ... for users of #newtwitter?

    Cohort analyses

    Correlating Big Data

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    64/75

    Correlating Big Data

    Usage difference for mobile users?

    ... for users on desktop clients?

    ... for users of #newtwitter?

    Cohort analyses

    What features get users hooked?

    Correlating Big Data

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    65/75

    Correlating Big Data

    Usage difference for mobile users?

    ... for users on desktop clients?

    ... for users of #newtwitter?

    Cohort analyses

    What features get users hooked?

    What features power Twitter users use often?

    Research on Big Data

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    66/75

    Research on Big Data

    What can we tell from a users tweets?

    Research on Big Data

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    67/75

    Research on Big Data

    What can we tell from a users tweets?

    ... from the tweets of their followers?

    Research on Big Data

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    68/75

    Research on Big Data

    What can we tell from a users tweets?

    ... from the tweets of their followers?

    ... from the tweets of those they follow?

    Research on Big Data

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    69/75

    Research on Big Data

    What can we tell from a users tweets?

    ... from the tweets of their followers?

    ... from the tweets of those they follow?

    What influences retweets? Depth of the retweet tree?

    Research on Big Data

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    70/75

    Research on Big Data

    What can we tell from a users tweets?

    ... from the tweets of their followers?

    ... from the tweets of those they follow?

    What influences retweets? Depth of the retweet tree?

    Duplicate detection (spam)

    Research on Big Data

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    71/75

    Research on Big Data

    What can we tell from a users tweets?

    ... from the tweets of their followers?

    ... from the tweets of those they follow?

    What influences retweets? Depth of the retweet tree?

    Duplicate detection (spam)

    Language detection (search)

    Research on Big Data

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    72/75

    Research on Big Data

    What can we tell from a users tweets?

    ... from the tweets of their followers?

    ... from the tweets of those they follow?

    What influences retweets? Depth of the retweet tree?

    Duplicate detection (spam)

    Language detection (search)

    Machine learning

    Research on Big Data

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    73/75

    Research on Big Data

    What can we tell from a users tweets?

    ... from the tweets of their followers?

    ... from the tweets of those they follow?

    What influences retweets? Depth of the retweet tree?

    Duplicate detection (spam)

    Language detection (search)

    Machine learning

    Natural language processing

    Diving Deeper

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    74/75

    Diving Deeper HBase and building products from Hadoop

    LZO Compression

    Protocol Buffers and Hadoop

    Our analytics-related open source: hadoop-lzo,

    elephant-bird

    Moving analytics to realtime

    http://github.com/kevinweil/hadoop-lzo

    http://github.com/kevinweil/elephant-bird

  • 8/3/2019 Analyzing Big Data at Twitter Presentation 1

    75/75

    Questions?

    Follow me at

    twitter.com/kevinweil

    Change this to your big-idea call-outs...


Recommended