+ All Categories
Home > Technology > Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Date post: 01-Dec-2014
Category:
Upload: guest5b1607
View: 8,845 times
Download: 1 times
Share this document with a friend
Description:
Presentation at the 2009 Text Analytics Summit
31
Transcript
Page 1: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"
Page 2: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Social Media, Happiness, Petabytes and LOLs

Roddy Lindsay, Data Scientist, Facebook

June 1, 2009

Page 3: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Lots of data is generated on Facebook

▪ 200 million active users

▪ More than 20 million users update their statuses at least once each day

▪ More than 850 million photos uploaded to the site each month

▪ More than 8 million videos uploaded each month

▪ More than 1 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each week

▪ More than 2.5 million events created each month

▪ More than 25 million active user groups exist on the site

Page 4: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Lots of data is generated on Facebook

▪ Undoubtedly a very rich data set (and large...we’re talking petabytes)

▪ Many different groups clamoring for data:

▪ Internal analysts▪ FB Engineers▪ Advertisers▪ Page owners▪ Platform/Connect developers▪ Marketers▪ Academics

Page 5: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Challenges

▪ How can Facebook satisfy all the different consumers of data?▪ What are the challenges?▪ 1. Infrastructure

▪ 2. Infrastructure▪ 3. Infrastructure

Page 6: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Facebook’s Data Infrastructure

▪ Attempt 1: Oracle Data Warehouse (2005)

▪ Business analysts already familiar with tools, SQL▪ Fast JOINs for data slicing ideal for dashboards (home-rolled in PHP)▪ i.e. growth by country and demographic

▪ When growth took off (2007), ETL processes to load and roll-up data started taking a very long time

▪ A single machine (or several machines) were not going to cut it much longer for data volumes at that scale...

Page 7: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Facebook’s Data Infrastructure

▪ Attempt 2: Hadoop (2007)

▪ Open-source framework for running Map-Reduce on a cluster of commodity machines, as well as a distributed file system for long-term storage▪ Map-Reduce (invented at Google) provides a way to process large data sets

that scales linearly with the number of machines in the cluster....if your data doubles in size, just buy twice as many computers

▪ Hadoop initially developed by Doug Cutting, now an Apache project led by the Grid Computing team at Yahoo!

▪ Much faster ETL when transform and load is distributed across a cluster

▪ Engineers able to write jobs in Java and Python▪ Not a viable solution for analysts who can write SQL but not code

Page 8: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Facebook’s Data Infrastructure

▪ Attempt 3: Hive (2008)

▪ SQL-like query language, table partitioning schema, and metadata store built on top of Hadoop

▪ Developed at Facebook, now an Apache subproject▪ Also includes:▪ Web interface for constructing queries on the fly without using a shell

▪ Live support for query problems from the data team▪ Easy integration with charts and dashboards▪ One-click scheduling▪ CSV/Excel export

Page 9: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Facebook’s Data Infrastructure

▪ Attempt 3: Hive (2008)

▪ Example: “Find the number of status updates mentioning ‘swine flu’ per day last month”

▪ SELECT a.date, count(1)▪ FROM status_updates a▪ WHERE a.status LIKE “%swine flu%”▪ AND a.date >= ‘2009-05-01’ AND a.date <= ‘2009-05-31’▪ GROUP BY a.date

Page 10: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Facebook’s Data Infrastructure

▪ Attempt 3: Hive (2008)

▪ Easily extendable to new operators▪ Hypothetical example: “Find the sentiment of the ‘Terminator’ movie”

▪ FROM (▪ FROM status_updates b▪ SELECT SENTIMENT(b.status, ‘terminator’) AS sentiment ▪ WHERE b.status LIKE “%terminator%”▪ AND b.date >= ‘2009-05-01’ AND b.date <= ‘2009-05-31’) a▪ SELECT a.sentiment, count(1)▪ GROUP BY a.sentiment

Page 11: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Facebook’s Data Infrastructure

▪ Attempt 3: Hive (2008)

▪ Successfully decentralized the querying and consumption of data across the company

▪ Instead of 10 dedicated data analysts, we trained a few hundred▪ Everyone is able to answer 95% of his or her data questions with

minimal training▪ Dedicated data scientists, instead of working on an endless queue of

ad-hoc requests, can spend their time performing complex analyses and building scalable systems on top of Hadoop/Hive▪ Machine Learning systems

▪ Rich reporting for clients + Page owners▪ Text analytics

Page 12: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Facebook text analytics

▪ Lexicon (Spring 2008)

▪ Started as an intern project to test Hadoop▪ First external deployment of a Hadoop-powered system at Facebook

(and one of the first anywhere)▪ Simple idea: count the number of occurrences of words and bigrams

on Facebook Walls per day, plot them on a line graph

Page 13: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

“american idol”

Page 14: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Facebook text analytics

▪ “New” Lexicon (Fall 2008), beta preview

▪ Leveraged Hive’s structured metadata and the raw computational power of a 600-node Hadoop cluster▪ Slices by age, gender, region

▪ Sentiment analysis▪ Common user interests▪ Associations graph of similar keywords, with age and gender axes

Page 15: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Dashboard: “economy”

Page 16: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Demographics: “economy”

Page 17: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Map: “laid off”

Page 18: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Sentiment: “iron man” (blue) vs. “indiana jones” (yellow)

Page 19: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Associations: “marriage”

Page 20: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Associations: “vodka”

Page 21: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Facebook text analytics

▪ Hadoop and Hive makes this all possible

▪ Consider “Associations” (similar words and phrases)

▪ Need to compare the co-occurrence of each term with every single other word and bigram, compared to baseline probability of occurrence (TF-IDF)......and keep demographic metadata around for fun

▪ Typical job generates several TB of data along the way▪ Absolutely need a cluster of machines

▪ Distributed computation opens up the possibilities for text analytics algorithms!

▪ And.....the software is free!

Page 22: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Text Analytics

▪ Text analytics is clearly useful in the “macro”:

▪ Big data sets▪ Big compute clusters▪ Big consumers (corporations)

▪ What about in the micro?

▪ Small data sets▪ B, not PB

▪ Small consumers▪ Individual people analyzing their own data

Page 23: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

HappyFactor

▪ Facebook Application (personal project, not associated with Facebook)

▪ Idea: ask people privately how happy they are and what they are doing

▪ Uses random text messages to ensure a good sample and to collect data easily

▪ Provide users with trends on their happiness (by day, week, month, etc.)

▪ When are you happiest?

▪ Sift through the unstructured text to find patterns in behavior that correlate with happiness and unhappiness

▪ Which activities make you happiest?▪ Which people in your life make you happiest?

Page 24: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

HappyFactor

▪ Just like corporations can learn about (and improve) themselves through text analytics....

▪ Why not humans?

Page 25: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

On a scale from 1 to 10, how happy are you right now? Reply with your score and an optional description of what you are doing.

Page 26: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"
Page 27: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"
Page 28: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"
Page 29: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"
Page 30: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

In sum...

▪ Analyzing large data sets is a challenging problem that requires significant investment (both human and financial) in infrastructure

▪ We’re now just learning what we can do with Facebook data since we developed the infrastructure to support it

▪ Distributed computation and structured metadata allow for a powerful new class of text analytics algorithms

▪ Text analytics has applications well beyond enterprise data-mining...

▪ ...could it potentially make the world a happier place?

Page 31: Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

(c) 2009 Facebook, Inc. or its licensors.  "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0


Recommended