Big Data Usage in Linkedin

Post on 23-Jan-2015

397 views 6 download

description

Information Excellence Presentation 2010 Sep from Hari Shankar, Linkedin Big Data Engineer, on Big Data usage in Linkedin

transcript

Recruiting SolutionsRecruiting SolutionsRecruiting Solutions

Harvesting Information Excellence

Information Excellence2012 Sep Session

Information Excellence 2 informationexcellence.wordpress.com

Big Data Usage and Implementation in Linkedin

Hari Shankar, Big Data Engineer, Linkedin

Thank You

for hosting us today

Today’s Speakers

Big data and Hadoop

September 2012

Hari Shankar MenonSoftware engineerLinkedIn

3

LinkedIn Engineering Data warehouse team

Previously, Software engineer @Clickable– Worked on building the reporting and analytics platform on

Hadoop and HBase.

Hadoop and Open-source enthusiast

4

About me

About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges

5

Agenda

Our missionConnect the world’s professionals to make

them more productive and successful

6

7

*as of Nov 4, 2011**as of June 30, 2011

2 48

17

32

55

90

2004 2005 2006 2007 2008 2009 2010

LinkedIn Members (Millions)

175M+

85%Fortune 100 Companies use LinkedIn to hire

Company Pages

>2M

**

New Members joining

~2/sec

Professional searches in 2011

~4.2B

LinkedIn by numbers

About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges

8

* Chart from Philip Russom- Research Director: TDWI

What is big data?

10

Infrastructure technologies

Primary data store (Front-end)Distributed key-value store

Document-oriented store

Distributed PubSub messaging

Search technologies

Database change replication SenseiDB

Zoie Bobo

11

http://data.linkedin.com/opensource

Open source

About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges

12

What is Hadoop Evolution of Hadoop Impact

13

Recommendation systems– Generating recommendations– Modeling– A/B Testing– Grandfathering

Data warehouse/ETL– Raw data storage– Aggregations– Heavy lifting

Data sciences– Strategic analyses– Experimentation sandbox

14

@

15

Pandora Search for People

Events YouMay BeInterested In

Groups browse maps

The Recommendations opportunity

• Relevance/Latency

• Offline computation

• Caching

16

Improving recommendations

• Mathematical modeling

• A/B Testing

• Grandfathering

17

Hadoop in the Data warehouse

• Source of truth• Lower retention• Ad-hoc analysis

• Longer retention• Complex

transformations• Algorithmic

computations

18

Hadoop in Data Sciences

• Deep dives

• Sandbox

• Hackday projects

19

Data Insights - 1

Job migration after financial collapse

20

Data Insights - 2

21

Data Insights - 3

About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges

22

1. User adoption of new technologies2. Real-time processing3. Graph/Network algorithms4. Making data accessible

23

Challenges

24

User adoption

25

• Challenges• Random reads/writes• Warm-up time

• Solutions• Parts of the problem that can be moved offline?• HBase, Voldemort

Real-time processing

26

• Graph problems• Traditional joins

Map-reduce-incompatible problems

27

• Hadoop Tons of data

Making data accessible

Finally!

No Silver bullet

Hadoop Offline processing

Scalability by design

28

www.linkedin.com/in/harisreekumar

29

www.linkedin.com/company/linkedin/careers

Information Excellence 30 informationexcellence.wordpress.com

Community Focused

Volunteer Driven

Knowledge Share

Accelerated Learning

Collective Excellence

Distilled Knowledge

Shared, Non Conflicting Goals

Validation / Brainstorm platform

Mentor, Guide, Coach

Satisfied, Empowered Professional

Richer Industry and Academia

About Information Excellence Group

Progress Information Excellence

Towards an Enriched Profession, Business and Society