Home > Technology > LinkedIn Infrastructure ([email protected], at fb 2013)

LinkedIn Infrastructure ([email protected], at fb 2013)

Date post: 11-May-2015
Author: jun-rao
View: 886 times
Download: 3 times
Share this document with a friend
This is the presentation at [email protected] in 2013 (http://analyticswebscale.splashthat.com/?em=187&utm_campaign=website&utm_source=sg&utm_medium=em)
Embed Size (px)
Popular Tags:
of 17 /17
Data Infrastructure at Linkedin Jun Rao and Sam Shah LinkedIn Confidential ©2013 All Rights Reserved
  • 1.Data Infrastructure at Linkedin Jun Rao and Sam ShahLinkedIn Confidential 2013 All Rights Reserved

2. Outline 1. 2. 3. 4.LinkedIn introduction Online/nearline infrastructure Offline infrastructure ConclusionLinkedIn Confidential 2013 All Rights Reserved2 3. The Worlds Largest Professional Network Connecting Talent Opportunity. At scale200M+ 2 new Members WorldwideMembers Per SecondLinkedIn Confidential 2013 All Rights Reserved100M+ Monthly Unique Visitors2M+ Company Pages3 4. Two Product Families For MembersProfessionalsFor Partners People You May Know Whos Viewed My Profile Jobs You May Be Interested In News/Sharing Today Search SubscriptionsHire CompaniesMarket SellScience and Analytics Data Infrastructure ActionsProfiles Connections LinkedIn Confidential 2013 All Rights ReservedDataContent 4 5. The Big-Data Feedback Loop Refinement Engagement Value MemberProductInsights ViralityDataSignalsScience Analytics Scale Infrastructure LinkedIn Confidential 2013 All Rights Reserved5 6. LinkedIn Data Infrastructure: Three-Phase Abstraction Near-Line InfraOffline Data InfraApplicationUsersInfrastructureOnlineNear-LineOfflineOnline Data InfraLatency & Freshness Requirements Activity that should be reflected immediately Products Messages Member Profiles Endorsements Company Profiles Skills ConnectionsActivity that should be reflected soon Activity Streams Profile Standardization NewsRecommendations Search MessagesActivity that can be reflected later People You May Know Connection Strength NewsRecommendations Next best ideaLinkedIn Confidential 2013 All Rights Reserved6 7. LinkedIn Data Infrastructure: Sample StackInfra challenges in 3-phase ecosystem are diverse, complex and specificSome off-the-shelf. Significant investment in home-grown, deep and interesting platforms 7 8. LinkedIn Data Infrastructure SolutionsVoldemort: Highly-Available Distributed KV Store Key/value access at scale8 9. Voldemort: Architecture Pluggable components Tunable consistency / availability Key/value model, server side views 10 clusters, 100+ nodes Largest cluster 10K+ qps Avg latency: 3ms Hundreds of Stores Largest store 2.8TB+ 10. LinkedIn Data Infrastructure SolutionsEspresso: Indexed Timeline-Consistent Distributed Data Store Fill in the gap btw Oracle and KV store10 11. Espresso: System Components Hierarchical data model Timeline consistency Rich functionality Transactions Secondary index Text search Partitioning/replication Change propagation11 12. Generic Cluster Manager: Helix Generic Distributed State Model ConfigManagement Automatic Load Balancing Fault tolerance Cluster expansion and rebalancing Espresso, Databus and Search Open Source Apr 2012 https://github.com/linkedin/helix12 13. LinkedIn Data Infrastructure SolutionsDatabus : Timeline-Consistent Change Data Capture Deliver data store changes to apps 14. Databus at LinkedIn DBCapture ChangesRelay Event WinOn-line ChangesOn-line ChangesDatabus Client LibClientSnapshot at UDatabus Client LibConsistent Transport independent of data source: Oracle, MySQL, Transactional semantics In order, at least once deliveryConsumer nClientBootstrapDBConsumer 1Consumer 1Consumer n Tens of relays Hundreds of sources Low latency - milliseconds14 15. LinkedIn Data Infrastructure SolutionsKafka: High-Volume Low-Latency Messaging System Log aggregation and queuing15 16. Kafka Architecture ProducerProducerBroker 1Broker 2Broker 3Broker 4topic1-part1topic1-part2topic2-part1topic2-part2topic2-part2topic1-part1topic1-part2topic2-part1topic2-part1topic2-part2topic1-part1topic1-part2Key features Scale-out architecture Automatic load balancing High throughput/low latency Rewindability Intra-cluster replicationZookeeperConsumerConsumerPer day stats writes: 10+ billion messages reads: 50+ billion messages 17. LinkedIn Data Infrastructure: A few take-aways 1. 2. 3.Building infrastructure in a hyper-growth environment is challenging. Few vs Many: Balance over-specialized (agile) vs generic efforts (leverage-able) platforms (*) Balance open-source products with homegrown platforms (**)LinkedIn Confidential 2013 All Rights Reserved17