Date post: | 15-Jan-2015 |
Category: |
Technology |
Upload: | hien-luu |
View: | 482 times |
Download: | 1 times |
HOW LUCENE POWERS LINKEDIN SEGMENTATION & TARGETING PLATFORM
Hien Luu & Raj Rangaswamy
About Us
Hien Luu Rajasekaran Rangaswamy
• Little bit about LinkedIn • Segmentation & Targeting Platform Overview • How Lucene powers Segmentation & Targeting Platform • Q&A
Agenda
Our Mission Connect the world’s professionals to make them
more productive and successful.
Our Vision Create economic opportunity for every
professional in the world.
Members First!
©2013 LinkedIn Corpora3on. All Rights Reserved.
The world’s largest professional network Over 65% of members are now international
Company Pages
>3M
Languages
>30M
>90%
Fortune 100 Companies use LinkedIn Talent Soln to hire
Professional searches in 2012
>5.7B 19
Other Company Facts • Headquartered in Mountain View, Calif., with offices around the world! • LinkedIn has ~4200 full-‐3me employees located around the world
Segmenta3on & Targe3ng PlaRorm Overview
Segmentation & Targeting Platform Overview
Segmentation & Targeting Platform Overview
Segmentation & Targeting Platform Overview 1. Create attributes
§ Name § Email § State § Occupation § Etc.
2. Attributes Added to Table
Name Email State OccupaEon …
John Smith [email protected] California Engineer
Jane Smith [email protected] Nevada HR Manager
3. Create Target Segment: California, Engineer
Name Email State OccupaEon
John Smith [email protected] California Engineer
Jane Doe [email protected] California Engineer
4. Export List & Send Vendor
Jane Doe [email protected] California Engineer
• Business definition – Business would like to launch new campaigns often – Business would like to specify targeting criteria using
arbitrary set of attributes – Attributes need to be computed to fulfill the targeting
criteria – The attribute data resides on Hadoop or TD – Business is most comfortable with SQL-like language
Segmentation & Targeting Platform Overview
Segmentation & Targeting Platform Overview
A[ribute Computa3on
Engine
A[ribute Serving Engine
Segmentation & Targeting Platform Overview
A[ribute Computa3on
Engine
Self-service
Support various data sources
Attribute consolidation
Attribute availability
Segmentation & Targeting Platform Overview
Attribute computation
~238M
PB
TB
TB
~440
Segmentation & Targeting Platform Overview
A[ribute Serving Engine
Self-service
A[ribute predicate expression
Build segments
Build lists
Segmentation & Targeting Platform Overview
Attribute Serving Engine
$
count filter sum complex
expressions
Σ 1234
~238M
~440
Segmentation & Targeting Platform Overview
Who are north American recruiters that don’t work for a competitor?
Who are the LinkedIn Talent Solution prospects in Europe?
Who are the job seekers?
Segmentation & Targeting Platform Overview
How Lucene powers Segmenta3on & Targe3ng PlaRorm
• Architecture – Indexer Architecture – Serving Architecture
• Load Balanced Model • Next Steps - Distributed Model • DocValues • Lessons Learnt • Why not use an existing solution?
How Lucene powers Segmentation & Targeting Platform
Architecture
Data
StorageLayer
AttributeCreationEngine
AttributeMaterialization
EngineAttributeComputationEngine
AttributeMetastore
AttributeIndexingAttribute
ServingEngine
AttributeServingEngine
Architecture
Index Merger
Web Servers
HDFS
shard 1
shard 2
shard n
Avro data in HDFS
mysql attribute
store
Hadoop Indexer MR
Attribute Definitions
LuceneOutputFormat RecordWriter LuceneDocumentWrapper
Document Index
Mapper K=> AvroKey<GenericRecord> V=> AvroValue<NullWritable> Reducer K=> NullWritable V=> LuceneDocumentWrapper
Architecture JSON Predicate Expression
JSON Lucene Query Parser
Inverted Index
Inverted Index
Inverted Index
Segment & List
• Architecture – Indexer Architecture – Serving Architecture
• Load Balanced Model • Next Steps - Distributed Model • DocValues • Lessons Learnt • Why not use an existing solution?
How Lucene powers Segmentation & Targeting Platform
Serving – Load Balanced Model
Shard 1
Shared Drive
Shard 2 Shard n
Web Server 2 Web Server nWeb Server 1
Load Balancer
HTTP Request
Serving – Load Balanced Model
But Wait…..
• Is load balancing alone good enough?
• What about distribu3on and failover?
• Architecture – Indexer Architecture – Serving Architecture
• Load Balanced Model • Next Steps - Distributed Model • DocValues • Lessons Learnt • Why not use an existing solution?
How Lucene powers Segmentation & Targeting Platform
Next Steps – Distributed Model
• A generic cluster management framework
• Manage par33oned and replicated resources in distributed systems
• Built on top of Zookeeper that hides the complexity of ZK primi3ves
• Provides distributed features such as leader elec3on, two-‐phase
commit etc. via a model of state machine
hLp://helix.incubator.apache.org/
Next Steps – Distributed Model
Shard 1
Shard 2
Web Server 2 Web Server 3Web Server 1
Load Balancer
HTTP Request
Scatter Gather
active
standby
Shard 2
Shard3
active
standby
Shard 3
Shard1
active
standby
Next Steps – Distributed Model
Shard 1
Shard 2
Web Server 2 Web Server 3Web Server 1
Load Balancer
HTTP Request
Scatter Gather
active
standby
Shard 2
Shard3
active
active
Shard 3
Shard1
failure
failure
• Architecture – Indexer Architecture – Serving Architecture
• Load Balanced Model • Next Steps - Distributed Model • DocValues • Lessons Learnt • Why not use an existing solution?
• Once segments are built, users want to forecast, see a
target revenue projection for the campaigns that they
want to run.
• Campaigns can be run on various Revenue Models
• This involves adding per member Propensity Scores and
Dollar Amounts
DocValues – Use Case
DocValues – Why not Stored Fields?
Why not use Stored Fields?
• Stored fields have one indirec3on per
document resul3ng in two disk seeks
per document
• Performance cost quickly adds up when
fetching millions of documents
Document ID
.fdx fetch filepointer to field data
.fdt scan by id until field is found
• Why not use Field Cache?
– Is memory resident
– Works fine when there is enough memory
– But keeping millions of un-inverted values in memory is
impossible
– Additional cost to parse values (from String and to String)
DocValues – Why not Stored Fields?
• Dense column based storage
– (1 Value per Document and 1 Column per field and segment)
• Accepts primitives
• No conversion from/to String needed
• Loads 80x-100x faster than building a FieldCache
• All the work is done during Indexing
• DocValue fields can be indexed and stored too
DocValues
• Architecture – Indexer Architecture – Serving Architecture
• Load Balanced Model • Next Steps - Distributed Model • DocValues • Lessons Learnt • Why not use an existing solution?
Indexing • Reuse index writers, field and document instances
• Create many partitions and merge them in a different process
• Rebuild (bootstrap) entire index if possible
• Use partial updates with caution
• Analyze the index
Lessons Learnt
Serving • Reuse a single instance of IndexSearcher
• Limit usage of stored fields and term vectors
• Plan for load balancing and failover
• Cache term frequencies
• Use different machines for serving and indexing
Lessons Learnt
• Architecture – Indexer Architecture – Serving Architecture
• Load Balanced Model • Next Steps - Distributed Model • DocValues • Lessons Learnt • Why not use an existing solution?
Why not use existing solutions?
• Doesn’t allow dynamic schema • Difficult to bootstrap indexes built in Hadoop • Indexing elevates query latency
• Doesn’t allow dynamic schema • Difficult to bootstrap indexes built in Hadoop • Larger memory overhead • Compara3vely slow
Ques3ons?
More info: data.linkedin.com