+ All Categories
Home > Documents > Joshuba Tuberville @ Lucene Revolution 2011

Joshuba Tuberville @ Lucene Revolution 2011

Date post: 09-Mar-2016
Category:
Upload: lucid-imagination
View: 213 times
Download: 1 times
Share this document with a friend
Description:
Jazzed About Solr
50
About Solr People as A Search Problem Thursday, May 26, 2011
Transcript
Page 1: Joshuba Tuberville @ Lucene Revolution 2011

About SolrPeople as A Search Problem

Thursday, May 26, 2011

Page 2: Joshuba Tuberville @ Lucene Revolution 2011

About Me

• Building websites since 1996, Java since 1997

• Prior web search experience• Building and scaling eHarmony

products since 2002

Thursday, May 26, 2011

Page 3: Joshuba Tuberville @ Lucene Revolution 2011

What is Jazzed

• Subscription Based Dating Site

• Incubated by eHarmony

Thursday, May 26, 2011

Page 4: Joshuba Tuberville @ Lucene Revolution 2011

What is Jazzed

• Create a profile• Search for others• View their photos• Privately

Communicate

Thursday, May 26, 2011

Page 5: Joshuba Tuberville @ Lucene Revolution 2011

What is Jazzed

• Create a profile• Search for others• View their photos• Privately

Communicate

Thursday, May 26, 2011

Page 6: Joshuba Tuberville @ Lucene Revolution 2011

What is Jazzed

• Create a profile• Search for others• View their photos• Privately

Communicate

Thursday, May 26, 2011

Page 7: Joshuba Tuberville @ Lucene Revolution 2011

What is Jazzed

• Create a profile• Search for others• View their photos• Privately

Communicate

Thursday, May 26, 2011

Page 8: Joshuba Tuberville @ Lucene Revolution 2011

How is it different?

• Covers broader range of relationships• Easy to get started• Real profiles screened by machine and

humans• Fast, effective search oriented tools

Thursday, May 26, 2011

Page 9: Joshuba Tuberville @ Lucene Revolution 2011

Jazzed Stats

• Started Fall 2009• Beta Summer 2010• Launched October 2010• 100,000s of Profiles• 1,000s of Searches Daily

Thursday, May 26, 2011

Page 10: Joshuba Tuberville @ Lucene Revolution 2011

Jazzed Architecture

• Event-driven SOA• REST, JSON, EIP, Not-only-SQL• Technology incubation

Thursday, May 26, 2011

Page 11: Joshuba Tuberville @ Lucene Revolution 2011

Tech Stack

• Java 6, Spring 3, Jersey 1.1, JMS (AQMP)

• RHEL 4, Oracle 11g, Voldemort 0.81, Solr 1.4.1, NFS

Thursday, May 26, 2011

Page 12: Joshuba Tuberville @ Lucene Revolution 2011

Thursday, May 26, 2011

Page 13: Joshuba Tuberville @ Lucene Revolution 2011

Thursday, May 26, 2011

Page 14: Joshuba Tuberville @ Lucene Revolution 2011

Not Covered

• Distributed Search• Caching Strategies• Data Import• Analyzers/Tokenizers

Thursday, May 26, 2011

Page 15: Joshuba Tuberville @ Lucene Revolution 2011

Why Lucene?

• Proven Solid IR library• Prefer Open Source Solutions• Not Only SQL• Flexible Ranking • Pluggable

Thursday, May 26, 2011

Page 16: Joshuba Tuberville @ Lucene Revolution 2011

Why Solr

• Performant, Extensible, RESTful Service• Configuration, Schema, Multicores• Admin Interface• Replication, Backups, Monitoring

Thursday, May 26, 2011

Page 17: Joshuba Tuberville @ Lucene Revolution 2011

Open Source

• Strengthens Engineering Team• Be apart of great community• Not Brochure-ware

Thursday, May 26, 2011

Page 18: Joshuba Tuberville @ Lucene Revolution 2011

Not Only SQL

• One solution does not fit all• Prefer availability over consistency• Horizontal Scaling over Vertical

Thursday, May 26, 2011

Page 19: Joshuba Tuberville @ Lucene Revolution 2011

Flexible Ranking

• Query Strategies• Boolean Algebra• Vector Space Analysis• Hybrids

• Extensive Function Support• Index and Query Boosting

Thursday, May 26, 2011

Page 20: Joshuba Tuberville @ Lucene Revolution 2011

...Oh My!

• Standard Plugins - Geospatial*, Faceting, Spelling, MoreLikeThis

• Full Text with Highlighted Results• Client agnostic

Thursday, May 26, 2011

Page 21: Joshuba Tuberville @ Lucene Revolution 2011

Inevitable Question

• “Does it scale?”• Solr POC Benchmark

• 10 Million profiles• >200 queries/sec under 100ms 90th• Default tuning until 5 million profiles

Thursday, May 26, 2011

Page 22: Joshuba Tuberville @ Lucene Revolution 2011

Profile Service

• RESTful Hybrid Data Service• Public, Private, Attributes• Event Producer

Thursday, May 26, 2011

Page 23: Joshuba Tuberville @ Lucene Revolution 2011

Profiles

• Mostly structured• Categories - Eye Color, Desired

Ethnicity• Dates - Birthdate• Numbers - Coordinates, Age Range• Text -Name, Headline

Thursday, May 26, 2011

Page 24: Joshuba Tuberville @ Lucene Revolution 2011

Inverting People

• Stored as an inverted index

• Index random accessed by term

Term DocumentMALE 1, 3, 5, 7, 9

FEMALE 2, 4, 6, 8, 10HAIR_RED 8

HAIR_BLOND 1, 2, 5, 6EYE_BLUE 1, 2, 3, 10

EYE_BROWN 4, 5, 6, 7, 8, 9fun 1, 3, 7, 9

funny 2, 4, 6, 10beach 1, 2, 3, 4, 5, 6, 7, 8

Thursday, May 26, 2011

Page 25: Joshuba Tuberville @ Lucene Revolution 2011

Schema Design

• Single “Table”• One-to-many = multi-value fields• Individual vs Composite Fields

• copyTo and have both!

Thursday, May 26, 2011

Page 26: Joshuba Tuberville @ Lucene Revolution 2011

Field considerations

• Stored or not• Indexed or not• Multivalued - desires fields• Type

Thursday, May 26, 2011

Page 27: Joshuba Tuberville @ Lucene Revolution 2011

Solr Types Used

• tdate, tint, tfloat* - birthdate, loginAt• text - all text• string - id, non indexed text• random - good for random sorts• enum - for all enumerations

The ‘t’ is for Trie

Thursday, May 26, 2011

Page 28: Joshuba Tuberville @ Lucene Revolution 2011

Data Duplication

• By function - numberPhotos & hasPhotos

• By relationship - hiddenBy & hidden• By analysis - name & text

Thursday, May 26, 2011

Page 29: Joshuba Tuberville @ Lucene Revolution 2011

Saving Profiles

• Updating is in memory operation• No partial updates• Commit means flush index changes• Autocommit on maxDocs, maxTime or

both

Thursday, May 26, 2011

Page 30: Joshuba Tuberville @ Lucene Revolution 2011

Why Also Voldemort

• Private profiles can not be stale• Many fields not searchable or viewable

by others• Isolate queries from fetch by id

Thursday, May 26, 2011

Page 31: Joshuba Tuberville @ Lucene Revolution 2011

Querying

• Superset of Lucene• Efficient Range Queries• Multiple Query Handlers

• Dismax, Boost, Geo

Thursday, May 26, 2011

Page 32: Joshuba Tuberville @ Lucene Revolution 2011

Recall vs Precision

• Focus on recall when corpus is small• Precision once it is at critical mass

Thursday, May 26, 2011

Page 33: Joshuba Tuberville @ Lucene Revolution 2011

Boolean Queries

• Default operator set to AND• +gender:FEMALE +seeking:MALE

+eyeColor:EYE_BLUE +hairColor:(HAIR_RED, HAIR_BLONDE)

• Sort order is important

Thursday, May 26, 2011

Page 34: Joshuba Tuberville @ Lucene Revolution 2011

Hybrid Queries

• Default operator set to OR• +gender:FEMALE +seeking:MALE

eyeColor:EYE_BLUE hairColor:(HAIR_RED, HAIR_BLONDE)

Thursday, May 26, 2011

Page 35: Joshuba Tuberville @ Lucene Revolution 2011

Why you’re lucky if you like redheads

• Inverse Document Frequency (IDF)

• Rarer is favored over more common

• More fields matched = higher ranking

1.Blue eyed, redheads2.Blue eyed, blonds3.Redheads4.Blonds

Thursday, May 26, 2011

Page 36: Joshuba Tuberville @ Lucene Revolution 2011

Boosting

• Query time by importance• eyeColor:EYE_BLUE^2

hairColor:HAIR_BLOND

Thursday, May 26, 2011

Page 37: Joshuba Tuberville @ Lucene Revolution 2011

Filter Fields

• Useful for roles and other lists

• -hidden:(2 4 6)

id hidden

1 2, 4, 6

2 1

Thursday, May 26, 2011

Page 38: Joshuba Tuberville @ Lucene Revolution 2011

Filter Fields

• Useful for roles and other lists

• -hidden:(2 4 6)• -hiddenBy:1

id hidden

1 2, 4, 6

2 1

id hiddenBy1 22 14 16 1

Thursday, May 26, 2011

Page 39: Joshuba Tuberville @ Lucene Revolution 2011

Date Math

• Simplifies query preprocessing• +birthDate:[NOW/DAY+1DAY-36YEAR

TO NOW/DAY-25YEAR]

Thursday, May 26, 2011

Page 40: Joshuba Tuberville @ Lucene Revolution 2011

Date Math

• Simplifies query preprocessing• +birthDate:[NOW/DAY+1DAY-36YEAR

TO NOW/DAY-25YEAR]

Between 25 and 35 years old

Thursday, May 26, 2011

Page 41: Joshuba Tuberville @ Lucene Revolution 2011

Distance Searching

• lat, lon, distance• SolrLocal by Patrick O’Leary• Additional overhead ~90ms per query• Superceded in Solr 3.1

Thursday, May 26, 2011

Page 42: Joshuba Tuberville @ Lucene Revolution 2011

Testing Queries

• Log queries and ids returned• Version your search strategies• Improve one thing at a time

Thursday, May 26, 2011

Page 43: Joshuba Tuberville @ Lucene Revolution 2011

Geo Service

• Read-mostly service• Fields - Postal Code, Country,

State, Cities, Lat, Lon• Usage - Registration

Validation, City Selection

Thursday, May 26, 2011

Page 44: Joshuba Tuberville @ Lucene Revolution 2011

Operations

• Servlet container and filesystem• Jetty 6, 64 Java 6 JVM• 8G Heap -XX:+UseCompressedOops

Thursday, May 26, 2011

Page 45: Joshuba Tuberville @ Lucene Revolution 2011

Operations

• Active/Passive • Layer 7 Load balancing• Nightly snapshots• Eventually SolrCloud

Thursday, May 26, 2011

Page 46: Joshuba Tuberville @ Lucene Revolution 2011

Multicore

• Run multiple schemas on the same• Hot swappable for backwards

compatible changes• private / public profiles

Thursday, May 26, 2011

Page 47: Joshuba Tuberville @ Lucene Revolution 2011

Security

• No security provided• At minimum secure

your UpdateHandler• Separate Cores

<delete><query>*:*</query>

</delete>

Thursday, May 26, 2011

Page 48: Joshuba Tuberville @ Lucene Revolution 2011

Future

• Solr 3.1• Mutual Matching• Faceting / Guided Search• Incorporating spelling• Hierarchies, categories, better ranking

models

Thursday, May 26, 2011

Page 49: Joshuba Tuberville @ Lucene Revolution 2011

Faceting

• Returns counts with query results

• Efficient • Guides the user

toward precision

Thursday, May 26, 2011

Page 50: Joshuba Tuberville @ Lucene Revolution 2011

Thank [email protected]

Twitter: @jtuberville

Thursday, May 26, 2011


Recommended