+ All Categories
Home > Software > Search@airbnb

Search@airbnb

Date post: 14-Aug-2015
Category:
Upload: mousom-gupta
View: 1,010 times
Download: 1 times
Share this document with a friend
Popular Tags:
28
Building Search@Airbnb Mousom Dhar Gupta
Transcript

Building Search@AirbnbMousom Dhar Gupta

Total Guests 20,000,000+Countries 190

Cities 34,000+Castles 600+

Listings Worldwide 1,200,000+

Search

That Awesome Slide Title of Yours

Technical Stack

____________________________

DropWizard as a service framework (incl. Jetty, Jersey, Jackson)

ZooKeeper (via Smartstack) for service discovery.

Lucene for index storage and simple retrieval.

In-house built forward index, real-time indexing, ranking, advanced filtering.

Web App

Search1

150 Search Threads

Lucene Index

~30 replicas of same index dataJVM

…Search2 SearchN

Search

Overview

search

Lucene

Lucene

Lucene

Lucene

Lucene

Lucene

Lucene

Lucene

Com

bine

r Filtering

and

Ranking

Shards

____________________________

Each box has 8 shards of Lucene Index Latency is 50% less than a single shard index

Challenges ____________________________

Bootstrap (creating the index from scratch) Ensuring consistency of the index with ground truth data in real time

Indexing

What’s in the Lucene index? ____________________________

Positions of listings indexed using Lucene’s spatial module (RecursivePrefixTreeStrategy)

Categorical and numerical properties like room type and maximum occupancy Full text (descriptions, reviews, etc.)

~40 fields per listing from a variety of data sources, all updated in real time

fraud

SpinalTap…

calendar

master

DataStore

Medusa

Search 1

Search N

Search 2

Realtime Update

Tails binary update logs from Mysql Servers (5.6+) Converts changes in any of the tables into actionable objects called “Mutations” (Inserts, deletes, Updates) Broadcasts them to Medusa using Kafka

Spinaltap

fraud

SpinalTap…

calendar

master

DataStore

Medusa

Search 1

Search N

Search 2

Realtime Update

Source of truth for search index data.

Listens to updates from Spinaltap and builds new IndexData by querying ~15 mysql tables from three different databases.

Persists everything in a DataStore and broadcasts latest version to all search nodes.

Uses ZooKeeper for leader election.

Medusa

fraud

SpinalTap…

calendar

master

DataStore

Medusa

Search 1

Search N

Search 2

Realtime Update

What’s in the forward index? ____________________________

Holds all the metadata about a listing required by scoring and filtering.

We also have complicated business rules to calculate Price, Availability, InstantBook etc which needs a ton of

metadata. ~50 fields built from multiple data source and updated

in realtime.

public final class ForwardIndexData { private final CalendarData calendarData; private final PricingData pricingData; private final HostInfo hostInfo; . . . . . . . .}!public final class CalendarData { private final DateRanges reservationDates; private final SeasonalValues startDayOfWeeks; . . . .

}!private final class SeasonalValues<T> { private final DateRange startDate; private final T value; . . . .}

Forward Index

Availability ____________________________

!Depends on the profile of guest.

The checkin date must be one of the valid start days of the week. Must satisfy seasonal minimum nights.

There must be enough preparation time for the host. Import busy dates from external calendars to avoid booking conflict.

Pricing ____________________________

!

Depends on number of guests , number of nights. How close or further away the checkin date is.

How long is the trip, does the host have Weekly and Monthly pricing. Is there special price override for these nights.

Instant Book ____________________________

!

Depends on number of guests , number of nights. Profile of the guest like positive reviews, does have profile photo?

How much preparation time the host has etc.

Needs to store objects with 50-100 fields as values keyed by listing id. Should avoid the cost of serialization/deserialization during every fetch.

Data must be available in-memory for fast lookup, but also persisted on disk.

Highly Concurrent, writer shouldn’t block the readers (One writer but >100 reader threads)

Requirements

Why did we need our custom Forward Index?

// Forward Indexpublic interface ForwardIndex<V> {! Map<Long, V> asMap(); void put(long id, V value);! void putAll(Map<Long, V> values);! void remove(long id);! void commit();!}

Forward Index Interface

// WriterforwardIndex.put(listingId, listingData);. . .// write to disk and also make it visible to readers.forwardIndex.commit();

// Reader// Fetch forward index data from in-memory mapMap<Long, ListingData> fwdIndex = forwardIndex.asMap();ListingData data = fwdIndex.get(listingId);!// Use it to evaluate business rules checkAvailability(data, searchRequest);calculatePrice(data, searchRequest)

NonBlocking In-Memory HashMap

DiskStore

// Forward Indexpublic class ForwardIndexStore<V> implements ForwardIndex<V> { private final DB<V> diskStore; private final Cache<V> cache;! . . . .! @Override Map<Long, V> asMap() { return Collections.unmodifiableMap(cache); } void put(long id, V value) { diskStore.put(id, value); cache.put(id, value); }! . . . .! void commit() { diskStore.commit(); cache.commit(); }}

Forward Index Implementation

Ranking Problem ____________________________

Not a text search problem Users are almost never searching for a specific item, rather they’re looking to

“Discover” The most common component of a query is location Highly personalized – the user is a part of the query

Optimizing for conversion (Search -> Inquiry -> Booking) Evolution through continuous experimentation

Ranking

Ranking Components ____________________________

Relevance Quality

Bookability Personalization

Desirability of location etc.

Ranking

Several hundred signals used to build machine learning models:

!

Properties of the listing (reviews, location, etc.)

Behavioral signals (mined from request logs)

Image quality and click ability (computer vision)

Host behavior (response time/rate, cancellations, etc.)

Host preferences model

DB snapshots Logs

Life of a Query

Query Understanding

Retrieval Populator

First Pass Scorer

GeocodingConfiguring retrieval optionsChoosing ranking models

QualityBookabilityRelevance

Second Pass Ranking

Result Generation AirEvents

Filtering by Price and Availability

25 results

2000 results

25 results

Second Pass Ranking ____________________________

Traditional ranking works like this: !

then sort by In contrast, second pass operates on the entire list at once:

!

Makes it possible to implement features like result diversity, etc.

Life of a Query

Query Understanding

Retrieval Populator

First Pass Scorer

GeocodingConfiguring retrieval optionsChoosing ranking models

QualityBookabilityRelevance

Second Pass Ranking

Result Generation AirEvents

Filtering by Price and Availability

25 results

2000 results

25 results


Recommended