+ All Categories
Home > Documents > Secondary Indexing

Secondary Indexing

Date post: 24-Feb-2016
Category:
Upload: jerome
View: 63 times
Download: 0 times
Share this document with a friend
Description:
Jesse Yates Salesforce.com. Secondary Indexing. t he discussion so far…. 9/11/12 HBase Pow -wow. What is it?. Problem. HBase rows are multi-dimensional Only sorted on the row key How do you efficiently lookup deeper into the row key?. Example. - PowerPoint PPT Presentation
Popular Tags:
33
Secondary Indexing the discussion so far…. 9/11/12 HBase Pow-wow Jesse Yates Salesforce.com
Transcript
Page 1: Secondary Indexing

Secondary Indexing

the discussion so far….

9/11/12 HBase Pow-wow

Jesse YatesSalesforce.com

Page 2: Secondary Indexing

What is it?

Page 3: Secondary Indexing

Problem

• HBase rows are multi-dimensional– Only sorted on the row key

• How do you efficiently lookup deeper into the row key?

Page 4: Secondary Indexing

ExampleRow Family Qualifier Timestamp value

1 Name First 0 Babe

1 Name Last 0 Ruth

How do we find all people with the last name ‘Ruth’?

Full table scan!

Page 5: Secondary Indexing

Indexing!Row Family Qualifier Timestamp Value

Ruth Name Last 0 1

Store the property we need to search for as the primary key• pointer back to the primary row • fast lookup - O(lg(n))

Page 6: Secondary Indexing

Use Cases

• Point lookups– Volume of data influences usefulness of index• Let user decide if they need to use an index

• Scan lookup– WHERE age > 16

Page 7: Secondary Indexing

Implementations

Page 8: Secondary Indexing

Omid

Full transactional supportCentralized oracle

Page 9: Secondary Indexing

Lily

WAL implementation on top of HBase100-500 writes/sec

Page 10: Secondary Indexing

Percolator

Full transactionsDistributed, optimistic locking

~10 sec latencies possible

Page 11: Secondary Indexing

Culvert

AsyncDead project, incomplete

Page 12: Secondary Indexing

http://jyates.github.com/2012/07/09/consistent-enough-secondary-indexes.html

Client-side coordinated indexUse timestamps to coordinate

Not yet implemented

Page 13: Secondary Indexing

Trend Micro Implementation

Still just POC???

Page 14: Secondary Indexing

Solr/Lucene

Standard Lucene library bolted on HBaseNot commonly used

Lots of formats/codecs already written

Page 15: Secondary Indexing

Considerations for HBase

What do we need to do?

Page 16: Secondary Indexing

Built-in vs. external library vs.

semi-supported (e.g. security)

Page 17: Secondary Indexing

Which should I use??

• HBase experts write a single ‘right’ impl• Officially endorse a ‘correct’ version• What changes do we need to make• How close to the core is the project– Written in everywhere– hbase-index module– External library

Page 18: Secondary Indexing

Async vs. Synchronous vs.

Transactional

Page 19: Secondary Indexing

Key Observation

“Secondary indexing is inherently an easier problem than full transactions… secondary index updates are idempotent.”

- Lars Hofhansl

Page 20: Secondary Indexing

Async vs. Synchronous vs.Transactional

• We don’t need full transactions– Transactions are slow – Transactions fail with increasing probability as

number of servers increases• Optionally async or sync– Async• Inherently ‘dirty’ index

• How does index cleanup work?– Inherently different for each type

Page 21: Secondary Indexing

Locality

Page 22: Secondary Indexing

Where’s my data?

• Extra columns vs. index table• HBase Region-pinning– Has to be best-effort or will decrease availability – Helps minimize RPC overhead– Cross-table region-pinning– Needs a coprocessor hook to be useful

• HDFS block allocation– Keep index and data blocks on same HDFS node

Page 23: Secondary Indexing

Index Cardinality

Page 24: Secondary Indexing

How much data are we talking?“Seems like there are 3 categories of sparseness:1. sparse indexes (like ipAddress) where a per-table approach is

more efficient for reads

2. dense indexes (like eventType) where there are likely values of every index key on each region

3. very dense indexes (like male/female) where you should just be doing a table scan anyway”

- Matt Corgan (9/10/12)

Page 25: Secondary Indexing

Impact on implementation

• Need a lot of knowledge of data to pick the right kind of index– User knows their data, let them do the hard work

of picking indexes

Page 26: Secondary Indexing

Pluggability

Page 27: Secondary Indexing

Everyone’s got an impl already

• We need to make HBase flexible enough to support (most) current indexing formats with minimal overhead for switching– Lucene style Codec/CodecProvider?

Page 28: Secondary Indexing

Client-interface

Page 29: Secondary Indexing

What should it look like?

• Minimal changes to the top-level interfaces– Add a single new flag?– Configuration based?

• Enough that the user gets to be smart about what should be used– We can’t get all cases right – just provide building

blocks• Automatically use an index?• Scanner/Filter style use?

Page 30: Secondary Indexing

Properties for the client

• Should the user even see the index lookups?

• ACID?• Ordering of results?– Support the current sorted order?– Batch lookup?

• Implications on current features– Replication– splitting

Page 31: Secondary Indexing

Schema(less)

• Schema enforced?– Rigid usage of index matching an expected schema?– Schema table? Reserved schema columns? .META.?

• Schema-less– Let the user apply whatever they think and use only

what actually works• Best-effort– Use client-hinted schema and try to apply all the

known indexes

Page 32: Secondary Indexing

My random thoughts….

• Client-side managed indexes are efficient– Minimal RPC overhead• Cleanup is async to client and rarely misses

– Solves the cross-region/server problem• Region-pinning is a nice-to-have optimization

– Scales without concern for locality– Flexible enough to support custom codecs– Can be built to provide server-side optimizations• Locality aware indexes to minimize RPCs

Page 33: Secondary Indexing

Discussion!


Recommended