Old and New Tricks with GIN
Heikki Linnakangas / VMware
March 20, 2014
What is GIN?
Generalized Inverted iNdex
Used to index things like
I full-text searchI arraysI key/value pairs (hstore)I json, xml (with expression indexes)
GIN example: Arrays
create table int_arrays (intarr integer[]);
create index intarr_gin on int_arrays using GIN (intarr);
insert into int_arrays
select array[g, random() * 1000, random() * 1000]
from generate_series(1,10000) g;
GIN example: Arrays
select * from int_arrays where intarr @> array[29, 95];
intarr
---------------
{4399,95,29}
{34355,29,95}
{59742,29,95}
{94927,95,29}
(4 rows)
GIN example: Array operators
At index creation / insertion:
1. Extract elements from array
2. Index the elements
GIN example: Array operators
At search:
1. Extract elements from query
2. Search the index for the elements
3. Return rows that contain all of them
@> - “contains”, must contain all elements&& - “overlap”, must contain at least one element
Operator classes
PostgreSQL is extendable.
The operations to extract elements, search, and combine resultsare defined by an operator class
Built-in operator classes for arrays, full-text search, etc.
Three fundamental GIN operations
1. Extract keys from a value to insert or query
System calls the opclass’ extractQuery / extractValue function
2. Index them
System stores the extracted keys in a B-tree, using the opclass’compare function.
3. Combine matches of several keys efficiently
System calls the opclass’ consistent function to determine if theitem with a combination of keys matches the overall query.
GIN examples: Full-text search (1/2)
At insert:
1. Extract words from text:‘PostgreSQL - The world’‘s most advanced open sourcedatabase’->“postgresql”, “world”, “advanc”, “open”, “sourc”
2. Index the words in the b-tree within GIN index.
GIN examples: Full-text search (2/2)
At search:
1. Extract words from query
2. Fetch all items containing any of the words
3. Determine which items match the overall query
Full-text search has a mini parser and syntax of its own:
select plainto_tsquery(’an advanced open source database’);
plainto_tsquery
-----------------------------------------
’advanc’ & ’open’ & ’sourc’ & ’databas’
(1 row)
GIN examples: Trigrams (1/2)
At insert:
1. Extract trigrams from text:
foobar -> ‘f’, ‘fo’, ‘foo’, ‘oob’, ‘oba’, ‘bar’, ‘ar’
2. Index them
GIN examples: Trigrams (2/2)
At search:
1. Extra trigrams from query
2. Fetch all items containing any of the trigrams.
3. Determine which items match the overall query
must have at least N common trigrams.
I Can speed up LIKE searches!I Also regular expressions!
Three fundamental GIN operations
1. Extract keys from a value to insert or query
2. Index them
System stores the extracted keys in a B-tree, using the opclass’compare function.
3. Determine which rows match, based on the keys present
Refresher: Regular B-tree
advanc: (0,8)advanc: (0,14)advanc: (0,22)advanc: (0,17)advanc: (0,26)...databas: (0,3)databas: (2,10)open: (0,11)postgresql: (0,8)postgresql: (0,41)...
GIN on-disk format
Posting list
I A posting list contains pointers to the physical tuples in thetable
I Each pointer consists of the Page Number and offset withinthe page
(0,8) (0,14) (0,17) (0,22) (0,26) (0,33) (0,34) (0,35) (0,45) (0,47)(0,48) (1,3) (1,4) (1,6) (1,8)
Can be stored in-line in the entry-tree, or as a whole separateB-tree (posting tree)
Posting tree page format
9.3 format
(0,8) (0,14) (0,17) (0,22) (0,26) (0,33) (0,34) (0,35)(0,45) (0,47) (0,48) (1,3) (1,4) (1,6) (1,8)
Each pointer takes 6 bytes (4 bytes for block number and 2 foroffset): 90 bytes in total.
Posting tree page format
9.4 format
(0,8) +6 +3 +5 +4 +7 +1 +1 +10 +2 +1 +2051 +1+2 +2
Stores the pointers in compressed format, as a difference from theprevious item: 21 bytes in total!
9.4 Posting tree format - btree gin example
(btree gin extension is a “dummy” opclass implementation toemulate a normal B-tree)
create extension btree_gin;
create table numbers (n int4);
insert into numbers
select g % 10 from generate_series(1, 10000000) g;
create index numbers_btree on numbers (n);
create index numbers_gin on numbers using gin (n);
9.4 Posting tree format - btree gin example9.4
postgres=# \di+
List of relations
Schema | Name | ... | Size | ...
--------+---------------+-----+--------+-----
public | numbers_btree | | 214 MB |
public | numbers_gin | | 11 MB |
(2 rows)
9.3
Schema | Name | ... | Size | ...
--------+---------------+-----+--------+-----
public | numbers_btree | | 214 MB |
public | numbers_gin | | 58 MB |
(2 rows)
Wow!
Table 346 MB
B-tree index 214 MB
GIN (9.3) 58 MB
GIN (9.4) 11 MB
New posting list format in 9.4
I Much more compactI The new code can still read old-format pages
I pg upgrade worksI but you won’t get the benefit until you REINDEX.
I More expensive to do random updates
I GIN isn’t very fast with random updates anyway. . .
Recap: Three fundamental GIN operations
1. Extract keys from a value to insert or query
2. Index them
3. Combine matches of several keys efficiently, anddetermine which items match the overall query
Consistent function
select plainto_tsquery(
’an advanced PostgreSQL open source database’);
plainto_tsquery
--------------------------------------------------------
’postgresql’ & ’advanc’ & ’open’ & ’sourc’ & ’databas’
(1 row)
select * from foo where col @@ plainto_tsquery(
’an advanced PostgreSQL open source database’
)
3. Combine matches efficiently (0/4)The query returns the following matches from the index:
advanc databas open postgresql sourc
(0,8) (0,3) (0,2) (0,8) (0,1)
(0,14) (0,8) (0,8) (0,41) (0,2)
(0,17) (0,43) (0,30) (0,8)
(0,22) (0,47) (0,33) (0,12)
(0,26) (1,32) (0,36) (0,13)
(0,33) (0,44) (0,18)
(0,34) (0,46) (0,19)
(0,35) (0,56) (0,20)
(0,45) (1,4) (0,26)
(0,47) (1,22) (0,34)
(0,48) (1,24) (0,35)
(1,3) (1,32) (0,50)
(1,4) (1,39) (1,1)
(1,6) (1,5)
(1,8) (1,6)
3. Combine matches efficiently (1/4)(0,1) contains only word “sourc” -> no match
advanc databas open postgresql sourc
(0,8) (0,3) (0,2) (0,8) (0,1)
(0,14) (0,8) (0,8) (0,41) (0,2)
(0,17) (0,43) (0,30) (0,8)
(0,22) (0,47) (0,33) (0,12)
(0,26) (1,32) (0,36) (0,13)
(0,33) (0,44) (0,18)
(0,34) (0,46) (0,19)
(0,35) (0,56) (0,20)
(0,45) (1,4) (0,26)
(0,47) (1,22) (0,34)
(0,48) (1,24) (0,35)
(1,3) (1,32) (0,50)
(1,4) (1,39) (1,1)
(1,6) (1,5)
(1,8) (1,6)
3. Combine matches efficiently (2/4)(0,2) contains words “open” and “sourc” -> no match
advanc databas open postgresql sourc
(0,9) (0,3) (0,2) (0,8) (0,1)
(0,14) (0,8) (0,8) (0,41) (0,2)
(0,17) (0,43) (0,30) (0,8)
(0,22) (0,47) (0,33) (0,12)
(0,26) (1,32) (0,36) (0,13)
(0,33) (0,44) (0,18)
(0,34) (0,46) (0,19)
(0,35) (0,56) (0,20)
(0,45) (1,4) (0,26)
(0,47) (1,22) (0,34)
(0,48) (1,24) (0,35)
(1,3) (1,32) (0,50)
(1,4) (1,39) (1,1)
(1,6) (1,5)
(1,8) (1,6)
3. Combine matches efficiently (3/4)(0,3) contains word “databas” -> no match
advanc databas open postgresql sourc
(0,8) (0,3) (0,2) (0,8) (0,1)
(0,14) (0,8) (0,8) (0,41) (0,2)
(0,17) (0,43) (0,30) (0,8)
(0,22) (0,47) (0,33) (0,12)
(0,26) (1,32) (0,36) (0,13)
(0,33) (0,44) (0,18)
(0,34) (0,46) (0,19)
(0,35) (0,56) (0,20)
(0,45) (1,4) (0,26)
(0,47) (1,22) (0,34)
(0,48) (1,24) (0,35)
(1,3) (1,32) (0,50)
(1,4) (1,39) (1,1)
(1,6) (1,5)
(1,8) (1,6)
3. Combine matches efficiently (4/4)(0,8) contains all the words -> match
advanc databas open postgresql sourc
(0,8) (0,3) (0,2) (0,8) (0,1)
(0,14) (0,8) (0,8) (0,41) (0,2)
(0,17) (0,43) (0,30) (0,8)
(0,22) (0,47) (0,33) (0,12)
(0,26) (1,32) (0,36) (0,13)
(0,33) (0,44) (0,18)
(0,34) (0,46) (0,19)
(0,35) (0,56) (0,20)
(0,45) (1,4) (0,26)
(0,47) (1,22) (0,34)
(0,48) (1,24) (0,35)
(1,3) (1,32) (0,50)
(1,4) (1,39) (1,1)
(1,6) (1,5)
(1,8) (1,6)
Fast Scan
Instead of scanning through the posting lists of all the keywords,only scan through the list with fewest items, and skip the otherlists to the next possible match.
I Big improvement for “frequent-term AND rare-term” stylequeries
Fast scan example(0,8) contains all the words -> match
postgresql databas open advanc sourc
(0,8) (0,3) (0,2) (0,8) (0,1)
(0,41) (0,8) (0,8) (0,14) (0,2)
(0,43) (0,30) (0,17) (0,8)
(0,47) (0,33) (0,22) (0,12)
(1,32) (0,36) (0,26) (0,13)
(0,44) (0,33) (0,18)
(0,46) (0,34) (0,19)
(0,56) (0,35) (0,20)
(1,4) (0,45) (0,26)
(1,22) (0,47) (0,34)
(1,24) (0,48) (0,35)
(1,32) (1,3) (0,50)
(1,39) (1,4) (1,1)
(1,6) (1,5)
(1,8) (1,6)
Summary: Improvements in 9.4
More compact posting list format
I 2x-10x smaller indexes, yay!
Fast scan
I Big speedup for queries with some frequent and some rareitems
Thanks to Alexander Korotkov for these improvements!
Final GIN tip
GIN indexes are efficient at storing duplicates
I Use a GIN index using btree gin extension for status-fields etc.
postgres=# \di+
List of relations
Schema | Name | ... | Size | ...
--------+---------------+-----+--------+-----
public | numbers_btree | | 214 MB |
public | numbers_gin | | 11 MB |
(2 rows)
Questions?