Old and New Tricks with GIN - iki.fi

Post on 19-Oct-2021

5 views 0 download

transcript

Old and New Tricks with GIN

Heikki Linnakangas / VMware

March 20, 2014

What is GIN?

Generalized Inverted iNdex

Used to index things like

I full-text searchI arraysI key/value pairs (hstore)I json, xml (with expression indexes)

GIN example: Arrays

create table int_arrays (intarr integer[]);

create index intarr_gin on int_arrays using GIN (intarr);

insert into int_arrays

select array[g, random() * 1000, random() * 1000]

from generate_series(1,10000) g;

GIN example: Arrays

select * from int_arrays where intarr @> array[29, 95];

intarr

---------------

{4399,95,29}

{34355,29,95}

{59742,29,95}

{94927,95,29}

(4 rows)

GIN example: Array operators

At index creation / insertion:

1. Extract elements from array

2. Index the elements

GIN example: Array operators

At search:

1. Extract elements from query

2. Search the index for the elements

3. Return rows that contain all of them

@> - “contains”, must contain all elements&& - “overlap”, must contain at least one element

Operator classes

PostgreSQL is extendable.

The operations to extract elements, search, and combine resultsare defined by an operator class

Built-in operator classes for arrays, full-text search, etc.

Three fundamental GIN operations

1. Extract keys from a value to insert or query

System calls the opclass’ extractQuery / extractValue function

2. Index them

System stores the extracted keys in a B-tree, using the opclass’compare function.

3. Combine matches of several keys efficiently

System calls the opclass’ consistent function to determine if theitem with a combination of keys matches the overall query.

GIN examples: Full-text search (1/2)

At insert:

1. Extract words from text:‘PostgreSQL - The world’‘s most advanced open sourcedatabase’->“postgresql”, “world”, “advanc”, “open”, “sourc”

2. Index the words in the b-tree within GIN index.

GIN examples: Full-text search (2/2)

At search:

1. Extract words from query

2. Fetch all items containing any of the words

3. Determine which items match the overall query

Full-text search has a mini parser and syntax of its own:

select plainto_tsquery(’an advanced open source database’);

plainto_tsquery

-----------------------------------------

’advanc’ & ’open’ & ’sourc’ & ’databas’

(1 row)

GIN examples: Trigrams (1/2)

At insert:

1. Extract trigrams from text:

foobar -> ‘f’, ‘fo’, ‘foo’, ‘oob’, ‘oba’, ‘bar’, ‘ar’

2. Index them

GIN examples: Trigrams (2/2)

At search:

1. Extra trigrams from query

2. Fetch all items containing any of the trigrams.

3. Determine which items match the overall query

must have at least N common trigrams.

I Can speed up LIKE searches!I Also regular expressions!

Three fundamental GIN operations

1. Extract keys from a value to insert or query

2. Index them

System stores the extracted keys in a B-tree, using the opclass’compare function.

3. Determine which rows match, based on the keys present

Refresher: Regular B-tree

advanc: (0,8)advanc: (0,14)advanc: (0,22)advanc: (0,17)advanc: (0,26)...databas: (0,3)databas: (2,10)open: (0,11)postgresql: (0,8)postgresql: (0,41)...

GIN on-disk format

Posting list

I A posting list contains pointers to the physical tuples in thetable

I Each pointer consists of the Page Number and offset withinthe page

(0,8) (0,14) (0,17) (0,22) (0,26) (0,33) (0,34) (0,35) (0,45) (0,47)(0,48) (1,3) (1,4) (1,6) (1,8)

Can be stored in-line in the entry-tree, or as a whole separateB-tree (posting tree)

Posting tree page format

9.3 format

(0,8) (0,14) (0,17) (0,22) (0,26) (0,33) (0,34) (0,35)(0,45) (0,47) (0,48) (1,3) (1,4) (1,6) (1,8)

Each pointer takes 6 bytes (4 bytes for block number and 2 foroffset): 90 bytes in total.

Posting tree page format

9.4 format

(0,8) +6 +3 +5 +4 +7 +1 +1 +10 +2 +1 +2051 +1+2 +2

Stores the pointers in compressed format, as a difference from theprevious item: 21 bytes in total!

9.4 Posting tree format - btree gin example

(btree gin extension is a “dummy” opclass implementation toemulate a normal B-tree)

create extension btree_gin;

create table numbers (n int4);

insert into numbers

select g % 10 from generate_series(1, 10000000) g;

create index numbers_btree on numbers (n);

create index numbers_gin on numbers using gin (n);

9.4 Posting tree format - btree gin example9.4

postgres=# \di+

List of relations

Schema | Name | ... | Size | ...

--------+---------------+-----+--------+-----

public | numbers_btree | | 214 MB |

public | numbers_gin | | 11 MB |

(2 rows)

9.3

Schema | Name | ... | Size | ...

--------+---------------+-----+--------+-----

public | numbers_btree | | 214 MB |

public | numbers_gin | | 58 MB |

(2 rows)

Wow!

Table 346 MB

B-tree index 214 MB

GIN (9.3) 58 MB

GIN (9.4) 11 MB

New posting list format in 9.4

I Much more compactI The new code can still read old-format pages

I pg upgrade worksI but you won’t get the benefit until you REINDEX.

I More expensive to do random updates

I GIN isn’t very fast with random updates anyway. . .

Recap: Three fundamental GIN operations

1. Extract keys from a value to insert or query

2. Index them

3. Combine matches of several keys efficiently, anddetermine which items match the overall query

Consistent function

select plainto_tsquery(

’an advanced PostgreSQL open source database’);

plainto_tsquery

--------------------------------------------------------

’postgresql’ & ’advanc’ & ’open’ & ’sourc’ & ’databas’

(1 row)

select * from foo where col @@ plainto_tsquery(

’an advanced PostgreSQL open source database’

)

3. Combine matches efficiently (0/4)The query returns the following matches from the index:

advanc databas open postgresql sourc

(0,8) (0,3) (0,2) (0,8) (0,1)

(0,14) (0,8) (0,8) (0,41) (0,2)

(0,17) (0,43) (0,30) (0,8)

(0,22) (0,47) (0,33) (0,12)

(0,26) (1,32) (0,36) (0,13)

(0,33) (0,44) (0,18)

(0,34) (0,46) (0,19)

(0,35) (0,56) (0,20)

(0,45) (1,4) (0,26)

(0,47) (1,22) (0,34)

(0,48) (1,24) (0,35)

(1,3) (1,32) (0,50)

(1,4) (1,39) (1,1)

(1,6) (1,5)

(1,8) (1,6)

3. Combine matches efficiently (1/4)(0,1) contains only word “sourc” -> no match

advanc databas open postgresql sourc

(0,8) (0,3) (0,2) (0,8) (0,1)

(0,14) (0,8) (0,8) (0,41) (0,2)

(0,17) (0,43) (0,30) (0,8)

(0,22) (0,47) (0,33) (0,12)

(0,26) (1,32) (0,36) (0,13)

(0,33) (0,44) (0,18)

(0,34) (0,46) (0,19)

(0,35) (0,56) (0,20)

(0,45) (1,4) (0,26)

(0,47) (1,22) (0,34)

(0,48) (1,24) (0,35)

(1,3) (1,32) (0,50)

(1,4) (1,39) (1,1)

(1,6) (1,5)

(1,8) (1,6)

3. Combine matches efficiently (2/4)(0,2) contains words “open” and “sourc” -> no match

advanc databas open postgresql sourc

(0,9) (0,3) (0,2) (0,8) (0,1)

(0,14) (0,8) (0,8) (0,41) (0,2)

(0,17) (0,43) (0,30) (0,8)

(0,22) (0,47) (0,33) (0,12)

(0,26) (1,32) (0,36) (0,13)

(0,33) (0,44) (0,18)

(0,34) (0,46) (0,19)

(0,35) (0,56) (0,20)

(0,45) (1,4) (0,26)

(0,47) (1,22) (0,34)

(0,48) (1,24) (0,35)

(1,3) (1,32) (0,50)

(1,4) (1,39) (1,1)

(1,6) (1,5)

(1,8) (1,6)

3. Combine matches efficiently (3/4)(0,3) contains word “databas” -> no match

advanc databas open postgresql sourc

(0,8) (0,3) (0,2) (0,8) (0,1)

(0,14) (0,8) (0,8) (0,41) (0,2)

(0,17) (0,43) (0,30) (0,8)

(0,22) (0,47) (0,33) (0,12)

(0,26) (1,32) (0,36) (0,13)

(0,33) (0,44) (0,18)

(0,34) (0,46) (0,19)

(0,35) (0,56) (0,20)

(0,45) (1,4) (0,26)

(0,47) (1,22) (0,34)

(0,48) (1,24) (0,35)

(1,3) (1,32) (0,50)

(1,4) (1,39) (1,1)

(1,6) (1,5)

(1,8) (1,6)

3. Combine matches efficiently (4/4)(0,8) contains all the words -> match

advanc databas open postgresql sourc

(0,8) (0,3) (0,2) (0,8) (0,1)

(0,14) (0,8) (0,8) (0,41) (0,2)

(0,17) (0,43) (0,30) (0,8)

(0,22) (0,47) (0,33) (0,12)

(0,26) (1,32) (0,36) (0,13)

(0,33) (0,44) (0,18)

(0,34) (0,46) (0,19)

(0,35) (0,56) (0,20)

(0,45) (1,4) (0,26)

(0,47) (1,22) (0,34)

(0,48) (1,24) (0,35)

(1,3) (1,32) (0,50)

(1,4) (1,39) (1,1)

(1,6) (1,5)

(1,8) (1,6)

Fast Scan

Instead of scanning through the posting lists of all the keywords,only scan through the list with fewest items, and skip the otherlists to the next possible match.

I Big improvement for “frequent-term AND rare-term” stylequeries

Fast scan example(0,8) contains all the words -> match

postgresql databas open advanc sourc

(0,8) (0,3) (0,2) (0,8) (0,1)

(0,41) (0,8) (0,8) (0,14) (0,2)

(0,43) (0,30) (0,17) (0,8)

(0,47) (0,33) (0,22) (0,12)

(1,32) (0,36) (0,26) (0,13)

(0,44) (0,33) (0,18)

(0,46) (0,34) (0,19)

(0,56) (0,35) (0,20)

(1,4) (0,45) (0,26)

(1,22) (0,47) (0,34)

(1,24) (0,48) (0,35)

(1,32) (1,3) (0,50)

(1,39) (1,4) (1,1)

(1,6) (1,5)

(1,8) (1,6)

Summary: Improvements in 9.4

More compact posting list format

I 2x-10x smaller indexes, yay!

Fast scan

I Big speedup for queries with some frequent and some rareitems

Thanks to Alexander Korotkov for these improvements!

Final GIN tip

GIN indexes are efficient at storing duplicates

I Use a GIN index using btree gin extension for status-fields etc.

postgres=# \di+

List of relations

Schema | Name | ... | Size | ...

--------+---------------+-----+--------+-----

public | numbers_btree | | 214 MB |

public | numbers_gin | | 11 MB |

(2 rows)

Questions?