Full text search

transcript

Rahila Syed Beena Emerson

Full text Search

• Full text search and its types

• Full text search in PostgreSQL

• PostgreSQL extension

• Similarity Search

Full Text Search

• Searching for a group of keywords in a pile of texts

– Document

– Query

– Similarity

• Full text search in database

– Searching for a set of keywords in a text field of a database table

– The data used for full text search can be huge

– Indexing words and associating indexed words with documents

What is full text search?

Full Text Search in PostgreSQL

• Creating Tokens

– Parsing document into set of tokens like numbers, words, complex words, email addresses.

• Creating Lexemes

– Normalization: Dictionary controls this.

• Removal of suffixes – converts variants into a single form (worry, worries, worried, etc.)

• Conversion to lower case

• Remove stop words – common words useless for searching (the, at etc.)

• Storing preprocessed documents

– Storing documents and creating indexes over them for faster search

• Relevance ranking

Full text search in PostgreSQL

• Full integration

• 27 built-in configurations for 10 languages

• Support of user-defined FTS configurations

• Pluggable dictionaries ( ispell, snowball, thesaurus ), parsers

• Relevance ranking

• GIN and GiST index

Full text search in PostgreSQL

Morphological Search

• Indexed tokens are words of a language

• Eg. Tree, book, rain

• Small index size

• Good in orthographical variants

• Search results depends on division of words

• Used for large documents like thesis

• Ex. Tsvector

N-gram search

• Indexed tokens are characters.

• Eg. _t, tr, re, e_ (2 grams)

• Big index size

• Cannot match orthographical variants

• Results closer to indexed LIKE

• Better suited for a limited set of words

• Ex. pg_bigm, pg_tigm

• Search similar words(No linguistic support)

• Ranking of search results

• Searches substrings

– Indexes does not support substring search

– LIKE operator doesn’t use INDEX when preceded by %.

– Low performance when compared to full text search using GIN and GiST

• Accuracy issue

Eg. LIKE %one% matches prone, money, lonely

Why full text search?

• POSIX Expression =# EXPLAIN ANALYZE SELECT * FROM fulltext_search WHERE doc ~ 'postgresql';

QUERY PLAN

--------------------------------------------------------------------------

Seq Scan on fulltext_search (cost=10000000000.00..10000000473.77 rows=40

width=152) (actual time=10.871..390.019 rows=250 loops=1)

Filter: (doc ~ 'postgresql'::text)

Rows Removed by Filter: 11397

Total runtime: 390.060 ms

Measurement results

• LIKE Query =# EXPLAIN ANALYZE SELECT * FROM fulltext_search WHERE doc LIKE '%postgresql%';

QUERY PLAN

------------------------------------------------------------------------

Seq Scan on fulltext_search (cost=10000000000.00..10000000473.77 rows=40 width=152) (actual time=1.342..110.107 rows=250 loops=1)

Filter: (doc ~~ '%postgresql%'::text)

Rows Removed by Filter: 11397

Measurement results

• Full Text Search

Nested Loop (cost=352.83..508.22 rows=107 width=64) (actual

time=1.397..1.575 rows=250 loops=1)

-> Function Scan on to_tsquery query (cost=0.00..0.01 rows=1 width=32)

(actual time=0.023..0.023 rows=1 loops=1)

-> Bitmap Heap Scan on full_text_search (cost=352.83..507.14 rows=107

width=32) (actual time=1.371..1.516 rows=250 loops=1)

Recheck Cond: (query.query @@ to_tsvector('english'::regconfig,

-> Bitmap Index Scan on full_search_idx (cost=0.00..352.80

rows=107 width=0) (actual time=1.354..1.354 rows=348 loops=1)

Index Cond: (query.query @@

to_tsvector('english'::regconfig, doc))

Normal Search: SELECT * FROM tbl WHERE col1 LIKE 'The tiger is the largest cat

species';

--------------------------------------

The tiger is the largest cat species

(1 row)

Ranking Example

Full Text Search: SELECT col1, similarity(col1, 'The tiger is the largest cat

species') AS sml

FROM tbl_t WHERE col1 % 'The tiger is the largest cat species'

ORDER BY sml DESC, col1;

col1 | sml

-----------------------------------------+----------

The tiger is the largest cat species | 1

The peacock is the largest bird species | 0.511111

The cheetah is the fastest cat species | 0.466667

(3 rows)

• GIN(Generalized Inverted Index)

• Custom strategies for particular data types

• Inverted indexes

• Interface for custom data types

• Slower to update

• Deterministic

• Appropriate for fixed data sets.

Indexes Used in Full Text Search

KEY TID

Meetup

100 ,140

Pune 100 , 150

Here 100

• GiST (Generalized Search Tree)

• Interface for data types and access methods

• Document is represented in the index by a fixed-length signature

• Based on hash tables

• Probability of false match

• Table row must be retrieved to see if the match is correct

• In appropriate for large data sets

• Filtering data at the end of index search to remove false match

EXPLAIN SELECT * FROM tab WHERE text_search @@

to_tsquery(‘Mountain'); ------------------------------- QUERY PLAN -----------------------

-------------------

Index Scan using text_search_idx on tab (cost=0.00..12.29 rows=2

width=1469)

Index Cond: (textsearch @@ '‘Mountain'''::tsquery) Filter: (textsearch @@ ''‘Mountain'''::tsquery)

Indexes Used in Full Text Search

• Representation of document best suited for full text search

• Normalized lexemes formed by pre-processing of the documents

• Functions to convert normal text to tsvector:

• to_tsvector to_tsvector([ config regconfig, ] document text) returns

tsvector

=# SELECT to_tsvector('english', 'Glad to be part of this

meetup');

to_tsvector

------------------------------

'glad':1 'meetup':7 'part':4

(1 row)

• The query above specifies 'english' as the configuration to be used to

parse and normalize the strings. The default_text_search_config value will be used if the configuration parameter is omitted.

tsvector

• Representation of search query best suited for full text search

• Normalized lexemes formed by processing the query

• Maybe combined using AND, OR, or NOT operator.

• All keywords used for search

tsquery

• Functions to convert normal text to tsquery:

• to_tsquery to_tsquery([ config regconfig, ] querytext text) returns

tsquery

=# SELECT to_tsquery('meetups & in & ! Pune');

to_tsquery

--------------------

'meetup' & !'pune'

(1 row)

• plainto_tsquery plainto_tsquery([ config regconfig, ] querytext text)

returns tsquery

=# SELECT plainto_tsquery ('english','meetups in Pune');

plainto_tsquery

-------------------

'meetup' & 'pune'

(1 row)

tsquery

• Checks a tsvector(document) with a tsquery(search word)

• Returns true if all tsquery elements are present in the tsvector of the document

=# SELECT to_tsvector('Welcome to this postgresql meetup') @@

plainto_tsquery('PostgreSQL Meetups');

?column?

----------

(1 row)

=# SELECT to_tsvector('Welcome to this postgresql meetup') @@

plainto_tsquery('Pune meetup');

?column?

----------

(1 row)

Match operator @@

SELECT * FROM <table> WHERE

to_tsvector('<config>', <colname>) @@ to_tsquery('<config>',

'<search word>');

The configuration parameter of the functions to_tsvector and to_tsquery should be same.

Example:

=# SELECT * FROM tbl WHERE to_tsvector('english', col) @@

to_tsquery('english', 'enjoy');

--------------------------------

He enjoyed the party

He enjoys the classical music.

(2 rows)

Full text search without index

• Creating the index CREATE INDEX <index_name> ON <table> USING

gin(to_tsvector('<config>', <col>));

• Performing search using the index: SELECT * FROM <table> WHERE to_tsvector('<config>', <col>) @@

plainto_tsquery('<config>','<search word>')

Example:

=# CREATE INDEX idx ON tbl USING gin(to_tsvector('english',

col));

=# SELECT * FROM tbl WHERE to_tsvector('english', col) @@

plainto_tsquery('english','enjoy');

--------------------------------

(2 rows)

Full text search using index

• Procedure

– Create a column of tsvector type

– Define a trigger which will automatically update the tsvector column

– Perform Search on the tsvector column

• Advantages:

– No need to specify the text search configuration in every query in order to make use of the index

– Faster searches as the to_tsvector function will not be called for each search query.

Full text search using separate column

Example:

=# CREATE TABLE tbl (col text, tsv_col tsvector);

=# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE

ON tbl FOR EACH ROW EXECUTE PROCEDURE

tsvector_update_trigger(tsv_col, 'pg_catalog.english', col);

=# INSERT INTO tbl VALUES ('He enjoyed the party'),('He enjoys the

classical music.'),('The moon winked at him');

=# SELECT * FROM tbl;

col | tsv

--------------------------------+---------------------------------

He enjoyed the party | 'enjoy':2 'parti':4

He enjoys the classical music. | 'classic':4 'enjoy':2 'music':5

The moon winked at him | 'moon':2 'wink':3

(3 rows)

Example:

=# CREATE TABLE tbl (col text, tsv_col tsvector);

=# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE

ON tbl FOR EACH ROW EXECUTE PROCEDURE

tsvector_update_trigger(tsv_col, 'pg_catalog.english', col);

=# INSERT INTO tbl VALUES ('He enjoyed the party'),('He enjoys the

classical music.'),('The moon winked at him');

=# SELECT col FROM tbl WHERE tsv_col @@ to_tsquery('enjoys');

--------------------------------

(2 rows)

Ranking

•ts_rank

–Lexical ranking

ts_rank([ weights float4[], ] vector tsvector, query tsquery [,

normalization integer ]) returns float4

=# select ts_rank(to_tsvector('Free text seaRCh is a wonderful

Thing'), to_tsquery('wonderful | thing'));

ts_rank ----------- 0.0607927

•ts_rank_cd

–Proximity ranking

=# select ts_rank_cd(to_tsvector('Free text seaRCh is a

wonderful Thing'), to_tsquery('wonderful & thing'));

ts_rank_cd ------------ 0.1

Ranking

• Structural ranking – Query

select ts_rank( array[0.1,0.1,0.9,0.1],

setweight(to_tsvector('All about search'), 'B') ||

setweight(to_tsvector('Free text seaRCh is a

wonderfulThing'),'A'),

to_tsquery('wonderful & search'));

– Result

ts_rank

0.328337

PostgreSQL Extension

• Uses index made from trigrams – 3 consecutive characters from string.

• Find string similarity by comparing the trigrams.

• provides GiST and GIN index operator classes to create index. CREATE INDEX <idx> ON <tbl> USING gist(<col> gist_trgm_ops);

CREATE INDEX <idx> ON <tbl> USING gin (<col> gin_trgm_ops);

• Problem:

− No partial match algorithm

− Slow when search key is < 3 characters

GIN_SEARCH_MODE_ALL is used

pg_trgm

• PostgreSQL module which provides full text search capability using 2-gram index.

• Based on pg_trgm

• First released on April 2013. Version 1.1 to be released soon.

• Developed by NTT Data

• Site: http://sourceforge.jp/projects/pgbigm/

pg_bigm

Difference

Feature pg_trgm pg_bigm

Method of full text search

3-gram " a", " ab", abc, bcd

2-gram " a", ab, bc, cd, "d "

Available index GIN and GiST GIN only

1-2 character keyword search

Slow Fast

• Download tar.gz file from the site

• Install pg_bigm $ make USE_PGXS=1

# make USE_PGXS=1 install

• Register- Set the postgresql.conf variables: – shared_preload_libraries = 'pg_bigm'

– custom_variable_classes = 'pg_bigm' (only in 9.1)

• Load into the required database =# CREATE EXTENSION pg_bigm;

Install pg_bigm

Argument: Search String

Return Value: Array of all possible 2-gram character string

Procedure:

• For each word perform the following:

• Add a space character before and after the text

• Moving from left to right extract strings in the unit of 2 characters.

=# SELECT show_bigm('ab');

show_bigm

----------------

{" a",ab,"b "}

(1 row)

Function – show_bigm

Argument: Search string

Return Value: String in a pattern to be used in LIKE for full-text search

Procedure:

• Add % to the beginning and the end of retrieval string.

• Add a backlash (\) before every underscore (_), percent (%) and backlash (\) present in the retrieval string.

=# SELECT likequery ('pg_bigm ppt');

likequery

----------------

%pg\_bigm ppt%

(1 row)

Function - likequery

• Only GIN support

• Create Index on the text column of a table CREATE INDEX <index_name> ON <table> USING gin (<column>,

gin_bigm_ops);

Creation of Index

Key TID

" c" 1

" m" 5

at 1, 5

"t " 1, 5

TID Data

Generate bigrams cat - " c", at, ca, "t "

mat - " m", at, ma, "t "

SELECT * FROM <tbl> WHERE <col> LIKE likequery(‘<word>');

=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE col LIKE likequery('cat');

QUERY PLAN

-------------------------------------------------------------------

Bitmap Heap Scan on tbl (cost=12.00..16.01 rows=1 width=4) (actual

time=0.038..0.039 rows=1 loops=1)

Recheck Cond: (col ~~ '%cat%'::text)

-> Bitmap Index Scan on idx (cost=0.00..12.00 rows=1 width=0)

Index Cond: (col ~~ '%cat%'::text)

(5 rows)

Full text search Query

Generate bigrams

Key TID

" c" 1

" m" 5

at 1, 5

"t " 1, 5

TID Data

Result Candidates

Perform Recheck

Search key

Index lookup

TID Data

Final Result

• Removes wrong results from result candidates of index scan.

=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE col LIKE

likequery('trial');

QUERY PLAN

-------------------------------------------------------------------

------------------------------------------

Bitmap Heap Scan on tbl (cost=24.00..28.01 rows=1 width=5)

Recheck Cond: (col ~~ '%trial%'::text)

Rows Removed by Index Recheck: 1

-> Bitmap Index Scan on idx (cost=0.00..24.00 rows=1 width=0)

Index Cond: (col ~~ '%trial%'::text)

(6 rows)

Why Recheck?

Key TID

" t" 1, 2

al 1, 2

ia 1, 2

“l " 1, 2

ri 1, 2

tr 1, 2

TID Data

1 trial

2 trivial

trial " t",al,ia,"l ",ri,tr

trivial " t",al,ia,iv,"l ",ri,tr,vi

Search ‘trial’

TID Data

1 trial

2 trivial

TID Data

1 trial

Index scan

Recheck

Parameter - enable_recheck

• To disable Recheck and get all the results retrieved by index scan

• Values on/off

=# SET pg_bigm.enable_recheck = on;

=# SELECT * FROM tbl WHERE doc LIKE likequery('trial');

----------------------

He is awaiting trial

(1 row)

=# SET pg_bigm.enable_recheck = off;

=# SELECT * FROM tbl WHERE doc LIKE likequery('trial');

--------------------------

He is awaiting trial

It was a trivial mistake

(2 rows)

Disabling Recheck

=# CREATE TABLE tbl (col text);

=# CREATE INDEX tbl_idx ON tbl USING gin (col gin_bigm_ops);

=# INSERT INTO tbl VALUES

('He is awaiting trial'),

('Those orchids are very special to her '),

('pg_bigm performs full text search using 2 gram index'),

('pg_trgm performs full text search using 3 gram index');

=# SELECT * FROM tbl WHERE col LIKE likequery('full text search');

------------------------------------------------------

pg_bigm performs full text search using 2 gram index

pg_trgm performs full text search using 3 gram index

(2 rows)

pg_bigm Full Text Search Sample

Similarity Search

Argument: The 2 strings whose similarity is to be checked

Return value - the similarity value of two arguments (0 - 1)

• measures the similarity of two strings by counting the number of 2-grams they share.

=# SELECT bigm_similarity ('test','text');

bigm_similarity

-----------------

(1 row)

Function – bigm_similarity

• specifies threshold used for the similarity search

• Search returns rows with similarity value >= similarity_limit

• Default: 0.3

• SET command can be used to modify the value.

=# SHOW pg_bigm.similarity_limit;

pg_bigm.similarity_limit

--------------------------

(1 row)

=# SET pg_bigm.similarity_limit = 0.5;

Parameter - similarity_limit

• Used to perform similarity search

• Uses full text search index.

• Returns rows whose similarity is higher than or equal to the value of pg_bigm.similarity_limit

SELECT * FROM <tbl> WHERE <col> =% ‘<key>';

Similarity Operator - =%

=# SELECT *, bigm_similarity(col, 'test') FROM tbl WHERE col =%

'test';

col | bigm_similarity

-------+-----------------

test | 1

text | 0.6

treat | 0.333333

(3 rows)

=# SELECT *, bigm_similarity(col, 'test') FROM tbl WHERE col =%

'test';

col | bigm_similarity

------+-----------------

test | 1

text | 0.6

(2 rows)

Similarity Search Sample

• PostgreSQL documents

• wiki.postgresql.org

• Understanding Full Text Search

• http://linuxgazette.net/164/sephton.html

• http://www.slideshare.net/billkarwin/full-text-search-in-postgresql

• Understanding pg_bigm

• pgbigm.sourceforge.jp

• www.slideshare.net/masahikosawada98/pg-bigm

References

Full text search

Technology