Post on 19-Dec-2014
description
transcript
© 2013 NTT DATA, Inc.
Rahila Syed Beena Emerson
Full text Search
© 2013 NTT DATA, Inc. 2
• Full text search and its types
• Full text search in PostgreSQL
• PostgreSQL extension
• Similarity Search
Index
3 © 2013 NTT DATA, Inc.
Full Text Search
© 2013 NTT DATA, Inc. 4
• Searching for a group of keywords in a pile of texts
– Document
– Query
– Similarity
• Full text search in database
– Searching for a set of keywords in a text field of a database table
– The data used for full text search can be huge
– Indexing words and associating indexed words with documents
What is full text search?
5 © 2013 NTT DATA, Inc.
Full Text Search in PostgreSQL
© 2013 NTT DATA, Inc. 6
• Creating Tokens
– Parsing document into set of tokens like numbers, words, complex words, email addresses.
• Creating Lexemes
– Normalization: Dictionary controls this.
• Removal of suffixes – converts variants into a single form (worry, worries, worried, etc.)
• Conversion to lower case
• Remove stop words – common words useless for searching (the, at etc.)
• Storing preprocessed documents
– Storing documents and creating indexes over them for faster search
• Relevance ranking
Steps
© 2013 NTT DATA, Inc. 7
Full text search in PostgreSQL
• Full integration
• 27 built-in configurations for 10 languages
• Support of user-defined FTS configurations
• Pluggable dictionaries ( ispell, snowball, thesaurus ), parsers
• Relevance ranking
• GIN and GiST index
© 2013 NTT DATA, Inc. 8
Full text search in PostgreSQL
Morphological Search
• Indexed tokens are words of a language
• Eg. Tree, book, rain
• Small index size
• Good in orthographical variants
• Search results depends on division of words
• Used for large documents like thesis
• Ex. Tsvector
N-gram search
• Indexed tokens are characters.
• Eg. _t, tr, re, e_ (2 grams)
• Big index size
• Cannot match orthographical variants
• Results closer to indexed LIKE
• Better suited for a limited set of words
• Ex. pg_bigm, pg_tigm
© 2013 NTT DATA, Inc. 9
• Search similar words(No linguistic support)
• Ranking of search results
• Searches substrings
– Indexes does not support substring search
– LIKE operator doesn’t use INDEX when preceded by %.
– Low performance when compared to full text search using GIN and GiST
• Accuracy issue
Eg. LIKE %one% matches prone, money, lonely
Why full text search?
© 2013 NTT DATA, Inc. 10
• POSIX Expression =# EXPLAIN ANALYZE SELECT * FROM fulltext_search WHERE doc ~ 'postgresql';
QUERY PLAN
--------------------------------------------------------------------------
Seq Scan on fulltext_search (cost=10000000000.00..10000000473.77 rows=40
width=152) (actual time=10.871..390.019 rows=250 loops=1)
Filter: (doc ~ 'postgresql'::text)
Rows Removed by Filter: 11397
Total runtime: 390.060 ms
Measurement results
• LIKE Query =# EXPLAIN ANALYZE SELECT * FROM fulltext_search WHERE doc LIKE '%postgresql%';
QUERY PLAN
------------------------------------------------------------------------
Seq Scan on fulltext_search (cost=10000000000.00..10000000473.77 rows=40 width=152) (actual time=1.342..110.107 rows=250 loops=1)
Filter: (doc ~~ '%postgresql%'::text)
Rows Removed by Filter: 11397
Total runtime: 110.134 ms
© 2013 NTT DATA, Inc. 11
Measurement results
• Full Text Search
Nested Loop (cost=352.83..508.22 rows=107 width=64) (actual
time=1.397..1.575 rows=250 loops=1)
-> Function Scan on to_tsquery query (cost=0.00..0.01 rows=1 width=32)
(actual time=0.023..0.023 rows=1 loops=1)
-> Bitmap Heap Scan on full_text_search (cost=352.83..507.14 rows=107
width=32) (actual time=1.371..1.516 rows=250 loops=1)
Recheck Cond: (query.query @@ to_tsvector('english'::regconfig,
doc))
-> Bitmap Index Scan on full_search_idx (cost=0.00..352.80
rows=107 width=0) (actual time=1.354..1.354 rows=348 loops=1)
Index Cond: (query.query @@
to_tsvector('english'::regconfig, doc))
Total runtime: 1.619 ms
© 2013 NTT DATA, Inc. 12
Normal Search: SELECT * FROM tbl WHERE col1 LIKE 'The tiger is the largest cat
species';
col1
--------------------------------------
The tiger is the largest cat species
(1 row)
Ranking Example
Full Text Search: SELECT col1, similarity(col1, 'The tiger is the largest cat
species') AS sml
FROM tbl_t WHERE col1 % 'The tiger is the largest cat species'
ORDER BY sml DESC, col1;
col1 | sml
-----------------------------------------+----------
The tiger is the largest cat species | 1
The peacock is the largest bird species | 0.511111
The cheetah is the fastest cat species | 0.466667
(3 rows)
© 2013 NTT DATA, Inc. 13
• GIN(Generalized Inverted Index)
• Custom strategies for particular data types
• Inverted indexes
• Interface for custom data types
• Slower to update
• Deterministic
• Appropriate for fixed data sets.
Indexes Used in Full Text Search
KEY TID
Meetup
100 ,140
Pune 100 , 150
Here 100
© 2013 NTT DATA, Inc. 14
• GiST (Generalized Search Tree)
• Interface for data types and access methods
• Document is represented in the index by a fixed-length signature
• Based on hash tables
• Probability of false match
• Table row must be retrieved to see if the match is correct
• In appropriate for large data sets
• Filtering data at the end of index search to remove false match
EXPLAIN SELECT * FROM tab WHERE text_search @@
to_tsquery(‘Mountain'); ------------------------------- QUERY PLAN -----------------------
-------------------
Index Scan using text_search_idx on tab (cost=0.00..12.29 rows=2
width=1469)
Index Cond: (textsearch @@ '‘Mountain'''::tsquery) Filter: (textsearch @@ ''‘Mountain'''::tsquery)
Indexes Used in Full Text Search
© 2013 NTT DATA, Inc. 15
• Representation of document best suited for full text search
• Normalized lexemes formed by pre-processing of the documents
• Functions to convert normal text to tsvector:
• to_tsvector to_tsvector([ config regconfig, ] document text) returns
tsvector
=# SELECT to_tsvector('english', 'Glad to be part of this
meetup');
to_tsvector
------------------------------
'glad':1 'meetup':7 'part':4
(1 row)
• The query above specifies 'english' as the configuration to be used to
parse and normalize the strings. The default_text_search_config value will be used if the configuration parameter is omitted.
tsvector
© 2013 NTT DATA, Inc. 16
• Representation of search query best suited for full text search
• Normalized lexemes formed by processing the query
• Maybe combined using AND, OR, or NOT operator.
• All keywords used for search
tsquery
© 2013 NTT DATA, Inc. 17
• Functions to convert normal text to tsquery:
• to_tsquery to_tsquery([ config regconfig, ] querytext text) returns
tsquery
=# SELECT to_tsquery('meetups & in & ! Pune');
to_tsquery
--------------------
'meetup' & !'pune'
(1 row)
• plainto_tsquery plainto_tsquery([ config regconfig, ] querytext text)
returns tsquery
=# SELECT plainto_tsquery ('english','meetups in Pune');
plainto_tsquery
-------------------
'meetup' & 'pune'
(1 row)
tsquery
© 2013 NTT DATA, Inc. 18
• Checks a tsvector(document) with a tsquery(search word)
• Returns true if all tsquery elements are present in the tsvector of the document
=# SELECT to_tsvector('Welcome to this postgresql meetup') @@
plainto_tsquery('PostgreSQL Meetups');
?column?
----------
t
(1 row)
=# SELECT to_tsvector('Welcome to this postgresql meetup') @@
plainto_tsquery('Pune meetup');
?column?
----------
f
(1 row)
Match operator @@
© 2013 NTT DATA, Inc. 19
SELECT * FROM <table> WHERE
to_tsvector('<config>', <colname>) @@ to_tsquery('<config>',
'<search word>');
The configuration parameter of the functions to_tsvector and to_tsquery should be same.
Example:
=# SELECT * FROM tbl WHERE to_tsvector('english', col) @@
to_tsquery('english', 'enjoy');
col
--------------------------------
He enjoyed the party
He enjoys the classical music.
(2 rows)
Full text search without index
© 2013 NTT DATA, Inc. 20
• Creating the index CREATE INDEX <index_name> ON <table> USING
gin(to_tsvector('<config>', <col>));
• Performing search using the index: SELECT * FROM <table> WHERE to_tsvector('<config>', <col>) @@
plainto_tsquery('<config>','<search word>')
Example:
=# CREATE INDEX idx ON tbl USING gin(to_tsvector('english',
col));
=# SELECT * FROM tbl WHERE to_tsvector('english', col) @@
plainto_tsquery('english','enjoy');
col
--------------------------------
He enjoyed the party
He enjoys the classical music.
(2 rows)
Full text search using index
© 2013 NTT DATA, Inc. 21
• Procedure
– Create a column of tsvector type
– Define a trigger which will automatically update the tsvector column
– Perform Search on the tsvector column
• Advantages:
– No need to specify the text search configuration in every query in order to make use of the index
– Faster searches as the to_tsvector function will not be called for each search query.
Full text search using separate column
© 2013 NTT DATA, Inc. 22
Example:
=# CREATE TABLE tbl (col text, tsv_col tsvector);
=# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON tbl FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(tsv_col, 'pg_catalog.english', col);
=# INSERT INTO tbl VALUES ('He enjoyed the party'),('He enjoys the
classical music.'),('The moon winked at him');
=# SELECT * FROM tbl;
col | tsv
--------------------------------+---------------------------------
He enjoyed the party | 'enjoy':2 'parti':4
He enjoys the classical music. | 'classic':4 'enjoy':2 'music':5
The moon winked at him | 'moon':2 'wink':3
(3 rows)
Full text search using separate column
© 2013 NTT DATA, Inc. 23
Example:
=# CREATE TABLE tbl (col text, tsv_col tsvector);
=# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON tbl FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(tsv_col, 'pg_catalog.english', col);
=# INSERT INTO tbl VALUES ('He enjoyed the party'),('He enjoys the
classical music.'),('The moon winked at him');
=# SELECT col FROM tbl WHERE tsv_col @@ to_tsquery('enjoys');
col
--------------------------------
He enjoyed the party
He enjoys the classical music.
(2 rows)
Full text search using separate column
© 2013 NTT DATA, Inc. 24
Ranking
•ts_rank
–Lexical ranking
ts_rank([ weights float4[], ] vector tsvector, query tsquery [,
normalization integer ]) returns float4
=# select ts_rank(to_tsvector('Free text seaRCh is a wonderful
Thing'), to_tsquery('wonderful | thing'));
ts_rank ----------- 0.0607927
•ts_rank_cd
–Proximity ranking
=# select ts_rank_cd(to_tsvector('Free text seaRCh is a
wonderful Thing'), to_tsquery('wonderful & thing'));
ts_rank_cd ------------ 0.1
© 2013 NTT DATA, Inc. 25
Ranking
• Structural ranking – Query
select ts_rank( array[0.1,0.1,0.9,0.1],
setweight(to_tsvector('All about search'), 'B') ||
setweight(to_tsvector('Free text seaRCh is a
wonderfulThing'),'A'),
to_tsquery('wonderful & search'));
– Result
ts_rank
0.328337
26 © 2013 NTT DATA, Inc.
PostgreSQL Extension
© 2013 NTT DATA, Inc. 27
• Uses index made from trigrams – 3 consecutive characters from string.
• Find string similarity by comparing the trigrams.
• provides GiST and GIN index operator classes to create index. CREATE INDEX <idx> ON <tbl> USING gist(<col> gist_trgm_ops);
CREATE INDEX <idx> ON <tbl> USING gin (<col> gin_trgm_ops);
• Problem:
− No partial match algorithm
− Slow when search key is < 3 characters
GIN_SEARCH_MODE_ALL is used
pg_trgm
© 2013 NTT DATA, Inc. 28
• PostgreSQL module which provides full text search capability using 2-gram index.
• Based on pg_trgm
• First released on April 2013. Version 1.1 to be released soon.
• Developed by NTT Data
• Site: http://sourceforge.jp/projects/pgbigm/
pg_bigm
© 2013 NTT DATA, Inc. 29
Difference
Feature pg_trgm pg_bigm
Method of full text search
3-gram " a", " ab", abc, bcd
2-gram " a", ab, bc, cd, "d "
Available index GIN and GiST GIN only
1-2 character keyword search
Slow Fast
© 2013 NTT DATA, Inc. 30
• Download tar.gz file from the site
• Install pg_bigm $ make USE_PGXS=1
$ su
# make USE_PGXS=1 install
• Register- Set the postgresql.conf variables: – shared_preload_libraries = 'pg_bigm'
– custom_variable_classes = 'pg_bigm' (only in 9.1)
• Load into the required database =# CREATE EXTENSION pg_bigm;
Install pg_bigm
© 2013 NTT DATA, Inc. 31
Argument: Search String
Return Value: Array of all possible 2-gram character string
Procedure:
• For each word perform the following:
• Add a space character before and after the text
• Moving from left to right extract strings in the unit of 2 characters.
=# SELECT show_bigm('ab');
show_bigm
----------------
{" a",ab,"b "}
(1 row)
Function – show_bigm
© 2013 NTT DATA, Inc. 32
Argument: Search string
Return Value: String in a pattern to be used in LIKE for full-text search
Procedure:
• Add % to the beginning and the end of retrieval string.
• Add a backlash (\) before every underscore (_), percent (%) and backlash (\) present in the retrieval string.
=# SELECT likequery ('pg_bigm ppt');
likequery
----------------
%pg\_bigm ppt%
(1 row)
Function - likequery
© 2013 NTT DATA, Inc. 33
• Only GIN support
• Create Index on the text column of a table CREATE INDEX <index_name> ON <table> USING gin (<column>,
gin_bigm_ops);
Creation of Index
Key TID
" c" 1
" m" 5
at 1, 5
ca 1
ma 5
"t " 1, 5
TID Data
1 cat
5 mat
Generate bigrams cat - " c", at, ca, "t "
mat - " m", at, ma, "t "
Table
Index
© 2013 NTT DATA, Inc. 34
SELECT * FROM <tbl> WHERE <col> LIKE likequery(‘<word>');
=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE col LIKE likequery('cat');
QUERY PLAN
-------------------------------------------------------------------
Bitmap Heap Scan on tbl (cost=12.00..16.01 rows=1 width=4) (actual
time=0.038..0.039 rows=1 loops=1)
Recheck Cond: (col ~~ '%cat%'::text)
-> Bitmap Index Scan on idx (cost=0.00..12.00 rows=1 width=0)
(actual time=0.025..0.025 rows=1 loops=1)
Index Cond: (col ~~ '%cat%'::text)
Total runtime: 0.093 ms
(5 rows)
Full text search Query
© 2013 NTT DATA, Inc. 35
Full text search Query
Generate bigrams
Key TID
" c" 1
" m" 5
at 1, 5
ca 1
ma 5
"t " 1, 5
TID Data
1 cat
Result Candidates
Perform Recheck
Search key
Index lookup
TID Data
1 cat
Final Result
© 2013 NTT DATA, Inc. 36
• Removes wrong results from result candidates of index scan.
=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE col LIKE
likequery('trial');
QUERY PLAN
-------------------------------------------------------------------
------------------------------------------
Bitmap Heap Scan on tbl (cost=24.00..28.01 rows=1 width=5)
(actual time=0.060..0.060 rows=1 loops=1)
Recheck Cond: (col ~~ '%trial%'::text)
Rows Removed by Index Recheck: 1
-> Bitmap Index Scan on idx (cost=0.00..24.00 rows=1 width=0)
(actual time=0.043..0.043 rows=2 loops=1)
Index Cond: (col ~~ '%trial%'::text)
Total runtime: 0.117 ms
(6 rows)
Why Recheck?
© 2013 NTT DATA, Inc. 37
Why Recheck?
Key TID
" t" 1, 2
al 1, 2
ia 1, 2
iv 2
“l " 1, 2
ri 1, 2
tr 1, 2
vi 2
TID Data
1 trial
2 trivial
trial " t",al,ia,"l ",ri,tr
trivial " t",al,ia,iv,"l ",ri,tr,vi
Search ‘trial’
TID Data
1 trial
2 trivial
TID Data
1 trial
Index scan
Recheck
© 2013 NTT DATA, Inc. 38
Parameter - enable_recheck
• To disable Recheck and get all the results retrieved by index scan
• Values on/off
=# SET pg_bigm.enable_recheck = on;
=# SELECT * FROM tbl WHERE doc LIKE likequery('trial');
doc
----------------------
He is awaiting trial
(1 row)
=# SET pg_bigm.enable_recheck = off;
=# SELECT * FROM tbl WHERE doc LIKE likequery('trial');
doc
--------------------------
He is awaiting trial
It was a trivial mistake
(2 rows)
Disabling Recheck
© 2013 NTT DATA, Inc. 39
=# CREATE TABLE tbl (col text);
=# CREATE INDEX tbl_idx ON tbl USING gin (col gin_bigm_ops);
=# INSERT INTO tbl VALUES
('He is awaiting trial'),
('Those orchids are very special to her '),
('pg_bigm performs full text search using 2 gram index'),
('pg_trgm performs full text search using 3 gram index');
=# SELECT * FROM tbl WHERE col LIKE likequery('full text search');
col
------------------------------------------------------
pg_bigm performs full text search using 2 gram index
pg_trgm performs full text search using 3 gram index
(2 rows)
pg_bigm Full Text Search Sample
40 © 2013 NTT DATA, Inc.
Similarity Search
© 2013 NTT DATA, Inc. 41
Argument: The 2 strings whose similarity is to be checked
Return value - the similarity value of two arguments (0 - 1)
• measures the similarity of two strings by counting the number of 2-grams they share.
=# SELECT bigm_similarity ('test','text');
bigm_similarity
-----------------
0.6
(1 row)
Function – bigm_similarity
© 2013 NTT DATA, Inc. 42
• specifies threshold used for the similarity search
• Search returns rows with similarity value >= similarity_limit
• Default: 0.3
• SET command can be used to modify the value.
=# SHOW pg_bigm.similarity_limit;
pg_bigm.similarity_limit
--------------------------
0.3
(1 row)
=# SET pg_bigm.similarity_limit = 0.5;
Parameter - similarity_limit
© 2013 NTT DATA, Inc. 43
• Used to perform similarity search
• Uses full text search index.
• Returns rows whose similarity is higher than or equal to the value of pg_bigm.similarity_limit
SELECT * FROM <tbl> WHERE <col> =% ‘<key>';
Similarity Operator - =%
© 2013 NTT DATA, Inc. 44
=# SET pg_bigm.similarity_limit = 0.2;
=# SELECT *, bigm_similarity(col, 'test') FROM tbl WHERE col =%
'test';
col | bigm_similarity
-------+-----------------
test | 1
text | 0.6
treat | 0.333333
(3 rows)
=# SET pg_bigm.similarity_limit = 0.5;
=# SELECT *, bigm_similarity(col, 'test') FROM tbl WHERE col =%
'test';
col | bigm_similarity
------+-----------------
test | 1
text | 0.6
(2 rows)
Similarity Search Sample
© 2013 NTT DATA, Inc. 45
• PostgreSQL documents
• wiki.postgresql.org
• Understanding Full Text Search
• http://linuxgazette.net/164/sephton.html
• http://www.slideshare.net/billkarwin/full-text-search-in-postgresql
• Understanding pg_bigm
• pgbigm.sourceforge.jp
• www.slideshare.net/masahikosawada98/pg-bigm
References
© 2013 NTT DATA, Inc.