+ All Categories
Home > Technology > Full Text Search Throwdown

Full Text Search Throwdown

Date post: 27-Jan-2015
Category:
Upload: karwin-software-solutions-llc
View: 142 times
Download: 4 times
Share this document with a friend
Description:
Odds are if you develop database applications you have been asked to make a large table of textual data searchable. How can we do this most reliably and efficiently? Bill compares a range of techniques to search text data with MySQL, with a focus on performance. - Ad hoc searches with the LIKE predicate or regular expressions. - MySQL FULLTEXT index. - InnoDB FULLTEXT index in MySQL 5.6. - Apache Solr search engine - Sphinx search engine. - Trigraphs.
Popular Tags:
69
Full Text Search Throwdown Bill Karwin, Percona Inc.
Transcript
Page 1: Full Text Search Throwdown

Full Text Search Throwdown

Bill Karwin, Percona Inc.

Page 2: Full Text Search Throwdown

In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user.

http://www.flickr.com/photos/tryingyouth/

In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user.

Page 3: Full Text Search Throwdown

www.percona.com

StackOverflow Test Data

•Data dump, exported December 2011• 7.4 million Posts = 8.18 GB

Page 4: Full Text Search Throwdown

www.percona.com

StackOverflow ER diagram

searchable text

Page 5: Full Text Search Throwdown

The Baseline:Naive Search Predicates

Page 6: Full Text Search Throwdown

www.percona.com

Some people, when confronted with a problem, think

“I know, I’ll use regular expressions.”

Now they have two problems.

— Jamie Zawinsky

Page 7: Full Text Search Throwdown

www.percona.com

Accuracy issue

• Irrelevant or false matching words ‘one’, ‘money’, ‘prone’, etc.:SELECT * FROM PostsWHERE Body LIKE '%one%'

•Regular expressions in MySQL support escapes for word boundaries:SELECT * FROM PostsWHERE Body RLIKE '[[:<:]]one[[:>:]]'

Page 8: Full Text Search Throwdown

www.percona.com

Performance issue

• LIKE with wildcards:! SELECT * FROM PostsWHERE title LIKE '%performance%' ! OR body LIKE '%performance%'! OR tags LIKE '%performance%';

•POSIX regular expressions:! SELECT * FROM PostsWHERE title RLIKE '[[:<:]]performance[[:>:]]'! OR body RLIKE '[[:<:]]performance[[:>:]]'! OR tags RLIKE '[[:<:]]performance[[:>:]]';

49 sec

7 min 57 sec

Page 9: Full Text Search Throwdown

www.percona.com

Why so slow?

CREATE TABLE TelephoneBook (! FullName VARCHAR(50));

CREATE INDEX name_idx ON TelephoneBook ! (FullName);

INSERT INTO TelephoneBook VALUES! ('Riddle, Thomas'), ! ('Thomas, Dean');

Page 10: Full Text Search Throwdown

www.percona.com

Why so slow?

•Search for all with last name “Thomas”SELECT * FROM telephone_bookWHERE full_name LIKE 'Thomas%'

•Search for all with first name “Thomas”SELECT * FROM telephone_bookWHERE full_name LIKE '%Thomas'

uses index

can’t use index

Page 11: Full Text Search Throwdown

www.percona.com

Because:

B-Tree indexes can’t search for substrings☞

Page 12: Full Text Search Throwdown

• FULLTEXT in MyISAM• FULLTEXT in InnoDB• Apache Solr• Sphinx Search• Trigraphs

Page 13: Full Text Search Throwdown

FULLTEXTin MyISAM

Page 14: Full Text Search Throwdown

www.percona.com

FULLTEXT Index with MyISAM

•Special index type for MyISAM• Integrated with SQL queries• Indexes always in sync with data•Balances features vs. speed vs. space

Page 15: Full Text Search Throwdown

www.percona.com

Insert Data into Index (MyISAM)

mysql> INSERT INTO PostsSELECT * FROM PostsSource;

time: 33 min, 34 sec

Page 16: Full Text Search Throwdown

www.percona.com

Build Index on Data (MyISAM)

mysql> CREATE FULLTEXT INDEX PostText ! ON Posts(title, body, tags);

time: 31 min, 18 sec

Page 17: Full Text Search Throwdown

www.percona.com

Querying

SELECT * FROM Posts WHERE MATCH(column(s)) AGAINST('query pattern');

must include all columns of your index, in the order you defined

Page 18: Full Text Search Throwdown

www.percona.com

Natural Language Mode (MyISAM)

•Searches concepts with free text queries:! SELECT * FROM Posts WHERE MATCH(title, body, tags ) AGAINST('mysql performance' IN NATURAL LANGUAGE MODE)LIMIT 100;

time with index: 200 milliseconds

Page 19: Full Text Search Throwdown

www.percona.com

Query Profile: Natural Language Mode (MyISAM)

+-------------------------+----------+| Status | Duration |+-------------------------+----------+| starting | 0.000068 || checking permissions | 0.000006 || Opening tables | 0.000017 || init | 0.000032 || System lock | 0.000007 || optimizing | 0.000007 || statistics | 0.000018 || preparing | 0.000006 || FULLTEXT initialization | 0.198358 || executing | 0.000012 || Sending data | 0.001921 || end | 0.000005 || query end | 0.000003 || closing tables | 0.000018 || freeing items | 0.000341 || cleaning up | 0.000012 |+-------------------------+----------+

Page 20: Full Text Search Throwdown

www.percona.com

Boolean Mode (MyISAM)

•Searches words using mini-language:! SELECT * FROM Posts WHERE MATCH(title, body, tags) AGAINST('+mysql +performance' IN BOOLEAN MODE)LIMIT 100;

time with index: 16 milliseconds

Page 21: Full Text Search Throwdown

www.percona.com

Query Profile:Boolean Mode (MyISAM)

+-------------------------+----------+| Status | Duration |+-------------------------+----------+| starting | 0.000031 || checking permissions | 0.000003 || Opening tables | 0.000008 || init | 0.000017 || System lock | 0.000004 || optimizing | 0.000004 || statistics | 0.000008 || preparing | 0.000003 || FULLTEXT initialization | 0.000008 || executing | 0.000001 || Sending data | 0.015703 || end | 0.000004 || query end | 0.000002 || closing tables | 0.000007 || freeing items | 0.000381 || cleaning up | 0.000007 |+-------------------------+----------+

Page 22: Full Text Search Throwdown

FULLTEXTin InnoDB

Page 23: Full Text Search Throwdown

www.percona.com

FULLTEXT Index with InnoDB

•Under development in MySQL 5.6• I’m testing 5.6.6 m1

•Usage very similar to FULLTEXT in MyISAM• Integrated with SQL queries• Indexes always* in sync with data•Read the blogs for more details:

• http://blogs.innodb.com/wp/2011/07/overview-and-getting-started-with-innodb-fts/

• http://blogs.innodb.com/wp/2011/07/innodb-full-text-search-tutorial/

• http://blogs.innodb.com/wp/2011/07/innodb-fts-performance/

• http://blogs.innodb.com/wp/2011/07/difference-between-innodb-fts-and-myisam-fts/

Page 24: Full Text Search Throwdown

www.percona.com

Insert Data into Index (InnoDB)

mysql> INSERT INTO PostsSELECT * FROM PostsSource;

time: 55 min 46 sec

Page 25: Full Text Search Throwdown

www.percona.com

Build Index on Data (InnoDB)

•Still under development; you might see problems:mysql> CREATE FULLTEXT INDEX PostText ! ON Posts(title, body, tags);

ERROR 2013 (HY000): Lost connection to MySQL server during query

Page 26: Full Text Search Throwdown

www.percona.com

Build Index on Data (InnoDB)

•Solution: make sure you define a primary key column `FTS_DOC_ID` explicitly:

mysql> ALTER TABLE Posts CHANGE COLUMN PostId`FTS_DOC_ID` BIGINT UNSIGNED;

mysql> CREATE FULLTEXT INDEX PostText ! ON Posts(title, body, tags);

time: 25 min 27 sec

Page 27: Full Text Search Throwdown

www.percona.com

Natural Language Mode (InnoDB)

•Searches concepts with free text queries:! SELECT * FROM Posts WHERE MATCH(title, body, tags) AGAINST('mysql performance' IN NATURAL LANGUAGE MODE) LIMIT 100;

time with index: 740 milliseconds

Page 28: Full Text Search Throwdown

www.percona.com

Query Profile: Natural Language Mode (InnoDB)

+-------------------------+----------+| Status | Duration |+-------------------------+----------+| starting | 0.000074 || checking permissions | 0.000007 || Opening tables | 0.000020 || init | 0.000034 || System lock | 0.000007 || optimizing | 0.000009 || statistics | 0.000020 || preparing | 0.000008 || FULLTEXT initialization | 0.577257 || executing | 0.000013 || Sending data | 0.106279 || end | 0.000018 || query end | 0.000012 || closing tables | 0.000018 || freeing items | 0.055584 || cleaning up | 0.000039 |+-------------------------+----------+

Page 29: Full Text Search Throwdown

www.percona.com

Boolean Mode (InnoDB)

•Searches words using mini-language:! SELECT * FROM Posts WHERE MATCH(title, body, tags) AGAINST('+mysql +performance' IN BOOLEAN MODE) LIMIT 100;

time with index: 350 milliseconds

Page 30: Full Text Search Throwdown

www.percona.com

Query Profile:Boolean Mode (InnoDB)

+-------------------------+----------+| Status | Duration |+-------------------------+----------+| starting | 0.000064 || checking permissions | 0.000005 || Opening tables | 0.000017 || init | 0.000047 || System lock | 0.000007 || optimizing | 0.000009 || statistics | 0.000019 || preparing | 0.000008 || FULLTEXT initialization | 0.347172 || executing | 0.000014 || Sending data | 0.008089 || end | 0.000011 || query end | 0.000012 || closing tables | 0.000015 || freeing items | 0.001570 || cleaning up | 0.000023 |+-------------------------+----------+

Page 31: Full Text Search Throwdown

Apache Solr

Page 32: Full Text Search Throwdown

www.percona.com

Apache Solr

• http://lucene.apache.org/solr/• Formerly known as Lucene, started 2001•Apache License• Java implementation•Web service architecture•Many sophisticated search features

Page 33: Full Text Search Throwdown

www.percona.com

DataImportHandler

• conf/solrconfig.xml:. . .<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">data-config.xml</str> </lst></requestHandler>. . .

Page 34: Full Text Search Throwdown

www.percona.com

DataImportHandler

• conf/data-config.xml:<dataConfig> <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/testpattern?useUnicode=true" batchSize="-1" user="xxxx" password="xxxx"/> <document> <entity name="id" query="SELECT PostId, ParentId, Title, Body, Tags FROM Posts"> </entity> </document></dataConfig> extremely important

to avoid buffering the whole query result!

Page 35: Full Text Search Throwdown

www.percona.com

DataImportHandler

•conf/schema.xml:. . .<fields> <field name="PostId" type="string" indexed="true" stored="true" required="true" /> <field name="ParentId" type="string" indexed="true" stored="true" required="false" /> <field name="Title" type="text_general" indexed="false" stored="false"

required="false" /> <field name="Body" type="text_general" indexed="false" stored="false" required="false" /

> <field name="Tags" type="text_general" indexed="false" stored="false" required="false" /

>

<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

<fields>

<uniqueKey>PostId</uniqueKey><defaultSearchField>text</defaultSearchField>

<copyField source="Title" dest="text"/><copyField source="Body" dest="text"/><copyField source="Tags" dest="text"/>. . .

Page 37: Full Text Search Throwdown

www.percona.com

Searching Solr

• http://localhost:8983/solr/select/?q=mysql+AND+performance

time: 79ms

Query results are cached (like MySQL Query Cache),so they return much faster on subsequent execution

Page 38: Full Text Search Throwdown

Sphinx Search

Page 39: Full Text Search Throwdown

www.percona.com

Sphinx Search

• http://sphinxsearch.com/•Started in 2001•GPLv2 license•C++ implementation•SphinxSE storage engine for MySQL•Supports MySQL protocol, SQL-like queries•Many sophisticated search features

Page 40: Full Text Search Throwdown

www.percona.com

sphinx.conf source src1{! type = mysql! sql_host = localhost! sql_user = xxxx! sql_pass = xxxx! sql_db = testpattern! sql_query = SELECT PostId, ParentId, Title, ! ! Body, Tags FROM Posts! sql_query_info = SELECT * FROM Posts \! ! WHERE PostId=$id}

Page 41: Full Text Search Throwdown

www.percona.com

sphinx.conf! index test1{! source = src1! path = C:\Sphinx\data}

Page 42: Full Text Search Throwdown

www.percona.com

Insert Data into Index (Sphinx)! C:\Sphinx> indexer.exe -c sphinx.conf.in --verbose test1

Sphinx 2.0.5-release (r3309)

using config file 'sphinx.conf'...indexing index 'test1'...collected 7397507 docs, 5731.8 MBsorted 920.3 Mhits, 100.0% donetotal 7397507 docs, 5731776959 bytestotal 500.149 sec, 11460138 bytes/sec, 14790.60 docs/sectotal 11 reads, 15.898 sec, 314584.8 kb/call avg, 1445.3 msec/call avg

total 542 writes, 3.129 sec, 12723.3 kb/call avg, 5.7 msec/call avg

Execution time: 500.196 s

time: 8 min 20 sec

Page 43: Full Text Search Throwdown

www.percona.com

Querying index$ mysql --port 9306

Server version: 2.0.5-release (r3309)

mysql> SELECT * FROM test1 WHERE MATCH('mysql performance');

+---------+--------+| id | weight |+---------+--------+| 6016856 | 6600 || 4207641 | 6595 || 2656325 | 6593 || 7192928 | 5605 || 8118235 | 5598 |. . .20 rows in set (0.02 sec)

Page 44: Full Text Search Throwdown

www.percona.com

Querying indexmysql> SHOW META;

+---------------+-------------+| Variable_name | Value |+---------------+-------------+| total | 1000 || total_found | 7672 || time | 0.013 || keyword[0] | mysql || docs[0] | 162287 || hits[0] | 363694 || keyword[1] | performance || docs[1] | 147249 || hits[1] | 210895 |+---------------+-------------+

time: 13ms

Page 45: Full Text Search Throwdown

Trigraphs

Page 46: Full Text Search Throwdown

www.percona.com

Trigraphs Overview

•Not very fast, but still better than LIKE / RLIKE•Generic, portable SQL solution•No dependency on version, storage engine, third-

party technology

Page 47: Full Text Search Throwdown

www.percona.com

Three-Letter Sequences! CREATE TABLE AtoZ (

! c! ! CHAR(1), ! PRIMARY KEY (c));

! INSERT INTO AtoZ (c) VALUES ('a'), ('b'), ('c'), ...

! CREATE TABLE Trigraphs (! Tri!! CHAR(3), ! PRIMARY KEY (Tri));

! INSERT INTO Trigraphs (Tri)SELECT CONCAT(t1.c, t2.c, t3.c)FROM AtoZ t1 JOIN AtoZ t2 JOIN AtoZ t3;

Page 48: Full Text Search Throwdown

www.percona.com

Insert Data Into Index my $sth = $dbh1->prepare("SELECT * FROM Posts") or die $dbh1->errstr; $sth->execute() or die $dbh1->errstr; $dbh2->begin_work; my $i = 0; while (my $row = $sth->fetchrow_hashref ) { my $text = lc(join('|', ($row->{title}, $row->{body}, $row->{tags}))); my %tri; map($tri{$_}=1, ( $text =~ m/[[:alpha:]]{3}/g )); next unless %tri; my $tuple_list = join(",", map("('$_',$row->{postid})", keys %tri)); my $sql = "INSERT IGNORE INTO PostsTrigraph (tri, PostId) VALUES

$tuple_list"; $dbh2->do($sql) or die "SQL = $sql, ".$dbh2->errstr; if (++$i % 1000 == 0) { print "."; $dbh2->commit; $dbh2->begin_work; } } print ".\n"; $dbh2->commit;

time: 116 min 50 secspace: 16.2GiB rows: 519 million

Page 49: Full Text Search Throwdown

www.percona.com

Indexed Lookups! SELECT p.*FROM Posts pJOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' !

time: 46 sec

Page 50: Full Text Search Throwdown

www.percona.com

Search Among Fewer Matches! SELECT p.*FROM Posts pJOIN PostsTrigraph t1 ON! t1.PostId = p.PostId AND t1.Tri = 'mys'JOIN PostsTrigraph t2 ON! t2.PostId = p.PostId AND t2.Tri = 'per'

time: 19 sec

Page 51: Full Text Search Throwdown

www.percona.com

Search Among Fewer Matches! SELECT p.*FROM Posts pJOIN PostsTrigraph t1 ON! t1.PostId = p.PostId AND t1.Tri = 'mys'JOIN PostsTrigraph t2 ON! t2.PostId = p.PostId AND t2.Tri = 'per'JOIN PostsTrigraph t3 ON! t3.PostId = p.PostId AND t3.Tri = 'for'

time: 22 sec

Page 52: Full Text Search Throwdown

www.percona.com

Search Among Fewer Matches! SELECT p.*FROM Posts pJOIN PostsTrigraph t1 ON! t1.PostId = p.PostId AND t1.Tri = 'mys'JOIN PostsTrigraph t2 ON! t2.PostId = p.PostId AND t2.Tri = 'per'JOIN PostsTrigraph t3 ON! t3.PostId = p.PostId AND t3.Tri = 'for'JOIN PostsTrigraph t4 ON! t4.PostId = p.PostId AND t4.Tri = 'man'

time: 13.6 sec

Page 53: Full Text Search Throwdown

www.percona.com

Narrow Down Further! SELECT p.*FROM Posts pJOIN PostsTrigraph t1 ON! t1.PostId = p.PostId AND t1.Tri = 'mys'JOIN PostsTrigraph t2 ON! t2.PostId = p.PostId AND t2.Tri = 'per'JOIN PostsTrigraph t3 ON! t3.PostId = p.PostId AND t3.Tri = 'for'JOIN PostsTrigraph t4 ON! t4.PostId = p.PostId AND t4.Tri = 'man'WHERE CONCAT(p.title,p.body,p.tags) LIKE '%mysql%'! AND CONCAT(p.title,p.body,p.tags) LIKE '%performance%';

time: 13.8 sec

Page 55: Full Text Search Throwdown

www.percona.com

Time to Insert Data into Index

LIKE expression n/a

FULLTEXT MyISAM 33 min, 34 sec

FULLTEXT InnoDB 55 min, 46 sec

Apache Solr 14 min, 28 sec

Sphinx Search 8 min, 20 sec

Trigraphs 116 min, 50 sec

Page 56: Full Text Search Throwdown

www.percona.com

Insert Data into Index (sec)

0

2000

4000

6000

8000

LIKE MyISAM InnoDB Solr Sphinx Trigraph

Page 57: Full Text Search Throwdown

www.percona.com

Time to Build Index on Data

LIKE expression n/a

FULLTEXT MyISAM 31 min, 18 sec

FULLTEXT InnoDB 25 min, 27 sec

Apache Solr n/a

Sphinx Search n/a

Trigraphs n/a

Page 58: Full Text Search Throwdown

www.percona.com

Build Index on Data (sec)

0

1000

2000

3000

4000

LIKE MyISAM InnoDB Solr Sphinx Trigraph

n/a n/a n/a

Page 59: Full Text Search Throwdown

www.percona.com

Index Storage

LIKE expression n/a

FULLTEXT MyISAM 2382 MiB

FULLTEXT InnoDB ? MiB

Apache Solr 2766 MiB

Sphinx Search 3355 MiB

Trigraphs 16589 MiB

Page 60: Full Text Search Throwdown

www.percona.com

Index Storage (MiB)

0

5000

10000

15000

20000

LIKE MyISAM InnoDB Solr Sphinx Trigraph

Page 61: Full Text Search Throwdown

www.percona.com

Query Speed

LIKE expression 49,000ms - 399,000ms

FULLTEXT MyISAM 16-200ms

FULLTEXT InnoDB 350-740ms

Apache Solr 79ms

Sphinx Search 13ms

Trigraphs 13800ms

Page 62: Full Text Search Throwdown

www.percona.com

Query Speed (ms)

0

50000

100000

150000

200000

250000

300000

350000

400000

LIKE MyISAM InnoDB Solr Sphinx Trigraph

Page 63: Full Text Search Throwdown

www.percona.com

Query Speed (ms)

0

250

500

750

1000

LIKE MyISAM InnoDB Solr Sphinx Trigraph

Page 64: Full Text Search Throwdown

www.percona.com

Bottom Line

LIKE expression 0 0 0 49k-399k ms SQL

FULLTEXT MyISAM 31:18 33:28 2382MiB 16-200ms MySQL

FULLTEXT InnoDB 25:27 55:46 ? 350-740ms MySQL 5.6

Apache Solr n/a 14:28 2766MiB 79ms Java

Sphinx Search n/a 8:20 3487MiB 13ms C++

Trigraphs n/a 116:50 16.2 GiB 13,800ms SQL

build insert storage query solution

Page 65: Full Text Search Throwdown

www.percona.com

Final Thoughts

• Third-party search engines are complex to keep in sync with data, and adding another type of server adds more operations work for you.

•Built-in FULLTEXT indexes are therefore useful even if they are not absolutely the fastest.

•Different search implementations may return different results, so you should evaluate what works best for your project.

•Any indexed search solution is orders of magnitude better than LIKE!

Page 66: Full Text Search Throwdown

www.percona.com/live

New York, October 1-2, 2012London, December 3-4, 2012Santa Clara, April 22-25, 2013

Page 67: Full Text Search Throwdown

www.percona.com

Expert instructorsIn-person training

Custom onsite trainingLive virtual training

http://www.percona.com/training

Page 69: Full Text Search Throwdown

www.percona.com

Copyright 2012 Bill Karwinwww.slideshare.net/billkarwin

Released under a Creative Commons 3.0 License: http://creativecommons.org/licenses/by-nc-nd/3.0/

You are free to share - to copy, distribute and transmit this work, under the following conditions:

Attribution. You must attribute this work to Bill Karwin.

Noncommercial. You may not use this work for commercial purposes.

No Derivative Works. You may not alter, transform, or build

upon this work.


Recommended