Yappo Groonga - with japanese search software history @ osdc.tw 2011

Groonga

OSDC.tw 2011 Yappo(大沢和宏)

with japanese search software history

yappo {aT} shibuya {dOt} plhttp://blog.yappo.jp/

http://github.com/yappo/http://search.cpan.org/~yappo/

http://twitter.com/yappo

2011年3月28日月曜日

http://blog.yappo.jp

http://blog.yappo.jp

http://github.com/yappo/

http://github.com/yappo/

http://search.cpan.org/~yappo/

http://search.cpan.org/~yappo/



Profile

• Yappo• from 東京


employer


our service is

• Ficia• Pikubo




my latest topic

• iphone web site development

• jquery mobile hack

• tiny perl hack



agenda

• yappo with search software• japanese search software’s topic• Groonga


yappo with search software


Since 1997I started Search Engine Service 'Yappo' at 1997.‘Yappo’ is Service Name.

very cheap, using grep.

I use Rental server, im banned server, because high load avg service.


Since 1998

I made search engine software for ISP with work.

modern than grep.

i wrote indexer, searcher by C-lang.


Since 1999I started Search Engine Service i'Yappo' at 1999.iYappo for japanese mobile device (i-mode).

Crawler, indexer, searcher is self development.

but switch to another software,because maintenance very hard.


i was using search software from ancient times.


history of search software in japan


use grepOne of the easiest ways to search is grep.It's sometimes called "idiot search" in Japan.

It's not good for searching lots of documentations... and it's slow.

However, it does have merit; it's easy to implement.


using indexYou need to know which document contains which word to search things quickly.It's easy for English.But it's very difficult for Japanese.


word separateBecause English sentences are basically separated by white space.(You need to handle declension and conjugation though)

You can't easily tell which character belongs to which word in Japanese.


in english

• "today is rainy."• "today", "is", "rainy"So you can write a simple tokenizer for English by splitting sentences on

white space.


in japanese

• "今日は雨です。"

• "今日", "は", "雨", "です", "。"

Japanese sentences are not separated by white space. So you need to know the

meaning and contextof the words in the sentence.

You can't split based on whether a character is Kanji or Kana.


in japanese2

• "私ははずかしいです。"

• "私", "は", "はずかしい", "です", "。"(in english "I am ashamed")

In Japanese you can transliterate with Kana instead of writing in Kanji. It makes tokenizing more difficult.


解決方法



形態素解析 (詞素解析)KAKASHI is often perform wrong tokenize, because a longest-first search algorithm.

Morphological analysis is precision is high, because it use the grammar that learned.




n-grambut, MeCab has limit of tokenizeしかしながら、立派なアルゴリズムを駆使しても限界があります。but, MeCab has limit of tokenize.

辞書を使うため、新しい名前がわからない。MeCab not have new words, because using dictionary.

- ex. けいおん, K-ON

日本人は言葉を作るのが大好きなので追いつかない。Japanese people like "Create New Words".

- ex. Twitter -> ヒウィッヒヒー

このような欠点を回避するためn-gramを使う事もあります。It will solve, if N-Gram is used.

but, MeCab has limit of tokenize.MeCab not have new words, because using dictionary.

- ex. けいおん, K-ON

Japanese people like "Create New Words".

- ex. Twitter -> ヒウィッヒヒー

It will solve, if N-Gram is used.2011年3月28日月曜日

3-gram example•けいおん -> "けいお", "いおん"

• K-ON -> "K-O", "-ON"• Twitter -> "Twi", "wit", "itt", "tte", "ter"•ヒウィッヒヒー -> "ヒウィ", "ウィッ", "ィッヒ", "ッヒヒ", "ヒヒー"


summary of word separator

• japanese is too hard• but, we have a solution means

• - Morphological analysis

• - n-gram


another search software in japan


Namazuhttp://www.namazu.org/

Namaze in english is catfish(鮎)

Namazu is developed in Japan since early times, with an indexer and searcher.Not suitable for embedding to another system.


http://www.namazu.org

http://www.namazu.org

HyperEstraierhttp://fallabs.com/hyperestraier/

Released at 2004, with a crawler, indexer, searcher.

Suitable for embedding to another system.


http://fallabs.com/hyperestraier/


Rasthttp://projects.netlab.jp/rast/

Released at 2005, developed by NaCl, suitable for embedding.

I wrote a Perl binding for it.Used in iYappo, but stopped using it because of the troubles in indexing.

Rast is now deprecated.




Sennahttp://qwik.jp/senna/

Released at 2005, suitable for embedding.A Perl binding was wrote by @lestrrat.

Senna can be integrated into MySQL full text search system.




tritonnhttp://qwik.jp/tritonn/

tritonn is a project to manage a patch of MySQL to integrate Senna.

> SELECT * FROM tbl WHERE MATCH(col) AGAINST("検索キーワード");> INSERT INTO t1 VALUES (3, "東京特許許可局");

Used in iYappo, very usefl.but, Senna is depricated.




次世代の検索システHyperEstraier, Senna, Rast が同時期にリリースされて、日本国内でも次世代の検索システムのブームがきました。どれも使いやすいライブラリとして提供されているため、hacker達が気軽に検索機能を追加出来るようになったのです。例えば Plagger では Plagger::Plugin::Search::Estraier, Plagger::Plugin::Search::Rast, Plagger::Plugin::Search::Senna などが作られました。


good book検索問題、良書 in japan.

this book written by senna/groonga developer.


Probably, you have question to "what about Lucene and

else?".


Lucene is not domestic in japan. Therefore, I do not

talk a topic.



Groonahttp://groonga.org/

Groonga とは、 Senna の開発者が作った新しい検索ソフトウェアです。私は開発に参加していません。 sory, i am not a developer.Senna の欠点を補いつつ、高機能化したものです。"1から書き直した方が良いよね"、という良くある話です。彼らの会社のプロダクトで使う機能が実装されています。





Groonga spec• bundle groonga daemon(HTTP, memcached

protocol, groonga protocol)

• suitable for embedding

• geolocation search

• 高速な集計クエリ

• groonga has not english document ;( (i was surprised)

• hack is too hard, because thin documents2011年3月28日月曜日

install$ wget http://groonga.org/files/groonga/groonga-1.1.0.tar.gz$ tar zxvf groonga-1.1.0.tar.gz && cd groonga-1.1.0$ ./configure --prefix=/usr --localstatedir=/var$ make && sudo make install


http://groonga.org/files/groonga/groonga-1.1.0.tar.gz

http://groonga.org/files/groonga/groonga-1.1.0.tar.gz

tiny demos for

gronnga CLI


MySQL Storagehttps://github.com/mroonga

Groonga has mysql storage engine plugin.same the Tritonn.


https://github.com/mroonga

https://github.com/mroonga

example 1/2mysql> CREATE TABLE t1 ( > c1 INT PRIMARY KEY, > c2 TEXT, > _score FLOAT, > FULLTEXT INDEX (c2) > ) ENGINE = groonga DEFAULT CHARSET utf8;Query OK, 0 rows affected (0.22 sec)


example 2/2mysql> insert into t1 values(1, "aa ii uu ee oo", null);Query OK, 1 row affected (0.00 sec)mysql> insert into t1 values(2, "aa ii ii ii oo", null);Query OK, 1 row affected (0.00 sec)mysql> insert into t1 values(3, "dummy", null);Query OK, 1 row affected (0.00 sec)

mysql> select * from t1 where match(c2) against("ii") order by _score desc;+----+----------------+--------+| c1 | c2 | _score |+----+----------------+--------+| 2 | aa ii ii ii oo | 3 || 1 | aa ii uu ee oo | 1 |+----+----------------+--------+2 rows in set (0.00 sec)


bindings

• Python• PHP• Ruby (rroonga/ラングバ)

http://groonga.rubyforge.org/


http://groonga.rubyforge.org

http://groonga.rubyforge.org

perl bindinghttps://github.com/yappo/p5-Groonga

i written perl binding of Groonga.I'm working for it, but it's not yet completed.


https://github.com/yappo/p5-Groonga

https://github.com/yappo/p5-Groonga

package main { no utf8;

my $path = 'tag_keys.db'; my $pat = Groonga::PatriciaTrie->new; if (! $pat->open($path)) { $pat->create($path, 1024, 1024, GRN_OBJ_KEY_VAR_SIZE | GRN_OBJ_KEY_NORMALIZE) or die 'Groonga::PatriciaTrie create error'; } $pat->add('ガッ', ''); $pat->add('muteki', ''); $pat->add('yappo', '');

my $text = 'muTEki マッチしない Yappo <> ガッ'; my $replace = $pat->tag_keys($text, sub { my($record, $word, $record_id) = @_; sprintf '%s(%s)', $record, $word; });

say $replace;}

__END__muTEki(muteki) マッチしない Yappo(yappo) <> ガッ(ガッ)


Summary of this talk

•Making search engines in Japanese.

• One of the hot topics is Groonga.• There's no English document ;( We'll write it in the near future.

• I'm happy if it interests you


謝謝2011年3月28日月曜日

Date post:	20-Jan-2015
Category:	Documents
Upload:	kazuhiro-osawa
View:	3,314 times
Download:	0 times

Yappo Groonga - with japanese search software history @ osdc.tw 2011

Documents