+ All Categories
Home > Technology > IR: Open source state

IR: Open source state

Date post: 20-Mar-2017
Category:
Upload: dmitry-kan
View: 133 times
Download: 0 times
Share this document with a friend
34
IR: open source state Dmitry Kan, AlphaSense, Insider Solutions University of Helsinki, Information Retrieval and Search Engines course, Feb 21, 2017
Transcript
Page 1: IR: Open source state

IR: open source state

Dmitry Kan, AlphaSense, Insider Solutions

University of Helsinki, Information Retrieval and Search Engines course, Feb 21, 2017

Page 2: IR: Open source state

About me● PhD in CS (Saint Petersburg State University), 2011

● Running a Search Engine team at AlphaSense since 2014

● Founded Insider Solutions in 2009: text analytics solutions +

consulting

● Co-committer on luke project: toolbox for Lucene index since 2013

Page 3: IR: Open source state

What is AlphaSense● Google for financial analysts● Semantic research engine● Edit, tag, annotate, share you data in a team● Oracle, JP Morgan, Credit Suisse● Engineering is 98% in Helsinki + 1% NYC + 1% India● #1 fastest growing IT startup in Finland by Deloitte

(2015)

www.alpha-sense.com

Page 4: IR: Open source state

● Founded 2009

● BigText Analytics APIs and on-premise solutions

○ Sentiment analysis: Russian, Chinese, English

○ Searchable trend extraction

● Consulting: startups and corporates

https://semanticanalyzer.info

Insider Solutions

Page 5: IR: Open source state

Outline● Search engine architecture

● Open source search ecosystem

● Research directions for applied IR

Page 6: IR: Open source state

Search engine: building blocks● Web crawler: Apache Nutch (based on Hadoop)

● Data ingestion pipeline: receiving, cleaning, data

extraction

● SolrCloud OR Elasticsearch (both based on Lucene)

● Shards: storing index on disk and / or memory

Page 7: IR: Open source state

Lucene / Solr history timeline

Page 8: IR: Open source state
Page 9: IR: Open source state

Inject URLs

Create segments

New URLs

Page 10: IR: Open source state

Search Engine Software Components● Schema

● Query parser

● Scoring algorithm

● Snippet highlighter

● Index (on-disk or in-memory)

Page 11: IR: Open source state

Query analysis and suggestions

Page 12: IR: Open source state

British vs US English handling

Page 13: IR: Open source state

One shard of the index

Page 14: IR: Open source state

Content extractionApache Tika for parsing formats:

● Html, XML

● PDF

● Microsoft Office & iWorks document formats

● Audio, image, video

● Mail

● Source code

Page 15: IR: Open source state

Inspecting Lucene index with LukeImplemented by Andrzej Bialecki. Since 2013 → by Dmitry Kan (Finland) and Tomoko Uchida (Japan)

● Perform index maintenance

● Prototype similarity functions

● Search for documents, reconstruct field values from the index

● Read index from HDFS (Hadoop’s distributed file system)

● Supports Apache Solr and Elasticsearch

Page 16: IR: Open source state
Page 17: IR: Open source state

Learning to rank: SolrContributed by Bloomberg

Machine learnt model for reranking documents based on user feedback

Trained on features: views, popularity, was hit in the title, length, can view on mobile device?

LamdaMART, RankSVM

Page 18: IR: Open source state

Lucene scoring formula

Page 19: IR: Open source state
Page 20: IR: Open source state
Page 21: IR: Open source state
Page 22: IR: Open source state

Feature: is person and executive?

Page 23: IR: Open source state

Feature: recency of the document

Page 24: IR: Open source state

Features as signal of result importance

Page 25: IR: Open source state

Learnt model

Page 26: IR: Open source state

Word vectors with LuceneWord2vec was released by Google to open source

Possible to train word2vec on Lucene index: https://github.com/kojisekig/word2vec-lucene

● NO need to provide a text file besides Lucene index

● NO need to normalize text. Normalization already done in the index or

Analyzer does it for you when processing

● Use part of the index by specifying a filter query

Page 27: IR: Open source state
Page 28: IR: Open source state
Page 29: IR: Open source state
Page 30: IR: Open source state
Page 31: IR: Open source state
Page 32: IR: Open source state
Page 33: IR: Open source state

Questions?Reach me at:

[email protected]

Twitter: @dmitrykanQuora: https://www.quora.com/profile/Dmitry-Kan

Page 34: IR: Open source state

References1. Luke: https://github.com/DmitryKey/luke2. My blog: http://dmitrykan.blogspot.fi/3. Solr vs Elasticsearch (overview): https://sematext.com/blog/2015/01/30/solr-elasticsearch-comparison/4. Solr vs Elasticsearch (in-depth): https://sematext.com/blog/2012/08/23/solr-vs-elasticsearch-part-1-overview/5. Introduction to Apache Solr http://www.slideshare.net/ChristosManios/introduction-to-apache-solr-540761896. Word2vec-lucene: https://github.com/kojisekig/word2vec-lucene7. Apache Tika: https://tika.apache.org/8. Apache Solr: http://lucene.apache.org/solr/9. Elasticsearch: https://github.com/elastic/elasticsearch

10. Learning to rank in Solr (video): https://www.youtube.com/watch?v=M7BKwJoh96s11. Learning to rank in Solr (slides): https://lucidworks.com/2016/08/17/learning-to-rank-solr/12. Word2vec: https://en.wikipedia.org/wiki/Word2vec#Analysis13. Lucene scoring formula:

https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html


Recommended