Overview of Recommind Decisiv Search

Petr VasilevKonsulent, Integrasjon og Søk avdelingBouvet ASA

Oversikt over Recommind Decisiv Search

Search engines in Norway

• Long tradition for working with enterprise and web search engines

• Most popular solutions:– Microsoft/Fast– Apache Solr

• But! World is bigger than that!

Recommind

• Founded in 2000• Headquartered in San Francisco;

offices in London, Boston, Sydney and Bonn

• #157 on 2010 Deloitte Fast 500– Fastest growing eDiscovery company– Fastest Growing Information Access company– 230+ employees

Product Families

CORE

• Fully powered search engine• Analytics:

– Concept extraction (Probabilistic latent semantic indexing, pLSI)

– Conceptual search– Categorization engine (pSVM with highest

empirical performance)– Workflow Rules and Smart Tagging– Duplicate detection– Near-duplicate computation– Thread extraction– Email Footer detection

Architecture overview

Enrichment and categorization

API and Hooks

• CORE API– Crawler extensions– Connector SDK– Post-processor SDK

• Planned– Parser API– Custom mime type detection

• CORE extensions– JAAS security SDK– Call-back API– Storage SDK

• Custom reports

Entity extraction

• Is based on generated lists• It is possible to bind custom names to

entities– http://psi.hafslund.no/sesam/ifs/Customer is

named “Kunde”

• Takes in simple XML for names and entity lists

• We can make many to 1 matching

Rule Based Classfication

• Smart tagging – – Tag incoming documents based complex

search expressions

• Workflow rules – Category triggered – Target review state, Batch, Coding pattern

• Categorization result rules – – Action based on score range

• Policy– Publish, Remediate, Move, ...

PLSA overview

• “Unsupervised”– Document collection only („inherent

statistics“)– Used for concept search („auto-extraction“)– No human required

• “Supervised“– By example, seed sets– For Categorization, Tagging, Annotation, etc– Human required

Search as Statistical Inference

• Document in bag-of-words representation

US

Disneyeconomic

intellectual property

relations

Beijing

human rightsfree

negotiations

imports

China US trade relations

China?How probable is it that terms like “China“ or “trade“ might occur?

Additional index terms can be added automatically via statistical inference!

Estimation via PLSA

Latent Concepts

TermsDocuments

TRADE

economic

imports

trade

Concept expression proba-bilities are estimated based on all documents that are dealing with a concept.

“Unmixing” of superimposed concepts is achieved by statistical learning algorithm.

Conclusion: No prior knowledge about concepts required, context and term co-occurrences are exploited

CHINA china

bejing

Automatically generated concept groups

Ship

ship 109.41212

coast 93.70902

guard 82.11109

sea 77.45868

boat 75.97172

fishing 65.41328

vessel 64.25243

tanker 62.55056

spill 60.21822

exxon 58.35260

boats 54.92072

waters 53.55938

valdez 51.53405

alaska 48.63269

ships 46.95736

port 46.56804

hazelwood 44.81608

vessels 43.80310

ferry 42.79100

fishermen 41.65175

Securities

securities 94.96324

firm 88.74591

drexel 78.33697

investment 75.51504

bonds 64.23486

sec 61.89292

bond 61.39895

junk 61.14784

milken 58.72266

firms 51.26381

investors 48.80564

lynch 44.91865

insider 44.88536

shearson 43.82692

boesky 43.74837

lambert 40.77679

merrill 40.14225

brokerage 39.66526

corporate 37.94985

burnham 36.86570

India

india 91.74842

singh 50.34063

militants 49.21986

gandhi 48.86809

sikh 47.12099

indian 44.29306

peru 43.00298

hindu 42.79652

lima 41.87559

kashmir 40.01138

tamilnadu 39.54702

killed 39.47202

india's 39.25983

punjab 39.22486

delhi 38.70990

temple 38.38197

shining 37.62768

menem 35.42235

hindus 34.88001

violence 33.87917

(Sample aspect lists from AP data, 100-Aspect Model)

Hafslund SESAM search

Demo

What is nice from technical side?

• Connectors! And framework for development

• SSO and integration with Kerberos, AD, OpenSSO

• Custom and OOTB authentication integration

• Query API and OOTB XSLT framework• Extensible set of parsers, including OCR

Nice!

• Extremely flexible index structure– We take in everything– We can determine what is to be shown later

• Publishing data in external systems– You can use it as data mining tool

• PLSA: Categorization and concept extraction

• Transparency for Java developers– Everything is exposed as RMI calls over 1099

port

• Small company, quick response

Sad parts

• Sessions in Query API– No sessions – no security

• There is no direct content push• Relatively heavy taxonomies• Licenses

– Basic license doesn’t includes entity extraction and rule based classification

• Notable hardware requirements

Overall impression

• Nice and capable enterprise search solution

• Is not dependent on vendor– Easy to integrate in heterogeneous

environment

• Can be used for search and data mining• PLSA and categorization opportunities

are extremely promising

Petr [email protected]

Spørsmål? Svar!

mailto:[email protected]

Date post:	15-Jan-2015
Category:	Technology
Upload:	yonyonson
View:	1,167 times
Download:	0 times

Overview of Recommind Decisiv Search

Technology