Date post: | 15-Jan-2015 |
Category: |
Technology |
Upload: | yonyonson |
View: | 1,167 times |
Download: | 0 times |
Petr VasilevKonsulent, Integrasjon og Søk avdelingBouvet ASA
Oversikt over Recommind Decisiv Search
Search engines in Norway
• Long tradition for working with enterprise and web search engines
• Most popular solutions:– Microsoft/Fast– Apache Solr
• But! World is bigger than that!
Recommind
• Founded in 2000• Headquartered in San Francisco;
offices in London, Boston, Sydney and Bonn
• #157 on 2010 Deloitte Fast 500– Fastest growing eDiscovery company– Fastest Growing Information Access company– 230+ employees
Product Families
CORE
• Fully powered search engine• Analytics:
– Concept extraction (Probabilistic latent semantic indexing, pLSI)
– Conceptual search– Categorization engine (pSVM with highest
empirical performance)– Workflow Rules and Smart Tagging– Duplicate detection– Near-duplicate computation– Thread extraction– Email Footer detection
Architecture overview
Enrichment and categorization
API and Hooks
• CORE API– Crawler extensions– Connector SDK– Post-processor SDK
• Planned– Parser API– Custom mime type detection
• CORE extensions– JAAS security SDK– Call-back API– Storage SDK
• Custom reports
Entity extraction
• Is based on generated lists• It is possible to bind custom names to
entities– http://psi.hafslund.no/sesam/ifs/Customer is
named “Kunde”
• Takes in simple XML for names and entity lists
• We can make many to 1 matching
Rule Based Classfication
• Smart tagging – – Tag incoming documents based complex
search expressions
• Workflow rules – Category triggered – Target review state, Batch, Coding pattern
• Categorization result rules – – Action based on score range
• Policy– Publish, Remediate, Move, ...
PLSA overview
• “Unsupervised”– Document collection only („inherent
statistics“)– Used for concept search („auto-extraction“)– No human required
• “Supervised“– By example, seed sets– For Categorization, Tagging, Annotation, etc– Human required
Search as Statistical Inference
• Document in bag-of-words representation
US
Disneyeconomic
intellectual property
relations
Beijing
human rightsfree
negotiations
imports
China US trade relations
China?How probable is it that terms like “China“ or “trade“ might occur?
Additional index terms can be added automatically via statistical inference!
Estimation via PLSA
Latent Concepts
TermsDocuments
TRADE
economic
imports
trade
Concept expression proba-bilities are estimated based on all documents that are dealing with a concept.
“Unmixing” of superimposed concepts is achieved by statistical learning algorithm.
Conclusion: No prior knowledge about concepts required, context and term co-occurrences are exploited
CHINA china
bejing
Automatically generated concept groups
Ship
ship 109.41212
coast 93.70902
guard 82.11109
sea 77.45868
boat 75.97172
fishing 65.41328
vessel 64.25243
tanker 62.55056
spill 60.21822
exxon 58.35260
boats 54.92072
waters 53.55938
valdez 51.53405
alaska 48.63269
ships 46.95736
port 46.56804
hazelwood 44.81608
vessels 43.80310
ferry 42.79100
fishermen 41.65175
Securities
securities 94.96324
firm 88.74591
drexel 78.33697
investment 75.51504
bonds 64.23486
sec 61.89292
bond 61.39895
junk 61.14784
milken 58.72266
firms 51.26381
investors 48.80564
lynch 44.91865
insider 44.88536
shearson 43.82692
boesky 43.74837
lambert 40.77679
merrill 40.14225
brokerage 39.66526
corporate 37.94985
burnham 36.86570
India
india 91.74842
singh 50.34063
militants 49.21986
gandhi 48.86809
sikh 47.12099
indian 44.29306
peru 43.00298
hindu 42.79652
lima 41.87559
kashmir 40.01138
tamilnadu 39.54702
killed 39.47202
india's 39.25983
punjab 39.22486
delhi 38.70990
temple 38.38197
shining 37.62768
menem 35.42235
hindus 34.88001
violence 33.87917
(Sample aspect lists from AP data, 100-Aspect Model)
Hafslund SESAM search
Demo
What is nice from technical side?
• Connectors! And framework for development
• SSO and integration with Kerberos, AD, OpenSSO
• Custom and OOTB authentication integration
• Query API and OOTB XSLT framework• Extensible set of parsers, including OCR
Nice!
• Extremely flexible index structure– We take in everything– We can determine what is to be shown later
• Publishing data in external systems– You can use it as data mining tool
• PLSA: Categorization and concept extraction
• Transparency for Java developers– Everything is exposed as RMI calls over 1099
port
• Small company, quick response
Sad parts
• Sessions in Query API– No sessions – no security
• There is no direct content push• Relatively heavy taxonomies• Licenses
– Basic license doesn’t includes entity extraction and rule based classification
• Notable hardware requirements
Overall impression
• Nice and capable enterprise search solution
• Is not dependent on vendor– Easy to integrate in heterogeneous
environment
• Can be used for search and data mining• PLSA and categorization opportunities
are extremely promising