Large Scale Search, Discovery and Analytics in Action

Confidential © Copyright 2012

Large Scale Search, Discovery and Analysis in Action

Grant IngersollChief Scientist

September 18, 2012

Confidential and Proprietary © 2012 LucidWorks

• Good keyword search is a commodity and easy to get up and running

• The Bar is Raised• Relevance is (always will

be?) hard

• Holistic view of the data AND the users is critical

• Search, Discovery and Analytics are the key to unlocking this view of users and data

Search is Dead, Long Live Search

Documents

User Interaction

Access

Content Relationships


Topics

•Background and needs

•Architecture•Road Ahead

•SDA In Action• Components• Challenges and Lessons Learned

•Wrap Up


Why Search, Discovery and Analytics (SDA)?

• User Needs• Real-time, ad hoc access to

content

• Aggressive Prioritization based on Importance

• Serendipity

• Feedback/Learning from past

• Business Needs• Deeper insight into users

• Leverage existing internal knowledge

• Cost effective

Search

DiscoveryAnalytics


Sample Use Cases

• Claims processing and analysis, including fraud analysis

• Large scale content acquisition and access for:• Defense, intelligence and pharmaceutical applications

• Views of data surrounding natural disasters and other tragedies for research, archiving and therapeutic purposes

• Analysis of Website and social media interactions

• Access and processing of genetic information for improved medical treatments

• Log processing and fraud detection in telecommunications

5


In Focus: Personalized Medicine

6

Genetic Variations

Patient DNA

Alignment and other analysis

Search and Faceting

Standard Therapies

Alternative Therapies


In Focus: Log Processing in Telecommunications

• Each year, large sums of money are lost due to fraudulent calls and poor service

• Logs are usually semi-structured and contain vital information about errors and fraud

• Deeper batch analytics can provide insight into patterns across vast amounts of data

• Search of call and network information (via logs) is critical to providing deeper analysis and understanding of these errors and fraudulent activities

7


What Does an SDA Platform Need?

• Fast, efficient, scalable search• Bulk and Near Real Time Indexing

• Handle billions of records with sub-second search and faceting

• Large scale, cost effective storage and processing capabilities• Need whole data consumption and analysis

• Experimentation/Sampling tools

• NLP and machine learning tools that scale to enhance discovery and analysis


Architecture


Under the Hood

• Lucene/Solr 4.0-dev

• Sharded with SolrCloud• 1 second (default) soft commits for NRT

updates

• 1 minute (default) hard commits (no searcher reopen)

• Transaction logs for recovery

• Solr takes care of leader election, etc. so no more master/slave

• RESTful services built on Restlet 2.1

• Service Discovery, load balancing, failover enabled via ZooKeeper + Netflix Curator

• Authentication and authorization over SSL (optional)

• Proxies for LucidWorks and WebHDFS API

• Workflow engine coordinates data flow

LucidWorks 2.1 SDA Engine


Under the Hood, cont.

• Apache Hadoop• Map-Reduce (MR) jobs for ETL

and bulk indexing into SolrCloud sharded system

• Leverage Pig and custom MR jobs for log processing and metric calculation

• WebHDFS

• Apache Mahout• K-Means Clustering

• Statistically Interesting Phrases

• Similar Docs

• More to come

• Apache HBase• Key-value and time series of all

calculated metrics

• Document storage

• Apache Pig• ETL

• Log analysis -> Hbase

• Apache ZooKeeper• Netflix Curator for service

discovery and higher level ZK client

• Apache Kafka• Pub-sub for collecting logs from

LucidWorks into HDFS


The Road Ahead

• Our approach is from search and discovery outwards to analytics• Analytics in beta are focused around analysis of search logs

• Apache Hive support

• Analytics Themes• Relevance

• Data quality

• Discovery

• Experiment Management

• Machine Learning• Classification

• Recommendations

• Natural Language Processing

• Incorporate latest LucidWorks/Solr


Computation and Storage

LucidWorks Search/Solr

• Document Index

• Faceting

• SolrCloud makes sharding easy

Hadoop

• Stores Logs, Raw files, intermediate files, etc.

• WebHDFS

• Small files are an unnatural act

HBase

• Metric Storage

• User Histories/Profile

• Document Storage

Challenges• Who is the authoritative store?• Real time vs. Batch• Where should analysis be done?


Search In Practice

•Three primary concerns• Performance/Scaling

• Relevance

• Operations: monitoring, failover, etc.

•Business typically cares more about relevance

•Devs care more about performance at first…


Search: Relevance

•Always Be Testing•Experiment management is critical•Top X + sampling•Click Logs

•Track Everything!• Queries• Clicks• Displayed Documents • Mouse/Scroll tracking?

•Phrases are your friends


Discovery Components

Serendipity

• Trends• Topics• Recommendations• Related Items• More Like This• Did you mean?• Stat. Interesting

Phrases

Organization

• Importance• Clustering• Classification

• Named Entities• Time Factors• Faceting

Data Quality

• Document factor Distributions• Length• Boosts

• Duplicates

Challenges• Many of these are intense calculations or iterative• Many are subjective and require a lot of experimentation


Discovery with Mahout

• Mahout’s 3 “C”s provide tools for helping across many aspects of discovery• Collaborative Filtering

• Classification

• Clustering

• Also: • Collocations (Statistically Interesting Phrases)

• Singular Value Decomposition (SVD)

• Others

• Challenges:• High cost to iterative machine learning algorithms

• Mahout is very command line oriented

• Some areas less mature


Aside: Experiment Management

• Plan for running experiments from the beginning across Search and Discovery components• Your engine should help!

• Types of Experiments to consider• Indexing/Analysis

• Query parsing

• Scoring formulas

• Machine Learning Models

• Recommendations, many more

• Make it easy to do A/B testing across all experiments and compare and contrast the results


Analytics in Practice

• Many of the components discussed provide analytical features• Leverage existing tools: R, etc.

• Simple Counts:• Facets

• Term and Document frequencies

• Clicks

• Search and Discovery example metrics• Relevance measures like Mean Reciprocal Rank

• Histograms/Drilldowns around Number of Results

• Log and navigation analysis

• Data cleanliness analysis is helpful for finding potential issues in content


Wrap

• Search, Discovery and Analytics, when combined into a single, coherent system provides powerful insight into both your content and your users

• LucidWorks has combined many of these things into LucidWorks Big Data• http://www.lucidworks.com/products/lucidworks-big-data

• Design for the big picture when building search-based applications

http://www.lucidworks.com/products/lucidworks-big-data

http://www.lucidworks.com/products/lucidworks-big-data


Discussion and Resources

• Questions?

• http://www.lucidworks.com

• [email protected]• @gsingers

21

http://www.lucidworks.com/

http://www.lucidworks.com/

mailto:[email protected]

Date post:	27-Jan-2015
Category:	Technology
Upload:	grant-ingersoll
View:	107 times
Download:	0 times

Large Scale Search, Discovery and Analytics in Action

Technology