Date post: | 27-Jan-2015 |
Category: |
Technology |
Upload: | grant-ingersoll |
View: | 107 times |
Download: | 0 times |
Confidential © Copyright 2012
Large Scale Search, Discovery and Analysis in Action
Grant IngersollChief Scientist
September 18, 2012
Confidential and Proprietary © 2012 LucidWorks
• Good keyword search is a commodity and easy to get up and running
• The Bar is Raised• Relevance is (always will
be?) hard
• Holistic view of the data AND the users is critical
• Search, Discovery and Analytics are the key to unlocking this view of users and data
Search is Dead, Long Live Search
Documents
User Interaction
Access
Content Relationships
Confidential and Proprietary © 2012 LucidWorks
Topics
•Background and needs
•Architecture•Road Ahead
•SDA In Action• Components• Challenges and Lessons Learned
•Wrap Up
Confidential and Proprietary © 2012 LucidWorks
Why Search, Discovery and Analytics (SDA)?
• User Needs• Real-time, ad hoc access to
content
• Aggressive Prioritization based on Importance
• Serendipity
• Feedback/Learning from past
• Business Needs• Deeper insight into users
• Leverage existing internal knowledge
• Cost effective
Search
DiscoveryAnalytics
Confidential and Proprietary © 2012 LucidWorks
Sample Use Cases
• Claims processing and analysis, including fraud analysis
• Large scale content acquisition and access for:• Defense, intelligence and pharmaceutical applications
• Views of data surrounding natural disasters and other tragedies for research, archiving and therapeutic purposes
• Analysis of Website and social media interactions
• Access and processing of genetic information for improved medical treatments
• Log processing and fraud detection in telecommunications
5
Confidential and Proprietary © 2012 LucidWorks
In Focus: Personalized Medicine
6
Genetic Variations
Patient DNA
Alignment and other analysis
Search and Faceting
Standard Therapies
Alternative Therapies
Confidential and Proprietary © 2012 LucidWorks
In Focus: Log Processing in Telecommunications
• Each year, large sums of money are lost due to fraudulent calls and poor service
• Logs are usually semi-structured and contain vital information about errors and fraud
• Deeper batch analytics can provide insight into patterns across vast amounts of data
• Search of call and network information (via logs) is critical to providing deeper analysis and understanding of these errors and fraudulent activities
7
Confidential and Proprietary © 2012 LucidWorks
What Does an SDA Platform Need?
• Fast, efficient, scalable search• Bulk and Near Real Time Indexing
• Handle billions of records with sub-second search and faceting
• Large scale, cost effective storage and processing capabilities• Need whole data consumption and analysis
• Experimentation/Sampling tools
• NLP and machine learning tools that scale to enhance discovery and analysis
Confidential and Proprietary © 2012 LucidWorks
Architecture
Confidential and Proprietary © 2012 LucidWorks
Under the Hood
• Lucene/Solr 4.0-dev
• Sharded with SolrCloud• 1 second (default) soft commits for NRT
updates
• 1 minute (default) hard commits (no searcher reopen)
• Transaction logs for recovery
• Solr takes care of leader election, etc. so no more master/slave
• RESTful services built on Restlet 2.1
• Service Discovery, load balancing, failover enabled via ZooKeeper + Netflix Curator
• Authentication and authorization over SSL (optional)
• Proxies for LucidWorks and WebHDFS API
• Workflow engine coordinates data flow
LucidWorks 2.1 SDA Engine
Confidential and Proprietary © 2012 LucidWorks
Under the Hood, cont.
• Apache Hadoop• Map-Reduce (MR) jobs for ETL
and bulk indexing into SolrCloud sharded system
• Leverage Pig and custom MR jobs for log processing and metric calculation
• WebHDFS
• Apache Mahout• K-Means Clustering
• Statistically Interesting Phrases
• Similar Docs
• More to come
• Apache HBase• Key-value and time series of all
calculated metrics
• Document storage
• Apache Pig• ETL
• Log analysis -> Hbase
• Apache ZooKeeper• Netflix Curator for service
discovery and higher level ZK client
• Apache Kafka• Pub-sub for collecting logs from
LucidWorks into HDFS
Confidential and Proprietary © 2012 LucidWorks
The Road Ahead
• Our approach is from search and discovery outwards to analytics• Analytics in beta are focused around analysis of search logs
• Apache Hive support
• Analytics Themes• Relevance
• Data quality
• Discovery
• Experiment Management
• Machine Learning• Classification
• Recommendations
• Natural Language Processing
• Incorporate latest LucidWorks/Solr
Confidential and Proprietary © 2012 LucidWorks
Computation and Storage
LucidWorks Search/Solr
• Document Index
• Faceting
• SolrCloud makes sharding easy
Hadoop
• Stores Logs, Raw files, intermediate files, etc.
• WebHDFS
• Small files are an unnatural act
HBase
• Metric Storage
• User Histories/Profile
• Document Storage
Challenges• Who is the authoritative store?• Real time vs. Batch• Where should analysis be done?
Confidential and Proprietary © 2012 LucidWorks
Search In Practice
•Three primary concerns• Performance/Scaling
• Relevance
• Operations: monitoring, failover, etc.
•Business typically cares more about relevance
•Devs care more about performance at first…
Confidential and Proprietary © 2012 LucidWorks
Search: Relevance
•Always Be Testing•Experiment management is critical•Top X + sampling•Click Logs
•Track Everything!• Queries• Clicks• Displayed Documents • Mouse/Scroll tracking?
•Phrases are your friends
Confidential and Proprietary © 2012 LucidWorks
Discovery Components
Serendipity
• Trends• Topics• Recommendations• Related Items• More Like This• Did you mean?• Stat. Interesting
Phrases
Organization
• Importance• Clustering• Classification
• Named Entities• Time Factors• Faceting
Data Quality
• Document factor Distributions• Length• Boosts
• Duplicates
Challenges• Many of these are intense calculations or iterative• Many are subjective and require a lot of experimentation
Confidential and Proprietary © 2012 LucidWorks
Discovery with Mahout
• Mahout’s 3 “C”s provide tools for helping across many aspects of discovery• Collaborative Filtering
• Classification
• Clustering
• Also: • Collocations (Statistically Interesting Phrases)
• Singular Value Decomposition (SVD)
• Others
• Challenges:• High cost to iterative machine learning algorithms
• Mahout is very command line oriented
• Some areas less mature
Confidential and Proprietary © 2012 LucidWorks
Aside: Experiment Management
• Plan for running experiments from the beginning across Search and Discovery components• Your engine should help!
• Types of Experiments to consider• Indexing/Analysis
• Query parsing
• Scoring formulas
• Machine Learning Models
• Recommendations, many more
• Make it easy to do A/B testing across all experiments and compare and contrast the results
Confidential and Proprietary © 2012 LucidWorks
Analytics in Practice
• Many of the components discussed provide analytical features• Leverage existing tools: R, etc.
• Simple Counts:• Facets
• Term and Document frequencies
• Clicks
• Search and Discovery example metrics• Relevance measures like Mean Reciprocal Rank
• Histograms/Drilldowns around Number of Results
• Log and navigation analysis
• Data cleanliness analysis is helpful for finding potential issues in content
Confidential and Proprietary © 2012 LucidWorks
Wrap
• Search, Discovery and Analytics, when combined into a single, coherent system provides powerful insight into both your content and your users
• LucidWorks has combined many of these things into LucidWorks Big Data• http://www.lucidworks.com/products/lucidworks-big-data
• Design for the big picture when building search-based applications
Confidential and Proprietary © 2012 LucidWorks
Discussion and Resources
• Questions?
• http://www.lucidworks.com
• [email protected]• @gsingers
21