Date post: | 03-Jul-2015 |
Category: |
Technology |
Upload: | lucenerevolution |
View: | 663 times |
Download: | 1 times |
Kitenga reinventing information
Mark Davis Founder/CTO
Enabling Big Data Search via the Lucid ReST API
Big Data
Enormous transactional data Enormous unstructured information Too big for databases New tools are needed
kilobyte (kB) 103 210 kibibyte (KiB) 210 megabyte (MB) 106 220 mebibyte (MiB) 220 gigabyte (GB) 109 230 gibibyte (GiB) 230 terabyte (TB) 1012 240 tebibyte (TiB) 240 petabyte (PB) 1015 250 pebibyte (PiB) 250 exabyte (EB) 1018 260 exbibyte (EiB) 260 zettabyte (ZB) 1021 270 zebibyte (ZiB) 270 yottabyte (YB) 1024 280 yobibyte (YiB) 280
Volume Velocity Variety
Gather Resources
• Crawl • Crack formats
Extract Metadata
• Named entities
• Categories • Machine learning
• Semantic analysis
Index
• Schema definition
• Collection management
Indexing Challenges
Complex, varied data Compute-‐intensive metadata generation Schema and collection management
Initial Query
• Keyword guesses
• Category guidance
Refine Query
• Analytic tools
• Facetted guidance
Evaluate Relevance
• Read KWIC • Read metadata
• Read document
Search Experience Challenges
Complex, varied data Resource discovery Facetted search experience management
The Solution
Enable fast metadata generation:
Hadoop Mahout GPUs
Manage and control collections and schema:
LucidWorks Enterprise API
SQL RDBMS
Transactional Data BI Tools
Search Documents Text Classification Taxonomies Ontologies
Parts-‐of-‐Speech Tagging
Tokenization
Lemmatization
Finite State Transducer Finite State Transducer
Finite State Transducer
Machine-‐Learning
Query Language
Metadata Extraction
Indexing
Facet Browsing Facet Charting
Resource Integration
Autosuggest Spellcheck
¡ Start to POC in a week ¡ Open source intelligence problems
GOAL: Be more competitive
SOURCES: Patents, PR
announcements, legal documents,
whitepapers, crawled websites
ANALYSIS: Extract named entities and
relationships, classify and label;
visually understand relationships and
trends
ACTION: Change R&D priorities and
improve marketing approaches
13
ZettaS
earch
Facetted Search and Analytics
ZettaV
ox metadata
relationships
data entities
Source
s
¡ Understand IP among competitors ¡ Assist legal team with litigation ¡ Custom search experience ¡ Custom extractors:
§ Electronic parts § Memory types § Flash memory
5/15/12 . 14
5/15/12 . 15
Documents Size
Dell 102,508 9Gb
EMC 303,678 14Gb
Huawei 11,912 890Mb
Kingston 2,534 134Mb
Lenovo 8,305 542Mb
NEC 3,900 252Mb
Nokia 174,681 22Gb
Panasonic 5,804 473Mb
Rim 181 8Mb
Sharp USA 31,918 4.9Gb
645,421 60.2Gb
GOAL: Discover new drugs, detect side-‐
effects, speed R&D
SOURCES: Published research reports,
patents, adverse effects databases,
genomics and proteomics databases
ANALYSIS: Extract named entities and
relationships, classify and label; visually
discover trends and relationships
ACTION: Change R&D priorities
16
ZettaS
earch
Facetted Search and Analytics
Source
s Ze
ttaV
ox
relationships
data entities pathways
sequences
¡ Lousy search (Google Search Appliance) ¡ Internal regulators can’t find by accession number
¡ Custom extractors: § Accession number § Ontology of active ingredients § Drug names
© 2012 Kitenga Proprietary 17
GOAL: Build “second screen
experiences”
SOURCES: wikipedia, IMDB, blogs
ANALYSIS: Extract named entities and
relationships, preserve existing
structural metadata
ACTION: Enable new media experiences
18
ZettaS
earch
Facetted Search and Analytics
ZettaV
ox metadata
relationships
data entities
Source
s
¡ Crawlers on Hadoop ¡ Document format crackers on Hadoop ¡ Extractors on Hadoop ¡ Filters on Hadoop ¡ HTTP documents to Solr sharded cluster ¡ Intermediary files remain on HDFS for reprocessing
¡ Missing piece of the puzzle ¡ Addresses the impedance mismatch between Big Data technologies and Solr search
¡ Manage collections ¡ Manage schema
¡ Create collections ¡ Delete collections ¡ Update collection properties ¡ Create schema ¡ Modify schema
¡ Schema interrogation ¡ Schema binding to user experience ¡ Facetted search ¡ Embedded analytics
¡ Big Data search and analytics has many challenges: § Volume of data § Variety of data § Velocity of data § Extracting structure from unstructured information
¡ Hadoop processing enables each of these aspects ¡ Controlling indexing and search is enabled by the
Lucid Imagination search API ¡ We can enable complex user interactions with Big
Data on a self-‐serve basis
ZettaVox Author RIA
Tomcat App Server
Tomcat Web Services
ZettaVoxServices Manager XML
+ JSON
Amazon S3
GPU Services Manager
Hadoop Services Manager
Analyst Browser Enterprise servers Cloud services
GPU MR Service Manager
GPU
GPU
Enterprise Cloud
Hadoop Server Job Tracker
Hadoop Task Manager Hadoop
Task Manager Hadoop
Task Manager
Hadoop Server Name node
Search Indexing
© 2012 Kitenga Proprietary Mahout
Entity Extraction Crawling
Quantum4D
RDBMS
ReST JSON
ZettaVox Author RIA
Analyst Browser Enterprise servers
Hadoop Server Job Tracker
Hadoop Task Manager Hadoop
Task Manager Hadoop
Task Manager
Hadoop Server Name node
Search Indexing
© 2012 Kitenga Proprietary Mahout
Entity Extraction Crawling
ReST
JSON
• Get collection information • Create new collection • Create fields • Delete fields • Edit fields
Indexing
Questions?