reach1to1 - 1 / 25
an introduction
reach1to1 - 2 / 25
what we do
big data solutions for capturing, storing, searching and analyzing structured and unstructured data
from multiple sources
reach1to1 - 3 / 25
big data technology benefits
distributedcomputingdistributedcomputing
● cluster of low cost commodity servers
● capable of handling unlimited growth in data size
● distributed parallel processing models
● no loss in performance with increasing data size
● no licensing costs - primarily open source
reach1to1 - 4 / 25
big data technologies
open source technologies that are developed and being used by companies like Google, Facebook, Twittter and LinkedIn
reach1to1 - 5 / 25
case studies● patent document repository
● Large international chemical manufacturer requires a high performance document repository capable of handling large volume of patent documents with advanced search capabilities
● log file analysis● Large telecom provider requires to analyze log files generated from
automated customer support calls and call center logs without manual data collation
● customer activity analysis● Fast growing low-cost airline requires to analyze customer activity to enable
promotional fares to increase market share
reach1to1 - 6 / 25
why reach1to1?● combined experience of over 20 years in NoSQL database
technologies● expertise in entire product development life cycle● handled range of enterprise applications using NoSQL
databases including● sales monitoring and analytics
● customer order tracking
● accounts receivable tracking
● customer support tracking
reach1to1 - 7 / 25
patent document repository- a case study outline
reach1to1 - 8 / 25
c1c2
d1d1d1
pf3pf2
pf1
c3
folders
f3f2
f1
documents are organized into multiple folders, that determine access rights and represent logical collections
documents are also grouped into into patent families, that determine relationships that are based on priority codes assigned to each document
users review documents and add comments that represent their views on the researched topic
documentfamilies
comments
documents
data requirements
reach1to1 - 9 / 25
repository
d1d1
search
crudcomment
batch operations user operations
functional requirements
reach1to1 - 10 / 25
documents are added or replaced in the repository in batches consisting of up to thousands of documentsthe critical performance metrics forbatch operations are throughputand access delaybatch throughput is the rate of processing of documentsaccess delay is the time it takes fromthe start of the batch till documentsare available for user operations
repository
d1d1
batch operations
reach1to1 - 11 / 25
repository
create / retrieve / update / delete – documents based on access rightscomment – crud operations on comments – comments can be private or publicsearch – facility for advanced full text search features– facility for faceted search for drillingdown into search results– search results need contain highlights for matching terms – search based on concordance
searchadd /
update comment
user operations
reach1to1 - 12 / 25
repository application server
client application
object orienteddatabase
advanced full text search
client API
synchronization
storage & retrieval indexing
architecture
document families
relationships
reach1to1 - 13 / 25
object orienteddatabase
Hbase used for persistenceprovides random, real-time read/write accesscapable of hosting very large datacan use clusters of serversmulti-value and hierarchicalparameters mapped to column families and columnslinks between documents andrelated objects stored as linkedobject ids
persistence
reach1to1 - 14 / 25
f3
c1c2
f2
f1
d1d1d1
p3p2
p1c3
folders
documents
patents comments
data model
reach1to1 - 15 / 25
advanced full textSolr provides powerful full-text search, hit highlighting, faceted search, dynamic clusteringhighly scalable, distributed searchand index replicationdocuments, comments and patents are indexed in a 1+n+m denormalized index structurefield collapsing is used to group multiple search resultspivoted faceting is used to provideaccurate facet results due to duplicate entries
indexing
reach1to1 - 16 / 25
1 folder 1 document 2 patents 3 comments
+
+
+
+
+
+
+
+
+
+
+
1 n m
=>6 index entries
1+n+m
=>
=>
=>
=>
=>
folder+documentproperties
folder+document+ patentproperties
folder+document+ commentproperties
indexing model
reach1to1 - 17 / 25
graph traversalNeo4j for mapping a graph ofdocuments based on their tagsa high performance graphdatabase with transactionsupportdocuments, tags and families arecreated as verticesedges between documentand tag verticesfamily is a fully connectedsub-graph
relationships
reach1to1 - 18 / 25
document vertex
tags
family
family
family
grouping into families
reach1to1 - 19 / 25
repository application server
oodebe is a synchronization enginethat is based on node.jsprovides a consistent client api that encapsulates combined synchronousoperations across multiple big datarepository componentsincludes a scripting engineincludes advanced sequencing patterns - serial, parallel, waterfall, concurrent queues etc.provides for multiple concurrentoperations with provision for logical object-level locks
client API
reach1to1 - 20 / 25
d1d1search
crudcomment
synchronization server
add/updatedocument
deletedocument
add/updatecomment
deletecomment
add/updatefolder
startbatch
batchstatus
retrievedocument
retrievecomment
searchquery
object orienteddatabase
graph indexfull text search index
deletefolder
batch operations user operations
scripts
web services
client API
reach1to1 - 21 / 25
searchquery
1.3 secs
retrievedocument
retrievecomment
0.3 secs
0.3 secs
add/updatedocument
deletedocument
add/updatecomment
deletecomment
add/updatefolder
deletefolder
0.3 secs 0.25 secs
0.24 secs
0.25 secs
not implemented
not measured
performance benchmarks
note: timings are average across a pre-defined set of operations
reach1to1 - 22 / 25
data size processing speed
hadoop scales to thousands of commodity computers using all cores and spindles simultaneouslyproven data size scalability – e.g. Facebook has 21 pbdata in a single hadoop cluster solr has built-in capabilities for replicationthat allows it to scale up for very high query volumeswithout loss of performance – e.g. solr has productioninstances of over 200+ mn itemsneo4j enterprise version includes high availabilityclustering and can traverse up to 1-2 mn hops per second
data complexity
scalability
reach1to1 - 23 / 25
data size processing speed
hbase column families and columns provide a flexibleway to manage sparse data structuresusing object links allows additional objects to be linked to documentsneo4j can be used to handle more hierarchical datastructures that require traversalssolr schema can be extended easily for adding new,though re-indexing is required after a changeadditional index servers can be added to manage newtypes of queries and synchronized by oodebe synchronization scripts
data complexity
scalability
reach1to1 - 24 / 25
data size processing speed
node.js allows clusters of worker processes with facility to monitor and automatically manage thembatch throughput can be optimized by using concurrentqueues and multiple worker processescustom client applications can be developed thatmanage complex processes faster, and invoked throughsynchronization scriptssolr batch updates and caching can be used to speed up updates and queries respectively
data complexity
scalability