an introduction€¦ · big data solutions for capturing, storing, searching and analyzing...

reach1to1 - 1 / 25

an introduction

reach1to1 - 2 / 25

what we do

big data solutions for capturing, storing, searching and analyzing structured and unstructured data

from multiple sources

reach1to1 - 3 / 25

big data technology benefits

distributedcomputingdistributedcomputing

● cluster of low cost commodity servers

● capable of handling unlimited growth in data size

● distributed parallel processing models

● no loss in performance with increasing data size

● no licensing costs - primarily open source

reach1to1 - 4 / 25

big data technologies

open source technologies that are developed and being used by companies like Google, Facebook, Twittter and LinkedIn

reach1to1 - 5 / 25

case studies● patent document repository

● Large international chemical manufacturer requires a high performance document repository capable of handling large volume of patent documents with advanced search capabilities

● log file analysis● Large telecom provider requires to analyze log files generated from

automated customer support calls and call center logs without manual data collation

● customer activity analysis● Fast growing low-cost airline requires to analyze customer activity to enable

promotional fares to increase market share

reach1to1 - 6 / 25

why reach1to1?● combined experience of over 20 years in NoSQL database

technologies● expertise in entire product development life cycle● handled range of enterprise applications using NoSQL

databases including● sales monitoring and analytics

● customer order tracking

● accounts receivable tracking

● customer support tracking

reach1to1 - 7 / 25

patent document repository- a case study outline

reach1to1 - 8 / 25

c1c2

d1d1d1

pf3pf2

pf1

c3

folders

f3f2

f1

documents are organized into multiple folders, that determine access rights and represent logical collections

documents are also grouped into into patent families, that determine relationships that are based on priority codes assigned to each document

users review documents and add comments that represent their views on the researched topic

documentfamilies

comments

documents

data requirements

reach1to1 - 9 / 25

repository

d1d1

search

crudcomment

batch operations user operations

functional requirements

reach1to1 - 10 / 25

documents are added or replaced in the repository in batches consisting of up to thousands of documentsthe critical performance metrics forbatch operations are throughputand access delaybatch throughput is the rate of processing of documentsaccess delay is the time it takes fromthe start of the batch till documentsare available for user operations

repository

d1d1

batch operations

reach1to1 - 11 / 25

repository

create / retrieve / update / delete – documents based on access rightscomment – crud operations on comments – comments can be private or publicsearch – facility for advanced full text search features– facility for faceted search for drillingdown into search results– search results need contain highlights for matching terms – search based on concordance

searchadd /

update comment

user operations

reach1to1 - 12 / 25

repository application server

client application

object orienteddatabase

advanced full text search

client API

synchronization

storage & retrieval indexing

architecture

document families

relationships

reach1to1 - 13 / 25


Hbase used for persistenceprovides random, real-time read/write accesscapable of hosting very large datacan use clusters of serversmulti-value and hierarchicalparameters mapped to column families and columnslinks between documents andrelated objects stored as linkedobject ids

persistence

reach1to1 - 14 / 25

f3

c1c2

f2

f1

d1d1d1

p3p2

p1c3

folders

documents

patents comments

data model

reach1to1 - 15 / 25

advanced full textSolr provides powerful full-text search, hit highlighting, faceted search, dynamic clusteringhighly scalable, distributed searchand index replicationdocuments, comments and patents are indexed in a 1+n+m denormalized index structurefield collapsing is used to group multiple search resultspivoted faceting is used to provideaccurate facet results due to duplicate entries

indexing

reach1to1 - 16 / 25

1 folder 1 document 2 patents 3 comments

+

+

+

+

+

+

+

+

+

+

+

1 n m

=>6 index entries

1+n+m

=>

=>

=>

=>

=>

folder+documentproperties

folder+document+ patentproperties

folder+document+ commentproperties

indexing model

reach1to1 - 17 / 25

graph traversalNeo4j for mapping a graph ofdocuments based on their tagsa high performance graphdatabase with transactionsupportdocuments, tags and families arecreated as verticesedges between documentand tag verticesfamily is a fully connectedsub-graph

relationships

reach1to1 - 18 / 25

document vertex

tags

family

family

family

grouping into families

reach1to1 - 19 / 25

repository application server

oodebe is a synchronization enginethat is based on node.jsprovides a consistent client api that encapsulates combined synchronousoperations across multiple big datarepository componentsincludes a scripting engineincludes advanced sequencing patterns - serial, parallel, waterfall, concurrent queues etc.provides for multiple concurrentoperations with provision for logical object-level locks

client API

reach1to1 - 20 / 25

d1d1search

crudcomment

synchronization server

add/updatedocument

deletedocument

add/updatecomment

deletecomment

add/updatefolder

startbatch

batchstatus

retrievedocument

retrievecomment

searchquery


graph indexfull text search index

deletefolder

batch operations user operations

scripts

web services

client API

reach1to1 - 21 / 25

searchquery

1.3 secs

retrievedocument

retrievecomment

0.3 secs

0.3 secs

add/updatedocument

deletedocument

add/updatecomment

deletecomment

add/updatefolder

deletefolder

0.3 secs 0.25 secs

0.24 secs

0.25 secs

not implemented

not measured

performance benchmarks

note: timings are average across a pre-defined set of operations

reach1to1 - 22 / 25

data size processing speed

hadoop scales to thousands of commodity computers using all cores and spindles simultaneouslyproven data size scalability – e.g. Facebook has 21 pbdata in a single hadoop cluster solr has built-in capabilities for replicationthat allows it to scale up for very high query volumeswithout loss of performance – e.g. solr has productioninstances of over 200+ mn itemsneo4j enterprise version includes high availabilityclustering and can traverse up to 1-2 mn hops per second

data complexity

scalability

reach1to1 - 23 / 25


hbase column families and columns provide a flexibleway to manage sparse data structuresusing object links allows additional objects to be linked to documentsneo4j can be used to handle more hierarchical datastructures that require traversalssolr schema can be extended easily for adding new,though re-indexing is required after a changeadditional index servers can be added to manage newtypes of queries and synchronized by oodebe synchronization scripts

data complexity

scalability

reach1to1 - 24 / 25


node.js allows clusters of worker processes with facility to monitor and automatically manage thembatch throughput can be optimized by using concurrentqueues and multiple worker processescustom client applications can be developed thatmanage complex processes faster, and invoked throughsynchronization scriptssolr batch updates and caching can be used to speed up updates and queries respectively

data complexity

scalability

reach1to1 - 25 / 25

thank you

[email protected]+91-98201-94408

Date post:	04-Aug-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

an introduction€¦ · big data solutions for capturing, storing, searching and analyzing...

Documents