Date post: | 13-Jan-2015 |
Category: |
Technology |
Upload: | anand-deshpande |
View: | 6,479 times |
Download: | 6 times |
10 April 2023 1
BIG DATA Defined: Data Stack 3.0
Anand DeshpandePersistent SystemsDecember 2012
10 April 2023 2
Congratulations to the Pune Chapter
Best Chapter Award at CSI 2012 Kolkata
10 April 2023 3
COMAD 2012 14-16 December
Pune
Coming to India
Delhi 2016
10 April 2023 4
The Data Revolution is Happening Now
The growing need for large-volume, multi-structured “Big Data” analytics,as well as … “Fast Data”, have positioned the industry at the cusp of the most radical revolution in database architectures in 20 years.
We believe that the economics of data will increasingly drive competitive advantage.
Source: Credit Suisse Research, Sept 2011
10 April 2023 5
Organizational leaders want analyticsto exploit their growing data and computational power to get smart, and get innovative, in ways they never could before.
Source - MIT Sloan Management Review- The New Intelligent Enterprise Big Data, Analytics and the Path From Insights to Value By Steve LaValle, Eric Lesser,Rebecca Shockley, Michael S. Hopkins and Nina KruschwitzDecember 21, 2010
What Data Can Do For You
10 April 2023 6
Source: New York Times, September 2, 2009. Tesco, British Grocer, Uses Weather to Predict Sales By Julia Werdigierhttp://www.nytimes.com/2009/09/02/business/global/02weather.html
Britain often conjures images of unpredictable weather, with downpours sometimes followed by sunshine within the same hour — several times a day.
Such randomness has prompted Tesco, the country’s largest grocery chain, to create…its own software that calculates how shopping patterns change “for every degree of temperature and every hour of sunshine.”
Determining Shopping PatternsBritish Grocer, Tesco Uses Big Databy Applying Weather Results to Predict Demand and Increase Sales
10 April 2023 7
GlaxoSmithKline is aiming to build direct relationships with 1 million consumers in a year using social media as a base for research and multichannel marketing. Targeted offers and promotions will drive people to particular brand websites where external data is integrated with information already held by the marketing teams.
Source: Big data: Embracing the elephant in the room By Steve Hemsley http://www.marketingweek.co.uk/big-data-embracing-the-elephant-in-the-room/3030939.article
Tracking Customers in Social Media
Glaxo Smith Kline Uses Big Datato Efficiently Target Customers
10 April 2023 8
What does India Think?
Persistent enabled Aamir Khan Productions and Star Plus use Big Data to know how people react to some of the most excruciating social issues. http://www.satyamevjayate.in/
Satyamev Jayate - Aamir Khan’s pioneering, interactive socio-cultural TV show - has caught the interest of the entire nation. It has already generated ~7.5M responses in 4 weeks over SMS, Facebook, Twitter, Phone Calls and Discussion Forums by its viewers across the world over. This data is being analyzed and delivered in real-time to allow the producers to understand the pulse of the viewers, to gauge the appreciation for the show and most importantly to spread the message. Harnessing the truth from all this data is a key component of the show’s success.
10 April 2023 9
10 April 2023 10
WE ALREADY HAVE DATABASES. WHY DO WE NEED TO DO ANYTHING DIFFERENT?
10 April 2023 11
● Transaction processing capabilities ideally suited for transaction-oriented operational stores.
● Data types – numbers, text, etc.● SQL as the Query language ● De-facto standard as the operational
store for ERP and mission critical systems.
● Interface through application programs and query tools
Relational Database Systems for Operational Store
Data Stack
1.0
10 April 2023 12
Data Stack 1.0: Online Transactions Processing (OLTP)
● High throughput for transactions (writes).
● Focus on reliability – ACID Properties.
● Highly normalized Schema.
● Interface through application programs and query toolsData Stack 1.0
10 April 2023 13
● Operational data stores store on-line transactions – Many writes, some reads.
● Large fact table, multiple dimension tables
● Schema has a specific pattern – star schema
● Joins are also very standard and create cubes
● Queries focus on aggregates.● Users access data through tools such
as Cognos, Business Objects, Hyperion etc.
Data Stack 2.0: Enterprise Data Warehouse for Decision Support
Data Stack 2.0
10 April 2023 14
Data Stack 2.0: Enterprise Data Warehouse
ETL
OLAPData Staging
Data Store
Reports & Ad hoc Anal
Alerts & Dashboard
s
What-if Anal. EPM
PredictiveAnalytics
Data Visualization
Data Warehouse
User
10 April 2023 15
Data Stack 2.0:Enterprise Data Warehouse Systems
Standard Enterprise Data Architecture
Data Warehouse Engine
Optimized LoaderExtractionCleansing
(ETL)
AnalyzeQuery
Metadata Repository
RelationalDatabases
LegacyData
Purchased Data
ERPSystems
Relational Databases
Application Logic
Presentation Layer
Data Stack 1.0:Operational Data Systems
10 April 2023 16
Who are the playersOracle Microsoft Open
SourcePure Play
ETL Oracle Data Integrator
SQL Server Integration
Service (SSIS)
IBM Infosphere DataStage
I
Business Objects Data
IntegratorKettle
Enterprise Data
integration server
Informatica Powercenter
DWH Oracle 11g/Exadata
Parallel Data Warehouse(P
DW)
Netezza (Pure Data) Sybase iQ
Postgres/MySQL <BLANK>
Teradata, Greenplum
(EMC),
OLAP Hyperion/Essbase
SQL Server Analysis
Services(SSAS)
Cognos Powerplay SAP Hana Mondrian OLAP Viewer
ReportingOracle BI –OBIEE) & Exalytics
SQL Server Reporting Services (SSRS)
Cognos BI
Business Objects , BO Dashboard
Builder
BIRTPentaho,
Jasper
Enterprise Guide, Web
Report Studio or;
MicroStrategy Qliktech, Tableau
Predictive Analytics
Oracle Data Mining (ODM)
SQL Server Data Mining
(SSDM)SPSS SAP Hana + R R/Weka
SAS Enterprise
Miner
10 April 2023 17
One in two business executives believe that they do not have sufficient information across their organization to do their job
Source: IBM Institute for Business Value
Despite the two data stacks ..
10 April 2023 18
Data has Variety: it doesn’t fit
Less than 40% of the Enterprise Data makes its way to Data Stack 1.0 or Data Stack 2.0.
10 April 2023 19
Beyond the Operational Systems, data required for decision making is scattered within and beyond the enterprise
ERP Systems
CRM Systems
EnterpriseData Warehouse
StructuredData Sources
Email SystemsCollaboration/Wiki Sites
Document Repositories
Project artifacts
Employee Surveys
Customer Call Center Records
UnstructuredData Sources
OrganizationalWorkflow
SensorData
CloudData Sources
CRM Systems
ExpenseManagementSystem Vendor
Collaboration Systems
Supply ChainSystems
Location and Presence Data
PublicData Sources
Weather forecasts
Demographic Data
Maps
Economic Data
Social Networking Data
TwitterFeeds
10 April 2023 20
5 Exabytes of information was created between the
dawn of civilization through 2003, but that much
information is now created every 2 days, and the pace is
increasingEric Schmidt
at the Techonomy Conference, August 4, 2010
(1 exabyte = 1018 bytes )
Data Volumes are Growing
10 April 2023 21
The Continued Explosion of Data in the Enterprise and Beyond
80% of new information growth is unstructured
content –
90% of that is currently unmanaged
1990 2000 2010 2020Source: IDC, The Digital Universe Decade – Are You Ready?, May 2010
2009
800,000 petabytes
2020
35 zettabytes
44x as much
Data and Content
Over Coming Decade
10 April 2023 22
What comes first -- Structure or data?
Schema/
Structure
Data
Structure First is Constraining
10 April 2023 23
Time to create a new data stack for unstructured data.
Data Stack 3.0.
10 April 2023 24
Time-out!
Internet companies have already addressed the same problems.
10 April 2023 25
● Twitter has 140 million active users and more than 400 million tweets per day.
● Facebook has over 900 million active users and an average of 3.2 billion Likes and Comments are generated by Facebook users per day.
● 3.1 billion email accounts in 2011, expected to rise to over 4 billion by 2015.
● There were 2.3 billion internet users (2,279,709,629) worldwide in the first quarter of 2012, according to Internet World Stats data updated 31st March 2012.
Internet Companies have to deal with large volumes of unstructured real-time data.
10 April 2023 26
● Hosted service● Large cluster (1000s of nodes) of
low-cost commodity servers.● Very large amounts of data --
Indexing billions of documents, video, images etc..
● Batch updates.● Fault tolerance.● Hundreds of Million users, ● Billions of queries every day.
Their data loads and pricing requirements do not fit traditional relational systems
10 April 2023 27
● It is the platform that distinguishes them from everyone else. ● They required:
– high reliability across data centers– scalability to thousands of network nodes– huge read/write bandwidth requirements– support for large blocks of data which are gigabytes in size.– efficient distribution of operations across nodes to reduce
bottlenecks
Relational databases were not suitable and would have been cost prohibitive.
They built their own systems
10 April 2023 28
Companies have created business models to support and enhance this software.
Internet Companies have open-sourced the source code they created for their own use.
What did the Internet Companies build? And how did they get there?
They started with a clean slate!
Do we need ..● transaction support?● rigid schemas?● joins?● SQL?● on-line, live updates?
Must have● Scale● Ability to handle unstructured
data● Ability to process large
volumes of data without having to start with structure first.
● leverage distributed computing
What features from the relational database can be compromised?
For the internet workload, with distributed computing, ACID properties are too strong.
Rethinking ACID properties
Atomicity Consistency Isolation Durability
Rather than requiring consistency after every transaction, it is enough for the database to eventually be in a consistent state -- BASE.
Basic Availability Soft-state Eventual consistency
● Consistent – Reads always pick up the latest write.
● Available – can always read and write.
● Partition tolerant – The system can be split across multiple machines and datacenters
Can do at most two of these three.
Brewer’s CAP Theorem for Distributed Systems
Consistency
PartitionTolerance
AvailabilityCA
CP AP
Essential Building Blocks for Internet Data Systems
Hadoop Distributed File System (HDFS)
Hadoop Map-Reduce Layer
C L U S T E R
Map Reduce Jobs (Developers)
Job
Tracker
“For the last several years, every company
involved in building large web-scale systems has faced some of the same fundamental challenges. While nearly everyone agrees that the "divide-and-conquer using lots of cheap hardware" approach to breaking down large problems is the only way to scale, doing so is not easy” - Jeremy Zawodny @Yahoo !
● Cheap nodes fail, especially if you have manyMean time between failures for 1 node = 3 yearsMean time between failures for 1000 nodes = 1 day
– Solution: Build fault-tolerance into system
● Commodity network = low bandwidth– Solution: Push computation to the data
● Programming distributed systems is hard– Solution: Data-parallel programming model: users write “map” &
“reduce” functions, system distributes work and handles faults
Challenges with Distributed Computing
36
The Hadoop Ecosystem● HDFS – distributed, fault tolerant file system● MapReduce – framework for writing/executing distributed, fault tolerant
algorithms● Hive & Pig – SQL-like declarative languages● Sqoop – package for moving data between HDFS and relational DB systems● + Others…
HDFS
Map/Reduce
Hive & Pig
Sqoop
Zooke
ep
er
Avro
(S
eri
aliz
ati
on
)
HBase
ETL Tools
BI Reporting
RDBMS
● Google GFS; Hadoop HDFS; Kosmix KFSlarge distributed log structured file system that stores all types of data.
● Provides global file namespace● Typical usage pattern
– Huge files (100s of GB to TB)– Data is rarely updated in place– Reads and appends are common
● A new application coming on line can use an existing GFS cluster or they can make your own.
● File system can be tuned to fit individual application needs.
Reliable Storage is Essential
http://highscalability.com/google-architecture
● Chunk Servers– File is split into contiguous chunks– Typically each chunk is 16-64MB– Each chunk replicated (usually 2x or 3x)– Try to keep replicas in different racks
● Master node– a.k.a. Name Nodes in HDFS– Stores metadata– Might be replicated
Distributed File System
● Why use MapReduce?– Nice way to partition tasks across lots of machines.– Handle machine failure– Works across different application types, like search and ads. – You can pre-compute useful data, find word counts, sort TBs
of data, etc.– Computation can automatically move closer to the IO source.
Now that you have storage, how would you manipulate this data?
MapReduce
● The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.
● The Apache Hadoop software library is a framework that allows:– distributed processing of large data sets across clusters of computers
using a simple programming model. – It is designed to scale up from single servers to thousands of machines,
each offering local computation and storage. – Rather than rely on hardware to deliver high-availability, the library itself
is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Hadoop is the Apache implementation of MapReduce
Hadoop MapReduce Flow
Word Count – Distributed Solution
the quick
brown fox
the fox ate
the mouse
how now
the
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 4
ate, 1
cow, 1
mouse, 1
quick, 1
Input Map Shuffle & Sort Reduce Output
the, 1brown, 1
fox, 1quick,
1the, 1fox, 1the, 1
ate, 1mouse, 1
how, 1now, 1
brown, 1the, 1
cow, 1
brown, [1,1]fox, [1,1]how, [1]now, [1]
the, [1,1,1,1]
ate, [1]cow, [1]
mouse, [1]quick, [1]
public void map(Object key, Text value, …. ) {StringTokenizer itr = new StringTokenizer(value.toString());while (itr.hasMoreTokens()) {
word.set(itr.nextToken()); context.write(word, one); }
public void reduce(Text key, Iterable<IntWritable> values, ……… ) { int sum = 0; for (IntWritable val : values) {sum += val.get();} result.set(sum); context.write(key, result); }
Word Count in Map-Reducem
ap
red
uce
● Pig and Hive provide a wrapper to make it easier to write MapReduce jobs.
● The raw data is stored in Hadoop's HDFS.
● These scripting languages provide– Ease of programming. – Optimization opportunities. – Extensibility.
Pig and Hive
Pig is a data flow scripting language
Hive is SQL-like language
● Avro™: A data serialization system.● Cassandra™: A scalable multi-master
database with no single points of failure.
● Chukwa™: A data collection system for managing large distributed systems.
● HBase™: A scalable, distributed database that supports structured data storage for large tables.
● Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
● Mahout™: A Scalable machine learning and data mining library.
● Pig™: A high-level data-flow language and execution framework for parallel computation.
● ZooKeeper™: A high-performance coordination service for distributed applications.
Other Hadoop-related projects at Apache include:
http://hadoop.apache.org/
● Facebook– 1100-machine cluster with 8800 cores– store copies of internal log and dimension data sources and use it
as a source for reporting/analytics and machine learning
● Yahoo– Biggest cluster: 4000 nodes– Search Marketing, People you may know, Search Assist, and many
more…
● Ebay– 532 nodes cluster (8 * 532 cores, 5.3PB). – Using it for Search optimization and Research
Powered by Hadoop http://wiki.apache.org/hadoop/PoweredBy (more than 100+ Companies are listed)
● Hadoop is best suited for batch processing of large volumes of unstructured data.– Lack of schemas– Lack of indexes – Lack of updates – pretty much absent!– Not designed for joins.– Support for Integrity Constraints– Limited support for data analysis tools
Hadoop is not a relational database
But what are your data analysis needs?
OLTP Data Integrity
Data Independen
ceSQL
Ad-hoc Queries
Complex Relationship
s
Maturity and Stability
Hadoop is not a Relational Database:If these are important, stick to RDBMS
Do you need SQL and full relational systems?If not, consider NoSQL databases for your needsN
OSQL
http://nosql-database.org/
Key-value Tabular Document Graph
The Key-Value In-Memory DBs
● In memory DBs are simpler and faster than their on-disk counterparts.● Key value stores offer a simple interface with no schema. Really a giant,
distributed hash table.● Often used as caches for on-disk DB systems.● Advantages:
– Relatively simple– Practically no server to server talk.– Linear scalability
● Disadvantages:– Doesn’t understand data – no server side operations. The key and value are always
strings.– It’s really meant to only be a cache – no more, no less.– No recovery, limited elasticity.
● Data is automatically – replicated over multiple servers.– partitioned so each server contains
only a subset of the total data
● Data items are versioned● Server failure is handled
transparently● Each node is independent of other
nodes with no central point of failure or coordination
● Support for pluggable data placement strategies to support things like distribution across data centers that are geographically far apart.
● Good single node performance: you can expect 10-20k operations per second
– depending on the machines, the network, the disk system, and the data replication factor
● Voldemort is not a relational database, – it does not attempt to satisfy arbitrary
relations while satisfying ACID properties.
– Nor is it an object database that attempts to transparently map object reference graphs.
– Nor does it introduce a new abstraction such as document-orientation.
● It is basically just a big, distributed, persistent, fault-tolerant hash table.
Voldemort is a distributed key-value storage system
http://project-voldemort.com/
Tabular stores
● The original: Google’s BigTable– Proprietary, not open source.
● The open source elephant alternative – Hadoop with HBase.
● A top level Apache Project.● Large number of users.● Contains a distributed file system, MapReduce, a
database server (Hbase), and more.● Rack aware.
● BigTable is a large scale, fault tolerant, self managing system that includes terabytes of memory and petabytes of storage. It can handle millions of reads/writes per second.
● BigTable is a distributed hash mechanism built on top of GFS. It is not a relational database. It doesn't support joins or SQL type queries.
● It provides lookup mechanism to access structured data by key. GFS stores opaque data and many applications needs has data with structure.
● Commercial databases simply don't scale to this level and they don't work across 1000s machines.
What is Google’s Big Table
Document Stores
● As the name implies, these databases store documents.
● Usually schema-free. The same database can store multiple documents.
● Allow indexing based on document content.● Prominent examples: CouchDB, MongoDB.
● Document-oriented– Documents (objects) map nicely
to programming language data types
– Embedded documents and arrays reduce need for joins
– Dynamically-typed (schemaless) for easy schema evolution
– No joins and no multi-document transactions for high performance and easy scalability
● High availability– Replicated servers with
automatic master failover
● Rich query language● Easy scalability
– Automatic sharding (auto-partitioning of data across servers)
– Eventually-consistent reads can be distributed over replicated servers
● High performance– No joins and embedding makes
reads and writes fast– Indexes including indexing of keys
from embedded documents and arrays
– Optional streaming writes (no acknowledgements )
Why MongoDB?
Mapping Systems to the CAP Theorem
A
C PCP
CA AP
BigTable, MongoDB, BerkeleyDBHypertable, Terrastore, MemcachedDBHbase, Scalaris, Redis
RDBMS (MySQL, Postgres etc.), AsterData, GreenplumVertica,
Dynamo, CassandraVoldermot, SimpleDBTokyo Cabinet, CouchDBKAI, Riak
Partition ToleranceThe system works well despite physical networkpartitions
Consistency:All clients have the same view of the data
AvailabilityEach client can always read
and write
Bigness Massive Write Performance
Fast Key Value Access
Flexible Schema and Flexible Data
Types
Schema Migration
Write Availability
No single point of failure
Generally available
Ease of programming
NoSQL Use cases: Important to align data model to the requirements
Mapping new Internet Data Management Technologies to the Enterprise
Enterprise data strategy is getting inclusive
Not
OnlySQL
NOSQL
Fromto
Open Source Rules !
Hadoop Infrastructure
What about support !
10 April 2023 62
The Path to Data Stack 3.0:Must support Variety, Volume and Velocity
Data Stack 3.0Dynamic Data Platform
Uncovering Key Insights
Schema less Approach
PBs of Data
End User Direct Access
Structured + Semi Structured
Data Stack 2.0Enterprise Data Warehouse
Support for Decision Making
Un-normalized Dimensional Model
TBs of Data
End User Access Through Reports
Structured
Data Stack 1.0Relational Database Systems
Recording Business Events
Highly Normalized Data
GBs of Data
End User Access through Ent Apps
Structured
10 April 2023 63
Can Data Stack 3.0 Address Real Problems?
Large Data Volume at Low Price
Diverse Data beyond
Structured Data
Queries that Are Difficult to Answer
Answer Queries that No One Dare
Ask
How does one go about the Big Data Expedition?
10 April 2023 65
PERSISTENT SYSTEMS AND BIG DATA
Persistent Systems has an experienced team of Big Data Experts that has created the technology building blocks to help you implement a Big Data Solution that offers a direct path to unlock the value
in your data.
10 April 2023
Big Data Expertise at Persistent● 10+ projects executed with Leading ISVs and Enterprise
Customers● Dedicated group to MapReduce, Hadoop and Big Data
Ecosystem(formed 3 years ago)
● Engaged with the Big Data Ecosystem, including leading ISVs and experts
• Preferred Big Data Services Partner of IBM and Microsoft
10 April 2023 68
Big Data Leadership and Contributions● Code Contributions to Big Data Open Source Projects,
including: – Hadoop, Hive, and SciDB
● Dedicated Hadoop cluster in Persistent● Created PeBAL – Persistent Big Data Analytics Library● Created Visual Programming Environment for Hadoop● Created Data Connectors for Moving Data● Pre-built Solutions to Accelerate Big Data Projects
10 April 2023 69
Persistent’s Big Data Offerings1.Setting up and Maintaining Big Data Platform2.Data Analytics on Big Data Platform3.Building Applications on Big Data
Foundational Infrastructure and Platform (Built Upon Selected 3rd Party Big Data Platforms and Technologies;
Cluster of Commodity Hardware)
Persistent Platform Enhancement IP (PeBAL Analytics Library, Data Connectors)
Persistent Pre-built Horizontal Solutions(Email, Text, IT Analytics, … )
Persistent Pre-built Industry
Solution: Retail
Technology Assets
Vis
ual
Pro
gra
mm
ing
Tools
Persistent Pre-built Industry
Solution: Banking
Persistent Pre-built Industry
Solution:Telco
Big Data Custom Services
Extension ofYour Team
Discovery WorkshopTraining for Your Team
Team Formation ProcessCluster Sizing/Config
People Assets
Methodology
10 April 2023 70
Commercial/ Open Source Product Persistent IP External Data source
Email Server
Connector Framew
ork
IBM Tivoli
BBCA
Web Proxy
Social M
edia Connector
Twitter, Facebook
Email Server
Web Proxy
DW
NoSQL
RDBMS
Data Warehouse
PIG/Jqal Text Analytics/GATE/SystemT
Hive
Persistent Analytics Library (PEBAL)
Graph Fn Set Fn …. ….. ….. Text Analytics Fn
Solutions
MapReduce and HDFSCluster Monitoring
Admin App
Workflow
Integration
Connector Framew
ork
BI ToolsReports & Alerts
Persistent Next Generation Data Architecture
10 April 2023 71
Persistent Big Data Analytics Library
WHY PEBAL• Lots of common problems – not all of them are solved in Map Reduce
• PigLatin, Hive, JAQL are languages and not libraries – something is needed to run on top that is not tied to SQL like interaces
BENEFITS OF A READY MADE SOLUTION• Proven – well written and tested• Reuse across multiple applications• Quicker implementation of map reduce applications• High performance
FEATURES• Organized as JAQL functions, PeBAL implements several graph, set, text extraction, indexing and correlation algorithms.
• PeBAL functions are schema agnostic. • All PeBAL functions are tried and tested against well defined use cases.
10 April 2023 72
Graph
Set
Text Analytic
s
Inverted Lists
Web Analytic
s
Statistics
10 April 2023 73
Visual Programming EnvironmentADOPTION BARRIERS
• Steep Learning Curve• Difficult to Code• Ad-hoc reporting can’t always be done by writing programs• Limited tooling available
VISUAL PROGRAMMING ENVIRONMENT• Use Standard ETL tool as the UI environment for generating PIG scripts
BENEFITS• ETL Tools are widely used in Enterprises• Can leverage large pool of skilled people who are experts in ETL and BI tools
• UI helps in iterative and rapid data analysis• More people will start using it
10 April 2023 74
Visual Programming Environment for Hadoop
HDFS/ HiveHDFS
Persistent IP
Data Flow UI
PIG Convertor
HDFS
PIG UDF Library
Big Data Platform
ETL Tool
Metadata
Data Data
Metadata
Data Sources
PIG code
10 April 2023 75
Persistent Connector Framework
OUT OF THE BOX• Database, Data Warehouse• Microsoft Exchange• Web proxy• IBM Tivoli• BBCA• Generic Push connector for *any* content
FEATURES• Bi-directional connector (as applicable)• Supports Push/Pull mechanism• Stores data on HDFS in an optimized format• Supports masking of data
WHY CONNECTOR FRAMEWORK• Pluggable Architecture
20+Years
10 April 2023 76
Persistent Data Connectors
10 April 2023 77
Persistent’s Breadth of Big Data Capabilities
Horizontal and Vertical Pre-built Solutions
Big Data Platform (PeBAL) analytics libraries and Connectors
IT Management
Big Data Application Programming
Distributed File Systems
Cluster Layer
Tooling
• RDBMS/DWH to import/export data
• Text Analytics libraries
• Data Visualization using Web2.0 and reporting tools - Cognos, Microstrategy
• Ecosystem tools like - Nutch, Katta, Lucene
• Job configuration, management and monitoring with BIgInsight’s job scheduler (MetaTracker)
• Job failure and recovery management
• Deep JAQL expertise - JAQL Programming, Extending JAQL using UDFs, Integration of third party tools/libraries, Performance tuning, ETL using JAQL
• Expertise in MR programming - PIG, Hive, Java MR
• Deep expertise in analytics - Text Analytics - IBM’s text extraction solution (AQL + SystemT)
• Statistical Analytics - R, SPSS, BigInsights Integration with R
• HDFS
• IBM GPFS
• Platform Setup on multi-node clusters, monitoring, VM based setup
• Product DeploymentPersistent IP for Big Data SolutionsBig Data Platform Components
10 April 2023 78
Persistent Roadmap to Big Data
1. Learn
2. Initiate
3. Scale4. Measure
5. Manage
Discover andDefine Use Cases
Improve Knowledge Baseand Shared Big Data
Platform
Upgrade to Production if Successful
Validate witha POC
Measure Effectiveness
and Business Value
10 April 2023 79
Build a social graph of all customers
Overlay sales data on the graph
Identify influential customers using network analysis
Target these customers for promotions.
Customer Analytics
Identifying your most influential customers ?
Targeting influential customers is best way to improve campaign ROI!
70 million customers
> 1billion transactions over twenty years
Few thousandInfluential customers
10 April 2023 80
Overview of Email Analytics● Key Business Needs
– Ensure compliance with respect to a variety of business and IT communications and information sharing guidelines.
– Provide an ongoing analysis of customer sentiment through email communications.
● Use Cases– Quickly identify if there has been an information breach or if the information is being
shared in ways that is not in compliance with organizational guidelines.– Identify if a particular customer is not being appropriately managed.
● Benefits– Ability to proactively manage email analytics and communications across the organization
in a cost-effective way.– Reduce the response time to manage a breach and proactively address issues that emerge
through ongoing analysis of email.
10 April 2023 81
Using Email to Analyze Customer Sentiment
Sense the mood of your customers through their emails
Carry out detailed analysis on customer team interactions and response times
10 April 2023 82
Analyzing Prescription Data
1.5 million patients are harmed by medication errors every year
Identifying erroneous prescriptions can save lives! Source: Center for Medication Safety & Clinical Improvement
10 April 2023 83
Overview of IT Analytics● Key Business Needs
– Troubleshooting issues in the world of advanced and cloud based systems is highly complex, requiring analysis of data from various systems.
– Information may be in different formats, locations, granularity, data stores.– System outages have a negative impact on short-term revenue, as well as long-term credibility and
reliability. – The ability to quickly identify if a particular system is unstable and take corrective action is imperative.
● Use Cases– Identify security threats and isolate the corresponding external factors quickly.– Identify if an email server is unstable, determine the priority and take preventative action before a
complete failure occurs.
● Benefits– Reduced maintenance cost– Higher reliablity and SLA compliance
10 April 2023 84
Consumer Insight from Social Media
Find out what the customers are talking about your organization or product in the social media
1. Structured AnalysisResponses to Pledge, multiple choice questions
2. Unstructured AnalysisResponses to following questions • Share your story• Ask a question to Aamir• Send a message of hope• Share your solution
Content Filtering Rating Tagging System (CFRTS)L0, L1, L2 phased analytics 3. Impact Analysis
Crawling general internet for measuring the before & after scenario on a particular topic
Web/TV Viewer
Response to Pledgemultiple choice questionsWeb, emails, IVR/CallsIndividual blogsSocial widgetsVideos…
IVR
SMS
Web
, Soc
ial M
edia
(S
truc
ture
d)So
cial
Med
ia
(uns
truc
ture
d)
Insights for Satyamev Jayate – Variety of sources
Rigorous Weekly Operation Cycle producing instant analyticsKiller combo of Human+Software to analyze the data efficiently Topic opens on
Sunday
Live Analytics report is sent
during the show
Data capture from SMS,
phone calls, social media,
website,
System runs L0 Analysis, L1, L2
Analysts continue
JSONs are created for the external and
internal dashboards
Featured content is delivered
thrice a day all through out the week.
Episode Tags are refined and messages are re-ingested for another pass
10 April 2023 87
10 April 2023 88
Thank you
Anand Deshpande ([email protected])http://in.linkedin.com/in/ananddeshpande
Persistent Systems Limitedwww.persistentsys.com
10 April 2023 89
Enterprise Value is Shifting to Data
Mainframe
Operating Systems
ERP
Apps
Data
20132006
Database
199519851975Line of D
iminishing Value