Date post: | 11-Mar-2018 |
Category: |
Documents |
Upload: | phungtuyen |
View: | 216 times |
Download: | 1 times |
©2011 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without
notice
Present and Future of Enterprise BI
January 17, 2013
Prepared for DAMA
Agenda
1. DB engines for BI and contrast with OLTP DB engines • Row-DB, column-DB, index/non-index, in-memory • Contrasts BI workload with OLTP/ERP workload • Our experience
2. MPP systems for BI: shared-nothing and shared-disk • IQ, HANA, Exadata, Netezza, MySQL • Our experience
3. Including unstructured data (“Big Data”) in BI • Our experience
4. Design the “dream” BI system 5. Comparing various “real” and “dream” BI systems
• Facts and our experience 6. Brief EDMT BI Overview 7. Open Discussion
Background
BMMsoft offers consulting and BI products/solutions for: • DW/DM and ETL on Sybase IQ, HANA, Oracle, Exadata, Netezza , MySQL • Extending enterprise BI with Big Data • HA, SLA(“how many-nines?”), DR, B/R solutions for BI systems • Scale and speed: “World’s Largest DW” (2002, 2004, 2007) and “World’s Fastest Data
Loader”(2011, 330 TB/day on HP DL980)
Paul Krneta, CTO of BMMsoft, • 20 years of industry experience in computer and database technology and architecture • CTO of Sybase IQ 2000-2007
• architected the MPP option for Sybase IQ (“IQ Multiplex”) • designed NonStopIQ (HA, DR and B/R for VLDB version of Sybase IQ) • optimized IQ for VLDB, certified IQ 3 times as the "World's Largest Data Warehouse“
• 2002 – 48 TB (200 B rows) • 2004 - 150 TB ( 1 T rows) • 2007 – 1,030 TB (1PB in 6 T rows) of structured and (opt) unstructured data
• Technical Director for DB Technology at Digital Equipment (DEC) 1994-2000 • Designed first In-memory DB: Oracle VLM Option (“Very Large Memory”) in 1995 • 1 TB/hour live backup of Oracle, Sybase, Informix, SQL Server, Adabas in 1995-1996
DB LANDSCAPE:
BI VS. OLTP
Categories of DBs for BI
Different DB architectures : 1. R=row-oriented DB 2. C = columnar DB 3. H-RC= hybrid row+columnar DB 4. Compression 5. NI=non-indexed DB 6. I = indexed DB 7. MPP-SN shared nothing DB 8. MPP-SD shared-disk DB 9. In-Memory DB 10. SQL, NoSQL, object, KV-pair = DB 11. ACID and non-ACID “unreliable” DB 12. HA, DR, B/R, Test/Dev 13. BLOB storage: in-row/column,
separate store, external BLOB 14. Text search: in-db or external 15. UDF 16. Storage efficiency, Green
Types of queries: 1. Pin-point query
• Interested in small # of rows selected from Bs and Ts of rows – i.e. call center, ATM etc.
2. Analytic query • Analyzes Ms, Bs, Ts rows (1%-100% ) of the
entire DB
3. Mixed search of structured+unstructured data • Single query cross-searches SQL rows and text • Best: both queries done using single engine • Worst: two engines (one SQL, one Text)=relict of
“divided SQL/text world”
4. Text search/analytics 5. OLTP (heavy updates)
1-5 day sessions: “BI for Today and Tomorrow”, “NonStopIQ bootcamp”, “BI Assessment” http://download.sybase.com/presentation/TW2005/AM21.pdf
Quick overview of “original” row DB
1. Record (“row”) has multiple fields (i.e. date, name, amount etc.) 1) Fields of a row placed next to each other (on disk and in RAM) 2) Each filed has single value (typ.) 3) Order of fields in DDL is irrelevant (sort of) 4) To get to Nth field in the row, DB “scans” each of previous N fields
2. DB page contains multiple unrelated) records 1) DB page is unit of storage management, IO and caching in RAM 2) Row (typ.) can’t span multiple pages= limits # fields, length of row
3. ACID applied all the time 4. Locking at record level 5. Small number of fields indexed
1 2 3 4 …...….. 100
Row DB
DB page(“block”): 2-32KB
SQL: Create table ABC yellow, blue, red, magenta
SQL: Select sum (red) from
ABC
Row-DB vs. Columnar DB
1 2 3 4 …...….. 100
Columnar DB
SQL: Select sum (red) from
ABC
Row DB SQL: Create table ABC
yellow, blue, red, magenta
1. Both use ANSI SQL & ODBC/JDBC 2. Column structure (invisible to apps and admin)
• Reduces I/O by 90-99% (eliminates full-table-scans) • Flex schema = add/remove columns on the fly • wide tables=simple, rich schema (i.e +42,000 column) • Large I/O can use large(+400GB), low-cost disks • Great match for BLOB data (image, video, email, docs…)
3. All row DBs have indices (almost unusable w/o indices) 4. Column DB w/ indices : bitmap, bit-wise, text +more
• Column+index queries 2x-1,000x faster than “classic” DBMS • Fast to load, small size, have all data statistics
5. Data Compression 90% cost reduction • “Row DB” is 4x-10x larger than Column DB • Disks for row DB costs 8x-20x more • Fast, no fragmentation, always ON, no LVM nor FS
6. Multi-node • Multi-
Db page
2-32KB
1 2 3 4 ………. 100
Db page
512KB
BI/DW vs. OLTP S
pe
ed
, S
ca
lab
ility
(#
use
r &
da
ta s
ize
)
OLTP = simple query • “touch/update” 10s of rows per query
• query takes seconds and few resources
• simple SQL statements
DSS = complex query • “touch” Th-M-(B-Tr)illions of rows
• query takes sec-hours to finish • Complex (10-page) SQL statements
• (typ.) 10x larger than OLTP DBMS
VLDB
10,000
1000
100
10
1
In-memory DB (HANA)
To Index or not to Index ?
1 2 3 4 …...….. 100
Column DB
SQL: Select sum (red) from
ABC
Row DB SQL: Create table ABC
yellow, blue, red, magenta
1. Row DB: Index is critical to avoid slow, costly full-table scans • Reduces I/O by 90-99% (eliminates full-table-scans)
2. Column DB without indices • Every query scans column(s) slow, heavy I/O & CPU load • Complex queries scan many columns (=much of a DB) • May be faster (but not much) to load • Uses less space (but needs faster disks for scans)
3. Column DB w/ indices : bitmap, bit-wise, text +more • Many queries use index only (=fast, low I/O, CPU use) • Indices have statistics about data = better QEP • No scans = Reduced I/O • Large I/O = use large(4TB), low-cost disks ($400/TB)
Db page
2-32KB
1 2 3 4 ………. 100
Db page
512KB
BI: Reporting vs. Advanced(Ad-hoc) S
pe
ed
, S
ca
lab
ility
(#
use
r &
da
ta s
ize
)
Reporting • “interested” in many rows per query
• predictable queries
Advanced (ad-hoc)query • “touch” Th-M-(B-Tr)illions of rows • query takes sec-hours to finish • unpredictable, complex queries
COLUMN DB with index
VLDB
10,000
1000
100
10
1
In-memory DB (SAP HANA)
BI: Data Scalability S
pe
ed
, S
ca
lab
ility
(#
use
r &
da
ta s
ize
)
DB size (TB, PB) # columns
COLUMN DB with index
“row” DBs
“row” DBs with HW “column” filters
10,000
1000
100
10
1
In-memory DB (HANA)
BI: Resource consumption R
eso
urc
e u
sa
ge
CP
U, R
AM
, I
OP
S a
nd
BW
DB size (TB, PB) # columns
10,000
1000
100
10
1
BI: Speed and efficiency P
erf
orm
an
ce
Re
so
urc
e e
ffic
ien
cy (
CP
U, R
AM
. IO
COLUMN DB with index
10,000
1000
100
10
1
Predictable/static Data and queries
UN-Predictable Data and queries
In-memory DB (HANA)
In-memory DB (OLTP & BI): SAP HANA
1 2 3 4 …...….. 100
Column DB
SQL: Select sum (red) from
ABC
Row DB SQL: Create table ABC
yellow, blue, red, magenta
1. HANA: ANSI SQL and odbc/jdbc 2. HANA: compression is always on, 5:1 – 20:1
• Single HANA server (4 TB RAM) can hold 15-60 TB of data • No transactional I/O to disk (except log file and start/stop) • Row or column DB (at table level)
3. HANA : much more than DB cache in RAM • Data access optimized for RAM • Supports multi-node configurations • 100s and 1,000s times faster than “std” on-disk row DB
4. HANA : RAM -DB for 0.1TB – 50 TB data (even more) • Good fit for complex, real-time BI/OLTP/ERP workloads • Benefits from cheap/big RAM and fast CPUs • Pricey (“too fast”?) for huge “warm/cold” data (+100TB?) • HANA+IQ = good mix of in-memory and on-disk DB
Db page = 2-32KB
1 2 3 4 ………. 100
Db page = 512KB
MASSIVELY PARALLEL PROCESSING - MPP
(“DIVIDE-AND-CONQUER “)
SHARED-NOTHING VS. SHARED-DISK
3 ways to add more CPU and storage
There are 3 ways to add more CPU power and storage : 1. Use larger server (more CPUs, RAM, I/O channels,)
1. Limited by the size of largest SMP server (128 cores, maybe 512 cores) 2. Can be expensive 3. HA and DR can be expensive
2. Divide data into many small partition s (MPP Shared Nothing or MPP S-N) 1. Add server (“node”) to “own” and process each data partition 2. Node = server+data “slice”: adding server requires adding storage and vice versa 3. Query has to be spread to every nodes 4. Results have to be collected and merged 5. Simple to implement, has some drawbacks
3. Many servers access shared data and process it (MPP Shared Disk or MPP S-D) 1. Optimal for indexed column DB because of low IO 2. Difficult to implement, smart and flexible to use 3. Suboptimal for row DB or scanning DB or storage HW filters : all need heavy IO 4. Server can be added without affecting storage 5. Storage can be added without affecting servers 6. Architectural HA : server crash does not affect data access
MPP S-N (“shared nothing”)
17 1/17/2013 17
Server (A)
36 TB Array
36 TB Array
36 TB Array
36 TB Array
36 TB Array
A
36 TB Array
36 TB Array
36 TB Array
36 TB Array
36 TB Array
B
Server (B)
36 TB Array
36 TB Array
36 TB Array
36 TB Array
36 TB Array
C
Server (C)
Server (D)
36 TB Array
36 TB Array
36 TB Array
36 TB Array
36 TB Array
D
Server (E)
36 TB Array
36 TB Array
36 TB Array
36 TB Array
36 TB Array
E
Add/remove node: significant time new node is “empty”, need redistribute data from other nodes Add storage: significant time must take data from other nodes Remove storage: hours/days needs to redistribute data to other nodes
MPP S-D (“shared disk”)
Scalable performance and data, flexible, config
1/17/2013
DL980 (A,B,C,D,E)
FC Switch
36 TB Array
36 TB Array
36 TB Array
36 TB Array
36 TB Array
A
DL980 (A,B,C,D,E)
DL980 (A,B,C,D,E)
DL980 (A,B,C,D,E)
DL980 (A,B,C,D,E)
Add/remove server: <1 min Add storage: <1 min Remove storage: < 1 min(*)
36 TB Array
36 TB Array
36 TB Array
36 TB Array
36 TB Array
C
36 TB Array
36 TB Array
36 TB Array
36 TB Array
36 TB Array
B
36 TB Array
36 TB Array
36 TB Array
36 TB Array
36 TB Array
E
36 TB Array
36 TB Array
36 TB Array
36 TB Array
36 TB Array
D
MPP : S-N vs. S-D
Current BI and Big Data: servers+storage are “sold together”
MPP S-D:
Small data,
Many CPUs
MPP S-D (indexed) vs. MPP S-N
Flexibly combining storage and servers
MPP S-N
MPP S-D(I) High-CPU Low-data
MPP S-D (I) High-CPU High Data
MPP S-D (I) Low-CPU High Data
MPP S-D(I) Low-CPU Low-data
MPP S-N , S-D, C-non-index and C-indexed Sybase IQ/EDMT 4XL (Full Rack)
MPP Shared Disk
2
160 Intel E7-4870
(2.4GHz) (No need for HW filter)
+100 TB/sec (indexed, no scan)
+30 TB/Hr
432 TB
+1,000 TB
96 racks (+500 custom)
15,360 (+700,000 custom)
On-line addition or removal of nodes Requires reorganization/repartitioning of Data with addition or removal of nodes
http://www.zdnet.com/blog/btl/emcs-launches-greenplum-appliance/40281
MPP S-N – HANA (in-memory DB)
22 1/17/2013 22
Server (A)
Server (B)
Server (C)
Server (D)
Server (E)
HANA (in-memory DB)
23 1/17/2013 23
ADDING UNSTRUCTURED DATA TO BI,
STORING TBS, AND PBS OF DATA,
TEXT SEARCH
Adding unstructured data to BI :
Load/Storage and Cross-Analysis
Problem 1: Load and Store
Load+store= too much for IT
1. Volume=Too Big: 100s of TB, multi-PB
2. Volume= Too Many: Billions & Trillions
3. Variety= too many diff data types
4. Velocity=Slow Load+Index of Data
5. Cost of Data Storage is high
Problem 2: Cross-analysis
No cross-analysis of SQL and Text
1. BI = only SQL analysis (no text)
2. Text analysis= only text, no SQL
3. No cross-analysis of SQL and Text data
(at large scale)
Storing 1 PB in Hadoop (default config)
Hadoop 1 PB of data (default config) Hadoop node: 8 TB of data/ node (24 TB raw, w/ 3x copy) Node= 8-core Xeon, 16 GB RAM, 12x 2TB disks, 2RU = $4K HW= 125 nodes (6 racks), 3 PB raw, 1,000 disks = $500,000 Power= 125 kW (incl. A/C) = $109,500/year ($0.1/kWh) ~600 Tons of CO2 per year (=~120 cars )
Hadoop 1 PB
Storing 10 PB in Hadoop (default config)
Hadoop 10 PB 1,200 servers, 12,000 disks, 60 racks, $5M ($4K/node), 1,200KW = $1.1 M /year in electricity (@ $0.10/kWh) ~6,000 Tons of CO2 per year (=~ 1,200 cars )
Hadoop 10 PB
Hadoop SW
OPERATIONS:
HA, B/R, DR, UPGRADES, LIFECYCLE
AND MORE
HA, DR and Backup/Restore ?
1. HA and DR is tricky for MPP S-N 2. MPP S-D handle HA, failure and change easier, but need plan 3. Text Engines : HA, DR and B/R BI engines is a afterthought 4. Tapes? Not a good media, very slow,
Uptime downtime per year
Specialists for HA/DR for MPP
1. Some of world’s largest DW use NonStopIQ 2. zero-downtime backup 3. Near-zero-downtime restore 4. Full DR and HA 5. Storage cost of $400/TB (HP P2000 MSA) opens new possibilities 6. Tapes? Should you even bother when storage costs is $400/TB ?
Building Large BI since 2002
DESIGNING THE “DREAM” BI
Dream BI
1. Fast , scalable and flexible BI engine 1. Speed: query and data loading speed 2. Scales well with data volume, query complexity and #users 3. Flexible configuration : add/remove storage/server as needed 4. Compatibility with 3rd party enterprise reporting and anlytic tools
2. Integrates rich text search into BI queries 1. Easy and cost-free inclusion of text search into BI analytics 2. Fast loading of text data – without jeopardizing existing SQL data
3. Able to store large volumes of structured and unstructured data 1. “deep history” of SQL data and unstructured data
4. HA, DR, B/R, ACID, flexibility etc . 5. Price: affordable and comparable with Open Source
BI/DW Analytics
Text Search & Analytics
Big Data Store
(‘Archive’) + + =
Dream BI Solution
EDMT SOLUTION
Terminology
EDMT stands for.
Emails (any type of communications – email, SMS, skype..)
Documents (100s of file and doc formats)
Multimedia ( image, audio, video and more)
Transactions (“standard” DB records )
EDMT Solution:
Pragmatic Approach to Data
Store Data
EDMT Solution stores emails, SMS,
Documents, Multimedia and DB
Transactions in RDBMS i.e. IQ for data
retention and mixed BI+text analysis
SQL+Text Analysis of All Data
EDMT cross-analyzes all data using
SQL+Text analysis to run Fraud Detection,
e-Discovery, CRM, Audit, GRC, BI etc.
10x, 100x or 1,000x faster than before
BI/DW Analytics
Text Search & Analytics
Big Data Store
(Archive) + + =
Dream BI Solution ?
EDMT solves what others cannot
EDMT: Big Data 2.0
Innovating data technology
• Enables BI systems to store and analyze unstructured data
• Broad DB support
• Supports all DB architectures : • R=row-oriented • C = columnar • I = indexed • NI=not indexed • SD=MPP shared disk • SN=MPP shared nothing
• OS:
• Certified: Linux, HP-UX (incl. Poulson)
• Verified: AIX, Solaris, Windows
SAP Sybase
IQ (C-I-SD)
SAP Sybase
ASE (R-I-SD)
Oracle RAC
(R-I-SD)
Netezza (R-F-NI-SN)
Oracle Exadata
(R-F-I-SD)
MySQL (R-I)
SAP HANA
(RC-I-SN) (Q1 ‘13)
2007: 1 Petabyte EDMT Solution
EDMT Big Data 2.0
• 1 PB of data (= 6 Trillion rows) loaded and indexed • Loading speed : 285 B rows per day (= 35 TB/day) • Load latency: < 2 sec • Pin-point search of 6 T rows = 0.5 sec • DB = Sybase IQ
2012 : 1 PB + new HW = PB for masses
40-core DL 980
½ HALF Of
RACK EMPTY
• Same data capacity and speed as 2007 “1 PB “ 1. 1/15 in physical size, cost, electricity, weight 2. Deploys in 1 week 3. 288 TB of raw storage ( ~$115,000 $400/TB) 4. 40-core Xeon Linux server
• Price :
SW + HW = ~$500,000 Amount of data stored = 1,030 TB $/TB of data = ~$480/TB of data
EDMT architecture
Innovating data technology
IQ, HANA, Oracle,
Exadata, Netezza, MySQL
DB Storage
EDMT
Server
EDMT
HW
Data Management, Access Control, Alerts, Auto-Classification,
Collaboration, Taxonomy, Data Retention, Connectivity, Search API ED
MT
AP
I &
Con
ne
cto
rs
E
D
M
T
ETL
(INGEST)
EDMT
Modules
Real-time ETL Parser, Metadata
Manager, Parallel Loader
ETL Storage
Linux, HP-UX Linux x86
ETL and Application Servers Database Servers
EDMT Data Access & Analysis Layer
EDMT GUI
Web Services
Data Export
Proxy Mobile
GUI eDiscovery, Audit,
Faud Modules
Social Net
Analysis
2012: 1 Petabyte EDMT : for masses
• Out-of-the-box features of EDMT 1. Enterprise BI engines (IQ/HANA: SQL, ACID) 2. Connector for Business Objects, Cognos etc. 3. Complex data reporting and visualization 4. eDiscovery, Litigation hold, Audit, Compliance 5. Full-text, proximity, and dictionary search 6. FINRA post-review and random sampling workflow 7. Cross analysis of structured+unstructured data 8. Email+file archive, indexing & auto-categorization 9. Multimedia archiving, indexing, and auto-cat 10. DB record analytics and archiving 11. Retention, WORM and records management
• Price : SW + HW = ~$500,000 or ~$480/TB of data
BI/DW Analytics
Text Search & Analytics
Big Data Store
(Archive) + + = Dream BI Solution
40-core DL980
EMPTY ½
RACK
EDMT systems
EDMT Big Data Appliance: Certified and Pre-Configured
• And beyond…
store + index index only
7 16K [ 96 racks ] 15,360 41,472 180 B 1,800 B 640 Tril l ion
6 4K [ 24 racks ] 3,840 10,368 48 B 480 B 160 Tril l ion
5 1K [ 6 racks ] 960 2,592 12 B 120 B 42 Tril l ion
4 4XL [ Full rack ] 160 432 2 B 20 B 7 Tril l ion
3 PB [ 1/2 rack ] 80 288 1.6 B 16 B 6 Tril l ion
2 XL [ 1/3 rack ] 40 144 600 M 6 B 2 Tril l ion
1 L [ 1/4 rack ] 24 72 300 M 3 B 1 Tril l ion
M [ 2 RU ] 12 36 150 M 1.5 B 500 B
S [ 2RU ] 6 36 150 M 1.5 B 500 B
XS [ 2RU ] 4 36 150 M 1.5 B 500 BOn
Lin
e
EDMT® Solution Models and Specifications
Model Description#
cores
Disk
Size (TB)
# emails & files (100KB each) DB rows
(150-byte)
Mid
Entry
High
Configuration Rule 1 Two or more EDMT® Solutions can be combined in one larger EDMT Solution
Configuration Rule 2 Storage can grow in 36TB increments ("Array", $14,000 or $400/TB )
EDMT systems
EDMT Big Data Appliance: Certified and Pre-Configured
• Entry level – hardware valued at $125,000 to $250,000 (US List Price)
# email & files (store + index) (100KB each)
# email & files (index only) (100KB each)
DB rows (150-byte)
1.6B 16B 6Trillion
# email & files (store + index) (100KB each)
# email & files (index only) (100KB each)
DB rows (150-byte)
600M 6B 2Trillion
EDMT systems
EDMT Big Data Appliance: Certified and Pre-Configured
• Mid level – hardware valued at $350K to $2,000,000 (US List Price)
# email & files (store + index) (100KB each)
# email & files (index only) (100KB each)
DB rows (150-byte)
12B 120B 42Trillion
# email & files (store + index) (100KB each)
# email & files (index only) (100KB each)
DB rows (150-byte)
2B 20B 7 Trillion
EDMT systems
EDMT Big Data Appliance: Certified and Pre-Configured
• High level – hardware valued at $8M (US). 4x larger system 16K at $30M (US)
# email & files (store + index) (100KB each)
# email & files (index only) (100KB each)
DB rows (150-byte)
48B 480B 160 Trillion
# email & files (store + index) (100KB each)
# email & files (index only) (100KB each)
DB rows (150-byte)
180B 1,800B 640 Trillion
EDMT systems
EDMT Big Data Appliance: Certified and Pre-Configured
• Highest level – EDMT supports up to 12,000 nodes
Federated EDMT using IQ and HANA
+12 racks
More info about 1 PB HANA: http://www.saphana.com/community/blogs/blog/2012/11/12/the-sap-hana-one-petabyte-test
EDMT 1 PB
1 server/ 80 cores/1TB RAM
1/2 rack, 288 TB disks
~$500K (HW+SW) IQ
+ HANA ($TBD)
PB of raw data 6 Trillion rows Star schema Load 285 B rows/day Search 6 T rows = 0.5 sec 50 concurrent streams
HANA
disks
IQ (HP-UX
or DL980)
HANA 1 PB
switch
EDMT: Federated IQ/HANA vs. Size/Speed
Speed
Data size
10,000
1000
100
10
1
EDMT @ IQ
EDMT @ HANA
EDMT @ HANA+IQ
Low
Med
High
Small
(< 100 TB)
Med
(100 TB - 1 PB) Large
+1 PB)
Multi-site DR w/ NonStopEDMT (2010)
IQ 1
Server 3 - PowerExpress 520; AIX Internal: 10.26.51.62 [hqiq01] External:
IQ 2
IQ 3
Remote Site
Server 1 - Xeon 8-core; Linux; Internal: 10.26.51.61 [hqetl01]
External: 216.207.70.33 Ports: Smtp & Http
Server 2 - Xeon 8-core; Linux; Internal: 10.26.51.65 [hqetl02]
External: 216.207.70.32 Ports: Smtp & Http
SAN
Staging_1 Staging_2 Staging_3 Staging_4
EDMT 1
EDMT 2
Node 1 10.26.51.35 [hqatg05]
Node 2 10.26.51.36 [hqatg06]
Node 3 10.26.51.37 [hqatg07]
Remote EDMT 3
Storing 1 PB in EDMT & Hadoop
EDMT 1 PB
(1/2 rack)
$450K
10kW ($9K/year)
Storage:
96 TB,
$20,090
$209/TB
8-core
Xeon
Server:
$1,172
16-core
Xeon
Server:
$4,860
CUSTOMER SUCCESS STORIES
EDMT Success Story 1:
Global Telecom & ISP (US)
Challenge Solution
SQL DW: Structured data (CDR, SMS,
Billing)
Step 1: Store 16 B CDR+SMS/day w/
EDMT = 85% of world’s SMS data
Step 2: - enable cross-correlation of CDR
data w/fully indexed text content
Benefit: create new services for +900
Telco carriers
EDMT Success Story 2 :
University Research Clinic and Hospital
Challenge Solution
Email and file archive with text search Step 1: store, search and “retention” all
emails/SMS/IM with collaboration
Step 2: Add patient insurance payment
data and cross-analyze
Benefit: full 360-degree view of patients,
carriers and physicians
EDMT Success Story 3 :
Taxation Office of European Country
Challenge Solution
SQL DW: DB consolidation project Step 1: Consolidate 30 years of 10 M
taxpayer SQL records
Step 2: Capture Audit data (emails,
voicemails, faxes etc.) with audited
taxpayers for Audit, Litigation and
Compliance purpose
Benefit: at ZERO cost, TaxOffice gets
360-degree customer view
EDMT Success Story 4 :
EU Country Intelligence Agency
Challenge Solution
Email/SMS/IM archive and text search
Step 1: Load+cross-correlate huge
volumes of email/SMS to prevent cyber
crime, online attacks, web fraud, digital
threats. Loads +20TB data per day. Real-
time sub-sec Searches
Step 2: Store financial and travel data
(=SQL) and cross-correlate with emails,
SMS in real time. +1,000 TB (1 PB) in size
Benefit: previously impossible real-time
monitoring and actionable intelligence
COMPARISONS AND
SIZING RULES
Price-per-TB of User data (compressed)
EDMT Solution : Three-year Cost per COMPRESSED TB of User Data < $3,000
Download the entire document from:
ftp://public.dhe.ibm.com/software/data/sw-library/infosphere/analyst-reports/ITG-ISAS-Exadata-Teradata.pdf
2012 : 1 PB + new HW = PB for masses
40-core Linux
½ HALF Of
RACK EMPTY
• Same data capacity and speed as 2007 “1 PB “ 1. 1/15 in physical size, cost, electricity, weight 2. Deploys in 1 week 3. 288 TB of raw storage ( ~$115,000 $400/TB) 4. 40-core Xeon Linux server HP DL980
• Price :
SW + HW = ~$500,000 Amount of data stored = 1,030 TB $/TB of data = ~ $480/TB of data
EDMT 1 PB vs. Hadoop 1 PB (3c-def)
Hadoop 1 PB of data (default config) Hadoop node: 8 TB of data/ node (24 TB raw, w/ 3x copy) Node= 8-core Xeon, 16 GB RAM, 12x 2TB disks, 2RU = $4K HW= 125 nodes (6 racks), 3 PB raw, 1,000 disks = $500,000 Power= 125 kW (incl. A/C) = $109,500/year ($0.1/kWh) ~600 Tons of CO2 per year (=~120 cars )
EDMT 1 PB (1/2 rack) ~$500K
10kW ($9K/year)
50 tons CO2/year (~10 cars)
Hadoop 1 PB EDMT 1 PB
AMAZON Cloud: 288 TB of storage (“PB”)
1. 4 monthly payments to for cloud storage may
pay for 288 TB of EDMT storage – the other 44
months (out of typ. 48 month HW cycle) are free
2. Savings could be significant
EDMT 10 PB vs. Hadoop 10 PB
Hadoop 10 PB 1,200 servers, 12,000 disks, 60 racks, $5M ($4K/node), 1,200KW = $1.1 M /year in electricity (@ $0.10/kWh) ~6,000 Tons of CO2 per year (=~ 1,200 cars )
EDMT 10 PB
EDMT 10 PB (6 racks)
~$2M-$5M 100kW ($90K/year)
500 tons CO2/year
(~100 cars)
Hadoop 10 PB
AMAZON Cloud storage for 10 PB
$300K/month for 3 PB of storage.
We sell 3 PB for $1.2 M
1. 4 monthly payments to for cloud storage may
pay for 3 PB of EDMT storage – the other 44
months (out of typ. 48 month HW cycle) are free
2. Savings could be significant
Hadoop v1
EDMT Million Channel Real Time Ingestor
EDMT: store Hadoop data in EDMT for speed and SQL/ACID
HADOOP
HADOOP
HADOOP
EDMT® vs. Google Search Appliance (GSA)
Dell.com
1. EDMT Solution can handle more data
2. GSA is more expensive “per document” than EDMT®
http://search.dell.com/results.aspx?s=gen&c=us&l=en&cs=&k=gb-7007&cat=all&x=7&y=6
EDMT® “L”
Google GSA
BMMsoft – Services and products
1/17/2013
• Assessment of your current BI and Big Data situation
• Design of “Dream BI” to meet your future BI and Big Data needs
• EDMT Solution (on any DB supported platform)
• 2-hour consultation block