Hadoop and Measurement Data
What is Hadoop?
• A framework for distributed processing of large datasets
• Hadoop supports only one type of paradigm for distributed processing: Map - Reduce
• Hadoop origin was in the area of web search engines
• „Moves computations to data“ the opposite of “Data to computations”
• HDFS (Hadoop Distributed File System) – Optimal for write once – read many semantics
– Large files (or even streams) chunked into blocks, distributed over several network nodes
– Offers replication of data (configurable)
09.02.2017 2
Hadoop is based on the Map-Reduce Paradigm
– Programming paradigm for distributed computation
– Mapper transforms input key/value pairs to output key/value pairs
– Reducer forms result per key from all Mapper outputs
– Row based processing
– Operates on slices of the original dataset that can be processed completely independent from each other
09.02.2017 3
What is Hadoop?
Map-Reduce Sample
09.02.2017 4
Calculate maximum temperature per city based on a huge set of measurement data
City Temp
City 1 22
City 2 18
City 1 25
City 1 28
City 1 12
City 2 15
Big File or Stream
Once
Slicing
File 1
File N
City Temp
City 1 22
City 2 18
City 1 25
City Temp
City 1 28
City 1 12
City 2 15
Mapping
Calc
Calc
City Temp (max)
City 1 25
City 2 18
City Temp (max)
City 1 28
City 2 15
CPU N
CPU 1
Many
City Temp (max)
City 1 28
City 2 18
Reducing
Measurement Data Files
• Only small files (Excel, ASCII) have a line oriented structure -> no Hadoop needed
• Typically test stand or road test data files (like MDF) medium size of each file, but high number of files non tabular (= non sliceable) structure: – Todays file structure cannot be changed
– Channels with different measuring rates
– Channels with own time channel, event based data
– Resampling to max. rate of all channels needs factor 10 or more file space
• Testing of Assistants up to Self Driving! Different world with typical streams for Hadoop operation
09.02.2017 5
Measurement Data Files
• Only small files (Excel, ASCII) have a line oriented structure -> no Hadoop needed
• Typically Big-Data files (like MDF) have a non tabular (= non sliceable) structure: – Channels with different measuring rates
– Channels with own time channel, event based data
– Resampling to max. rate of all channels needs factor 10 or more file space
• Hadoop’s map technology with key-values must use time values as keys: Not applicable!
09.02.2017 6
Measurement Data Analysis
• Only very few calculations allow a row by row independent operation: – 1D Statistical Frequency (Histogram): Yes
– Signal Filtering need at least an overlap: No
– FFT Analysis needs whole set of values: No
– Slicing is critical!
• Because Slicing and Mapping is a “Once” process it must fit all possible analysis operations
• Complex calculations are not processable as independent blocks
09.02.2017 7
Where is Hadoop helpful?
• Tabular data structures are very frequently used in commercial / office application -> Hadoop helps
• Measurement data are typically not tabular -> Hadoop does not help for faster analysis
• Typical measurement data analysis is not tabular -> Hadoop does not help for faster analysis
• Measurement data are recorded once, but used frequently for analysis -> Hadoop’s HDFS file system is a good alternative to a regular file system to store all the data files
09.02.2017 8
AMS Solutions for Distributed Analysis
with Big Amount of Data
09.02.2017 9
AMS - Big Data Technologies
09.02.2017 10
• Parallel Operation 1. Single calculation (per calculation base) 2a. Analysis tree (per file base) 2b. Analysis tree with aggregation 3. Analysis job (per MaDaM/storage center)
• Statistic over meta data extracted during import • Catalog files
• Channel-name and –unit mapping • Resampling on a per calculation base
Parallel Operation – 1. Single Calculation
09.02.2017 11
• Independent rows (sample: Histogram) • Input data can be sliced and analyzed independent
• Overlapping value areas (sample Signal-Filter) • Data must be sliced with overlapping,
depending on parameter of calculation
• Not sliceable calculations (sample: FFT) • Calculation of one result value needs information
of all input values Strategy is depending on individual calculation
Parallel Operation – 1. Single Calculation
09.02.2017 12
Task 1 X Temp
[s] [°]
0.0 22
0.15 18
0.3 24
0.45 28
0.6 12
0.75 14
X Temp
[s] [°]
0.0 22
0.15 18
0.3 24
X Temp
[s] [°]
0.45 28
0.6 12
0.75 14
Calculation Histogram
Task N
…
X N
[°] []
10-15 0
15-20 1
20-25 2
25-30 0
X N
[°] []
10-15 2
15-20 0
20-25 0
25-30 1
Slicing Parallel Processing
X N
[°] []
10-15 2
15-20 1
20-25 2
25-30 1
Aggregation
DoubleChannel
Parallel Operation 2a. Analysis Tree - Without Aggregation
09.02.2017 13
• Processing on a per file base Files are natural slices
• Independent Analysis and Report • Synchronization problems solved in jBEAM • jBEAM project-templates define the analysis
• MultiFile-Modul controls the jBEAM pool • MF-Modul is responsible for load balancing • Each jBEAM needs access to the files
Several jBEAM instances each on an own server
09.02.2017 14
Calculate a whole report from one or several files. Process individual file sets individual and in parallel.
Plenty of files
Many
Import
Server 1: jBEAM Instance 1
Analysis Report
Import
Server N: jBEAM Instance N
Analysis Report
Distribution & Load Control
Parallel Operation 2.a Analysis Tree - Without Aggregation
MultiFile-Controller
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
Distribution & Load Control
09.02.2017 15
• Processing on a per file base Files are natural slices
• Independent Analysis and Report • Synchronization problems solved in jBEAM • jBEAM project-templates define the analysis
• MultiFile-Modul controls the jBEAM pool • MF-Modul is responsible for load balancing • Each jBEAM needs access to the files • MF-Modul is responsible for aggregation
Several jBEAM-instances each on an own server
Parallel Operation 2b. Analysis Tree - With Aggregation
09.02.2017 16
Analyze file sets individually and aggregate results for a common report.
Plenty of files
Many
Import
Server 1: jBEAM Instance 1
Analysis
Import
Server N: jBEAM Instance N
Analysis
Rep.
Aggregation & Reporting
Parallel Operation 2.b Analysis Tree - With Aggregation
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
0110110001011101110101101001
MultiFile-Controller
Distribution & Load Control
09.02.2017 17
Multiple-jBEAM Solution
jBEAM
Aggregation
jBEAM
File 1, 5, 11, …
jBEAM
File 2, 6, 12, …
jBEAM
File 3, 7, 13, …
jBEAM
File 4, 8, 14, …
MaDaM
Job Centre
File System
Final Report
Analysis Job
• Data(-files) are stored where they are most frequently used
• Analysis / computation is done near the data
• Analysis is distributed using a communication
between different MaDaM-Instances
• Individual results are aggregated to a global result
09.02.2017 18
Parallel Operation
3. Global Solution: Multiple-MaDaM
09.02.2017 19
Search for Tests & Preview: Modern interactive web interface accessible by any browser
Standardized Reports: Server-jBEAM-generated PDF files can be viewed by PDF-Reader
Interactive Analysis: jBEAM with Java Web Start running on client desktop
Import of Tests: MaDaM Importer with Java Web Start running on client desktop
1e-6 = 0.0001% Only meta data exchange
1e-3 = 0.1% EnCom-minimized traffic
1e0 = 100% Complete file upload
*) optimized traffic with EnCom
Multiple- MaDaM Solution
IP Traffic
Web Browser
jBEAM
Client
1e-6
Serv
er
Clie
nt
USA Germany China
*) *) *)
long d
ista
nce
1e-3
long d
ista
nce
Web Browser
jBEAM
Client
MaDaM
Importer
File System
Web Browser
jBEAM
Client
MaDaM
Importer
File System File System
Web Browser
jBEAM
Client
MaDaM
Importer
1e0=100%
jBEAM
Server HTML-5
MaDaM TM
Lucene Database
jBEAM
Server HTML-5
MaDaM TM
Lucene Database
jBEAM
Server HTML-5
MaDaM TM
Lucene Database
MaDaM
Importer
For Big-Data processing the analysis tool must support special time and storage reducing technologies:
• “StandBy” technology of data file importers
– Read and import only channels really in use
– Create and use “overviews” for fast drawing
• Multithreading on a per calculation base
• Resampling on a per calculation base
• Quantity based calculation (1.5cm + 7mm = 2.2cm)
09.02.2017 20
Big Data Feature Fast processing features inside jBEAM
Why MaDaM does not use Hadoop
• For Hadoop the data structure and all related analysis logic must support up front data slicing
• This is not possible in measurement data analysis
– Measurement data structures are not tabular or line oriented
– The slicing of data can, if at all, be based on specific use cases
– Complex calculations are not processable as independent blocks
09.02.2017 21
AMS has its own Approach
AMS uses adequate technologies for distributed analysis of multiple files, which can not be found in the Hadoop framework.
• Global distributed analysis by Multi-MaDaM
• Local distributed analysis by Multi-jBEAM
• Internal jBEAM technologies: – Multithreading in components
– Sequential processing of multiple measurement files
– Multiple x-grids (time-grids) on a per calculation base
– Multiple unit-spaces
– Channel-name and –unit mapping 09.02.2017 22
Bahnhofstraße 6 1760 Opdyke Court German Centre, Unit 719A 09111 Chemnitz Auburn Hills, MI 48326 88 Keyuan Road, Pudong Germany USA Shanghai 201203 / PR China
Tel.: +49 (371) 918 668-0 Tel.: +1 (248) 270-7779 Tel.: +86 (21) 289 866 19 Fax.: +49 (371) 918 668-99 Fax: +1 (248) 393-0340 Fax: +86 (21) 289 865 11 E-Mail: [email protected] E-Mail: [email protected] E-Mail: [email protected] Web: www.AMSonline.de Web: www.AMSonline.eu Web: www.AMSonline.cn
Gesellschaft für angewandte Mess- und Systemtechnik mbH