Date post: | 19-Jul-2015 |
Category: |
Technology |
Upload: | paolo-latella |
View: | 259 times |
Download: | 2 times |
A big data problem
• Log data analysis in cloud computing technology is a big data problem!
• Volume: large amount of data generated
• Velocity: data growing at large rates and needs to be analyzed quickly
• Variety: Different type of structured and
unstructured data
Log analysis tasks
Generate Collect Store ETL Analyze
Server LogApplication Log
ScribeFlume
FluentdLogstash
DynamodbS3
Elastic MapReduceData Pipeline
DynamodbS3
CloudSearchRedshift
Collect events and log
Batch processing (rsync)• burst of traffic• High latency• Hard to analyze
Remote server (rsyslog)• Hard to configure• Not scalable• Hard to analyze
Log collector• simple to configure• Scalable• NoSQL integration
In real-time: log collector (1/2)
• Apache Flume: distributed, reliable, and available service (java) for collecting, aggregating, and moving large amounts of log data• source: a source of data from which Flume receives data
• sink: is the counterpart to the source in that it is a destination for data
• Fluent: fully free and open-source (ruby) log collector that instantly enables you to have a “Log Everything” architecture with +125 plugins• Input plugin: retrieve data from most varied sources (file, mysql, scribe,
flume and others)
• Output plugin: send data to different destination (file, S3, Dynamodb, SQSand others)
In real-time: log collector (2/2)
• Logstash: a tool (java) for managing events and logs• input section: configure source of data and relative plugin
• filter section: configure regular expression or plugin for filter data input
• output section: configure destination section and relative plugin
• Scribe: is a service (C++ and Python) for aggregating log data streamed in real-time from a large number of servers• Scribe was developed at Facebook using Apache Thrift and released in
2008 as open source
• Scribe servers are arranged in a directed graph, with each server knowing only about the next server in the graph
NoSQL database: Amazon DynamoDB
• A fast, fully managed NoSQL database service in the AWS cloud
• You simply create a table and specify how much request (read/write) capacity you require
• Tables do not have fixed schemas, and each item may have a different number of attributes and multiple data types
• Performance, reliability and security are built-in• SSD-storage and automatic 3-way replication
• Integrate with Amazon Elastic MapReduce by a customized version of Hive
Amazon DynamoDB: throughput
1 million evenly spread writes per day is equivalent to
1,000,000 (writes) / 24 (hours) / 60 (minutes) / 60
(seconds) = 11.6 writes per second.
A DynamoDB Write Capacity Unit can handle 1 write per
second, so you need 12 Write Capacity Units. Similarly, to
handle 1 million reads per second, you need 12 Read
Capacity Units.
Or Storage: Amazon S3
• Provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web
• Write, read, and delete objects containing from 1 byte to 5 terabytes of data each. The number of objects you can store is unlimited.
• Designed for 99.999999999% durability and 99.99%availability of objects over a given year
• Import data from Dynamodb with Apache hive
With Amazon Web Services (1/2)
• Amazon provide Elastic MapReduce web service for ETL operations in AWS cloud
• Elastic MapReduce Extract your data from Dynamodb,S3 or HDFS
• Elastic MapReduce Transform your data across a cluster of EC2 instances
• Elastic MapReduce Load your data to S3 or Dynamodbthen to Redshift or Cloudsearch
With Amazon Web Services (2/2)
Amazon EC2
Amazon EC2
Auto scaling
Group
S3
Bucket
DynamoDB
Extract Transform Load
Amazon Redshift
S3
Bucket
Amazon CloudSearch
DynamoDB
EMR
Amazon Elastic MapReduce
• With Amazon Elastic MapReduce (Amazon EMR) you
can analyze and process vast amounts of data.
• It does this by distributing the computational work
across a cluster of EC2 virtual servers running in the
Amazon cloud
• The cluster is managed using an open-source framework
called Hadoop and use the map-reduce logic
Amazon EMR: cluster (1/2)
• Offers several different types of clusters, each
configured for a particular type of data processing
• Hive clusters: use Apache Hive on top of Hadoop.
• You can load data onto cluster and then query it using HiveQL
• Custom JAR Cluster: runs a Java map-reduce application
that you have previously compiled into a JAR file and
uploaded to Amazon S3
Amazon EMR: cluster (2/2)
• Streaming Cluster: runs mapper and reducer scripts that
you have previously uploaded to Amazon S3.
• The scripts can be written using any of the following supported
languages: Ruby, Perl, Python, PHP, Bash, C++.
• Pig Cluster: use Apache Pig, an open-source Apache
library that runs on top of Hadoop.
• you can load data onto your cluster and then query it using Pig
Latin, a SQL-like query language.
ETL workflow: Amazon Data pipeline
Helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available
You define a pipeline composed of the “data sources”, the “activities” or business logic and the “schedule” on which your business logic executes.
Analysis: Amazon Redshift (beta)
• Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud
• Load data from your RDS, S3 or DynamoDB
• Use your preferred business intelligence software package for the fast-growing Big Data analysis
• After you create an Amazon Redshift cluster, you can use any SQL client tools to connect to the cluster with JDBC or ODBC drivers from postgresql.org .
Demo resume
1. Configure fluentd for log collection
2. Create a Dynamodb table and setting throughput
3. Generate logs and items to Dynamodb table
4. Create Hive cluster on Elastic MapReduce
5. Import data from DynamoDB with Hive and join tables
6. Export join tables from Hive to Amazon S3
7. Load data from S3 to Redshift and query data
8. Load data from S3 to Cloudsearch