+ All Categories
Home > Documents > Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config...

Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config...

Date post: 25-Jul-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
38
Venkatesh Vinayakarao (Vv) Venkatesh Vinayakarao [email protected] http://vvtesh.co.in Big Data Products and Practices
Transcript
Page 1: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Venkatesh Vinayakarao (Vv)

Venkatesh [email protected]

http://vvtesh.co.in

Big Data

Products and Practices

Page 2: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Cloud Platforms

2

Cloud Services: • SaaS• PaaS• IaaS

Page 3: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Cloud Platforms

3Ref: https://maelfabien.github.io/bigdata/gcps_1/#what-is-gcp

Page 4: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Storage Service (Amazon S3 Example)

4

Page 5: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Compute Services (Google Cloud Example)

5

Page 6: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Network Services (Azure Example)

• Azure Traffic Manager is a DNS-based traffic load balancer that distributes traffic optimally to services across global Azure regions, while providing high availability.

• Traffic Manager directs client requests to the most appropriate service endpoint.

6

Page 7: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Building Great Apps/Services

• We need• Products that make certain features easy to implement

• Visualization

• Crawling/Search

• Log Aggregation

• Graph DB

• Synchronization

7

Page 8: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Tableau

8

Page 9: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Crawling with Nutch

9Solr Integration Image Src: https://suyashaoc.wordpress.com/2016/12/04/nutch-2-3-1-hbase-0-98-8-hadoop-2-5-2-solr-4-1-web-crawling-and-indexing/

Page 10: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Log Files are an Important Source of Big Data

10

Page 11: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Log4j

11

Page 12: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Flume

12

Flume Config Files

Page 13: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Sqoop

13

Designed for efficiently transferring bulk

data between Hadoop and RDBMS

Structured UnStructuredSqoop2

Page 14: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

GraphDB – Neo4j

14

ACID compliant graph database management system

Page 15: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Neo4j

• A leading graph database, with native graph storage and processing.

• Open Source

• NoSQL

• ACID compliant

15

Neo4j Sandbox

https://sandbox.neo4j.com/

Neo4j Desktop

https://neo4j.com/download

Page 16: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Data Model

• create (p:Person {name:'Venkatesh'})-[:Teaches]->(c:Course {name:'BigData'})

16

Page 17: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Query Language

• Cypher Query Language• Similar to SQL

• Optimized for graphs

• Used by Neo4j, SAP HANA Graph, Redis Graph, etc.

17

Page 18: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

CQL

• create (p:Person {name:'Venkatesh'})-[:Teaches]->(c:Course {name:'BigData'})

• Don’t forget the single quotes.

18

Page 19: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

CQL

• Match (n) return n

19

Page 20: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

CQL

• match(p:Person {name:'Venkatesh'}) set p.surname='Vinayakarao' return p

20

Page 21: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

CQL

• Create (p:Person {name:’Raj’})-[:StudentOf]->(o:Org{name:’CMI’})

• Match (n) return n

21

Page 22: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

CQL

• create (p:Person {name:'Venkatesh'})-[:FacultyAt]->(o:Org {name:'CMI’})

• Match (n) return n

22

Page 23: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

CQL

• MATCH (p:Person {name:'Venkatesh'})-[r:FacultyAt]->()• DELETE r• MATCH (p:Person) where ID(p)=4• DELETE p• MATCH (o:Org) where ID(o)=5• DELETE o

• MATCH (a:Person),(b:Org)• WHERE a.name = 'Venkatesh' AND b.name = 'CMI'• CREATE (a)-[:FacultyAt]->(b)

23

Page 24: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

CQL

create (p:Person {name:’Isha’})

MATCH (a:Person),(b:Course)

WHERE a.name = 'Isha' and b.name = 'BigData'

CREATE (a)-[:StudentOf]->(b)

MATCH (a:Person)-[o:StudentOf]->(b:Course) where a.name = 'Isha' DELETE o

MATCH (a:Person),(b:Org)

WHERE a.name = 'Isha' and b.name = ‘CMI'

CREATE (a)-[:StudentOf]->(b)

MATCH (a:Person),(b:Course)

WHERE a.name = 'Isha' and b.name = 'BigData'

CREATE (a)-[:EnrolledIn]->(b)

24

Page 25: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Apache ZooKeeper

25

A Zookeeper Ensemble Serving Clients

A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services

It is simple to store data using zookeeper$ create /zk_test my_data$ set /zk_test junk$ get /zk_testjunk$ delete /zk_test

Data is stored hierarchically

Page 26: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Stream Processing

• Process data as they arrive.

26

Page 27: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Stream Processing with StormOne of these is a master node. “Nimbus” is the “job tracker”!

In MR parlance, “Supervisor” process is our “task tracker”

Page 28: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Apache Kafka

• Uses Publish-Subscribe Mechanism

28

Page 29: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Kafka – Tutorial (Single Node)

• Create a topic• > bin/kafka-topics.sh ---topic test

• List all topics• > bin/kafka-topics.sh –list• > test

• Send messages• > bin/kafka-console-producer.sh --topic test• This is a message• This is another message

• Receive messages (subscribed to a topic)• > bin/kafka-console-consumer.sh --topic test --from-beginning• This is a message• This is another message

29

Page 30: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Kafka – Multi-node

30

• topic is a stream of records.• for each topic, the Kafka

cluster maintains a partitioned log

• records in the partitions are each assigned a sequential id number called the offset

Page 31: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Kafka Brokers

• For Kafka, a single broker is just a cluster of size one.

• We can setup multiple brokers• The broker.id property is the unique and permanent

name of each node in the cluster.• > bin/kafka-server-start.sh config/server-1.properties &

• > bin/kafka-server-start.sh config/server-2.properties &

• Now we can create topics with replication factor• > bin/kafka-topics.sh --create --replication-factor 3 --partitions

1 --topic my-replicated-topic

• > bin/kafka-topics.sh --describe --bootstrap-server localhost:9092 --topic my-replicated-topic

• Topic: my-replicated-topic PartitionCount:1 ReplicationFactor:3

• Partition: 0 Leader: 2 Replicas: 1,2,0

31

Page 32: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Streams API

32

Page 33: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Apache Kinesis

• Amazon Kinesis Data Streams is a managed service that scales elastically for real-time processing of streaming big data.

33

“Netflix uses Amazon Kinesis to monitor the communications between all of its applications so it can detect and fix issues quickly, ensuring high service uptime and availability to its customers.” – Amazon (https://aws.amazon.com/kinesis/).

Page 34: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Amazon Kinesis capabilities

• Video Streams

• Data Streams

• Firehose

• Analytics

34https://aws.amazon.com/kinesis/

Page 35: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Apache Spark (A Unified Library)

35

https://spark.apache.org/

In spark, use data

frames as tables

Page 36: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Resilient Distributed Datasets (RDDs)

36

RDD

RDD

RDD

RDD

RDD RDD

RDD

InputData.txt

TransformationsMap, filter, …

ActionsReduce, count, …

Page 37: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Spark Examples

37

distributed dataset can

be used in parallel

passing functions

through spark

distFile = sc.textFile("dta.txt")

distFile.map(s => s.length).

reduce((a, b) => a + b)

Map/reduce

Page 38: Big Data Products and Practices - vvtesh.co.in · of Big Data 10. Log4j 11. Flume 12 Flume Config Files. Sqoop 13 Designed for efficiently transferring bulk data between Hadoop and

Thank You

38


Recommended