SNS Analysis using Cloud Computing Services

Post on 11-May-2015

3,283 views 0 download

Tags:

description

SNS Cloud AWS S3 Hadoop MapReduce

transcript

SNS Analysis using Cloud Computing ServicesDHT-based Key-Value Storage and MapReduce-based Analysis

DongWoo Leeoiko.cloud@gmail.com

SocialFlowOikoLabDSOiko

Laboratory 2CloudKR

PlatformDay2009

1

Agenda

‣ Introduction• Social Network Serivce• Motivation : Visualization, Social Network Analysis• SocialFlow• Scale Out Technologies : Cloud Computing

‣ SNS Analysis Architecture based on Cloud• Overall Process• Crawling• DHT Storage (CouchDB)• MapReduce• Pair-Wise Similarity

‣ Cloud Computing Service• Amazon Web Service• EC2 / S3 / Elastic MapReduce• Tips

‣ References

2CloudKR

2

Introduction

Mobile DeviceCloud ComputingSocial Network

2CloudKR

3

Social Network Service

“Social Applications = Social Networks”“A social network is a collection of people bound together through a specific set of social relations.”

“A collection of people is a social network if and only if it is possible for something to spread virally through that collection.”

2CloudKR

4

Social Network Services : Twitter, Facebook2CloudKR

5

Social Applications

6

Social Networks

http://www.vincos.it/world-map-of-social-networks/

7

Social Network Analysis

‣ Social Graph Analysis

‣ Visualization

‣ Person-to-Person Relationship

‣ Temporal Mind Mining (Content Clustering)

‣ Post-Mortem Log Processing

2CloudKR

8

Social Network Analysis : Visualization2CloudKR

‣ Social Graph(50 People)

9

Social Network Analysis : Visualization2CloudKR

‣ Social Graph (100 People)

10

Social Network Analysis : Visualization2CloudKR

‣ Social Graph (200 People)

‣ Limitations‣ Visualization‣ Computational Complexity

11

‣Social 3D Graph

Social Network Analysis : Visualization 2CloudKR

12

SocialFlow

‣ Thoughts, Feelings, Interests, Relationship and Information of SNS

‣ Real-time Massive Social Data Streams

‣ Difficult to follow the Social Streams

‣ Need a way to get a summary or clustered information based on Common Interests

2CloudKR

SocialFlowOikoLabD

13

SocialFlow

‣ Getting Common Flows of people through Content Similarities

‣ Reflecting Short-Term Interests of People

‣ Extracting Hot Issues

‣ Revealing Relationships among In/Out Resources

‣ Implementing Scale-Out Technologies

‣ Evolving toward Recommendation System based on Collective Intelligence

2CloudKR

14

Scale Out Technologies : Cloud Computing2CloudKR

15

Why Cloud Computing?

‣ SPOF (Single Point of Failure)

‣ Cluster Administration (Who do this?)

‣ Initial Infrastructure Investment (Risk Management)

‣ Focus on Main Thing (Intelligence)

‣ Enable Highly Scalable Services

2CloudKR

New resource provision paradigms for Grid Infrastructures: Virtualization and Cloud / ISGC 2009

http://tinyurl.com/nacgu7

16

Cloud Computing: e.g. Storage Failure2CloudKR

Failure Trends in a Large Disk Drive Population, by Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz André Barroso, Google Inc.

17

SNS Analysis Architecture based on Cloud2CloudKR

SocialFlowOikoLabD

18

Experimental Project

SocialFlowOikoLabD

‣Python / Django / Boto

‣ML / Data Mining

‣DHT / CouchDB

‣Cloud / AWS S3, EC2, Hadoop MapReduce

2CloudKR

19

Workflow2CloudKR

SNS Crawler MapReduce CDN UserPost-Processing

In-house Cluster(Local DataCenter)

Cloud Service

20

Technologies : Before

Key-ValueStorage

ConsistentDHT MapReduce

MachineLearning

CouchDB

CouchJSHash_ring

HomeMade

Crawler

2CloudKR

Crawler Crawler

21

Technologies : After

Key-ValueStorage

ConsistentDHT MapReduce

MachineLearning

CouchDB

EC2Hadoop

Hash_ring

HomeMade

Crawler

2CloudKR

Crawler Crawler

Storage S3

22

Crawling2CloudKR

DB

DB

DB

DB

IndexerIndex

File

[ term, doc ]

Mapper

Crawler

Crawler

Crawler

Crawler

DHT Replication

‣ Fetching recent postings of SNS

‣ Storing fetched postings to CouchDB Storage through DHT Layer (which select a sever)

‣ Pushing raw data into the Cloud to process them with MapReduce

23

Consistent DHT (Distributed Hash Table)

2CloudKR‣ Uniform key distribution and load balancing with a good hash function

‣ Minimizing the effects of a storage crash or temporal down

‣ High availability with replication scheme

2

Replicas

Replicate(k, k-1, k+1)

Node k-1

Node k+1 Node k

1

0N-1

Node N-1

k+1

k-1

!"#$!%&'()*+,-.(/0123',(0405123',(&6-.-7-1(080.-'9(.0405.-'9(.&6-.-7-1(0:

‣ Notice: A real node has non-linear portions of the total key space.

24

Consistent DHT (Distributed Hash Table)

2CloudKR

2

Node k-1

Node k+1 Node k

1

0N-1

Node N-1

Memory Cache

DHT

DHT Front End

AWS S3

html image

SNS Anlysis

Admin View

View

User View

SNS Crawler

Anonymouse User Traffic

Admin Traffic

Generated Contents

25

Consistent DHT : Replication2CloudKR

A B

D

B C

A

C D

B

D A

C

B

B

B

Replica Replica

* Replica = 2

26

CouchDB (Key-Value Storage)2CloudKR

‣ Erlang -based Key-Value Storage

‣ Storage Engine (MVCC, B-tree)

‣ RESTful API

‣ Service-side JavaScript Engine (MapReduce)

‣ View Engine

‣ Futon Web UI

27

CouchDB: Server-side Javascript

‣ Purpose

‣ Local Computations on Local Data Sets

‣ Features

‣ Mozilla’s Spidermonkey

‣ MapReduce Framework with Javascript

‣ Fork External Process (couchjs)

‣ Performance Enhancements Expected

‣ Googles V8 (Chrome’s Javascript Engine / JIT)

2CloudKR

http://tinyurl.com/m76sx3

28

CouchDB: MapReduce2CloudKR

doc = (d1, d2, fq)

dx: { di }

29

Map & Reduce : Pair-Wise Similarity2CloudKR

DB

DB

DB

DB

IndexerIndex File

Group File

[ term, doc ] [ term, { docs } ]

Mapper Reducer

Doc File

DocCombinator

Candidate File

[ term, { docs } ] =>

[ doc1, doc2 ]

Mapper

Result File

[ freq, doc1, doc2 ]

Reducer

DocGrouper

DocPairCounter

‣ Indexer and Grouper for Processing Korean.

‣ No NLP and No Structural Analysis.

‣ Produce a pairwise similarity between two postings.

30

Map & Reduce : Optimization

‣ Concerns‣ Consider Key Group Size Distribution‣ Data Load Balancing‣ Barrier Point

‣ Sample Data‣ Two months postings of my friends‣ Reachable graph: 4,060 Peoples‣ Total Postings: 206,115

2CloudKR

31

Pair-Wise Similarity and its TreeMap

Posting: 110,008Users: 2,691

Score >= 6

32

Pair-Wise Similarity and its Cluster2CloudKR

➡One issue and different opinions among people

33

Pair-Wise Similarity and its Cluster2CloudKR

➡C

omm

on In

tere

st /

Hot

Issu

e

34

Pair-Wise Similarity and its Cluster2CloudKR

➡One person and the similar contents pattern (specialty)

35

Pair-Wise Similarity and its Cluster2CloudKR

➡ Similar Structure of Sentences (trendy, parady)

36

Deployment

www

Flickr

S3/CloudFront

EC2

2CloudKR

37

Cloud Computing Service2CloudKR

38

Before the Cloud Age2CloudKR‣ Smart Shell Guru’s Daily Work : Parallel Sort

$ wc -l data$ split -l 1000k data

$ sort -rm data*.sorted > data.sorted

scpNFS

scpNFS

$ nohup ./work.sh data1 > data1.processed$ nohup sort -r data1.processed > data1.sorted

➡ Need to prepare/maintain physical machines and resources➡ Need to monitor job progress (wait and see job’s status)➡ Need to cope with machine failure (slave nodes / storages / networks)➡ Need to schedule multiple jobs

Complexity

39

Amazon Web Service : Overview2CloudKR

EBS (Elastic Block Store)EC2 (Elastic Compute Cloud) 1 GB to 1TBMount

SimpleDB S3 (Simple Storage Service)

API

CloudFront

SQS (Simple Query Service)

HTTP

Clients

Buckets

Objects

Permissions

key-value

AMI (Machine Image)

EC2 EC2 EC2 EC2

Access Key IDSecret Access KeyKey Pair

Clients HTTP

Admin

SSH

Clients

Clients

Elastic MapReduceInstant EC2 Hadoop Cluster

Hadoop Hadoop Hadoop

Header

CloudWatch

Auto Scaling

Elastic Load Balancing

Mgmt Console

Monitoring

Edges

Messages

Import/Export

Offline

eSATA/USB

EC2 CLI

40

Amazon Web Service2CloudKR

‣ Amazon Management Console

41

AWS : AMI

AMIAmazon Machine Image

2CloudKR

42

AWS : Paid AMI / The Cloud Market

AMIAmazon Machine Image

2CloudKR

Paid AMI

43

AWS : How to make a AMI (1)2CloudKR

Loopback File# dd if=/dev/zero of=new_image.fs bs=1M count=1024

Make ext3 file system# mke2fs -F -j new_image.fs# mkdir /mnt/ec2-fs# mount -o loop new_image.fs /mnt/ec2-fs# mkdir /mnt/ec2-fs/dev# /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x console# /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x null# /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x zero# mkdir /mnt/ec2-fs/etc

Create /mnt/ec2-fs/etc/fstab (Add /dev/sda1 --> /, /etc/pts, shm, /proc, /sys)Create yum-xen.conf

# mkdir /mnt/ec2-fs/proc# mount -t proc none /mnt/ec2-fs/proc# yum -c yum-xen.conf --installroot=/mnt/ec2-fs -y groupinstall Base

Edit /mnt/ec2-fs/etc/sysconfig/network-scripts/ifcfg-eth0Edit /mnt/ec2-fs/etc/sysconfig/networkEdit /mnt/ec2-fs/etc/fstab (Add /dev/sda2 --> /mnt, /dev/sda3 --> swap)

chroot /mnt/ec2-fs /bin/shEdit services

44

AWS : How to make a AMI (2)2CloudKR

Building an AMI# yum install ruby# rpm -i ec2-ami-tools-noarch.rpm (Download from public s3 bucket)# ec2-bundle-image -i new_image.fs -k my-private-key.key -u aws-user-id

Local Machine Root File System# ec2-bundle-vol -k my-private-key.key -s 1000 -u aws-user-id

Upload to S3# ec2-upload-bundle -b my-bucket -m image.manifest -a my-aws-access-key-id -s my-secret-key-id

Register AMI# ec2-register my-bucket/image.manifestIMAGE ami-xxxx

Testing# ec2-describe-images ami-xxxx

Deregister AMI# ec2-deregister ami-xxxx

Running AMI# ec2-run-intances ami-xxxx -n 1

http://docs.amazonwebservices.com/AWSEC2/2006-06-26/DeveloperGuide/

45

AWS : EC2 Running Instance2CloudKR‣ AWS Management Console

46

AWS : EC2 Running Instance2CloudKR

47

Amazon Web Service: Access Methods2CloudKR

‣ Access Key ID / Secret Access Key ID / Key Pairs

‣ Amazon Management Console‣ EC2 API (WSDL) / EC2 CLI (Command Line Interface)‣ SSH

‣ Firefox Extensions• S3 Firefox Organizer• Elasticfox

‣ S3•DNS: s3 CNAME s3.amazonaws.com. e.g) Bucket Name: /s3.xyz.com http://s3.xyz.com ---> S3‘s s3.xyz.com

‣s3cmd (python)‣s3cmd.rb / s3sync.rb (ruby)‣S3Hub (Mac)

48

Amazon Web Service: Elasticfox 2CloudKR‣ Firefox’s Extension: Elasticfox

49

Amazon Web Service: Elasticfox 2CloudKR

‣ Key Pairs‣ Private Key‣ SSH

50

Amazon Web Service: Elasticfox 2CloudKR

‣ Security Groups‣ Open Network Ports

51

AWS: Elastic MapReduce2CloudKR

‣ EC2 + Hadoop

‣Tools‣ Management Console‣ elastic-mapreduce CLI

‣ Preparation‣ Code --> S3‣ Data --> S3

‣ Log Folder‣ Output Folder

‣Job Flow‣ Streaming‣ Custom Jar‣ Sample Applications

52

AWS: Elastic MapReduce2CloudKR

53

AWS: Elastic MapReduce : Web UI2CloudKR

54

AWS: Elastic MapReduce : CLI for Workflow

Step1

Step2

Step3

input/*

output1/part-000**

output2/part-000**

output3/part-000**

2CloudKR

jobflow #id

55

AWS: Elastic MapReduce2CloudKR

‣ Failed tasks will be rescheduled in other Hadoop slaves.‣ If a task is finished, the same instance will be killed by a tracker.

56

AWS: Elastic MapReduce2CloudKR

57

AWS: SocialFlow Automation2CloudKR

DHT

Home IDC Amazon Wild World

UsersAdmin

Re

ad

On

ly

Re

ad

/Write

Local Global

S3

boto python Launching EC2 pool

Results

Renderer

58

AWS: EC2, EMR Price Model2CloudKR

Service Type Per Instance HourPer Instance Hour 1 Week (7 Days) 1 Week (7 Days)

EC2

On-Demand$ 0.10 (S)$ 0.40 (L)$ 0.80 (E)

$ 0.10 (S)$ 0.40 (L)$ 0.80 (E)

$ 16.8 $ 67.2 $ 134.4

KRW 20,865 KRW 83,462 KRW 166,924

EC2

Reserved1yr $ 3253yr $ 500

$ 0.03 (S)$ 0.12 (L)$ 0.24 (E)

$ 0.03 (S)$ 0.12 (L)$ 0.24 (E)

$ 5.04 $ 20.16 $ 40.32

KRW 6,259 KRW 25,038KRW 50,077

ElasticMapReduce On-Demand

$ 0.10 (S)$ 0.40 (L)$ 0.80 (E)

$ 0.015$ 0.06$ 0.12

$ 19.32$ 77.28$ 154.56

KRW 23,995KRW 95,981KRW 191,963

1 USD = 1242 KRW(S) = Small, (L) = Large, (E) = Extra Large

59

AWS: Performance

http://tinyurl.com/qj6ao7

2CloudKR

60

AWS: Performance2CloudKR

61

AWS: Performance

http://tinyurl.com/p9jsyz

2CloudKR

62

AWS: Performance

http://tinyurl.com/cqqxgl

2CloudKR

63

10 Cent Tips

‣ AWS EC2

‣ Minimizing set-up time with prepared shell scripts

‣ Use Boto for automating deployments

‣ Use S3 (Free of Charge between S3 and EC2 in the same region)

‣ $0.030 per GB through June 30, 2000 ($0.1 per GB normal price)

‣ AWS Elastic MapReduce

‣ Enabling the SSH port(22) and Hadoop related ports (9100, 91001)

‣ Assess to Master Node: ssh -i keypair hadoop@public_dns_name

‣ Double Check (PATH, etc)

‣ Debug, Debug, Debug

‣ Use EC2 for hadoop (eg. Clouera’s Hadoop AMI) (No extra cost for Hadoop!)

2CloudKR

64

10 Cent Tips

‣ AWS S3

‣ Setting HTTP header for images and static resources.

‣ Cache-Control: max-age=31536000

‣ Block Search Bots

‣ robots.txt at the root of a Bucket‣ User-agent: *‣ Disallow: /

‣ Using BitTorrent for large files

‣ http://s3.xyz.com/xfile.zip?torrent

‣ Compress Rendered HTML with gzip

‣ Content-Encoding: gzip

2CloudKR

$ s3cmd put index.html s3://s3.xyz.com/www \ --mime-type "text/html” \ --add-header "Content-Encoding: gzip" \ --acl-public

65

Amazon Web Service : Limitations2CloudKR

66

References

‣ 10 MapReduces Tips, Cloudera, http://tinyurl.com/pxuqup ‣ Christian Charas, Thierry Lecroq, Handbook of Exact String-Matching Algorithms‣ Dan Pritchett (eBay), BASE: Alternative ACID, p.48-55, ACM Queue May/June 2008‣ Edward Chang, (Google Research), Mining Large Scale Social Networks, MMDS ’08‣ Edward Walker, Benchmarking Amazon EC2 for high-performance scientific computing‣ Matei Zaharia et al, Improving MapReduce Performance in Heterogeneous Environments, OSDI ’08

‣ Following Twitter‣ http://twitter.com/AmazonEC2‣ http://twitter.com/AmazonS3S3

2CloudKR

67