+ All Categories
Home > Technology > SNS Analysis using Cloud Computing Services

SNS Analysis using Cloud Computing Services

Date post: 11-May-2015
Category:
Upload: dongwoo-lee
View: 3,283 times
Download: 0 times
Share this document with a friend
Description:
SNS Cloud AWS S3 Hadoop MapReduce
Popular Tags:
67
SNS Analysis using Cloud Computing Services DHT-based Key-Value Storage and MapReduce-based Analysis DongWoo Lee [email protected] SocialFlow OikoLab D S Oiko Laboratory 2 CloudKR PlatformDay2009 1
Transcript
Page 1: SNS Analysis using Cloud Computing Services

SNS Analysis using Cloud Computing ServicesDHT-based Key-Value Storage and MapReduce-based Analysis

DongWoo [email protected]

SocialFlowOikoLabDSOiko

Laboratory 2CloudKR

PlatformDay2009

1

Page 2: SNS Analysis using Cloud Computing Services

Agenda

‣ Introduction• Social Network Serivce• Motivation : Visualization, Social Network Analysis• SocialFlow• Scale Out Technologies : Cloud Computing

‣ SNS Analysis Architecture based on Cloud• Overall Process• Crawling• DHT Storage (CouchDB)• MapReduce• Pair-Wise Similarity

‣ Cloud Computing Service• Amazon Web Service• EC2 / S3 / Elastic MapReduce• Tips

‣ References

2CloudKR

2

Page 3: SNS Analysis using Cloud Computing Services

Introduction

Mobile DeviceCloud ComputingSocial Network

2CloudKR

3

Page 4: SNS Analysis using Cloud Computing Services

Social Network Service

“Social Applications = Social Networks”“A social network is a collection of people bound together through a specific set of social relations.”

“A collection of people is a social network if and only if it is possible for something to spread virally through that collection.”

2CloudKR

4

Page 5: SNS Analysis using Cloud Computing Services

Social Network Services : Twitter, Facebook2CloudKR

5

Page 6: SNS Analysis using Cloud Computing Services

Social Applications

6

Page 7: SNS Analysis using Cloud Computing Services

Social Networks

http://www.vincos.it/world-map-of-social-networks/

7

Page 8: SNS Analysis using Cloud Computing Services

Social Network Analysis

‣ Social Graph Analysis

‣ Visualization

‣ Person-to-Person Relationship

‣ Temporal Mind Mining (Content Clustering)

‣ Post-Mortem Log Processing

2CloudKR

8

Page 9: SNS Analysis using Cloud Computing Services

Social Network Analysis : Visualization2CloudKR

‣ Social Graph(50 People)

9

Page 10: SNS Analysis using Cloud Computing Services

Social Network Analysis : Visualization2CloudKR

‣ Social Graph (100 People)

10

Page 11: SNS Analysis using Cloud Computing Services

Social Network Analysis : Visualization2CloudKR

‣ Social Graph (200 People)

‣ Limitations‣ Visualization‣ Computational Complexity

11

Page 12: SNS Analysis using Cloud Computing Services

‣Social 3D Graph

Social Network Analysis : Visualization 2CloudKR

12

Page 13: SNS Analysis using Cloud Computing Services

SocialFlow

‣ Thoughts, Feelings, Interests, Relationship and Information of SNS

‣ Real-time Massive Social Data Streams

‣ Difficult to follow the Social Streams

‣ Need a way to get a summary or clustered information based on Common Interests

2CloudKR

SocialFlowOikoLabD

13

Page 14: SNS Analysis using Cloud Computing Services

SocialFlow

‣ Getting Common Flows of people through Content Similarities

‣ Reflecting Short-Term Interests of People

‣ Extracting Hot Issues

‣ Revealing Relationships among In/Out Resources

‣ Implementing Scale-Out Technologies

‣ Evolving toward Recommendation System based on Collective Intelligence

2CloudKR

14

Page 15: SNS Analysis using Cloud Computing Services

Scale Out Technologies : Cloud Computing2CloudKR

15

Page 16: SNS Analysis using Cloud Computing Services

Why Cloud Computing?

‣ SPOF (Single Point of Failure)

‣ Cluster Administration (Who do this?)

‣ Initial Infrastructure Investment (Risk Management)

‣ Focus on Main Thing (Intelligence)

‣ Enable Highly Scalable Services

2CloudKR

New resource provision paradigms for Grid Infrastructures: Virtualization and Cloud / ISGC 2009

http://tinyurl.com/nacgu7

16

Page 17: SNS Analysis using Cloud Computing Services

Cloud Computing: e.g. Storage Failure2CloudKR

Failure Trends in a Large Disk Drive Population, by Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz André Barroso, Google Inc.

17

Page 18: SNS Analysis using Cloud Computing Services

SNS Analysis Architecture based on Cloud2CloudKR

SocialFlowOikoLabD

18

Page 19: SNS Analysis using Cloud Computing Services

Experimental Project

SocialFlowOikoLabD

‣Python / Django / Boto

‣ML / Data Mining

‣DHT / CouchDB

‣Cloud / AWS S3, EC2, Hadoop MapReduce

2CloudKR

19

Page 20: SNS Analysis using Cloud Computing Services

Workflow2CloudKR

SNS Crawler MapReduce CDN UserPost-Processing

In-house Cluster(Local DataCenter)

Cloud Service

20

Page 21: SNS Analysis using Cloud Computing Services

Technologies : Before

Key-ValueStorage

ConsistentDHT MapReduce

MachineLearning

CouchDB

CouchJSHash_ring

HomeMade

Crawler

2CloudKR

Crawler Crawler

21

Page 22: SNS Analysis using Cloud Computing Services

Technologies : After

Key-ValueStorage

ConsistentDHT MapReduce

MachineLearning

CouchDB

EC2Hadoop

Hash_ring

HomeMade

Crawler

2CloudKR

Crawler Crawler

Storage S3

22

Page 23: SNS Analysis using Cloud Computing Services

Crawling2CloudKR

DB

DB

DB

DB

IndexerIndex

File

[ term, doc ]

Mapper

Crawler

Crawler

Crawler

Crawler

DHT Replication

‣ Fetching recent postings of SNS

‣ Storing fetched postings to CouchDB Storage through DHT Layer (which select a sever)

‣ Pushing raw data into the Cloud to process them with MapReduce

23

Page 24: SNS Analysis using Cloud Computing Services

Consistent DHT (Distributed Hash Table)

2CloudKR‣ Uniform key distribution and load balancing with a good hash function

‣ Minimizing the effects of a storage crash or temporal down

‣ High availability with replication scheme

2

Replicas

Replicate(k, k-1, k+1)

Node k-1

Node k+1 Node k

1

0N-1

Node N-1

k+1

k-1

!"#$!%&'()*+,-.(/0123',(0405123',(&6-.-7-1(080.-'9(.0405.-'9(.&6-.-7-1(0:

‣ Notice: A real node has non-linear portions of the total key space.

24

Page 25: SNS Analysis using Cloud Computing Services

Consistent DHT (Distributed Hash Table)

2CloudKR

2

Node k-1

Node k+1 Node k

1

0N-1

Node N-1

Memory Cache

DHT

DHT Front End

AWS S3

html image

SNS Anlysis

Admin View

View

User View

SNS Crawler

Anonymouse User Traffic

Admin Traffic

Generated Contents

25

Page 26: SNS Analysis using Cloud Computing Services

Consistent DHT : Replication2CloudKR

A B

D

B C

A

C D

B

D A

C

B

B

B

Replica Replica

* Replica = 2

26

Page 27: SNS Analysis using Cloud Computing Services

CouchDB (Key-Value Storage)2CloudKR

‣ Erlang -based Key-Value Storage

‣ Storage Engine (MVCC, B-tree)

‣ RESTful API

‣ Service-side JavaScript Engine (MapReduce)

‣ View Engine

‣ Futon Web UI

27

Page 28: SNS Analysis using Cloud Computing Services

CouchDB: Server-side Javascript

‣ Purpose

‣ Local Computations on Local Data Sets

‣ Features

‣ Mozilla’s Spidermonkey

‣ MapReduce Framework with Javascript

‣ Fork External Process (couchjs)

‣ Performance Enhancements Expected

‣ Googles V8 (Chrome’s Javascript Engine / JIT)

2CloudKR

http://tinyurl.com/m76sx3

28

Page 29: SNS Analysis using Cloud Computing Services

CouchDB: MapReduce2CloudKR

doc = (d1, d2, fq)

dx: { di }

29

Page 30: SNS Analysis using Cloud Computing Services

Map & Reduce : Pair-Wise Similarity2CloudKR

DB

DB

DB

DB

IndexerIndex File

Group File

[ term, doc ] [ term, { docs } ]

Mapper Reducer

Doc File

DocCombinator

Candidate File

[ term, { docs } ] =>

[ doc1, doc2 ]

Mapper

Result File

[ freq, doc1, doc2 ]

Reducer

DocGrouper

DocPairCounter

‣ Indexer and Grouper for Processing Korean.

‣ No NLP and No Structural Analysis.

‣ Produce a pairwise similarity between two postings.

30

Page 31: SNS Analysis using Cloud Computing Services

Map & Reduce : Optimization

‣ Concerns‣ Consider Key Group Size Distribution‣ Data Load Balancing‣ Barrier Point

‣ Sample Data‣ Two months postings of my friends‣ Reachable graph: 4,060 Peoples‣ Total Postings: 206,115

2CloudKR

31

Page 32: SNS Analysis using Cloud Computing Services

Pair-Wise Similarity and its TreeMap

Posting: 110,008Users: 2,691

Score >= 6

32

Page 33: SNS Analysis using Cloud Computing Services

Pair-Wise Similarity and its Cluster2CloudKR

➡One issue and different opinions among people

33

Page 34: SNS Analysis using Cloud Computing Services

Pair-Wise Similarity and its Cluster2CloudKR

➡C

omm

on In

tere

st /

Hot

Issu

e

34

Page 35: SNS Analysis using Cloud Computing Services

Pair-Wise Similarity and its Cluster2CloudKR

➡One person and the similar contents pattern (specialty)

35

Page 36: SNS Analysis using Cloud Computing Services

Pair-Wise Similarity and its Cluster2CloudKR

➡ Similar Structure of Sentences (trendy, parady)

36

Page 37: SNS Analysis using Cloud Computing Services

Deployment

www

Flickr

S3/CloudFront

EC2

2CloudKR

37

Page 38: SNS Analysis using Cloud Computing Services

Cloud Computing Service2CloudKR

38

Page 39: SNS Analysis using Cloud Computing Services

Before the Cloud Age2CloudKR‣ Smart Shell Guru’s Daily Work : Parallel Sort

$ wc -l data$ split -l 1000k data

$ sort -rm data*.sorted > data.sorted

scpNFS

scpNFS

$ nohup ./work.sh data1 > data1.processed$ nohup sort -r data1.processed > data1.sorted

➡ Need to prepare/maintain physical machines and resources➡ Need to monitor job progress (wait and see job’s status)➡ Need to cope with machine failure (slave nodes / storages / networks)➡ Need to schedule multiple jobs

Complexity

39

Page 40: SNS Analysis using Cloud Computing Services

Amazon Web Service : Overview2CloudKR

EBS (Elastic Block Store)EC2 (Elastic Compute Cloud) 1 GB to 1TBMount

SimpleDB S3 (Simple Storage Service)

API

CloudFront

SQS (Simple Query Service)

HTTP

Clients

Buckets

Objects

Permissions

key-value

AMI (Machine Image)

EC2 EC2 EC2 EC2

Access Key IDSecret Access KeyKey Pair

Clients HTTP

Admin

SSH

Clients

Clients

Elastic MapReduceInstant EC2 Hadoop Cluster

Hadoop Hadoop Hadoop

Header

CloudWatch

Auto Scaling

Elastic Load Balancing

Mgmt Console

Monitoring

Edges

Messages

Import/Export

Offline

eSATA/USB

EC2 CLI

40

Page 41: SNS Analysis using Cloud Computing Services

Amazon Web Service2CloudKR

‣ Amazon Management Console

41

Page 42: SNS Analysis using Cloud Computing Services

AWS : AMI

AMIAmazon Machine Image

2CloudKR

42

Page 43: SNS Analysis using Cloud Computing Services

AWS : Paid AMI / The Cloud Market

AMIAmazon Machine Image

2CloudKR

Paid AMI

43

Page 44: SNS Analysis using Cloud Computing Services

AWS : How to make a AMI (1)2CloudKR

Loopback File# dd if=/dev/zero of=new_image.fs bs=1M count=1024

Make ext3 file system# mke2fs -F -j new_image.fs# mkdir /mnt/ec2-fs# mount -o loop new_image.fs /mnt/ec2-fs# mkdir /mnt/ec2-fs/dev# /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x console# /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x null# /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x zero# mkdir /mnt/ec2-fs/etc

Create /mnt/ec2-fs/etc/fstab (Add /dev/sda1 --> /, /etc/pts, shm, /proc, /sys)Create yum-xen.conf

# mkdir /mnt/ec2-fs/proc# mount -t proc none /mnt/ec2-fs/proc# yum -c yum-xen.conf --installroot=/mnt/ec2-fs -y groupinstall Base

Edit /mnt/ec2-fs/etc/sysconfig/network-scripts/ifcfg-eth0Edit /mnt/ec2-fs/etc/sysconfig/networkEdit /mnt/ec2-fs/etc/fstab (Add /dev/sda2 --> /mnt, /dev/sda3 --> swap)

chroot /mnt/ec2-fs /bin/shEdit services

44

Page 45: SNS Analysis using Cloud Computing Services

AWS : How to make a AMI (2)2CloudKR

Building an AMI# yum install ruby# rpm -i ec2-ami-tools-noarch.rpm (Download from public s3 bucket)# ec2-bundle-image -i new_image.fs -k my-private-key.key -u aws-user-id

Local Machine Root File System# ec2-bundle-vol -k my-private-key.key -s 1000 -u aws-user-id

Upload to S3# ec2-upload-bundle -b my-bucket -m image.manifest -a my-aws-access-key-id -s my-secret-key-id

Register AMI# ec2-register my-bucket/image.manifestIMAGE ami-xxxx

Testing# ec2-describe-images ami-xxxx

Deregister AMI# ec2-deregister ami-xxxx

Running AMI# ec2-run-intances ami-xxxx -n 1

http://docs.amazonwebservices.com/AWSEC2/2006-06-26/DeveloperGuide/

45

Page 46: SNS Analysis using Cloud Computing Services

AWS : EC2 Running Instance2CloudKR‣ AWS Management Console

46

Page 47: SNS Analysis using Cloud Computing Services

AWS : EC2 Running Instance2CloudKR

47

Page 48: SNS Analysis using Cloud Computing Services

Amazon Web Service: Access Methods2CloudKR

‣ Access Key ID / Secret Access Key ID / Key Pairs

‣ Amazon Management Console‣ EC2 API (WSDL) / EC2 CLI (Command Line Interface)‣ SSH

‣ Firefox Extensions• S3 Firefox Organizer• Elasticfox

‣ S3•DNS: s3 CNAME s3.amazonaws.com. e.g) Bucket Name: /s3.xyz.com http://s3.xyz.com ---> S3‘s s3.xyz.com

‣s3cmd (python)‣s3cmd.rb / s3sync.rb (ruby)‣S3Hub (Mac)

48

Page 49: SNS Analysis using Cloud Computing Services

Amazon Web Service: Elasticfox 2CloudKR‣ Firefox’s Extension: Elasticfox

49

Page 50: SNS Analysis using Cloud Computing Services

Amazon Web Service: Elasticfox 2CloudKR

‣ Key Pairs‣ Private Key‣ SSH

50

Page 51: SNS Analysis using Cloud Computing Services

Amazon Web Service: Elasticfox 2CloudKR

‣ Security Groups‣ Open Network Ports

51

Page 52: SNS Analysis using Cloud Computing Services

AWS: Elastic MapReduce2CloudKR

‣ EC2 + Hadoop

‣Tools‣ Management Console‣ elastic-mapreduce CLI

‣ Preparation‣ Code --> S3‣ Data --> S3

‣ Log Folder‣ Output Folder

‣Job Flow‣ Streaming‣ Custom Jar‣ Sample Applications

52

Page 53: SNS Analysis using Cloud Computing Services

AWS: Elastic MapReduce2CloudKR

53

Page 54: SNS Analysis using Cloud Computing Services

AWS: Elastic MapReduce : Web UI2CloudKR

54

Page 55: SNS Analysis using Cloud Computing Services

AWS: Elastic MapReduce : CLI for Workflow

Step1

Step2

Step3

input/*

output1/part-000**

output2/part-000**

output3/part-000**

2CloudKR

jobflow #id

55

Page 56: SNS Analysis using Cloud Computing Services

AWS: Elastic MapReduce2CloudKR

‣ Failed tasks will be rescheduled in other Hadoop slaves.‣ If a task is finished, the same instance will be killed by a tracker.

56

Page 57: SNS Analysis using Cloud Computing Services

AWS: Elastic MapReduce2CloudKR

57

Page 58: SNS Analysis using Cloud Computing Services

AWS: SocialFlow Automation2CloudKR

DHT

Home IDC Amazon Wild World

UsersAdmin

Re

ad

On

ly

Re

ad

/Write

Local Global

S3

boto python Launching EC2 pool

Results

Renderer

58

Page 59: SNS Analysis using Cloud Computing Services

AWS: EC2, EMR Price Model2CloudKR

Service Type Per Instance HourPer Instance Hour 1 Week (7 Days) 1 Week (7 Days)

EC2

On-Demand$ 0.10 (S)$ 0.40 (L)$ 0.80 (E)

$ 0.10 (S)$ 0.40 (L)$ 0.80 (E)

$ 16.8 $ 67.2 $ 134.4

KRW 20,865 KRW 83,462 KRW 166,924

EC2

Reserved1yr $ 3253yr $ 500

$ 0.03 (S)$ 0.12 (L)$ 0.24 (E)

$ 0.03 (S)$ 0.12 (L)$ 0.24 (E)

$ 5.04 $ 20.16 $ 40.32

KRW 6,259 KRW 25,038KRW 50,077

ElasticMapReduce On-Demand

$ 0.10 (S)$ 0.40 (L)$ 0.80 (E)

$ 0.015$ 0.06$ 0.12

$ 19.32$ 77.28$ 154.56

KRW 23,995KRW 95,981KRW 191,963

1 USD = 1242 KRW(S) = Small, (L) = Large, (E) = Extra Large

59

Page 60: SNS Analysis using Cloud Computing Services

AWS: Performance

http://tinyurl.com/qj6ao7

2CloudKR

60

Page 61: SNS Analysis using Cloud Computing Services

AWS: Performance2CloudKR

61

Page 62: SNS Analysis using Cloud Computing Services

AWS: Performance

http://tinyurl.com/p9jsyz

2CloudKR

62

Page 63: SNS Analysis using Cloud Computing Services

AWS: Performance

http://tinyurl.com/cqqxgl

2CloudKR

63

Page 64: SNS Analysis using Cloud Computing Services

10 Cent Tips

‣ AWS EC2

‣ Minimizing set-up time with prepared shell scripts

‣ Use Boto for automating deployments

‣ Use S3 (Free of Charge between S3 and EC2 in the same region)

‣ $0.030 per GB through June 30, 2000 ($0.1 per GB normal price)

‣ AWS Elastic MapReduce

‣ Enabling the SSH port(22) and Hadoop related ports (9100, 91001)

‣ Assess to Master Node: ssh -i keypair hadoop@public_dns_name

‣ Double Check (PATH, etc)

‣ Debug, Debug, Debug

‣ Use EC2 for hadoop (eg. Clouera’s Hadoop AMI) (No extra cost for Hadoop!)

2CloudKR

64

Page 65: SNS Analysis using Cloud Computing Services

10 Cent Tips

‣ AWS S3

‣ Setting HTTP header for images and static resources.

‣ Cache-Control: max-age=31536000

‣ Block Search Bots

‣ robots.txt at the root of a Bucket‣ User-agent: *‣ Disallow: /

‣ Using BitTorrent for large files

‣ http://s3.xyz.com/xfile.zip?torrent

‣ Compress Rendered HTML with gzip

‣ Content-Encoding: gzip

2CloudKR

$ s3cmd put index.html s3://s3.xyz.com/www \ --mime-type "text/html” \ --add-header "Content-Encoding: gzip" \ --acl-public

65

Page 66: SNS Analysis using Cloud Computing Services

Amazon Web Service : Limitations2CloudKR

66

Page 67: SNS Analysis using Cloud Computing Services

References

‣ 10 MapReduces Tips, Cloudera, http://tinyurl.com/pxuqup ‣ Christian Charas, Thierry Lecroq, Handbook of Exact String-Matching Algorithms‣ Dan Pritchett (eBay), BASE: Alternative ACID, p.48-55, ACM Queue May/June 2008‣ Edward Chang, (Google Research), Mining Large Scale Social Networks, MMDS ’08‣ Edward Walker, Benchmarking Amazon EC2 for high-performance scientific computing‣ Matei Zaharia et al, Improving MapReduce Performance in Heterogeneous Environments, OSDI ’08

‣ Following Twitter‣ http://twitter.com/AmazonEC2‣ http://twitter.com/AmazonS3S3

2CloudKR

67


Recommended