+ All Categories
Home > Data & Analytics > Sam zhang demo

Sam zhang demo

Date post: 22-Jan-2017
Category:
Upload: chentao-zhang
View: 102 times
Download: 1 times
Share this document with a friend
31
Improvement for StackOverflow.com Chentao Zhang Insight Data Engineering SV
Transcript

Improvement for StackOverflow.com

Chentao Zhang Insight Data Engineering SV

Motivationsjava

hadoop java

Input DataQuestion:{“post_id”:67172, “post_date”:”6-10-2015-00-01-02”, “type”:0,“parent_id”:0 “tiltle”:” Java Exception”, “body”:”....”, “tags”:“java;algorithm”, “user_id”:782,…}

Answer:{“post_id”:67172, “post_date”:”6-10-2015-00-01-23”, “type”:1,“parent_id”:67172 “tiltle”:” “, “body”:”You should....”, “tags”:“”, “user_id”:1982,…}

Data Modeling and Queries ~Elasticsearch

1.Index • A collection of documents that have somewhat similar characteristics • Corresponding to ‘database’ in Relational Database.

2.Type • logical category/partition of your index whose semantics is completely up to you • Corresponding to ‘table’ in Relational Database.

3.Document • A basic unit of information that can be indexed • Corresponding to ‘row’ in Relational Database.

question_id tags answer_time(sec) posted_at Random_sq

231 Java 3010 2016_01_02_21_20_01 11_10

290 spark 7381 2016_01_02_22_09_01 11_28

341 Java 5611 2016_01_10_01_02_05 11_31

Data Modeling and Queries stackover/questions:

index type Document

question_id tags answer_time(sec) posted_at Random_sq

231 Java 3010 2016_01_02_21_20_01 11_10

290 spark 7381 2016_01_02_22_09_01 11_28

341 Java 5611 2016_01_10_01_02_05 11_31

Data Modeling and Queries stackover/questions:

index type Document

question_id tags answer_time(sec) posted_at Random_sq

231 Java 3010 2016_01_02_21_20_01 11_10

290 spark 7381 2016_01_02_22_09_01 11_28

341 Java 5611 2016_01_10_01_02_05 11_31

Data Modeling and Queries stackover/questions:

index type Document

question_id tags answer_time(sec) posted_at Random_sq

231 Java 3010 2016_01_02_21_20_01 11_10

290 spark 7381 2016_01_02_22_09_01 11_28

341 Java 5611 2016_01_10_01_02_05 11_31

Data Modeling and Queries stackover/questions:

index type Document

question_id tags answer_time(sec) posted_at Random_sq

231 Java 3010 2016_01_02_21_20_01 11_10

290 spark 7381 2016_01_02_22_09_01 11_28

341 Java 5611 2016_01_10_01_02_05 11_31

Data Modeling and Queries stackover/questions:

index type Document

question_id tags answer_time(sec) posted_at Random_sq

231 Java 3010 2016_01_02_21_20_01 11_10

290 spark 7381 2016_01_02_22_09_01 11_28

341 Java 5611 2016_01_10_01_02_05 11_31

Data Modeling and Queries stackover/questions:

index type Document

• Prob. of a question labeled with specific tag(such as ‘java’) and answered in 10 mins= number of questions answered in 10 mins and tagged with ‘java’

/ total number of questions tagged with ‘java’

question_id tags answer_time(sec) posted_at Random_sq

231 Java 3010 2016_01_02_21_20_01 11_10

290 spark 7381 2016_01_02_22_09_01 11_28

341 Java 5611 2016_01_10_01_02_05 11_31

Data Modeling and Queries stackover/questions:

index type Document

• Prob. of a question labeled with specific tag(such as ‘java’) and answered in 10 mins= number of questions answered in 10 mins and tagged with ‘java’

/ total number of questions tagged with ‘java’

• Stratified Sampling ~tags ~posted_at(month)

question_id tags answer_time(sec) posted_at Random_sq

231 Java 3010 2016_01_02_21_20_01 11_10

290 spark 7381 2016_01_02_22_09_01 11_28

341 Java 5611 2016_01_10_01_02_05 11_31

Data Modeling and Queries stackover/questions:

index type Document

• Prob. of a question labeled with specific tag(such as ‘java’) and answered in 10 mins= number of questions answered in 10 mins and tagged with ‘java’

/ total number of questions tagged with ‘java’

• Stratified Sampling ~tags ~posted_at(month)

userid tags

231 [“Java”,”hadoop"]

290 [“hadoop”,”Spark”]

341 [“java”,”sql”,”hadoop”]

Data Modeling and Queries s stackovergraph/userstags:

userid tags

231 [“Java”,”hadoop"]

290 [“hadoop”,”Spark”]

341 [“java”,”sql”,”hadoop”]

Data Modeling and Queries s stackovergraph/userstags:

userid tags

231 [“Java”,”hadoop"]

290 [“hadoop”,”Spark”]

341 [“java”,”sql”,”hadoop”]

Data Modeling and Queries s stackovergraph/userstags:

userid tags

231 [“Java”,”hadoop"]

290 [“hadoop”,”Spark”]

341 [“java”,”sql”,”hadoop”]

Data Modeling and Queries s stackovergraph/userstags:

userid tags

231 [“Java”,”JVM"]

290 [“JVM”,”Spark”]

341 [“java”,”sql”,”JVM”]

Data Modeling and Queries s stackovergraph/userstags:

Java

Tag num

Java 1

userid tags

231 [“Java”,”JVM"]

290 [“JVM”,”Spark”]

341 [“java”,”sql”,”JVM”]

Data Modeling and Queries s stackovergraph/userstags:

Java JVM

Tag num

Java 1

JVM 1

userid tags

231 [“Java”,”JVM"]

290 [“JVM”,”Spark”]

341 [“java”,”sql”,”JVM”]

Data Modeling and Queries s stackovergraph/userstags:

Java JVM

Tag num

Java 1

JVM 1

1

userid tags

231 [“Java”,”JVM"]

290 [“JVM”,”Spark”]

341 [“java”,”sql”,”JVM”]

Data Modeling and Queries s stackovergraph/userstags:

Java JVM

Tag numJava 1JVM 1

1

userid tags

231 [“Java”,”JVM"]

290 [“JVM”,”Spark”]

341 [“java”,”sql”,”JVM”]

Data Modeling and Queries s stackovergraph/userstags:

Java JVM

Tag numJava 1JVM 2

1

userid tags

231 [“Java”,”JVM"]

290 [“JVM”,”Spark”]

341 [“java”,”sql”,”JVM”]

Data Modeling and Queries s stackovergraph/userstags:

Java JVM

Tag numJava 1JVM 2

spark 1

1

Spark

1

userid tags

231 [“Java”,”JVM"]

290 [“JVM”,”Spark”]

341 [“java”,”sql”,”JVM”]

Data Modeling and Queries s stackovergraph/userstags:

Java JVM

Spark

Tag numJava 2JVM 2

spark 1

11

userid tags

231 [“Java”,”JVM"]

290 [“JVM”,”Spark”]

341 [“java”,”sql”,”JVM”]

Data Modeling and Queries s stackovergraph/userstags:

Java JVM

Spark

Tag numJava 2JVM 2

spark 1sql 1

11

Sql

userid tags

231 [“Java”,”JVM"]

290 [“JVM”,”Spark”]

341 [“java”,”sql”,”JVM”]

Data Modeling and Queries s stackovergraph/userstags:

Java JVM

Spark

Tag numJava 2JVM 2

spark 1sql 1

11

Sql

1

userid tags

231 [“Java”,”JVM"]

290 [“JVM”,”Spark”]

341 [“java”,”sql”,”JVM”]

Data Modeling and Queries s stackovergraph/userstags:

Java JVM

Spark

Tag numJava 2JVM 3

spark 1sql 1

21

Sql

1 1

userid tags

231 [“Java”,”JVM"]

290 [“JVM”,”Spark”]

341 [“Java”,”sql”,”JVM”]

Data Modeling and Queries s stackovergraph/userstags:

Java JVM

Spark

Tag numJava 2JVM 3

spark 1sql 1

21

Sql

1 1

Data Modeling and Queries

Tag numJava 2JVM 3spark 1

sql 1

Recommend tags for users:

Java JVM

Spark

21

Sql

1 1

Proportion of people who can answer “B” question in people who can answer “A” question =weight of edge AB / number of people who have answered “A” question =Similarity of “A” to “B”

Data Modeling and Queries

Tag numJava 2JVM 3

spark 1sql 1

Recommend tags for users:

Java JVM

Spark

21

Sql

1 1

Data PipelineHistorical data(60G)

Streaming data

1.Computing how long it takes to get answer for each question 2.Based on sampling fraction ,generating random number 3.Computing what types of questions which each user has answered (constructing graph)

1.Sampling data 2.Computing prob. 3.searching neighbors

About Me• Chentao(Sam) Zhang

• MS in Electrical & Computer Engineering from University of Delaware

• Passionated to learn and try new things


Recommended