Date post: | 22-Jan-2017 |
Category: |
Data & Analytics |
Upload: | chentao-zhang |
View: | 102 times |
Download: | 1 times |
Improvement for StackOverflow.com
Chentao Zhang Insight Data Engineering SV
Input DataQuestion:{“post_id”:67172, “post_date”:”6-10-2015-00-01-02”, “type”:0,“parent_id”:0 “tiltle”:” Java Exception”, “body”:”....”, “tags”:“java;algorithm”, “user_id”:782,…}
Answer:{“post_id”:67172, “post_date”:”6-10-2015-00-01-23”, “type”:1,“parent_id”:67172 “tiltle”:” “, “body”:”You should....”, “tags”:“”, “user_id”:1982,…}
Data Modeling and Queries ~Elasticsearch
1.Index • A collection of documents that have somewhat similar characteristics • Corresponding to ‘database’ in Relational Database.
2.Type • logical category/partition of your index whose semantics is completely up to you • Corresponding to ‘table’ in Relational Database.
3.Document • A basic unit of information that can be indexed • Corresponding to ‘row’ in Relational Database.
question_id tags answer_time(sec) posted_at Random_sq
231 Java 3010 2016_01_02_21_20_01 11_10
290 spark 7381 2016_01_02_22_09_01 11_28
341 Java 5611 2016_01_10_01_02_05 11_31
Data Modeling and Queries stackover/questions:
index type Document
question_id tags answer_time(sec) posted_at Random_sq
231 Java 3010 2016_01_02_21_20_01 11_10
290 spark 7381 2016_01_02_22_09_01 11_28
341 Java 5611 2016_01_10_01_02_05 11_31
Data Modeling and Queries stackover/questions:
index type Document
question_id tags answer_time(sec) posted_at Random_sq
231 Java 3010 2016_01_02_21_20_01 11_10
290 spark 7381 2016_01_02_22_09_01 11_28
341 Java 5611 2016_01_10_01_02_05 11_31
Data Modeling and Queries stackover/questions:
index type Document
question_id tags answer_time(sec) posted_at Random_sq
231 Java 3010 2016_01_02_21_20_01 11_10
290 spark 7381 2016_01_02_22_09_01 11_28
341 Java 5611 2016_01_10_01_02_05 11_31
Data Modeling and Queries stackover/questions:
index type Document
question_id tags answer_time(sec) posted_at Random_sq
231 Java 3010 2016_01_02_21_20_01 11_10
290 spark 7381 2016_01_02_22_09_01 11_28
341 Java 5611 2016_01_10_01_02_05 11_31
Data Modeling and Queries stackover/questions:
index type Document
question_id tags answer_time(sec) posted_at Random_sq
231 Java 3010 2016_01_02_21_20_01 11_10
290 spark 7381 2016_01_02_22_09_01 11_28
341 Java 5611 2016_01_10_01_02_05 11_31
Data Modeling and Queries stackover/questions:
index type Document
• Prob. of a question labeled with specific tag(such as ‘java’) and answered in 10 mins= number of questions answered in 10 mins and tagged with ‘java’
/ total number of questions tagged with ‘java’
question_id tags answer_time(sec) posted_at Random_sq
231 Java 3010 2016_01_02_21_20_01 11_10
290 spark 7381 2016_01_02_22_09_01 11_28
341 Java 5611 2016_01_10_01_02_05 11_31
Data Modeling and Queries stackover/questions:
index type Document
• Prob. of a question labeled with specific tag(such as ‘java’) and answered in 10 mins= number of questions answered in 10 mins and tagged with ‘java’
/ total number of questions tagged with ‘java’
• Stratified Sampling ~tags ~posted_at(month)
question_id tags answer_time(sec) posted_at Random_sq
231 Java 3010 2016_01_02_21_20_01 11_10
290 spark 7381 2016_01_02_22_09_01 11_28
341 Java 5611 2016_01_10_01_02_05 11_31
Data Modeling and Queries stackover/questions:
index type Document
• Prob. of a question labeled with specific tag(such as ‘java’) and answered in 10 mins= number of questions answered in 10 mins and tagged with ‘java’
/ total number of questions tagged with ‘java’
• Stratified Sampling ~tags ~posted_at(month)
userid tags
231 [“Java”,”hadoop"]
290 [“hadoop”,”Spark”]
341 [“java”,”sql”,”hadoop”]
Data Modeling and Queries s stackovergraph/userstags:
userid tags
231 [“Java”,”hadoop"]
290 [“hadoop”,”Spark”]
341 [“java”,”sql”,”hadoop”]
Data Modeling and Queries s stackovergraph/userstags:
userid tags
231 [“Java”,”hadoop"]
290 [“hadoop”,”Spark”]
341 [“java”,”sql”,”hadoop”]
Data Modeling and Queries s stackovergraph/userstags:
userid tags
231 [“Java”,”hadoop"]
290 [“hadoop”,”Spark”]
341 [“java”,”sql”,”hadoop”]
Data Modeling and Queries s stackovergraph/userstags:
userid tags
231 [“Java”,”JVM"]
290 [“JVM”,”Spark”]
341 [“java”,”sql”,”JVM”]
Data Modeling and Queries s stackovergraph/userstags:
Java
Tag num
Java 1
userid tags
231 [“Java”,”JVM"]
290 [“JVM”,”Spark”]
341 [“java”,”sql”,”JVM”]
Data Modeling and Queries s stackovergraph/userstags:
Java JVM
Tag num
Java 1
JVM 1
userid tags
231 [“Java”,”JVM"]
290 [“JVM”,”Spark”]
341 [“java”,”sql”,”JVM”]
Data Modeling and Queries s stackovergraph/userstags:
Java JVM
Tag num
Java 1
JVM 1
1
userid tags
231 [“Java”,”JVM"]
290 [“JVM”,”Spark”]
341 [“java”,”sql”,”JVM”]
Data Modeling and Queries s stackovergraph/userstags:
Java JVM
Tag numJava 1JVM 1
1
userid tags
231 [“Java”,”JVM"]
290 [“JVM”,”Spark”]
341 [“java”,”sql”,”JVM”]
Data Modeling and Queries s stackovergraph/userstags:
Java JVM
Tag numJava 1JVM 2
1
userid tags
231 [“Java”,”JVM"]
290 [“JVM”,”Spark”]
341 [“java”,”sql”,”JVM”]
Data Modeling and Queries s stackovergraph/userstags:
Java JVM
Tag numJava 1JVM 2
spark 1
1
Spark
1
userid tags
231 [“Java”,”JVM"]
290 [“JVM”,”Spark”]
341 [“java”,”sql”,”JVM”]
Data Modeling and Queries s stackovergraph/userstags:
Java JVM
Spark
Tag numJava 2JVM 2
spark 1
11
userid tags
231 [“Java”,”JVM"]
290 [“JVM”,”Spark”]
341 [“java”,”sql”,”JVM”]
Data Modeling and Queries s stackovergraph/userstags:
Java JVM
Spark
Tag numJava 2JVM 2
spark 1sql 1
11
Sql
userid tags
231 [“Java”,”JVM"]
290 [“JVM”,”Spark”]
341 [“java”,”sql”,”JVM”]
Data Modeling and Queries s stackovergraph/userstags:
Java JVM
Spark
Tag numJava 2JVM 2
spark 1sql 1
11
Sql
1
userid tags
231 [“Java”,”JVM"]
290 [“JVM”,”Spark”]
341 [“java”,”sql”,”JVM”]
Data Modeling and Queries s stackovergraph/userstags:
Java JVM
Spark
Tag numJava 2JVM 3
spark 1sql 1
21
Sql
1 1
userid tags
231 [“Java”,”JVM"]
290 [“JVM”,”Spark”]
341 [“Java”,”sql”,”JVM”]
Data Modeling and Queries s stackovergraph/userstags:
Java JVM
Spark
Tag numJava 2JVM 3
spark 1sql 1
21
Sql
1 1
Data Modeling and Queries
Tag numJava 2JVM 3spark 1
sql 1
Recommend tags for users:
Java JVM
Spark
21
Sql
1 1
Proportion of people who can answer “B” question in people who can answer “A” question =weight of edge AB / number of people who have answered “A” question =Similarity of “A” to “B”
Data Modeling and Queries
Tag numJava 2JVM 3
spark 1sql 1
Recommend tags for users:
Java JVM
Spark
21
Sql
1 1
Data PipelineHistorical data(60G)
Streaming data
1.Computing how long it takes to get answer for each question 2.Based on sampling fraction ,generating random number 3.Computing what types of questions which each user has answered (constructing graph)
1.Sampling data 2.Computing prob. 3.searching neighbors