+ All Categories
Home > Documents > On Statistical Characteristics of Real-life Knowledge...

On Statistical Characteristics of Real-life Knowledge...

Date post: 30-May-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
32
On Statistical Characteristics of Real-life Knowledge Graphs Wenliang Cheng, Chengyu Wang, Bing Xiao, Weining Qian, Aoying Zhou Institute for Data Science and Engineering East China Normal University Shanghai, China 4/9/2015
Transcript
Page 1: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

On Statistical Characteristics of Real-life

Knowledge Graphs

Wenliang Cheng, Chengyu Wang, Bing Xiao, Weining Qian, Aoying Zhou

Institute for Data Science and Engineering

East China Normal University Shanghai, China

4/9/2015

Page 2: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

• Introduction & Motivation • Statistical Characteristics • Data Description • Empirical Studies • Conclusion

2

Outline

4/9/2015

Page 3: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

• Essential elements in the knowledge graph – Entities,and relationships among them.

• Entity – Person, location, organization, concepts etc.

• Relationship – Semantic relation between two entities.

• Fact – The triple of an entity, a relation and an entity.

3

What is a knowledge graph?

4/9/2015

Page 4: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

• Academic: – WordNet, YAGO, DBpedia, Probase, FreeBase,

Linked Open Data etc. • Industry:

– Google Knowledge Graph/Vault – Microsoft Satori – Facebook Graph Search – Baidu Zhixin – IBM Watson – …

4

Famous knowledge graphs

4/9/2015

Knowledge graph can serve as the backbone of

Web-scale applications, such as search engine,

question answering, text understanding etc.

Page 5: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

• How to effectively and efficiently manage a large-scale knowledge graph? – MySQL, Oracle, Neo4j, Triple store etc.

• Knowledge graph is different with social network – More semantic labels in both entities and relations – Topic or domain sensitive. – Contain various kinds of knowledge – Hard to define a unified schema

5

Knowledge graph management

4/9/2015

Page 6: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

• A benchmark for management of knowledge graph is required

• Understanding the real-life knowledge graph data is the first effort and is meaningful for us to design a synthetic data generator

• As a comparison, we also need to analyze the data

distributions of the social networks

6

Benchmarking a knowledge graph

4/9/2015

Page 7: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

• We evaluate 4 kinds of real-life knowledge graphs and 2 synthetic social networks via 13 statistical metrics and 4 distributions

• We have conducted a series of in-depth analysis about the evaluation results

7

Evaluate the graphs

4/9/2015

Page 8: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

• Introduction & Motivation • Statistical Characteristics • Data Description • Empirical Studies • Conclusion

8

Outline

4/9/2015

Page 9: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

• Previous research works on analyzing structural properties of large scale graphs – [Broder et al. Comput. Netw. 2000] studied the web

structure as a graph via a series of metrics, e.g degree, diameter, component.

– [Kumar et al. KDD, 2006] studied the dynamic social network’s structure properties, e.g. degree, hop etc.

– [Boccaletti et al. Phys. Rep. 2006] surveyed the studies of the structure and dynamics of complex network. 9

Large-scale graph properties

4/9/2015

Page 10: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

Thirteen statistical metrics

10 4/9/2015

Page 11: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

• Distribution of degrees – In-degree and out-degree – Power-law distribution

• Distribution of hops – Reflects the connectivity cost inside a graph

• Distribution of connected components – Strongly and weakly connected components – Reflects the connectivity of a graph

• Distribution of clustering coefficients – Measures the nodes’ tendency to cluster together

11

Four kinds of distributions

4/9/2015

Page 12: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

• Introduction & Motivation • Statistical Characteristics • Data Description • Empirical Studies • Conclusion

12

Outline

4/9/2015

Page 13: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

Four knowledge graphs

• WordNet – A lexical network for the English language. – Synonym set as node and semantic relation as edge. – 98,000 entities, 154,000 relationships

13 4/9/2015

Page 14: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

Four knowledge graphs

• YAGO2 – A huge semantic knowledge graph based on WordNet,

Wikipedia and GeoNames – 10+ million entities, 120+ million facts

• Separate the YAGO2 into three sub-graphs

– YagoTax: Taxonomy tree of YAGO2 – YagoFact: Facts in YAGO2 – YagoWiki: Hyperlink relations in YAGO2 based on

Wikipedia

14 4/9/2015

Page 15: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

Four knowledge graphs

• DBpedia – A multi-language knowledge base extracted from

Wikipedia info-boxes – English version of DBPedia – 4.58 million things and 2,795 different kinds of properties

• Enterprise Knowledge Graph (EKG) – Describes an enterprise ontology in Chinese. – Domain specific knowledge graph – 9,450 entities and 12,100 relations.

15 4/9/2015

Page 16: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

Two social networks

• SNRand – 0.2 million randomly selected users – 5 million fellowship relations between users

• SNRank – 0.2 million most active users. – 36+ million fellowship relations between users

• The raw data is collected from a famous social media platform named Sina Weibo in China

16 4/9/2015

Page 17: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

• Introduction & Motivation • Statistical Characteristics • Data Description • Empirical Studies • Conclusion

17

Outline

4/9/2015

Page 18: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

Empirical studies

• Analysis for statistical metrics – Comparison between different parts within a knowledge

graph.Take YAGO2 as a case study – Comparison between different knowledge graphs – Comparison between knowledge graphs and social

networks • Analysis for distributions

– Six distributions • Analysis for semantic labels’ relatedness

4/9/2015 18

Page 19: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

Analysis for statistical metrics

19 4/9/2015

Page 20: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

• Node degree distributions

20 4/9/2015

Analysis for distributions All the in-degree distributions exhibit the

power-law, except for some initial

segments that deviate the power-law.

Page 21: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

Analysis for distributions

• Hop distributions

21 4/9/2015

They are all in the “S”

shape, and can be

fitted by a sigmoid like

function:

f(x) = a / (1+ e^(b-c*x))

Page 22: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

Analysis for distributions

• Hop distributions

22 4/9/2015

The max number of

hops between different

parts of a knowledge

graph is different with

each other.

Page 23: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

Analysis for distributions

• Hop distributions

23 4/9/2015

The max number of hops

between knowledge

graphs and social

networks are also

different with each other.

Page 24: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

Analysis for distributions

• Connected component distributions

24 4/9/2015

Both the strongly and weakly connected

component distributions of knowledge graphs

exhibit the power-law distribution in general. While

the social networks are nearly in a whole strongly

connected component.

Page 25: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

Analysis for distributions

• Clustering coefficient distributions

25 4/9/2015

Node degree in this

experiment is the

sum of in-degree

and out-degree

Page 26: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

Analysis for distributions

• Clustering coefficient distributions

26 4/9/2015

Despite the points in the

scatter diagram are

dispersive, their tendencies

are in power-law distribution

Page 27: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

Analysis for distributions

• Clustering coefficient distributions

27 4/9/2015

The ACCs of social

networks are higher

than knowledge

graphs in general.

Page 28: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

Analysis for labels’ relatedness

28 4/9/2015

The semantic labels are

indeed topic related.

Page 29: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

• Introduction & Motivation • Statistical Characteristics • Data Description • Empirical Studies • Conclusion

29

Outline

4/9/2015

Page 30: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

Conclusions

• Different parts of a knowledge graph have different properties in some certain statistical characteristics

• The different knowledge graphs have different performances in several statistical characteristics, and their data distributions are also different

• Knowledge graphs are different with social networks in many ways.

30 4/9/2015

Page 31: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

Discussions on benchmarking

• The data generator should generate synthetic data of a knowledge graph in different aspects

• The generator should take the semantic labels in knowledge graphs into consideration and preserve the statistical characteristics of the real-life data

• The generator should not only generate the static synthetic data of a knowledge graph, but also the different stages of knowledge graph’s development

31 4/9/2015

Page 32: On Statistical Characteristics of Real-life Knowledge Graphsprof.ict.ac.cn/bpoe_6_vldb/wp-content/uploads/2015/... · –A multi-language knowledge base extracted from Wikipedia info-boxes

Thanks!

4/9/2015


Recommended