On Statistical Characteristics of Real-life
Knowledge Graphs
Wenliang Cheng, Chengyu Wang, Bing Xiao, Weining Qian, Aoying Zhou
Institute for Data Science and Engineering
East China Normal University Shanghai, China
4/9/2015
• Introduction & Motivation • Statistical Characteristics • Data Description • Empirical Studies • Conclusion
2
Outline
4/9/2015
• Essential elements in the knowledge graph – Entities,and relationships among them.
• Entity – Person, location, organization, concepts etc.
• Relationship – Semantic relation between two entities.
• Fact – The triple of an entity, a relation and an entity.
3
What is a knowledge graph?
4/9/2015
• Academic: – WordNet, YAGO, DBpedia, Probase, FreeBase,
Linked Open Data etc. • Industry:
– Google Knowledge Graph/Vault – Microsoft Satori – Facebook Graph Search – Baidu Zhixin – IBM Watson – …
4
Famous knowledge graphs
4/9/2015
Knowledge graph can serve as the backbone of
Web-scale applications, such as search engine,
question answering, text understanding etc.
• How to effectively and efficiently manage a large-scale knowledge graph? – MySQL, Oracle, Neo4j, Triple store etc.
• Knowledge graph is different with social network – More semantic labels in both entities and relations – Topic or domain sensitive. – Contain various kinds of knowledge – Hard to define a unified schema
5
Knowledge graph management
4/9/2015
• A benchmark for management of knowledge graph is required
• Understanding the real-life knowledge graph data is the first effort and is meaningful for us to design a synthetic data generator
• As a comparison, we also need to analyze the data
distributions of the social networks
6
Benchmarking a knowledge graph
4/9/2015
• We evaluate 4 kinds of real-life knowledge graphs and 2 synthetic social networks via 13 statistical metrics and 4 distributions
• We have conducted a series of in-depth analysis about the evaluation results
7
Evaluate the graphs
4/9/2015
• Introduction & Motivation • Statistical Characteristics • Data Description • Empirical Studies • Conclusion
8
Outline
4/9/2015
• Previous research works on analyzing structural properties of large scale graphs – [Broder et al. Comput. Netw. 2000] studied the web
structure as a graph via a series of metrics, e.g degree, diameter, component.
– [Kumar et al. KDD, 2006] studied the dynamic social network’s structure properties, e.g. degree, hop etc.
– [Boccaletti et al. Phys. Rep. 2006] surveyed the studies of the structure and dynamics of complex network. 9
Large-scale graph properties
4/9/2015
Thirteen statistical metrics
10 4/9/2015
• Distribution of degrees – In-degree and out-degree – Power-law distribution
• Distribution of hops – Reflects the connectivity cost inside a graph
• Distribution of connected components – Strongly and weakly connected components – Reflects the connectivity of a graph
• Distribution of clustering coefficients – Measures the nodes’ tendency to cluster together
11
Four kinds of distributions
4/9/2015
• Introduction & Motivation • Statistical Characteristics • Data Description • Empirical Studies • Conclusion
12
Outline
4/9/2015
Four knowledge graphs
• WordNet – A lexical network for the English language. – Synonym set as node and semantic relation as edge. – 98,000 entities, 154,000 relationships
13 4/9/2015
Four knowledge graphs
• YAGO2 – A huge semantic knowledge graph based on WordNet,
Wikipedia and GeoNames – 10+ million entities, 120+ million facts
• Separate the YAGO2 into three sub-graphs
– YagoTax: Taxonomy tree of YAGO2 – YagoFact: Facts in YAGO2 – YagoWiki: Hyperlink relations in YAGO2 based on
Wikipedia
14 4/9/2015
Four knowledge graphs
• DBpedia – A multi-language knowledge base extracted from
Wikipedia info-boxes – English version of DBPedia – 4.58 million things and 2,795 different kinds of properties
• Enterprise Knowledge Graph (EKG) – Describes an enterprise ontology in Chinese. – Domain specific knowledge graph – 9,450 entities and 12,100 relations.
15 4/9/2015
Two social networks
• SNRand – 0.2 million randomly selected users – 5 million fellowship relations between users
• SNRank – 0.2 million most active users. – 36+ million fellowship relations between users
• The raw data is collected from a famous social media platform named Sina Weibo in China
16 4/9/2015
• Introduction & Motivation • Statistical Characteristics • Data Description • Empirical Studies • Conclusion
17
Outline
4/9/2015
Empirical studies
• Analysis for statistical metrics – Comparison between different parts within a knowledge
graph.Take YAGO2 as a case study – Comparison between different knowledge graphs – Comparison between knowledge graphs and social
networks • Analysis for distributions
– Six distributions • Analysis for semantic labels’ relatedness
4/9/2015 18
Analysis for statistical metrics
19 4/9/2015
• Node degree distributions
20 4/9/2015
Analysis for distributions All the in-degree distributions exhibit the
power-law, except for some initial
segments that deviate the power-law.
Analysis for distributions
• Hop distributions
21 4/9/2015
They are all in the “S”
shape, and can be
fitted by a sigmoid like
function:
f(x) = a / (1+ e^(b-c*x))
Analysis for distributions
• Hop distributions
22 4/9/2015
The max number of
hops between different
parts of a knowledge
graph is different with
each other.
Analysis for distributions
• Hop distributions
23 4/9/2015
The max number of hops
between knowledge
graphs and social
networks are also
different with each other.
Analysis for distributions
• Connected component distributions
24 4/9/2015
Both the strongly and weakly connected
component distributions of knowledge graphs
exhibit the power-law distribution in general. While
the social networks are nearly in a whole strongly
connected component.
Analysis for distributions
• Clustering coefficient distributions
25 4/9/2015
Node degree in this
experiment is the
sum of in-degree
and out-degree
Analysis for distributions
• Clustering coefficient distributions
26 4/9/2015
Despite the points in the
scatter diagram are
dispersive, their tendencies
are in power-law distribution
Analysis for distributions
• Clustering coefficient distributions
27 4/9/2015
The ACCs of social
networks are higher
than knowledge
graphs in general.
Analysis for labels’ relatedness
28 4/9/2015
The semantic labels are
indeed topic related.
• Introduction & Motivation • Statistical Characteristics • Data Description • Empirical Studies • Conclusion
29
Outline
4/9/2015
Conclusions
• Different parts of a knowledge graph have different properties in some certain statistical characteristics
• The different knowledge graphs have different performances in several statistical characteristics, and their data distributions are also different
• Knowledge graphs are different with social networks in many ways.
30 4/9/2015
Discussions on benchmarking
• The data generator should generate synthetic data of a knowledge graph in different aspects
• The generator should take the semantic labels in knowledge graphs into consideration and preserve the statistical characteristics of the real-life data
• The generator should not only generate the static synthetic data of a knowledge graph, but also the different stages of knowledge graph’s development
31 4/9/2015
Thanks!
4/9/2015