Chapter 2 Graphs First develop some of the basic ideas behind graph theory, then
look at some fundamental applications
2.1 Basic Definitions Graphs: Nodes and Edges A graph is a way to specify relationships among a collection of items
A graph consists of a set of objects, called nodes
Certain pairs of these objects connected by links called edges
E.g., the graph in Fig. 2.1(a) consists of 4 nodes, labeled A, B, C, D
B is connected to each of the other 3 nodes by edges
C and D are connected by an edge too
Two nodes are neighbors if they’re connected by an edge
In Fig. 2.1(a), the relationship between the 2 ends of an edge as being symmetric
The edge simply connects them to each other
But often we want to express asymmetric relationships
Define a directed graph to consist of a set of nodes together with a set of directed edges, each a link from one node to another
The direction being important—see Fig. 2.1(b)
To emphasize that a graph isn’t directed, call it an undirected graph
But generally a graph is assumed undirected unless otherwise noted
Graphs as Models of Networks
Graphs serve as mathematical models of network structures
For a real example, Fig. 2.2 depicts the network structure of the Internet (“Arpanet”) in Dec. 1970—only 13 sites
Nodes represent computing hosts
There’s an edge joining 2 nodes if there’s a direct communication link between them
Ignoring the superimposed US map and the blow-up circles in MA and CA, the rest depicts this 13-node graph using the dots-and-lines style of Fig. 2.1
For showing the pattern of connections, the actual placement of the nodes is immaterial; all that matters is which nodes link to which
Fig. 2.3 is a different drawing of the same 13-node Arpanet graph
Graphs are useful whenever we want to represent how things are either physically or logically linked to one another in a network structure
The 13-node Arpanet is an example of a communication network
In Chapter 1, we saw examples from 2 other broad classes of graph structures
Social networks Nodes are people or groups of people Edges represent some kind of social interaction
Information networks Nodes are info resources (e.g., Web pages or documents) Edges represent logical connections such as hyperlinks,
citations, or cross-references
Fig. 2.4: a few further examples
The depictions of airline and subway systems in (a) and (b) are examples of transportation networks
Nodes are destinations and edges represent direct connections
The prerequisites among college courses in (c) is an example of a dependency network
Nodes are tasks and directed edges indicate that one task must be performed before another
The Tank Street Bridge from Brisbane, Australia in (d) is an example of a structural network
Joints are nodes and physical linkages are edges
2.2 Paths and ConnectivityPaths A path is a sequence of nodes with the property that
each consecutive pair in the sequence is connected by an edge
E.g., the sequence of nodes MIT, BBN, RAND, UCLA is a path in the Internet graph from Figs. 2.2 and 2.3
Another path is the sequence CASE, LINCOLN, MIT, UTAH, SRI, UCSB
A path can repeat nodes, e.g., SRI, STAN, UCLA, SRI, UTAH, MIT
But most paths we consider won’t do this
A path without repeat nodes is a simple path
Cycles A particularly important kind of non-simple path is a cycle,
a path with 3 edges, in which the 1st and last nodes are the same, but otherwise all nodes are distinct
Many cycles in Fig. 2.3, e.g.,
SRI, STAN, UCLA, SRI (as short as possible: it has 3 edges)
SRI, STAN, UCLA, RAND, BBN, MIT, UTAH, SRI
Every edge in the 1970 Arpanet belongs to a cycle
If any edge were to fails, there’s still a way to get from any node to any other node
More generally, cycles in communication and transportation networks allow for redundancy—provide for alternate routings
In the social network of friendships, cycles are common
NetworkX: Cyclescycle_basis(G) returns a list of cycles that form a basis for cycles of G
Each cycle list is a list of nodes forming a cycle; cyclic permutations are not included
>>> G = nx.barbell_graph(4,2)
>>> nx.cycle_basis(G)
[[1, 3, 0], [2, 3, 0], [8, 7, 6], [9, 7, 6], [8, 9, 6], [1, 2, 0]]
G must be a Graph—can’t be a DiGraph
A basis for cycles of a network is a minimal collection of cycles s.t. any cycle in the network can be written as a sum of cycles in the basis
Here summation of cycles is defined as “exclusive or” of the edges
simple_cycles(DG) returns the simple cycles (elementary circuits) of a directed graph
DG must be a DiGraph
A simple cycle is a closed path where no node appears twice, except the 1st and last are the same
Two elementary circuits are distinct if they aren’t cyclic permutations of each other
>>> DG1 = nx.DiGraph([(0,1), (1,3), (3,0), (3,2), (2,0)])
>>> nx.simple_cycles(DG1)
[[0, 1, 3, 0], [0, 1, 3, 2, 0]]
Connectivity A graph is connected if there’s a path between every pair of nodes
E.g., the 13-node Arpanet graph is connected
We expect most communication and transportation networks to (try to) be connected
Their goal is to move traffic from one node to another
But there’s no a priori reason to expect graphs in other settings to be connected
Figs. 2.5 and 2.6 show disconnected graphs
Fig. 2.6 is the collaboration graph of the biological research center Structural Genomics of Pathogenic Protozoa
Nodes represent researchers
There’s an edge between 2 nodes if the researchers co-authored a publication
Components
In Fig. 2.5, the graph consists of 3 “pieces”:
one consisting of nodes A and B,
one consisting of nodes C, D, and E, and
one consisting of the rest of the nodes
The network in Fig. 2.6 also consists of 3 pieces: one on 3 nodes, one on 4 nodes, and one that’s much larger
A connected component (or just component) of a graph is a subset of the nodes s.t.
(i) every node in the subset has a path to every other, and
(ii) the subset isn't part of some larger set with the property that every node can reach every other
Dividing a graph into its components is just a first, global way of describing its structure
Within a given component, there may be richer internal structure that’s important to our interpretation of the network
E.g., in the largest component in Fig. 2.6, there’s a prominent node at the center, and tightly-knit groups linked to this node but not to each other
This component would break into 3 distinct components if this node were removed
Analyzing a graph this way (its densely-connected regions and the boundaries between them) is a powerful way of thinking about network structure—cf. Chap. 3
Giant Components
Consider the social network of the entire world, with a link between 2 people if they’re friends
This global friendship network probably isn’t connected—consider, e.g., un-contacted tribes
But the component you inhabit probably contains a significant fraction of the world’s population.
This is true for a range of network datasets—large, complex networks often have a giant component
This is a deliberately informal term for a connected component containing a significant fraction of all the nodes
When a network contains a giant component, it almost always contains only one
E.g., if the global friendship network had 2 giant components, all it would take is a meting between a representative of each to combines them
This in fact happened with the discovery of America—with dramatic consequences
The notion of giant components is useful for reasoning about networks on much smaller scales as well
See the collaboration network in Fig. 2.6
Another example is Fig. 2.7: the romantic relationships in an American high school over an 18-month period
Not all edges were present at once
The fact that this graph contains such a large component is significant regarding the spread of STDs
The researchers noted that, “like social facts, [these structures] are invisible yet consequential macrostructures that arise as the product of individual agency.”
NetworkX: Subgraphs A subgraph of a graph G is a graph
whose vertex set is a subset of that of G, and
whose adjacency relation is a subset of that of G restricted to this subset
Graph.subgraph(nbunch) returns the subgraph induced on the nodes in nbunch
I.e., the nodes in nbunch and the edges between them
The graph, edge or node attributes just point to the original graph
So changes to the node or edge structure won’t be reflected in the original graph But changes to the attributes will
To create a subgraph with its own copy of the edge/node attributes use
nx.Graph(G.subgraph(nbunch))
If edge attributes are containers, get a deep copy using
G.subgraph(nbunch).copy()
For an in-place reduction of a graph to a subgraph, remove nodes
G.remove_nodes_from([n in G if n not in set(nbunch)])
The following all have the same description as Graph.subgraph()
DiGraph.subgraph(nbunch)
MultiGraph.subgraph(nbunch)
MultiDiGraph.subgraph(nbunch)
Make a barbell without the handle
>>> G = nx.barbell_graph(4,2)
>>> G1 = G.subgraph([0,1,2,3,6,7,8,9])
NetworkX: Connected Components In an undirected graph G, vertices u and v are connected if G
contains a path from u to v
A graph is connected if every pair of vertices in it is connected
A connected component is a maximal connected subgraph of G
Each vertex belongs to exactly 1 connected component, as does each edge
A directed graph is weakly connected if replacing all of its directed edges with undirected edges produces a connected (undirected) graph
It’s connected if it contains a directed path from u to v or a directed path from v to u for every pair of vertices u, v
It’s strongly connected if it contains a directed path from u to v and a directed path from v to u for every pair of vertices u, v
The weakly connected components are the maximal weakly connected subgraphs
The strongly connected components are the maximal strongly connected subgraphs
The condensation of a directed graph is the graph with each of the strongly connected components contracted into a single node
For a Graph G
is_connected(G) tests G’s connectivity
number_connected_components(G) returns the number of connected components in G
connected_components(G) returns a list of the connected components of G, each a list of nodes
connected_component_subgraphs(G) returns a list of the connected components of G as subgraphs
node_connected_component(G, n ) returns a list of the nodes in the connected components of G containing node n
>>> G1 = nx.complete_graph(3)
>>> G2 = nx.complete_graph(2)
>>> GG = nx.disjoint_union(G1, G2)
>>> GG.edges()
[(0, 1), (0, 2), (1, 2), (3, 4)]
>>> nx.connected_components(GG)
[[0, 1, 2], [3, 4]]
>>> H1, H2 = nx.connected_component_subgraphs(GG)
>>> H1.edges()
[(0, 1), (0, 2), (1, 2)]
>>> H2.edges()
[(3, 4)]
>>> nx.node_connected_component(GG, 2)
[0, 1, 2]
For a DiGraph G
is_strongly_connected(G) tests G for strong connectivity
number_strongly_connected_components(G) returns the number of strongly connected components in G
strongly_connected_components(G) returns a list of the strongly connected components of G, each a list of nodes
strongly_connected_component_subgraphs(G) returns a list of the strongly connected components of G as subgraphs
condensation(G, scc) returns the condensation of G
scc is a list of strongly connected components—cf. strongly_connected_components()
The resulting graph is a directed acyclic graph
Node labels are the indices of the components in the list of strongly connected components
is_weakly_connected(G) tests G for weak connectivity
number_weakly_connected_components(G) returns the number of weakly connected components in G
weakly_connected_components(G) returns a list of the strongly connected components of G, each a list of nodes
weakly_connected_component_subgraphs(G) returns a list of the weakly connected components of G as subgraphs
>>> DG = nx.DiGraph([(0,1),(1,2),(2,0),(3,4)])
>>> nx.weakly_connected_components(DG)
[[0, 1, 2], [3, 4]]
>>> scc = nx.strongly_connected_components(DG)
>>> scc
[[0, 1, 2], [4], [3]]
>>> DGS = nx.strongly_connected_component_subgraphs(DG)[0]
>>> DGS.edges()
[(0, 1), (1, 2), (2, 0)]
>>> DGcon = nx.condensation(DG, scc)
>>> DGcon.edges()
[(2, 1)]
>>> DGcon.nodes()
[0, 1, 2]
Cliques A clique in an undirected graph G = (V, E) is a subset of the vertex set
C ⊆ V s.t. every 2 vertices in C are connected by an edge
Equivalently, the subgraph induced by C is complete
Sometimes the term “clique” also refers to the subgraph
A maximal clique is a clique that can’t be extended by including 1 more adjacent vertex
i.e., a clique that doesn’t exist exclusively within the vertex set of a larger clique
A maximum clique is a clique of the largest possible size in G
The clique number of G is the number of nodes in a maximum clique of G
Finding a maximum clique is an NP-complete problem
In the following, G may be a Graph, DiGraph, MultiGraph, or MultiDiGraph
find_cliques(G) returns a generator of maximal cliques in G as node lists
graph_clique_number(G) returns the clique number for G
graph_number_of_cliques(G) returns the number of maximal cliques in G
>>> F = nx.barbell_graph(4,1)
>>> for cl in nx.find_cliques(F):
... print cl
...
[8, 5, 6, 7]
[3, 0, 1, 2]
[3, 4]
[5, 4]
>>> nx.graph_number_of_cliques(F)
4
>>> nx.graph_clique_number(F)
4
2.3 Distance and Breadth-First Search Beyond asking whether 2 nodes are connected by a path, ask how long
such a path is
The length of a path is the number of edges in the sequence that comprises it E.g., the path MIT, BBN, RAND, UCLA in Fig. 2.3 has length 3 The path MIT, UTAH has length 1
The distance between 2 nodes is the length of the shortest path between them
E.g., the distance between LINC and SRI is 3 Check that there’s no length-1 or length-2 path between them
Breadth-First Search First declare all of your actual
friends to be at distance 1
Then find all of their friends (not counting people already friends of yours), declare these to be at distance 2
Then find all of their friends (not counting people already found at distances 1 and 2), declare these to be at distance 3
Continuing in this way, search in successive layers, each representing the next distance out Each new layer is built from all those nodes that
have not already been discovered in earlier layers, and have an edge to some node in the previous layer
Figure 2.8
Figure 2.9. How to discover all distances from the node MIT in the 13-node Arpanet graph from Figure 2.3
NetworkX: Depth First Search Various algorithms giving the result of a depth-first search (DFS) on a graph
The source argument (where the traversal begins) is optional (defaulting to node 0 or whichever is listed first) but generally included
dfs_edges(G, source) returns a generator that produces edges in a DFS
dfs_tree(G, source) returns a directed tree (a DiGraph) of a DFS
dfs_predecessors(G, source) returns a dictionary of predecessors in a DFS
dfs_successors(G, source) returns a dictionary of successors in a DFS
dfs_preorder_nodes(G, source) returns a generator producing nodes in a DFS pre-ordering
dfs_postorder_nodes(G, source) returns a generator producing nodes in a DFS post-ordering
dfs_labeled_edges(G, source) returns a generator that produces edges in a DFS labeled by direction (‘dir’) type (‘forward’, ‘reverse’, ‘nontree’)
>>> G = nx.krackhardt_kite_graph()
>>> list(nx.dfs_edges(G,0))
[(0, 1), (1, 3), (3, 2), (2, 5), (5, 6), (6, 4), (6, 7), (7, 8), (8, 9)]
>>> list(nx.dfs_edges(G,9))
[(9, 8), (8, 7), (7, 5), (5, 0), (0, 1), (1, 3), (3, 2), (3, 4), (4, 6)]
>>> list(nx.dfs_edges(G))
[(0, 1), (1, 3), (3, 2), (2, 5), (5, 6), (6, 4), (6, 7), (7, 8), (8, 9)]
>>> tree = nx.dfs_tree(G, 9)
>>> tree
<networkx.classes.digraph.DiGraph object at 0x0217CF10>
>>> tree.succ{0: {1: {}}, 1: {3: {}}, 2: {}, 3: {2: {}, 4: {}}, 4: {6: {}}, 5: {0: {}}, 6: {}, 7: {5: {}}, 8: {7: {}}, 9: {8: {}}}
>>> nx.dfs_successors(G, 9)
{0: [1], 1: [3], 3: [2, 4], 4: [6], 5: [0], 7: [5], 8: [7], 9: [8]}
>>> nx.dfs_predecessors(G, 9)
{0: 5, 1: 0, 2: 3, 3: 1, 4: 3, 5: 7, 6: 4, 7: 8, 8: 9}
>>> list(nx.dfs_preorder_nodes(G, 9))
[9, 8, 7, 5, 0, 1, 3, 2, 4, 6]
>>> list(nx.dfs_postorder_nodes(G, 9))
[2, 6, 4, 3, 1, 0, 5, 7, 8, 9]
NetworkX: Breadth First Search Various algorithms that give the result of a breadth-first search (BFS) on
a graph
The source argument (where the traversal begins) again is optional (defaulting to the node listed first) but generally included
bfs_edges(G, source) returns a generator that produces edges in a BFS
bfs_tree(G, source) returns a directed tree of a BFS
bfs_predecessors(G, source) returns a dictionary of predecessors in a BFS
bfs_successors(G, source) returns a dictionary of successors in a BFS
>>> list(nx.bfs_edges(G, 0))
[(0, 1), (0, 2), (0, 3), (0, 5), (1, 4), (1, 6), (5, 7), (7, 8), (8, 9)]
>>> list(nx.bfs_edges(G, 9))
[(9, 8), (8, 7), (7, 5), (7, 6), (5, 0), (5, 2), (5, 3), (6, 1), (6, 4)]
>>> tree = nx.bfs_tree(G, 9)
>>> tree.succ{0: {}, 1: {}, 2: {}, 3: {}, 4: {}, 5: {0: {}, 2: {}, 3: {}},
6: {1: {}, 4: {}}, 7: {5: {}, 6: {}}, 8: {7: {}}, 9: {8: {}}}
>>> nx.bfs_predecessors(G, 9)
{0: 5, 1: 6, 2: 5, 3: 5, 4: 6, 5: 7, 6: 7, 7: 8, 8: 9}
>>> nx.bfs_successors(G, 9)
{8: [7], 9: [8], 5: [0, 2, 3], 6: [1, 4], 7: [5, 6]}
NetworkX: Shortest Path These algorithms work for undirected and directed graphs
Return an arbitrary shortest path when there is more than 1 shortest path between 2 nodes
Those that search for a path between 2 particular nodes raise a NetworkXNoPath exception if there is not such path
has_path(G, source, target) returns True if G has a path from source to target; otherwise, False is returned
shortest_path(G, source=None, target=None, weight=None) computes shortest paths in G
If the source and target are both specified, return a single list of nodes in a shortest path
If only the source is specified, return a dictionary keyed by targets with a list of nodes in a shortest path
If neither the source nor the target is specified, return path, a dictionary of dictionaries where
path[source][target] is the list of nodes in the source-to-target path
weight, if None (default), causes every edge to have weight/distance 1
If weight is a string, it’s the edge attribute to use as the edge weight Any edge attribute not present defaults to 1
shortest_path_length(G, source=None, target=None, weight=None) computes shortest path lengths in G
source, target, weight are as with shortest_path()
If the source and target are both specified, return a single number for the shortest path
If only the source is specified, return a dictionary keyed by targets with the shortest path lengths as values
If neither the source nor the target is specified, return length, a dictionary of dictionaries where length[source][target] is the length of a shortest path from source to target
average_shortest_path_length(G, weight=None) returns the average shortest path length over all pairs of distinct nodes of G
weight is as before
The average shortest path length a is
where
V is the set of nodes in G,
d (s, t ) is the shortest path from s to t, and
n is the number of nodes in G
>>> import networkx as nx
>>> DG = nx.DiGraph()
>>> DG.add_weighted_edges_from([(0,1,1.0), (1,0,1.0),
(1,3,1.0), (3,4,1.0), (4,2,1.0), (2,1,1.0),
(1,4,3.0), (4,1,3.0), (5,4,1.0)])
>>> nx.has_path(DG,5,0)
True
>>> nx.has_path(DG,0,5)
False
>>> d = nx.shortest_path(DG)
>>> d
{0: {0: [0], 1: [0, 1], 2: [0, 1, 4, 2], 3: [0, 1, 3], 4: [0, 1, 4]},
1: {0: [1, 0], 1: [1], 2: [1, 4, 2], 3: [1, 3], 4: [1, 4]},
2: {0: [2, 1, 0], 1: [2, 1], 2: [2], 3: [2, 1, 3], 4: [2, 1, 4]},
3: {0: [3, 4, 1, 0], 1: [3, 4, 1], 2: [3, 4, 2], 3: [3], 4: [3, 4]},
4: {0: [4, 1, 0], 1: [4, 1], 2: [4, 2], 3: [4, 1, 3], 4: [4]},
5: {0: [5, 4, 1, 0], 1: [5, 4, 1], 2: [5, 4, 2], 3: [5, 4, 1, 3],
4: [5, 4], 5: [5]}}
>>> d[0]
{0: [0], 1: [0, 1], 2: [0, 1, 4, 2], 3: [0, 1, 3], 4: [0, 1, 4]}
>>> dw = nx.shortest_path(DG, weight='weight')
>>> dw[0]
{0: [0], 1: [0, 1], 2: [0, 1, 3, 4, 2], 3: [0, 1, 3], 4: [0, 1, 3, 4]}
>>> nx.shortest_path(DG, source=0, weight='weight')
{0: [0], 1: [0, 1], 2: [0, 1, 3, 4, 2], 3: [0, 1, 3], 4: [0, 1, 3, 4]}
>>> nx.shortest_path(DG, source=0, target=4, weight='weight')
[0, 1, 3, 4]
>>> nx.shortest_path(DG, source=0, target=5)
Traceback (most recent call last):
…
networkx.exception.NetworkXNoPath: No path between 0 and 5.
>>> dw_len = nx.shortest_path_length(DG, weight='weight')
>>> dw_len[0]
{0: 0, 1: 1.0, 2: 4.0, 3: 2.0, 4: 3.0}
>>> dw_len[0][4]
3.0
>>> nx.shortest_path_length(DG, source=0, weight='weight')
{0: 0, 1: 1.0, 2: 4.0, 3: 2.0, 4: 3.0}
>>> nx.shortest_path_length(DG, source=0, target=4, weight='weight')
3.0
>>> nx.average_shortest_path_length(DG, weight='weight')
1.9333333333333333
>>> dw_len = nx.shortest_path_length(DG, weight='weight')
>>> count = len_sum = 0
>>> for lens in dw_len.values():
... count += len(lens)
... len_sum += sum(lens.values())
...
>>> print count, len_sum
31 58.0
>>> len_sum / count
1.8709677419354838
Advanced Shortest Path These are often more specific than the functions listed above and
often provide the implementations for those functions
Generally return results in the now familiar nested dictionary format
A function without ‘dijkstra’ in its name ignores weights and other edge data
A function with ‘dijkstra’ in its name by default considers the values of edge ‘weight’ attributes
To consider different edge data, set the ‘weight’ keyword argument to that attribute
All these functions have a keyword argument cutoff
Can be set to stop the search at the given depth
Paths of length greater than the cutoff are ignored
single_source_shortest_path(G, source, cutoff=None) computes shortest path from source to all nodes reachable from it
single_source_shortest_path_length(G, source, cutoff=None) computes shortest path lengths from source to all reachable nodes
all_pairs_shortest_path(G, cutoff=None) computes shortest paths between all node
all_pairs_shortest_path_length(G, cutoff=None) computes the shortest path lengths between all nodes
>>> nx.single_source_shortest_path(DG, 0)
{0: [0], 1: [0, 1], 2: [0, 1, 4, 2], 3: [0, 1, 3],
4: [0, 1, 4]}
>>> nx.single_source_shortest_path(DG, 0, cutoff=2)
{0: [0], 1: [0, 1], 3: [0, 1, 3], 4: [0, 1, 4]}
>>> nx.single_source_shortest_path_length(DG, 0)
{0: 0, 1: 1, 2: 3, 3: 2, 4: 2}
>>> nx.all_pairs_shortest_path_length(DG)
{0: {0: 0, 1: 1, 2: 3, 3: 2, 4: 2},
1: {0: 1, 1: 0, 2: 2, 3: 1, 4: 1},
2: {0: 2, 1: 1, 2: 0, 3: 2, 4: 2},
3: {0: 3, 1: 2, 2: 2, 3: 0, 4: 1},
4: {0: 2, 1: 1, 2: 1, 3: 2, 4: 0},
5: {0: 3, 1: 2, 2: 2, 3: 3, 4: 1, 5: 0}}
dijkstra_path(G, source, target, weight=’weight’) returns the shortest path from source to target in a weighted graph
dijkstra_path_length(G, source, target, weight=’weight’) returns the shortest path length from source to target in a weighted graph
single_source_dijkstra_path(G, source, cutoff=None, weight=’weight’) computes the shortest paths between source and all other reachable nodes for a weighted graph
single_source_dijkstra_path_length(G, source, cutoff=None, weight=’weight’) computes the lengths of the shortest paths lengths between source and all
other reachable nodes for a weighted graph
all_pairs_dijkstra_path(G, cutoff=None, weight=’weight’) compute shortest paths between all nodes in a weighted graph
all_pairs_dijkstra_path_length(G, cutoff=None, weight=’weight’) compute the lengths of the shortest paths between all nodes in a weighted graph
single_source_dijkstra(G, source, target=None, cutoff=None, weight=’weight’)
computes the shortest paths and their lengths in a weighted graph
Returns a tuple of 2 dictionaries keyed by node,
1st for distances from the source
2nd for the paths from the source to that node
>>> nx.dijkstra_path(DG,0,4)
[0, 1, 3, 4]
>>> ls, ps = nx.single_source_dijkstra(DG, 0)
>>> ls
{0: 0, 1: 1.0, 2: 4.0, 3: 2.0, 4: 3.0}
>>> ps
{0: [0], 1: [0, 1], 2: [0, 1, 3, 4, 2], 3: [0, 1, 3], 4: [0, 1, 3, 4]}
floyd_warshall(G, weight=’weight’) finds all-pairs shortest path lengths using Floyd’s algorithm
Floyd’s algorithm is appropriate for finding shortest paths in dense graphs or graphs with negative weights when Dijkstra’s algorithm fails
This algorithm can still fail if there are negative cycles
It has running time O(n 3) with running space is O(n
2)
>>> fw = nx.floyd_warshall(DG)
>>> fw[0]
defaultdict1(<function <lambda> at 0x01E0E5B0>, {0: 0, 1: 1.0, 2: 4.0, 3: 2.0, 4: 3.0, 5: inf})
>>> fw[5]
defaultdict(<function <lambda> at 0x01FC1230>, {0: 4.0, 1: 3.0, 2: 2.0, 3: 4.0, 4: 1.0, 5: 0})
>>> fw[5][0]
4.0
1. defaultdict is a dict subclass that calls a factory function to supply missing values
astar_path(G, source, target, heuristic=None, weight=’weight’) returns a list of nodes in a shortest path between source and target using the A* (“A-star”) algorithm
heuristic is a function to estimate the distance from a node to target
To guarantee a shortest path, this function should never overestimate this distance
The function takes 2 node arguments and must return a number
>>> G=nx.grid_graph(dim=[3,3])
>>> def dist(a, b):
... (x1, y1) = a
... (x2, y2) = b
... return ((x1 - x2) ** 2 + (y1 - y2) ** 2) ** 0.5
...
>>> nx.astar_path(G,(0,0),(2,2),dist)
[(0, 0), (0, 1), (1, 1), (1, 2), (2, 2)]
The Small-World Phenomenon Go back to our thought experiments on the global friendship
network
The argument explaining why you belong to a giant component asserts something stronger:
Not only do you have paths of friends connecting you to a large fraction of the world’s population,
but these paths are surprisingly short
E.g., consider a friend from another country (thence to his friends and relatives, etc.)
This is the small-world phenomenon:
the idea that the world looks “small” when you think of how short a path of friends it takes to get from you to almost anyone else
Also known as the 6 degrees of separation
Title of a play by John Guare
One line is:
“I read somewhere that everybody on this planet is separated by only six other people. Six degrees of separation between us and everyone else on this planet
The first experimental study of this notion (and the origin of “6”) was by Stanley Milgram and his colleagues in the 1960s
With a small budget, he tested the idea that people are really connected in the global friendship network by short chains of friends
Asked 296 randomly chosen “starters” to try forwarding a letter to a “target” person, a stockbroker living in a suburb of Boston
The starters were each given some personal info about the target (including address and occupation)
Asked to forward the letter to someone they knew on a first-name basis, with the same instructions Try to reach the target as quickly as possible
Formed chains of people that closed in on the stockbroker
Figure 2.10: the distribution of path lengths, among the 64 chains that reached the target
The median length was 6
It’s striking that so many letters reached their destination and by such short paths
Some caveats about the experiment
It clearly doesn't establish a statement as bold as “six degrees of separation between us and everyone else on this planet”
The paths were just to a single, fairly affluent target
Many letters never got there
Attempts to recreate the experiment have been problematic due to lack of participation
We can ask how useful these short paths really are
Milgram himself in his original paper:
If we think of each person as the center of their own social “world,” then “six short steps" becomes “six worlds apart”
Makes 6 sound like a much larger number
Still, the overall conclusion has been accepted in a broad sense:
Social networks tend to have very short paths between essentially arbitrary pairs of people
The existence of all these short paths has substantial consequences for the potential speed with which info, diseases, and other kinds of contagion can spread
Cf. also the potential access that the social network provides to opportunities and to people with very different characteristics from our own
See Chapter 20, a more detailed study of the small-world phenomenon and its consequences
Instant Messaging, Paul Erdös, and Kevin Bacon That social networks generally are “small worlds” has been
increasingly confirmed in settings where we do have full data on the network structure
Milgram resorted to an experiment where letters served as “tracers” through a global friendship network
He had no hope of fully mapping the network on his own
But where the full graph structure is known,
we load it into a computer and perform breadth-first search to determine what typical distances look like
One of the largest such computational studies was by Leskovec and Horvitz
Analyzed the 240 million active user accounts on Microsoft Instant Messenger
They were employed by Microsoft at the time, had access to a complete snapshot of the system for the month under study
Built a graph where each node corresponds to a user
There’s an edge between two users if they engaged in a two-way conversation at any point during a month-long observation period
The graph had a giant component containing almost all of the nodes
The distances within this giant component were very small
An estimated average distance of 6.6
An estimated median of 7
Figure 2.11: The distribution of distances averaged over a random sample of 1000 users:
Breadth-first search was done separately from each of these 1000 users
The results from these 1000 nodes were combined to produce the plot in
The graph was so large that doing breadth-first search from every single node would have taken an astronomical amount of time
Producing plots like this efficiently for massive graphs is an interesting research topic in itself
Figure 2.11 approximates what Milgram was after: the distribution of how far apart we all are in the full global friendship network
Reconciling the structure of such massive datasets with the underlying networks we’re trying to measure is an issue arising here and many times later
Here we’re still some distance from Milgram's goal
We only track people who are technologically-endowed enough to have access to instant messaging
Rather than basing the graph on who is truly friends with whom, we observe only who talks to whom during an observation period
Turning to a smaller scale (magnitude 105 rather than 108 people), researchers have also discovered very short paths in the collaboration networks within professional communities
E.g., in mathematics, Erdös (published c. 1500 papers) is a central figure in the collaborative structure of the field
Define a collaboration graph (as in Figure 2.6) with
nodes for mathematicians and
edges connecting pairs who have jointly authored a paper
Figure 2.12: A small hand-drawn piece of the collaboration graph, with paths leading to Paul Erdös
A mathematician's Erdös number is the distance from them to Erdös
Most mathematicians have Erdös numbers of at most 4 or 5
Extending the collaboration graph to co-authorship across all the sciences, most scientists in other fields have Erdös numbers only slightly (if at all) larger
Einstein (2), Fermi (3), Chomsky (4), Pauling (4), Crick (5), Watson (6)
Three students at Albright College in PA around 1994 adapted the idea of Erdös numbers to the collaboration graph of movie actors
Nodes are performers
An edge connects 2 performers if they've appeared together in a movie
A performer's Bacon number is their distance in this graph to Kevin Bacon
Using cast lists from the Internet Movie Database (IMDB), compute Bacon numbers for all performers via breadth-first search
The ave. Bacon number, over all performers in the IMDB, is c. 2.9
Hard to find one that's larger than 5
One movies enthusiast tried to come up with the largest Bacon number
Found an obscure 1928 Soviet pirate film, Plenniki Morya, starring P. Savin with Bacon number of 7
Supporting cast of 8 appeared nowhere else
NetworkX: Minimum Spanning Tree Given a connected, undirected graph G, a spanning tree of G is a
subgraph that
is a tree and
connects all the vertices
A single graph may have several spanning trees
The weight of a spanning tree is the sum of the weights of the edges in that spanning tree
A minimum spanning tree (MST) is a spanning tree with weight the weight of every other spanning tree
More generally, any undirected graph (not necessarily connected) graph has a minimum spanning forest—
the union of MSTs for its connected components
minimum_spanning_tree(G, weight=’weight’) returns an MST or, if the graph isn’t connected, a min. spanning forest of G
G must be a Graph (not a DiGraph, MultiGraph, …)
A Graph is returned
weight is the edge-data key to use for the weight (default = ‘weight’)
If the edges don’t have a weight attribute, a default weight of 1 is used
Uses Kruskal’s algorithm
minimum_spanning_edges(G, weight=’weight’, data=True) returns a generator that produces edges in the MST
Edges are 3-tuples (u,v,w), where w is the edge-data dictionary
If keyword argument data is False, edges are just (u,v)
>>> G = nx.watts_strogatz_graph(30, 10, 0.2)
>>> nx.draw_graphviz(G, prog='sfdp', node_color='w')
>>> plt.show()
>>> mst = nx.minimum_spanning_tree(G)
>>> nx.draw_graphviz(mst, prog='sfdp', node_color='w')
>>> plt.show()
>>> ee = nx.minimum_spanning_edges(G)
>>> for (u,v,w) in ee:
... if u == 0 or v == 0:
... print u,v,w
...
0 3 {}
0 4 {}
0 5 {}
0 7 {}
0 16 {}
0 26 {}
0 27 {}
0 28 {}
0 29 {}
NetworkX: Distance Measures First define the graphic-theoretic distance-related concepts then
give the relevant NetworkX functions
The distance between 2 vertices (nodes) in a graph is the number of edges in a shortest path connecting them
Also called the geodesic distance: it’s the length of the graph geodesic between those 2 vertices A graph geodesic is a shortest path between 2 nodes—
possibly several for a given pair of nodes
If there is no path connecting the 2 vertices, the distance is defined as infinite
The eccentricity of a vertex v is the greatest geodesic distance between v and any other vertex
How far a node is from the node most distant from it in the graph
The radius of a graph is the min. eccentricity of any vertex in the graph
The diameter of a graph is the max. eccentricity of any vertex in the graph
I.e., the greatest distance between any pair of vertices.
A central vertex in a graph of radius r is one whose eccentricity is r —i.e., a vertex that achieves the radius
A peripheral vertex in a graph of diameter d is one that is distance d from some other vertex—i.e., a vertex that achieves the diameter
Python Functions for Distance Measures
eccentricity(G) returns the eccentricities of G
radius(G) returns the radius of G
diameter(G) returns the diameter of G
center(G) returns the set of central vertices (nodes) of G
periphery(G) returns the set of peripheral nodes of G
>>> G = nx.barbell_graph(4,2)
>>> nx.eccentricity(G)
{0: 5, 1: 5, 2: 5, 3: 4, 4: 3, 5: 3, 6: 4, 7: 5, 8: 5, 9: 5}
>>> nx.diameter(G)
5
>>> nx.periphery(G)
[0, 1, 2, 7, 8, 9]
>>> nx.radius(G)
3
>>> nx.center(G)
[4, 5]
>>> DG1 = nx.DiGraph([(0,1), (1,3), (3,0), (3,2), (2,0)])
>>> nx.eccentricity(DG1)
{0: 3, 1: 2, 2: 3, 3: 2}
>>> nx.diameter(DG1)
3
>>> nx.periphery(DG1)
[0, 2]
>>> nx.radius(DG1)
2
>>> nx.center(DG1)
[1, 3]
NetworkX: Directed Acyclic Graphs The following work only for a DiGraph
A directed acyclic graph (DAG) is a directed graph with no cycles
A topological sort is a non-unique permutation of the nodes of a DAG s.t. an edge from u to v implies that u appears before v
is_directed_acyclic_graph(DG) returns True if DG is a DAG or False if not
topological_sort(DG, nbunch=None) returns a list of nodes in topological sort order
nbunch is an optional container of nodes; only those nodes are sorted
If DG isn’t a DAG, no topological sort exists, and a NetworkXUnfeasible exception is raised
>>> DG = nx.DiGraph([(0,2), (0,3), (1,2), (1,4), (2,3), (2,4)])
>>> nx.is_directed_acyclic_graph(DG)
True
>>> nx.topological_sort(DG)
[1, 0, 2, 4, 3]
>>> nx.topological_sort(DG, [4,2,3])
[2, 3, 4]
>>> DG.add_edge(3,1)
>>> nx.is_directed_acyclic_graph(DG)
False
>>> nx.topological_sort(DG)
Traceback (most recent call last):
…
networkx.exception.NetworkXUnfeasible: Graph contains a cycle.
NetworkX: Reversing a DiGraphDiGraph.reverse(copy=True) returns the reverse of the graph—
a graph with the same nodes and edges but with the edge directions reversed
copy, if True, results in a new DiGraph returned that holds the reversed edges
If copy is False, the reverse graph is created using the original graph (changing it in place)
MultiDiGraph.reverse(copy=True) has the same description
>>> DG = nx.DiGraph([(0,1), (1,2)])
>>> DG1 = DG.reverse()
2.4 Network Datasets: An Overview The increasing availability of large, detailed network datasets has led to
an explosion of research on large-scale networks in recent
Now think more systematically about where people get the data for such research
There are several reasons we might study a particular network dataset
We may care about the actual domain it comes from
So fine-grained details of the data are potentially as interesting as the broad picture
Or we’re using the dataset as a proxy for a related network that may be impossible to measure
E.g., the Microsoft IM graph from Figure 2.11 gave us info about distances in a social network of a scale and character that begins to approximate the global friendship network
Or we’re looking for network properties common across many different domains
So finding a similar effect in unrelated settings can suggest that it has a certain universal nature
All 3 motivations are often at work simultaneously, to varying degrees
E.g., consider the analysis of the Microsoft IM graph
It gave insight into the global friendship network
At a more specific level, the researchers were also interested in the dynamics of instant messaging in particular
At a more general level, the result of the IM graph analysis fit into the broader framework of small-world phenomena that spans many domains
To study a social network on 20 people, we can interview then all and ask them who their friends are
But to study the interactions among 20,000 people, we need to be more opportunistic in where we look for data
Can't just go collect everything by hand
Must think about settings where the data has in some essential way already been measured for us
Now consider some of the main sources of large-scale network data used for research
The resulting list is not exhaustive
The categories aren’t truly distinct—a single dataset can exhibit characteristics from several
Collaboration Graphs Collaboration graphs record who works with whom in a specific setting
E.g., co-authorships among scientists, co-appearances by actors
An example extensively studied by sociologists is the graph on highly-placed people in the corporate world
An edge joins 2 if they’ve served together on the board of directors of the same Fortune 500 company
The on-line world provides new instances
The Wikipedia collaboration graph (connecting 2 Wikipedia editors if they've ever edited the same article)
The World-of-Warcraft collaboration graph (connecting 2 W-o-W users if they've ever taken part together in the same raid or other activity)
Sometimes a collaboration graph is studied to learn about the specific domain it comes from
E.g., sociologists who study the business world are interested in the relationships among companies at the director level, as expressed via co-membership on boards
In contrast, e.g., people other than research scientists are interested in scientific co-authorship networks
because they form detailed, pre-digested snapshots of a rich form of social interaction that unfolds over a long period of time
With on-line bibliographic records, can often track the patterns of collaboration within a field across a century or more
Thereby extrapolate how the social structure of collaboration may work across a range of harder-to-measure settings as well
Who-Talks-to-Whom Graphs The Microsoft IM graph is a snapshot of a large community engaged
in several billion conversations during a month
Captures the “who-talks-to-whom” structure of the community
Similar datasets have been constructed
from the e-mail logs within a company or a university
from records of phone calls: study the structure of call graphs where each node is a phone number there’s an edge between 2 if they engaged in a phone call
over a given observation period
Can also use the fact that mobile phones with short-range wireless technology can detect other similar devices nearby
Equip subjects with such devices
Study the traces they record
Thereby build “face-to-face” graphs that record physical proximity A node is a person carrying one of the mobile devices There’s an edge joining 2 people if they were detected to be in
close physical proximity over the observation period
The nodes generally represent customers, employees, or students of the organization that maintains the data with strong expectations of privacy
The research is generally restricted in specific ways to protect privacy
Such privacy considerations have also become an issue where
companies try to use this type of data for marketing
governments try to use it for intelligence-gathering purposes
Economic network measurements recording the “who-transacts-with-whom” structure of a market or financial community have been used to study the ways in which
different levels of access to market participants lead to
different levels of market power and different prices for goods
This motivates more mathematical investigations of how a network structure limiting access between buyers and sellers affects outcomes (cf. Chaps. 10-12)
Information Linkage Graphs Snapshots of the Web are central examples of network datasets
Nodes are Web pages
Directed edges represent links from one page to another
Of particular interest (beyond the info in the documents) is the social and economic structures that stand behind the info
hundreds of millions of personal pages on social-networking and blogging sites
hundreds of millions more representing companies and governmental organizations engineering their external images
Because of the scale of the full Web, just manipulating the data effectively is a research challenge in itself
So a lot of network research is done on interesting, well-defined subsets of the Web, including the linkages among bloggers pages on Wikipedia pages on social-networking sites such as Facebook or MySpace discussions and product reviews on shopping sites
Since the early 20th century (well before the Web), citation analysis has studied the network structure of citations among scientific papers or patents
Lets us track the evolution of science
Citation networks remain popular in social research for the same reason that scientific co-authorship graphs are
They’re very clean datasets that span decades
Technological Networks Don’t think of the Web as primarily a technological network
It’s really a projection onto a technological backdrop of ideas, info, and social and economic structure created by humans
But there’s been a convergence of social and technological networks
A lot of interesting network data comes from the more overtly technological end of the spectrum
Nodes represent physical devices
Edges represent physical connections between them
Examples include the interconnections among computers on the Internet generating stations in a power grid
Even such physical networks are ultimately also economic networks
Represent the interactions among the competing organizations, companies, regulatory bodies, and other economic entities that shape them
On the Internet, we have a two-level view of the network
At the lowest level
Nodes are individual routers and computers
An edge means that 2 devices are physically connected
At a higher level, these nodes are grouped into little “nation-states” termed autonomous systems (ASs)
Each is controlled by a different Internet service-providers (ISP)
The who-transacts-with-whom graph on the ASs is the AS graph Represents the data transfer agreements these ISPs make
with each other
Networks in the Natural World Network research has special interest in several different types of
biological networks
Look at 3 examples at 3 different scales, from population level down to molecular level
Food webs represent the who-eats-whom relationships among species in an ecosystem
There’s a node for each species
A directed edge from node A to node B indicates that members of A consume members of B
Seeing the structure of a food web as a graph helps us reason about issues such as cascading extinctions If certain species become extinct, species relying on them for
food also risk extinction These extinctions can propagate through the food web
In the structure of neural connections within an organism's brain:
Nodes are neurons
An edge represents a connection between 2 neurons
The global brain architecture for the simple organism C. Elegans (a 1mm roundworm), with 302 nodes and c. 7000 edges, has been completely mapped
But detailed network pictures for brains of higher organisms are far beyond the state of the art
Still, significant insight has been gained by studying the structure of specific modules within a complex
brain and understanding how they interrelate
There are many ways to define the set of networks that make up a cell’s metabolism, but roughly:
Nodes are compounds that play a role in a metabolic process
Edges represent chemical interactions among them
Hope that analysis of these networks can shed light on the complex reaction pathways and regulatory
feedback loops that take place inside a cell and suggest network-centric attacks on pathogens that disrupt a
cell’s metabolism