Finding dense components in weighted graphs Paul Horn 12-2-02.

Finding dense components Finding dense components in weighted graphsin weighted graphs

Paul HornPaul Horn

12-2-0212-2-02

OverviewOverview

Addressing the problemAddressing the problem What is the problemWhat is the problem How it differs from other already solved How it differs from other already solved

problemsproblems

Building a solutionBuilding a solution Already existing researchAlready existing research Preliminary workPreliminary work Final solutionFinal solution

Overview: The SequelOverview: The Sequel

AnalysisAnalysis TestingTesting EffectivenessEffectiveness Time ComplexityTime Complexity

Future WorkFuture Work Trimming the data set moreTrimming the data set more Linking it with real dataLinking it with real data

The problemThe problem

To find dense subgraphTo find dense subgraphss of a graph. of a graph. Not just the densestNot just the densest Not necessarily all, but as many as possible Not necessarily all, but as many as possible

of graphs that are ‘dense enough’of graphs that are ‘dense enough’

The idea is to identify communities based The idea is to identify communities based on a communications networkon a communications network The more dense the communication is in The more dense the communication is in

within a subgraph, the more likely it is a within a subgraph, the more likely it is a communitycommunity

Why is it hardWhy is it hard

The fastest flow based methods for finding The fastest flow based methods for finding the single densest are cubic or worse.the single densest are cubic or worse.

We want more than one dense subgraphWe want more than one dense subgraph

The greedy approximation algorithm is The greedy approximation algorithm is destructive and thus returns only one destructive and thus returns only one graphgraph

The problem becomes harder when we The problem becomes harder when we allow subgraphs to overlapallow subgraphs to overlap

Weighty IdeasWeighty Ideas

Input graphs to the algorithm are weightedInput graphs to the algorithm are weighted

Weights of a graph represent the Weights of a graph represent the intensity intensity of a communicationof a communication Intensity represents the duration and Intensity represents the duration and

frequency of a communicationfrequency of a communication

Requires a new definition of densityRequires a new definition of density

How dense can it get?How dense can it get?

Recall our old definition ofRecall our old definition of

densitydensity

We modify it to give a We modify it to give a

notion of density of anotion of density of a

weighted graphweighted graph

Note that if the weight of all edges is one the Note that if the weight of all edges is one the two definitions two definitions

( )( )

E Sf S

S

( )

( )

( )| |

v E s

w v

f SS

Done before?Done before?

Discussed in Charikar paper presentationDiscussed in Charikar paper presentation

Goldberg, A.V., Finding a Maximum Goldberg, A.V., Finding a Maximum Density Subgraph. A flow based Density Subgraph. A flow based maximum density subgraph algorithmmaximum density subgraph algorithm

Charikar, Greedy Approximation Charikar, Greedy Approximation Algorithms for finding Dense Components Algorithms for finding Dense Components in a Graph presented a linear in a Graph presented a linear approximation algorithmapproximation algorithm

Preliminary WorkPreliminary Work

An implementation of Goldberg and An implementation of Goldberg and Charikar’s algorithmCharikar’s algorithm In test data (generated in a dual-probability In test data (generated in a dual-probability

Erdos-Reyne model) Charikar’s algorithm Erdos-Reyne model) Charikar’s algorithm identified close to the actual density graphidentified close to the actual density graph

These graphs, however were unweighted and These graphs, however were unweighted and thus ignored the weighted requirement, and it thus ignored the weighted requirement, and it only had one dense subgraph.only had one dense subgraph.

A First AttemptA First Attempt

A modification of Charikar’s algorithm for A modification of Charikar’s algorithm for weighted graphsweighted graphsAt each step remove a random At each step remove a random edgeedge of lowest of lowest weight. weight. Then find all connected componentsThen find all connected componentsRecurse down on each component, and return Recurse down on each component, and return the maximal density subgraph.the maximal density subgraph.By repeated executions of the algorithm the By repeated executions of the algorithm the hope is that different dense components will be hope is that different dense components will be revealed, that can overlap.revealed, that can overlap.

Seems Promising, but…Seems Promising, but…

In test cases generated similarly to that In test cases generated similarly to that used in testing Charikar and Goldberg’s used in testing Charikar and Goldberg’s algorithm, successfully identified close to, algorithm, successfully identified close to, if not the entire, dense portions.if not the entire, dense portions.

In simulated communication network data, In simulated communication network data, the graph was dense enough that large the graph was dense enough that large areas of the graph were denser than the areas of the graph were denser than the smaller portions, and they were not found.smaller portions, and they were not found.

Partitioning?Partitioning?

By partitioning optimally, by finding a cut of By partitioning optimally, by finding a cut of minimum size we can increase the density minimum size we can increase the density of the graph (to some extent)of the graph (to some extent) Since we cut edges of low weight, the edges Since we cut edges of low weight, the edges

of high weight remain on each of the of high weight remain on each of the partitions.partitions.

(Obviously) doesn’t work forever(Obviously) doesn’t work forever However knowing approximately what size we However knowing approximately what size we

want we can find ideal candidateswant we can find ideal candidates

Rethinking our algorithmRethinking our algorithm

Partitioning based algorithm ideaPartitioning based algorithm idea Uses Kernighan-Lee to find close to optimal Uses Kernighan-Lee to find close to optimal

partitions.partitions. Recurses down on the partitions until the are Recurses down on the partitions until the are

of the desired size.of the desired size. The densest of the partitions left are our The densest of the partitions left are our

output.output.

Finalizing our thoughtFinalizing our thought

Run the algorithm on more than one Run the algorithm on more than one partition.partition.Random partitions are likely to be close to Random partitions are likely to be close to orthogonal.orthogonal.Generate Generate kk partitions, and take best partitions, and take best l l partitions (after KL is applied) at the top partitions (after KL is applied) at the top levellevelOn each other level, generate On each other level, generate kk partitions, partitions, and take the top one.and take the top one.

Analyzing the SituationAnalyzing the Situation

The 2-approximation bound that we had for KL-The 2-approximation bound that we had for KL-is no longer necessarily valid.is no longer necessarily valid.The algorithm has met with some success in The algorithm has met with some success in identify clusters in simulated data, but needs identify clusters in simulated data, but needs more tuning with respect to size, and the more tuning with respect to size, and the trimming of the data set.trimming of the data set.By trimming out small partitions that are found By trimming out small partitions that are found that are similar, we reduce overlapthat are similar, we reduce overlapNow may find too many graphs, or incorrect Now may find too many graphs, or incorrect graphs but this problem can be relieved by graphs but this problem can be relieved by taking only the small portions of a certain density taking only the small portions of a certain density (say, some percentage of the final)(say, some percentage of the final)

Time it.Time it.

Original modification to Charikar runs in Original modification to Charikar runs in approximately O(|V||E|) timeapproximately O(|V||E|) timeNew algorithim runs in approximately O(New algorithim runs in approximately O(klkl|V||V|22log|log|V|) time.V|) time. k, l k, l due to generated the k partitions each time, and due to generated the k partitions each time, and

picking the top picking the top ll at each step. at each step. |V||V|22 is a result of Kernighan-Lee is a result of Kernighan-Lee log|V| is the result of continuing to partitionlog|V| is the result of continuing to partition In practice runs very fast. Partitioning graphs of size In practice runs very fast. Partitioning graphs of size

10000+ vertices is possible in a reasonable amount of 10000+ vertices is possible in a reasonable amount of time.time.

In the futureIn the future

The algorithm still needs to better trim the The algorithm still needs to better trim the partitions it finds, and specifically needs to find partitions it finds, and specifically needs to find partitions of more variable sizepartitions of more variable size Could perhaps trim based on the density of the entire Could perhaps trim based on the density of the entire

graph, or perhaps based on a maximum density graph, or perhaps based on a maximum density subgraph (as found by the modified Charikar)subgraph (as found by the modified Charikar)

Already finds graphs of many sizes, but only Already finds graphs of many sizes, but only considers the smallest at the end, so could be considers the smallest at the end, so could be modified to include more of the larger partitionsmodified to include more of the larger partitions

In the Future IIIn the Future II

Future data will not be simulated, but Future data will not be simulated, but instead come from online sourcesinstead come from online sources Running on a newsgroup induced graph, for Running on a newsgroup induced graph, for

instance, can hopefully help identify groups instance, can hopefully help identify groups interested in particular topics.interested in particular topics.

Finding graphs based on email or portions of Finding graphs based on email or portions of the web graph, could help identify groups of the web graph, could help identify groups of friends or topic-related sites as well, and thus friends or topic-related sites as well, and thus help predict communitieshelp predict communities

So What?So What?

By looking at not just a graph, but a series of By looking at not just a graph, but a series of time based graph we can identify communities time based graph we can identify communities and how they change over time.and how they change over time.

Using this method we can hope to identify rules Using this method we can hope to identify rules which govern the changes of these communities which govern the changes of these communities and make predictions on their future actionsand make predictions on their future actions

Simulated data used was designed with this end Simulated data used was designed with this end in mind. in mind.

Summing UpSumming Up

Finding multiple dense subgraphs of a Finding multiple dense subgraphs of a graph is a relatively unexplored topic, graph is a relatively unexplored topic, especially finding dense subgraphs of especially finding dense subgraphs of large graphs (so that exact algorithms are large graphs (so that exact algorithms are unreasonable)unreasonable)

Prior work (such as Goldberg and Prior work (such as Goldberg and Charikar) centered on finding a single Charikar) centered on finding a single densest subgraph densest subgraph

Summing downSumming down

First algorithm a modification of Charikar First algorithm a modification of Charikar centered around removing edges and centered around removing edges and finding connected componentsfinding connected components

Second algorithm based on Kernighan-Second algorithm based on Kernighan-Lee algorithm for finding optimal partitions, Lee algorithm for finding optimal partitions, and recursing down to find small and recursing down to find small subgraphs that are generated by cutting a subgraphs that are generated by cutting a small number of vertices.small number of vertices.

The SummingThe Summing

Still work to do:Still work to do: Linking it back to the real dataLinking it back to the real data

Internet data from newsgroups, email, etcInternet data from newsgroups, email, etc

Using that to find communities over timeUsing that to find communities over time

Finding microlaws that govern them based on how Finding microlaws that govern them based on how the communities change over timethe communities change over time

Finding better ways to trim data to ensure Finding better ways to trim data to ensure that the best candidates are foundthat the best candidates are found

Date post:	12-Jan-2016
Category:	Documents
Upload:	cassandra-miller
View:	214 times
Download:	0 times

Finding dense components in weighted graphs Paul Horn 12-2-02.

Documents