+ All Categories
Home > Documents > Jerry Scripps

Jerry Scripps

Date post: 21-Jan-2016
Category:
Upload: sophie
View: 52 times
Download: 0 times
Share this document with a friend
Description:
R. W. E. T. O. K. N. G. I. I. N. M. N. Jerry Scripps. Overview. What is network mining? Motivation Preliminaries definitions metrics network types Network mining techniques. What is Network Mining?. Statistics. Computer Science. Mathematics. Data Mining. - PowerPoint PPT Presentation
Popular Tags:
46
Jerry Scripps Jerry Scripps N T O K M I N I N G E W R
Transcript
Page 1: Jerry Scripps

Jerry ScrippsJerry Scripps

NT O K

M IN I N

G

E W R

Page 2: Jerry Scripps

OverviewOverview What is network mining?What is network mining? MotivationMotivation Preliminaries Preliminaries

definitionsdefinitions metricsmetrics network typesnetwork types

Network mining techniquesNetwork mining techniques

Page 3: Jerry Scripps

What is Network Mining?What is Network Mining?Statistics

Graph TheorySocial Network Analysis

Machine Learning

Network Mining

Data Mining

Computer ScienceMathematics

Pattern Recognition

Page 4: Jerry Scripps

What is Network Mining?What is Network Mining?Border DisciplinesBorder Disciplines

StatisticsStatistics Computer Computer

ScienceScience PhysicsPhysics MathMath PsychologyPsychology Law EnforcementLaw Enforcement

SociologySociology MilitaryMilitary BiologyBiology MedicineMedicine ChemistryChemistry BusinessBusiness

Page 5: Jerry Scripps

What is Network Mining?What is Network Mining?

Examples:Examples: Discovering communities within Discovering communities within

collaboration networkscollaboration networks Finding authoritative web pages on a Finding authoritative web pages on a

given topicgiven topic Selecting the most influential people Selecting the most influential people

in a social networkin a social network

Page 6: Jerry Scripps

Network Mining – MotivationNetwork Mining – MotivationEmerging Data SetsEmerging Data Sets

World wide webWorld wide web Social networkingSocial networking Collaboration databasesCollaboration databases Customer or Employee Customer or Employee

setssets Genomic dataGenomic data Terrorist setsTerrorist sets Supply ChainsSupply Chains Many more…Many more…

Page 7: Jerry Scripps

Network Mining – MotivationNetwork Mining – MotivationDirect ApplicationsDirect Applications

What is the What is the community around community around msu.edu?msu.edu?

What are the What are the authoritative authoritative pages?pages?

Who has the most Who has the most influence?influence?

Who is the likely Who is the likely member of terrorist member of terrorist cell?cell?

Is this a news story Is this a news story about crime, about crime, politics or business?politics or business?

Page 8: Jerry Scripps

Network Mining – MotivationNetwork Mining – MotivationIndirect ApplicationsIndirect Applications

Convert ordinary Convert ordinary data sets into data sets into networksnetworks

Integrate network Integrate network mining techniques mining techniques into other into other techniquestechniques

Page 9: Jerry Scripps

PreliminariesPreliminaries

DefinitionsDefinitions MetricsMetrics Network TypesNetwork Types

DefinitionsDefinitions MetricsMetrics Network TypesNetwork Types

Page 10: Jerry Scripps

DefinitionsDefinitions

Node (vertex, point, object)

Link (edge, arc)

Community

Page 11: Jerry Scripps

MetricsMetrics

NodeNode DegreeDegree ClosenessCloseness BetweenneBetweenne

ssss Clustering Clustering

coefficientcoefficient

Node PairNode Pair Graph distanceGraph distance Min-cutMin-cut Common Common

neighborsneighbors Jaccard’s coefJaccard’s coef Adamic/adarAdamic/adar Pref. attachmentPref. attachment KatzKatz Hitting timeHitting time Rooted Rooted

pageRankpageRank simRanksimRank Bibliographic Bibliographic

metricsmetrics

NetworkNetwork CharacteristiCharacteristi

c path c path lengthlength

Clustering Clustering coefficientcoefficient

Min-cutMin-cut diameterdiameter

Page 12: Jerry Scripps

Network Types – RandomNetwork Types – Random

Page 13: Jerry Scripps

Network Types – Small Network Types – Small WorldWorld

Regular Small World

Random

Watts Watts & & StrogatStrogatzz

Page 14: Jerry Scripps

Networks – Scale-freeNetworks – Scale-free

GVSU FaceBook

0

200

400

600

800

1000

0100200300400500600

Degree

Co

un

ts

GVSU FaceBook (log scale)

1

10

100

1000

1101001000

Degree

Co

un

ts

Barabasi & Bonabeau Degree follows a power law ~ 1/kn Can be found in a wide variety of real-

world networks

Page 15: Jerry Scripps

Network recapNetwork recap

Network Network TypeType

Clustering Clustering coefficientcoefficient

CharacteristCharacteristic path ic path lengthlength

Power LawPower Law

RandomRandom LowLow LowLow NoNo

RegularRegular HighHigh HighHigh NoNo

Small worldSmall world HighHigh LowLow ??

Scale-freeScale-free ?? ?? YesYes

Page 16: Jerry Scripps

TechniquesTechniques

Link-Based ClassificationLink-Based Classification Link PredictionLink Prediction RankingRanking Influential NodesInfluential Nodes Community FindingCommunity Finding

Page 17: Jerry Scripps

Link-Based ClassificationLink-Based Classification

?Include features from linked objects: building a single model on all features Fusion of link and attribute models

Page 18: Jerry Scripps

Link-Based ClassificationLink-Based ClassificationChakrabarti, et al.Chakrabarti, et al.

Copying data from neighboring web Copying data from neighboring web pages actually reduced accuracypages actually reduced accuracy

Using the label from neighboring page Using the label from neighboring page improved accuracyimproved accuracy

010010

011110

111011

A

A

?

101011

B

111011

010010

101011

011110

A

A

B

Page 19: Jerry Scripps

Link-Based ClassificationLink-Based ClassificationLu & GetoorLu & Getoor

Define vectors for attributes and linksDefine vectors for attributes and links Attribute data OA(X)Attribute data OA(X) Link data LD(X) constructed usingLink data LD(X) constructed using

mode (single feature – class of plurality)mode (single feature – class of plurality) count (feature for each class – count for neighbors)count (feature for each class – count for neighbors) binary (feature for each class – 0/1 if exists)binary (feature for each class – 0/1 if exists)

010010

011110

111011

A

?

101011

BA

111011

OA (attr)

LD (link)A

2 1 0

1 1 0

ModelModel 1

Model 2

Page 20: Jerry Scripps

Link-Based ClassificationLink-Based ClassificationLu & GetoorLu & Getoor

Define probabilities for both Define probabilities for both AttributeAttribute

LinkLink

Class estimation:Class estimation: ))(,|())(,|()(ˆ10 XLDwcPXOAwcPXC

1))(exp(

1))(,|(

cXOAwXOAwcP

To

o

1))(exp(

1))(,|(

cXLDwXLDwcP

Tl

l

Page 21: Jerry Scripps

Collective ClassificationCollective Classification

Uses both attributes and linksUses both attributes and links Iteratively update the unlabeled Iteratively update the unlabeled

instancesinstances message passing, loopy belief nets, message passing, loopy belief nets,

etc.etc.

Page 22: Jerry Scripps

Link-Based ClassificationLink-Based ClassificationSummarySummary

Using class of neighbors improves accuracyUsing class of neighbors improves accuracy Using separate models for attribute and link data Using separate models for attribute and link data

further improves accuracyfurther improves accuracy Other considerations:Other considerations:

improvements are possible by using community improvements are possible by using community informationinformation

knowledge of network type could also benefit classifierknowledge of network type could also benefit classifier

Page 23: Jerry Scripps

TechniquesTechniques

Link-Based ClassificationLink-Based Classification Link PredictionLink Prediction RankingRanking Influential NodesInfluential Nodes Community FindingCommunity Finding

Page 24: Jerry Scripps

Link PredictionLink Prediction

Page 25: Jerry Scripps

Link PredictionLink PredictionLiben-Nowell and KleinbergLiben-Nowell and Kleinberg

Tested node-pair metrics:Tested node-pair metrics: Graph distanceGraph distance Common neighborsCommon neighbors Jaccards coefficientJaccards coefficient Adamic/adarAdamic/adar Preferential Preferential

attachmentattachment KatzKatz Hitting timeHitting time Rooted PageRankRooted PageRank SimRankSimRank

Neighborhood

Ensemble of paths

Page 26: Jerry Scripps

Link Prediction - resultsLink Prediction - results

Page 27: Jerry Scripps

Link Prediction – newer Link Prediction – newer methodsmethods

maximum likelihoodmaximum likelihood stochastic block modelstochastic block model probabilisticprobabilistic

Page 28: Jerry Scripps

Link Prediction – summaryLink Prediction – summary

There is room for growth – best There is room for growth – best predictor has accuracy of only predictor has accuracy of only around 9%around 9%

Predicting collaborations is difficultPredicting collaborations is difficult New problem could be to predict the New problem could be to predict the

direction of the linkdirection of the link

Page 29: Jerry Scripps

TechniquesTechniques

Link-Based ClassificationLink-Based Classification Link PredictionLink Prediction RankingRanking Influential NodesInfluential Nodes Community FindingCommunity Finding Link CompletionLink Completion

Page 30: Jerry Scripps

RankingRanking

Page 31: Jerry Scripps

Ranking – Markov Chain Ranking – Markov Chain BasedBased

Random-surfer analogyRandom-surfer analogy Problem with cyclesProblem with cycles PageRank uses random vectorPageRank uses random vector

Page 32: Jerry Scripps

Ranking – summaryRanking – summary

Other methods such as HITS and Other methods such as HITS and SALSA also based on Markov chainSALSA also based on Markov chain

Ranking has been applied in other Ranking has been applied in other areas:areas: text summarizationtext summarization anomaly detectionanomaly detection

Page 33: Jerry Scripps

TechniquesTechniques

Link-Based ClassificationLink-Based Classification Link PredictionLink Prediction RankingRanking Influential NodesInfluential Nodes Community FindingCommunity Finding

Page 34: Jerry Scripps

InfluenceInfluence

Page 35: Jerry Scripps

Influence MaximizationInfluence Maximization

Problem: find the best nodes to Problem: find the best nodes to activateactivate

Approaches:Approaches: degree – fast but not effectivedegree – fast but not effective greedy – effective but slowgreedy – effective but slow improvements to greedy: degree improvements to greedy: degree

heuristics and Shapely valueheuristics and Shapely value use communitiesuse communities cost-benefit – probabilistic approachcost-benefit – probabilistic approach

Page 36: Jerry Scripps

Maximizing influence Maximizing influence model-basedmodel-based

Problem – finding the k best nodes to activate to Problem – finding the k best nodes to activate to maximize the number of nodes activatedmaximize the number of nodes activated

Models:Models: independent cascade – when activated a node has a independent cascade – when activated a node has a

one-time change to activate neighbors with prob. pone-time change to activate neighbors with prob. p ijij linear threshold – node becomes activated when the linear threshold – node becomes activated when the

percent of its neighbors crosses a thresholdpercent of its neighbors crosses a threshold

Page 37: Jerry Scripps

Maximizing influence Maximizing influence model-basedmodel-based

Models: independent cascade & linear Models: independent cascade & linear thresholdthreshold

A function f:SA function f:S→S→S**, can be created using either , can be created using either modelmodel

Functions use monte-carlo, hill-climbing Functions use monte-carlo, hill-climbing solutionsolution

Submodular functions, Submodular functions, where Swhere ST are proven in another work to be T are proven in another work to be NP-C but by using a hill-climbing solution can NP-C but by using a hill-climbing solution can get to within 1-1/e of optimum.get to within 1-1/e of optimum.

)(}){()(}){( TfvTfSfvSf

Page 38: Jerry Scripps

Maximizing influence – Maximizing influence – cost/benefitcost/benefit

Assumptions:Assumptions: product x sells for $100product x sells for $100 a discount of 10% can be offered to various prospective a discount of 10% can be offered to various prospective

customerscustomers If customer purchases profit is:If customer purchases profit is:

90 if discount is offered90 if discount is offered 100 if discount is not offered 100 if discount is not offered

Expected lift in profit (ELP) from offering discount is:Expected lift in profit (ELP) from offering discount is: 90*P(buy|discount) - 100*P(buy|no discount)90*P(buy|discount) - 100*P(buy|no discount)

Page 39: Jerry Scripps

Maximizing influence – Maximizing influence – cost/benefitcost/benefit

Goal is to find M Goal is to find M that maximizes that maximizes global ELPglobal ELP

Three Three approximations approximations used:used: single passsingle pass greedygreedy hill-climbinghill-climbing

n

ii

kii

ki

k cMfYXXPrMfYXXPrMYXELP1

00

11 ))(,,|1())(,,|1(),,(

XXii is the decision of is the decision of customer i to buycustomer i to buy

Y is vector of product Y is vector of product attributesattributes

M is vector of marketing M is vector of marketing decisiondecision

f is a function to set the ith f is a function to set the ith element of Melement of M

rr00 and r and r11 are revenue are revenue gained gained

c is the cost of marketingc is the cost of marketing

Page 40: Jerry Scripps

Comparison of approachesComparison of approaches

Cost/benefitCost/benefit Model-basedModel-based

Size of Size of starting setstarting set

variable - variable - based on based on max. liftmax. lift

fixedfixed

uses uses attributesattributes

yesyes nono

probabilitiesprobabilities extracted extracted from data setfrom data set

assigned to assigned to linkslinks An extension would be to spread influence An extension would be to spread influence

to the most number of communitiesto the most number of communities Improvements can be made in speedImprovements can be made in speed

Page 41: Jerry Scripps

TechniquesTechniques

Link-Based ClassificationLink-Based Classification Link PredictionLink Prediction RankingRanking Influential NodesInfluential Nodes Community FindingCommunity Finding

Page 42: Jerry Scripps

CommunitiesCommunities

Page 43: Jerry Scripps

Gibson, Kleinberg and Gibson, Kleinberg and Raghavan Raghavan

Query

Search Engine

Root Set

Use HITS to find top 10 hubs and authorities

Base Set: add forward and back links

Page 44: Jerry Scripps

Flake, Lawrence and GilesFlake, Lawrence and Giles

Uses Min-cutUses Min-cut Start with seed setStart with seed set Add linked nodesAdd linked nodes Find nodes from Find nodes from

outgoing linksoutgoing links Create virtual source nodeCreate virtual source node Add virtual sink linking it to all nodesAdd virtual sink linking it to all nodes Find the min-cut of the virtual source Find the min-cut of the virtual source

and sinkand sink

Page 45: Jerry Scripps

Community FindingCommunity Finding

Girvan and Newman – minimize betweennessGirvan and Newman – minimize betweenness Clauset, et al. – agglomerative, uses modularityClauset, et al. – agglomerative, uses modularity Shi & Malik – spectral clusteringShi & Malik – spectral clustering

Page 46: Jerry Scripps

Communities - summaryCommunities - summary

There are many options for building There are many options for building communities around a small group of communities around a small group of nodesnodes

Possible future directionsPossible future directions finding communities in networks having finding communities in networks having

different link typesdifferent link types impact of network type on community impact of network type on community

finding techniquesfinding techniques


Recommended