Subgraph Isomorphism Graph Challenge - draft -
Siddharth Samsi, Vijay Gadepally, Jeremy Kepner, Albert Reuther
http://GraphChallenge.org
Graph Challenge - 2
Outline
• Introduction
• Data Sets
• Static Graph Isomorphism Challenge
• Metrics
• Summary
Graph Challenge - 3
Introduction
• Previous challenges in machine learning, High Performance Computing and visual analytics include – YOHO, MNIST, HPC Challenge, ImageNet, VAST
• GraphChallenge encourages community approaches, such as DARPA HIVE, to develop new solutions for analyzing graphs derived from social media, sensor feeds, and scientific data to enable relationships between events to be discovered as they unfold in the field
• GraphChallenge organizers will provide specifications, data sets, data generators, and serial implementations in various languages
• GraphChallenge participants are encouraged to apply innovative hardware, software, and algorithm techniques to push the envelop of power efficiency, computational efficiency, and scale
• Submissions will be in the form of full conference write ups to IEEE HPEC which will allow participants to be evaluated on their complete solution
Graph Challenge - 4
Graph Challenge
• GraphChallenge seeks input from diverse communities to develop graph challenges that take the best of what has been learned from groundbreaking efforts such as GraphAnalysis, Graph500, FireHose, MiniTri, and GraphBLAS to create a new set of challenges to move the community forward
• Initial Graph Challenges – Static Graph Challenge: Sub-Graph Isomorphism
• This challenge seeks to identify a given sub-graph in a larger graph
– Streaming Graph Challenge: Stochastic Block Optimization • This challenge seeks to identify optimal blocks (or clusters) in a larger graph
Graph Challenge - 5
Static versus Streaming Mode
• Static graph processing – Given a large graph G – Evaluate ƒ(G)
• Two classes of streaming: stateless and stateful – Stateless: process data g as it goes by ƒ(g) – Stateful: add data g to a larger corpus G and then process corpus ƒ(G + g)
• Graph processing is often in the stateful category – Easy to simulate by partitioning G into streaming pieces g
Graph Challenge - 6
• Filtering on edge/vertex labels is often used when available to reduce the search space
– Filtering can be applied at initialization, during intermediate steps, or at the vertex level – Very problem dependent
• Some Graph Challenge data sets have labels and some data sets are unlabeled – Some participants will want to filter on labels
• Initial example implementations will work without labels – Labels can be used if they choose
• GraphChallenge is judged by a panel – Panel will value the variations that are submitted
Labels and Filtering
Graph Challenge - 7
Outline
• Introduction
• Data Sets
• Static Graph Isomorphism Challenge
• Metrics
• Summary
Graph Challenge - 8
Publicly Available Datasets Initial Datasets in Blue
• Social networks (10 data sets up to 4.8M vertices & 69M edges) – Online social networks, edges represent interactions between people
• Networks with ground-truth (6 data sets up to 66M vertices & 1.8B edges, e.g. Friendster) – Ground-truth network communities in social and information networks
• Communication networks (3 data sets up to 2.3M vertices & 5M edges) – Email communication networks with edges representing communication
• Citation networks (3 data sets up to 3.7M vertices & 16M edges) – Nodes represent papers, edges represent citations
• Collaboration networks (5 data sets up to 23K vertices & 198K edges) – Nodes represent scientists, edges represent collaborations
• Web graphs (4 data sets up to 875K vertices & 5.1M edges) – Nodes represent webpages and edges are hyperlinks
• Amazon networks (5 data sets up to 548K vertices & 3.4M edges) – Nodes represent products and edges link commonly co-purchased
products
• Internet networks (9 data sets up to 62K vertices & 147K edges) – Nodes represent computers and edges communication
• Road networks (3 data sets up to 1.9M vertices & 2.8M edges) – Nodes represent intersections and edges roads connecting the
intersections
• Autonomous systems (5 data sets up to 26K vertices & 106K edges) – Graphs of the internet
• Signed networks (10 data sets up to 4.8M vertices & 69M edges) – Networks with positive and negative edges (friend/foe, trust/distrust)
• Location-based online social networks (2 data sets up to 198K vertices & 950K edges) – Social networks with geographic check-ins
• Wikipedia networks, articles, and metadata (7 data sets up to 3.5M vertices & 250M edges) – Talk, editing, voting, and article data from Wikipedia
• Twitter and Memetracker (4 data sets up to 96M vertices & 476M edges) – Memetracker phrases, links and Tweets
• Online communities (3 data sets up to 2.3M images) – Data from online communities such as Reddit and Flickr
• Online reviews (6 data sets up to 34M product reviews) – Data from online review systems such as BeerAdvocate and Amazon
Stanford Large Network Dataset Collection snap.stanford.edu/data
Graph Challenge - 9
Publicly Available Datasets Initial Datasets in Blue
• Astronomy (1 data set 180 GB)
– Sloan Digital Sky Survey SQL MDF files • Biology (4 data sets up to 200 TB)
– Genome sequence data • Climate (3 data sets size growing daily)
– Satellite imagery data • Economics (10 data sets up to 220 GB)
– Census, transaction, and transportation data • Encyclopedic (10 data sets up to 541 TB - Common Crawl Corpus)
– Various online encyclopedia data • Geographic (3 data sets up to 125 GB)
– Street maps • Mathematics (1 data sets 160 GB)
– University of Florida Sparse Matrix Collection
• Advertising and Market Data (4 data sets up to 3.7 GB) – Yahoo!'s auction-based platform for selling advertising space
• Competition Data (3 data sets up to 1.5 GB)
– Data challenges run by Yahoo • Computing Systems Data (5 data sets up to 8.8 GB)
– Computer systems log data • Graph and Social Data (3 data sets up to 5 GB)
– Graph data from search, groups, and webpages
• Image Data (3 data sets up to 14 GB) – Flickr imagery and metadata
• Language Data (29 data sets up to 166 GB - Answers browsing behavior)
– Wide range of question/answer data sets • Ratings and Classification Data (10 data sets up to 83 GB — 1.5 TB
dataset withdrawn) – Community preferences and data
• Used to generate world’s largest power law graphs • Can be modeled with Kronecker Product of a Recursive
MATrix (R-MAT): G⊗k = G⊗k–1 ⊗ G – Where “⊗”denotes the Kronecker product of two matrices
• R-MAT: A Recursive Model for Graph Mining, Chakrabarti, Zhan & Faloutsos (2004) • Mathematically Tractable Graph Generation and Evolution, Leskovec, Chakrabarti, Kleinberg & Faloutsos, (2005)
AWS Public Data Sets aws.amazon.com/public-data-sets
Graph500.org Data Generator
Yahoo! Webscope Datasets webscope.sandbox.yahoo.com
Tools to generate (at various scales and parameters) data sets will be provided as part of the challenge
Graph Challenge - 10
GraphChallenge Data Formats
• Public data is available in a variety of formats – Linked list, tab separated, labeled/unlabeled
• Requires parsing and standardization • Proposed formats
– Tab separated triples in ASCII file with labels removed – MMIO ASCII format: math.nist.gov/MatrixMarket
The Matrix Market Exchange Formats: Initial Design, Boisvert, Pozo & Remington (1996)
Wikipedia Voting Wikipedia Adminship Road Network Twitter Data
KillYourTV_I2P
markaci
ankedDebonairFox
ageis
tehowe
stevewerby
otr_imComunityCube
WOGofFARs
iamsambee
i2porignal
sednawk
decourl
emmangoldstein
deray
iamjohnoliver
MsLods
archuser
joepie91
chaospiratin
Reversity
msaukko
supaheld
GeorgFritzsche
islammireligion
TorEkelandPC
l33tness
n0x00
langley_va
dieKadda
kgerloff
Checkmarx
samhocevar
Street_D
ogg
flamsm
arkaccessjam
escoracurrier
yuange75
MarineTraffic
Schema
taTheory
chronic_jordannitot
luutoo
a_marshall_plan
D3rB3obacht3r
sleevi_
ajam
edouardng
4b5
abnev
blackplansAnonNewsSwe
decorrespondent
TA3Mchicago
sleepylemur
LilianaSeguramastahyeti
GreatDismal
certivoxlabs
Fl0range
switchingtoguns
JacobHokland
mathpunk
anttitikkanen
angealbertinianarchival
sbl0410
carlsmith
infosecjerk
JamesRisen
virus
mchunky11
romanmars
NorseCorp
d_olex
DarkArtifacts
ClickInfosec
chrismckee
jimmieange
l
linuxaudit
_murx
languagejones
CharlieShrem
TheQuinnspiracy
ibroadfo
LucioIO
Mauerfall89
BetterCrypto freebsdgirl
subtopes
torservers
colleenkelly
yawnbox
eitanm
k
chrifpa
bspector
murmosh
VSebastien
Philae2014
nikmd23
MichelleLMcKee
Pedinska
letelierdos
resentfultweet
cijournalism
amherstac
xandersherry
mattbraga
philsweeney
yinettesys
guaqTimJPrice
sotongciliejpfauth
landonfuller
flr666
Nienor_
esaoperationsArchipet
jonobacon
GPGTools
phillymag
camicelli
InfoSecHotSpot
Fengxii
AbiWilks
JeroenInTheDark
lehtior2
LaatBloeien
pzb
sc13ns
Ideenwanderer
ACLU_NorCalthereaIbanksy
yamila_m
oreno
IVAW
nginx
heinihm
peksa
alldaydotcom
die_hausmaus
HelloArbit
CynthiaLive
film_girlTin
kerToyTechDrPhiltill Keltounet
jlangdale
e_svedang
BAESystems_AI
secmeme
ustwogames
MidmarketCIO
hack_lu
mikispag
Mobutesustaincondoms
karsvaneijsden
Turmio_
adron
NettieBeez
RA_Thomas_Koll
erikandersonExplo
reNZtravel
mlowdi
sinanantoon
Yorgosh
thalmic
InterFaxAPI
BrianGoulet_
UID_
calculusdude16
cznweb
oursdemer
vmjaggard99
FightThePower13
YaiAou
shpendk
philpraxis
Readywater
danslimmoncammipham
birdsfallsilent
JayKyuu
swadoka
The_SolarSystem
ParteiHHSarcasticRover
ichetandhembre
AhmiaNews
Huicthgara
heckmueller
Haukursteinn
tofugu
ErikRose
pandom_
shaftag
HovikYerevan
cubed2D
SPIEGEL_Netz
qirexVoulnet
bradfitz
lennutrajektoor
rem
ccneill
AtariFrosch
ianthe88
asgrim
fujin_
oherrala
theodoricmeyer
medtek
toebonecasalejlittlecalculist
mza
scopesetic
lastfrodo
TheG
utes
Cloudiu
sSystem
sn1dq
Nerd4U_eu
UXHow
vgcerf
hej107us
nickster4kpenajulia
_tty0
ProfVeenstra
ebeip90
jimalkhalili
dkords
gelnior
DebC
onf
GodlessUtopia
chrisbkelly
CasBow
ie
hmadrwx
TheM
atjaz
angarinmichaud
FranzvonFeld
HopeFrank
augrunt
VERISIGN
archaeologymag
SofieHagen
commiegirl1
GISgamer
CRISoundStageSerianox_
reesemoore
TRACterrorism
ichilton
goldtristesse
TexasVC
daavidhentunen
mikk0j
jerrychen
therealelp
SpaceCharlieUK
jvesalaanthony
sterling
noclador
colinhostert
felipelerena
Fuzion24
andersostlund
evktalobull3tp
r00f
InfoSecJesus
YourAnonXploit
AlexKara15
traeumeleben
Anon2earth
dipdip11
gmpolice
tomihj
JD_Smi
les
abhishekmdb
twittersecurity
HUBB
LE_space
humble
pondswimmer
JaanaHu
hta
cavaticat
nolaforensix
mohamedrmah
farahney
__agwa
VickyChandler
PekkaNikander
nicholasjhebert
josephredfern
Roambi
Daeinar
SexyLikeMeiosis
daspecst
er
simonbayly
ColisPrive
OlafScholz
helpmonks
AsteroidEnergy
vidrackdev
levudev
techinsider_sv
heathercmiller
praetorianlabs
Tural301
beckyferreira
stevewoz
biom
otor
DennysDiner
serial
RooneyMcNibNug
msftsecresponse
IndieELF
NodiNovi
DataArtist
linkgard
jellyfish
_games
willcanine
BlucoinBox
defcoin
nodesecurity
CertiVox
AnonO_o
bertranruiz
Athbi
keen_io
skriptuurus
garthbrooks
NummerM
agazin
CommitStrip
EightTons
CHSommers
elizaege
Giribot
NightDiveStudio
anlum
o1
numer6_6
jonasanbrown
droidespleen
HuibBroekhuis
scottdwarevnik5
287
PriteshJRocks
fallenjehovaAgarri_FR
sh33psy
MelbourneGeek
calsilcock
PureNewZeala
nd
randi_ebooks
equartey
tanwapi
cyberzedd
k
dptreks1
Mobile_Secwpgoulet
TechieIO
PoliceVideo
romainletendartinboxbygmail
CryptoVillage
kasimerkan
antkanan
SocketIO
MicSpehr
Pleasure__Kevin
JenLucPiquant
Bundes_GenSek
NSDC_ua
lorbone
wilh01
DogFlotilla
thinktecture
shittydogs
marpichun
CryptocoinAPI
RusiAlpo
elizabethusa
anikas_h
GoodPhellasPHL
TheGingerDog
SoluMachines
acmebench
AndreaBarisan
i
tealcavalon
rungga_reksya
JeffreyChadd
CASBforCloud
JanetInfoSec
reasonablysnd
saunatotoro
EdgeyP
PGDayTR
elektronaut1
linuxnotlari
GamesAustria
djdavies7
willnichthaben
hartrobert
silvinamarq
MontrealSauce
NeustarCTO
MuOneOz
CodeSavvyOrg
JasonMajoue
GenomeC
ompiler
tbwsam
ples
CrowdTRetailJedediahSPurdy
KhalidH
Alanazi
w8rbt
cssquare_unb
AndyR
anger4
Tiffani_Bova
CatCrenshaw
mattrsull
ivan
TK5EP
AcademiaObscura
LinuxMisyoneri
indieRadiantGames_is
JMutarjim
timeless_app
oficiallu
qman
devops
daysDe
rby
BitquarkBot
autismusleben
UnmagicalM
e
WillKleiser
luqman_hakim
_y
SmoothLocalize
zulahni_ebooks
MarijkeD87
joshcaganebooks
bittarman
thera_ebooks
kayla_ebooks
DrKeremDundar
kaygeeuk
ospadano
Autoktor
michtodd
joeklein
PetriKo
istinen
leahkuipers
ifeulner
minoad7
krakjoe
pono
getlychee
marian_ebooks
msolnik
ihmdi
oghieyesterday
davidthodey
FurioFortaleza
clearlogin
fraukaleu
MayorHodges
nick__w
ScoopNZ
MotorMouthN3ws
Chuca
galpondbanquito
Apryl_Parchergiannimaggiora
dieappleshow
williedillsGarrettArt
GJHodson
hackerspacefest
MFordFuture
matthi_bolte
jasontcohen
sbrebrown
wonder75386_SX
GoogleDesign
AARNet
AmanHardikar
openzfsonosx
Uber_Paris
YoApp
Kotaku_UK
ICT_Networks
DavidCh27992090
karishanindya
KPbewelldoc
SinanCananmountainrescuer
RFairclough
PaulCuts
inger
TeselaGen
Sklcrshrmtn
TLuzat
ijimene
TarlaciSultan
jdauphant
sporndly
big3bioS
F
therealredman
VMPJon
kki
CurlyBeautys
pa2nkOz
AkseIi
bobveznat
SpikeBeeJo
Mbosinwa
1cloudstar
prcheney
ICONTM
Cryptos_Crew
fibre2342
arieslob
TomWasTaken
Hari4u27
mgualtieri
alexlearn3005
ReasonDigital
dcbz32
mobify
BeatrijsRitsema
mattmcilwain
infil00p
PauliAltieri
Westin
idcommune
atseitlin
TayeDiggs
Hamdiyah_c
pu
martinbuberl
PLDHnet
mrdepressiv
laura
nsarafa_
Boulevard_Beer
oddur
sitelutions
veethembem
johnryanshea
jjconti
dotGoEu
TiphaineLara
AUusiJa
akkola
UNBFCS
PHIL_FISH
HarmonyGrits
TACpodcast
ASOS_Servicecli
chrishasl
floorbaieli
Coverity
gutweiler
junaxup
dnlsrl
kev_mcdougall
sandersonm
ichae
miratim herrod
oracl_ebooks
mesosphere
firmread
autistichoyaLBMForte
DanielRopers
ShipShowPodcast
areuugee
NathanGracie
ItalianFD
RulerAnalytics
_mattio
Kolibree
atejerina
JhonyKnx
bikesnobnyc
DowntonAbbey
UruShop
FranThai
ezest
LadyDetectives1
KellyBBattles
RachelAndJun
Shoe_key
jcsjcsjcs
mmtinez
deklanowski
skimmfr
tommylandz
DomSpoerli
ILiveCoffee
Timothy_Lindsey
BlitzWing00
bernino
6C9
SSurjik
Beikonkisur
billofmaterials
_tony_richards
phamquang578578
KNavelin
Cyclingnewsfeed
phoenixskyXXX
ValuedMerchants
POIDirectors
mhausherr
protel
royalathletes
gusuECPadrien69290
Fundme
Amazon cloud services will be used to host particularly large data sets
Graph Challenge - 11
Outline
• Introduction
• Data Sets
• Static Graph Isomorphism Challenge
• Metrics
• Summary
Graph Challenge - 12
Static Subgraph Isomorphism Problem
Vertex-Based
• Given a sub-graph H and a larger graph G
• Is there a 1-to-1 mapping of the vertices in H to vertices in G such that every edge in H is also in G ?
Array-Based
• Given sub-graph adjacency array B and a larger graph adjacency array A
• A subgraph isomorphism exists iff there is a permutation array P where
B = PT A P
Bold uppercase indicates a matrix, Bold lowercase indicates a vector, lowercase indicates scalars, sets, ...
Graph Challenge - 13
Scaling Observations
• Two key parameters govern the complexity of subgraph isomorphism – Size of G and H – Nominal complexity is |G||H|
• Large G and H is may be too challenging • Large G or H is more practical
• Small G and G and H are of same order is less interesting – Focus of program is large graphs
• Large G and small H (G is much larger than H) is more interesting – Example: H is a triangle (i.e., triangle finding algorithm) – Example: H is a k-truss (i.e., k-truss algorithm)
Graph Challenge - 14
Triangle Counting
• Triangles are the most basic non-trivial subgraph – Set of three mutually adjacent vertices in a graph
• Number of triangles in a graph is an important metric used in applications such as – Social network mining (e.g.: clustering coefficient calculation) – Link classification and recommendation – Cyber security – Intelligence – Functional biology – Spam detection
• Example of triangles in a Graph – Triangle 1 : Nodes 3,4,5 – Triangle 2 : Nodes 2,4,5
Counting and Sampling Triangles from a Graph Stream A. Pavan, K. Tangwongsan, S. Tirthapura & Kun-Lung Wu, VLDB 2013
4 5 1
2
3
2
1
Graph Challenge - 15
Triangle Counting and Enumeration - Example Algorithm -
• Lower complexity array approach • Triangle enumeration has big I/O • Triangle counting has minimal I/O
Article
Information Visualization1–10! The Author(s) 2016Reprints and permissions:sagepub.co.uk/journalsPermissions.navDOI: 10.1177/1473871616666393ivi.sagepub.com
Graphing trillions of triangles
Paul Burkhardt
AbstractThe increasing size of Big Data is often heralded but how data are transformed and represented is also pro-foundly important to knowledge discovery, and this is exemplified in Big Graph analytics. Much attention hasbeen placed on the scale of the input graph but the product of a graph algorithm can be many times largerthan the input. This is true for many graph problems, such as listing all triangles in a graph. Enabling scal-able graph exploration for Big Graphs requires new approaches to algorithms, architectures, and visual analy-tics. A brief tutorial is given to aid the argument for thoughtful representation of data in the context of graphanalysis. Then a new algebraic method to reduce the arithmetic operations in counting and listing triangles ingraphs is introduced. Additionally, a scalable triangle listing algorithm in the MapReduce model will be pre-sented followed by a description of the experiments with that algorithm that led to the current largest andfastest triangle listing benchmarks to date. Finally, a method for identifying triangles in new visual graphexploration technologies is proposed.
KeywordsGraph, scalable algorithms, triangle counting, visual analytics, parallel programming, MapReduce
Introduction
Data, data, data, and more data! Obtaining data is justthe beginning of analysis, but heaping piles of raw dataare rarely useful. How data are structured or repre-sented facilitates analysis. Data must undergo transfor-mation, translation, transmuting and otherstructuring, categorization, and organization opera-tions to enable intelligent analysis. This includes mak-ing data more concise, more relevant, and structuredto improve the performance of a query or algorithm.Ultimately, data form the raw material we need topave the roads to knowledge, and visualization is theinterface to insight. The entanglement between datarepresentation and analysis is best exemplified in thegraph-theoretic study of complex networks.
With the era of Big Data1 comes the emergence ofBig Graphs,2 and while there has been attention toimproving the capacity of graph primitives, such asBreadth-First Search,3,4 it should not be overlookedthat the answers to even simple questions, the algo-rithm output itself, can be tremendously large andcomplex. Consider the task of identifying all mutually
connected triplets in a graph or network of entities,often depicted as a triangle where each vertex of thetriangle represents an entity that is connected to theother entities by the edges of the triangle. Finding allsuch triangles in a graph requires a lot of computationand the number of triangles can be many times morethan the number of entities. Despite a triangle being asimple graphic, visualizing all triangles in a graph isvery difficult because the triangles overlap and are eas-ily obscured.
Computing triangles in graphs is a fundamentaloperation in graph theory and has wide application inthe analysis of social networks,5 identifying dense sub-graphs6 and network motifs,7 spam detection,8 anduncovering hidden thematic layers in the World WideWeb.9 Yet, given the attention to finding, counting,
US National Security Agency, USA
Corresponding author:Paul Burkhardt, National Security Agency, 9800 Savage Road, FortMeade, MD 20755, USA.Email: [email protected]
by guest on September 13, 2016ivi.sagepub.comDownloaded from
Graph500 (www.graph500.org) benchmark para-meters. Table 2 lists the sizes of the graphs and the tri-angle counts computed by the algorithm. The resultsof this benchmark are given in Figure 8. The top per-formance was listing 1.6 billion triangles per minute!This was achieved on the RMAT scale 30 graph withD(G). 200 billion, which was 1.8 TB in size; to thebest of this author’s knowledge it is the largest andfastest triangle listing experiment. The RMAT 28 andlarger graphs have a maximum degree on the order of106 but because of the aforementioned load-balancingtechnique, the neighbor pairing for high-degree ver-tices was distributed across many reduce tasks,demonstrating the efficacy of the approach. Moreover,
the number of 2-paths output from the RMAT 30graph was nearly two trillion and for the RMAT 32graph exceeded eleven trillion. The experiment on theRMAT 32 graph did not complete owing to time con-straints and not failure.
Visualizing trillions of triangles
Having established an algorithm capable of listingpotentially trillions of triangles in a graph, the next stepis to identify how to visualize the triangles. To begin,we need a scalable visual analytics platform but thereare very few tools designed for large-scale graph explo-ration. The GreenHornet26 visual analytics tool is
Figure 7. Triangle listing results for SNAP datasets.
Figure 8. Triangle listing results for Graph500 datasets.
Table 2. Graph500 RMAT graphs (A = 0:57, B = C = 0:19, D = 0:05).
n (vertices) 2m (edges) D(G)
RMAT 24 16,777,216 268,435,456 2,127,854,845RMAT 26 67,108,864 1,073,741,824 9,838,965,401RMAT 28 268,435,456 4,294,967,296 44,970,850,296RMAT 30 1,073,741,824 17,179,869,184 203,333,933,183RMAT 32 4,294,967,296 68,719,476,736 \ 11,068,947,204,458
8 Information Visualization
by guest on September 13, 2016ivi.sagepub.comDownloaded from
which every vertex is connected to every other vertex,and a cycle, i.e. a path with identical endpoints.Triangles in graphs are useful for complex networkanalysis based solely on structural, i.e. adjacency,information. Let D(v) be the local count of trianglesinvolving vertex v and D(G) be the global count of tri-angles in G. There can be many more triangles thanvertices and edges, and if the graph is completely con-nected, where G itself is a clique, the total number oftriangles, D(G), is
D(G)=n3
! "2 Y n3
# $
The triangles in a graph can be easily identifiedby testing all possible triples, fu, v,wg, for
fu, vg, fv,wg, fu,wg 2 E, but this takes On3
! "! "2
O(n3) time. If most of the triples were triangles andthe task was to enumerate all triangles, then theapproach is optimal. But generally this brute forcemethod is highly inefficient. We will briefly review thestandard method for counting triangles leading to anew matrix approach, then transition into the descrip-tion of a new algorithm that can potentially enumeratetrillions of triangles.
Counting triangles
Since cycles of length n are diagonal elements in An
and a triangle is a cycle of length 3, then it is wellknown that the local and global triangle counts can becalculated by
D(v)=1
2A3
vv
D(G)=1
6
X
v
A3vv =
1
6Tr A3# $
where the symbol Tr will refer to the trace operator ofa matrix. The 1
2 factor in D(v) is to account for thecycles from fu,wg incident to v, and the factor16 = 1
2 3 13 in D(G) is because each triangle is counted
thrice.Computing triangles by A3 by fast matrix multipli-
cation admits the best runtime of nv+ o(1), wherev42:3728639 is the fast matrix product exponent.14
But the classic matrix product is used in practice,therefore practical improvements would be beneficial.A new matrix approach developed in 2013 by thisauthor reduces the total number of arithmetic opera-tions.15 Here the element-wise multiplication operatorfor matrices is denoted by the 8 operator.
Theorem 1. Given G and the Hadamard product(A2 s A), then D(v)= 1
2
Pv (A
2 s A)v and D(G)=16
Pij (A
2 s A)ij .
Proof. The paths of length two, i.e. 2-paths, are iden-tified by A2; then element-wise multiplication A s A2
retains only those fv, ug endpoints of 2-paths that arealso the fv, ug 2 E endpoints of edges. Then the sumof elements in the vth column vector of A2 s A is thenumber of 3-cycles involving v. Using D(v)= 1
2 A3vv
leads to
D(v)=1
2A3
vv =1
2
X
v
A2s A
# $v
ð1Þ
Now recall Tr(XYT)=P
ij (X s Y )ij; thenTr(A2AT)=
Pij (A
2 s A)ij . Finally, since D(G)=16 Tr(A3), then
D(G)=1
6Tr A3# $
=1
6Tr A2AT# $
=1
6
X
ij
A2s A
# $ij
ð2Þ
The corollary to this is that the number of arith-metic operations is cut in half when using classicmatrix multiplication.
Corollary 1. The number of arithmetic operations forcounting all triangles in G by Theorem 1 using sparsematrix representation is 2N(n+ 1)# 1=O(mn), whereN = 2m is the number of non-zeros, which is approxi-mately half the number of arithmetic operations if countingby Tr(A3) using sparse matrix products.
Proof. There are a total of 2N multiplication andaddition operations for a sparse matrix–vector prod-uct. Then A2 is possible by direct application of sparsematrix–vector multiplication of A with all n columnvectors in A for a total of 2Nn arithmetic operations.Then A s A2 takes N operations for the element-wisemultiplication and, finally, summing all non-zero ele-ments from this is an additional (N # 1) operations.Thus,
Pij (A
2 s A)ij requires 2N(n+ 1)# 1=O(mn)arithmetic operations, since N = 2m. In contrast,computing A3 takes 4Nn operations and Tr(A3) addsan extra (N # 1) operations totaling 4Nn+N # 1,which is of the order of twice the number of arithmeticoperations, as
2Nn+ 2N # 1
4Nn+N # 1’
2Nn
4Nn=
1
2
Using similar logic, it is easy to see that this resultalso holds for dense graphs and the local trianglecount, D(v).
Burkhardt 5
by guest on September 13, 2016ivi.sagepub.comDownloaded from
Graph Challenge - 16
Triangle Counting using miniTri - Example Algorithm -
• Array based algorithm – Given a graph adjacency array A and incidence array E
C = AE nT = nnz(C)/3
4 5 1
2
3
miniTri represented in compact linear algebra-based operations Kokkos/Qthreads Task-Parallel Approach to Linear Algebra Based Graph Analytics Wolf, Edwards & Olivier, IEEE HPEC 2016
A =
0
BBBB@
0 0 0 0 10 0 0 1 10 0 0 1 10 1 1 0 11 1 1 1 0
1
CCCCAB =
0
BBBB@
1 0 0 0 0 00 1 0 0 1 00 0 1 1 0 00 0 0 1 1 11 1 1 0 0 1
1
CCCCA=
0
BBBB@
; ; ; ; ; ;; ; ; ; ; {2, 4, 5}; ; ; ; ; {3, 4, 5}; {4, 2, 5} {4, 3, 5} ; ; ;; ; ; {5, 3, 4} {5, 2, 4} ;
1
CCCCAC=AET =
• Multiplication is overloaded such that
A ET
C(i,j) = {i,x,y} iff A(i,x) = A(i,y) = 1 and ET(x,j) = ET(y,j) = 1
Graph Challenge - 17
K-Truss
• A graph is a k-truss if each edge is part of at least k-2 triangles • A generalization of a clique (a k-clique is a k-truss), ensuring a minimum level of
connectivity within the graph
• Traditional technique to find a k-truss subgraph: – Compute the support for every edge – Remove any edges with support less than k-2 and update the list of edges – When all edges have support of at least k-2, we have a k-truss
Example 3-truss
Graphulo: Linear algebra graph kernels for NoSQL databases Gadepally, Bolewski, Hook, Hutchison, Miller & Kepner, IPDPS GABB 2015
Graph Challenge - 18
K Truss: Array Formulation - Example Algorithm -
• If E is the unoriented incidence array (rows are edges and columns are vertices) of graph G, and A is the adjacency array
• If G is a k-truss, the following must be satisfied – (E A == 2) 1 > (k – 2)
Graphulo: Linear algebra graph kernels for NoSQL databases, Gadepally, Bolewski, Hook, Hutchison, Miller & Kepner, IPDPS GABB 2015
Algorithm
R = E A
x = find( (R == 2 ) 1 < k − 2 )
while x
Ex = E(x,∶)
E = E(xc,∶)
R = E(xc,∶) A
R = R − E ( ExExT − diag(ExEx
T) )
x = find( (R == 2 ) 1 < k − 2 )
Graph Challenge - 19
Outline
• Introduction
• Data Sets
• Static Graph Isomorphism Challenge
• Metrics
• Summary
Graph Challenge - 20
Metrics
• Correctness – Number of triangles is exact for a given graph – Enumerating all k-trusses is exact for a given graph
• Performance – Total number of edges in graph (edges) – Execution time (seconds) – Rate (edges/second) – Energy (Watts) – Rate per energy (edges/second/Watt)
Graph Challenge - 21
Summary
• Static sub-graph isomorphism graph challenge consists of – Triangle finding – K-Truss
• An implementation will perform these algorithms on provided data sets and report the given metrics
• A submission to the IEEE HPEC Graph Challenge consists of a conference style paper describing the approach, implementation, innovations, and results
• Both hardware and software should be described in solution – Innovative hardware solutions are of interest (in addition to best algorithm for hardware) – Special interest in performance for very large scale data sets (1-100 trillion edges)