CS2504, Spring'2007
©Dim
itris Nikolopoulos 1
Directo
ry-b
ased C
ache C
ohere
nce
�Three valid states in directory-based coherence
−Shared: >= 1 processors have data, memory up-to-date
−Uncached: no processor has data, not valid in any cache
−Exclusive: 1 processor has the data, memory out-of-date
�Directory tracks which processors have data and
which state they are in
−Sends messages to sharers on write
−Updates list of sharers on new accesses or write-backs
�CC-NUMA –Cache Coherent NUMA
CS2504, Spring'2007
©Dim
itris Nikolopoulos 2
Actions in D
irecto
ry-b
ased C
ohere
nce
�Block is in Uncached state:
−Read miss: processor sent data and the requestor made only sharing node; block
mode set to Shared
−Write miss: processor becomes only sharer; Block is made exclusive indicating the
only valid copy is cached
�Block is Shared:
−Read miss: processor is sent data and added to sharing set
−Write miss: requesting processor sent the data and made the onlysharer; block
state set to Exclusive; all sharers sent invalidate message
�Block is Exclusive:
−Read Miss: owner sent message to fetch value and set state to shared; value
written to memory and sent to requestor; sharers updated to include requesting
processor; state is shared
−Data write-back: value is committed to memory; state set to un-cached
−Write miss: requesting processor becomes the new owner; sharers set is set to
the processor; block remains exclusive
CS2504, Spring'2007
©Dim
itris Nikolopoulos
Exam
ple
P1
P2
Bus
Directo
ryM
em
ory
ste
pS
tate
Addr
Valu
eS
tate
Addr
Valu
eA
ction
Pro
c.
Addr
Valu
eA
ddr
Sta
te{P
rocs}
Valu
e
P1: W
rite
10 to A
1
P1: R
ead A
1
P2: R
ead A
1
P2: W
rite
40 to A
2
P2
: W
rite
20 to
A1
Pro
cesso
r 1
Pro
cesso
r 2
Inte
rco
nn
ect
Mem
ory
Dir
ecto
ry
CS2504, Spring'2007
©Dim
itris Nikolopoulos
Exam
ple
P1
P2
Bus
Directo
ryM
em
ory
ste
pS
tate
Addr
Valu
eS
tate
Addr
Valu
eA
ction
Pro
c.
Addr
Valu
eA
ddr
Sta
te{P
rocs}
Valu
e
P1: W
rite
10 to A
1W
rMs
P1
A1
A1
Ex
{P1}
Excl.
A1
10
DaR
pP
1A
10
P1: R
ead A
1
P2: R
ead A
1
P2: W
rite
40 to A
2
P2
: W
rite
20 to
A1
A1 a
nd A
2 m
ap to the s
am
e c
ache b
lock
Pro
cesso
r 1
Pro
cesso
r 2
Inte
rco
nn
ect
Mem
ory
Dir
ecto
ry
CS2504, Spring'2007
©Dim
itris Nikolopoulos
Exam
ple
P1
P2
Bus
Directo
ryM
em
ory
ste
pS
tate
Addr
Valu
eS
tate
Addr
Valu
eA
ction
Pro
c.
Addr
Valu
eA
ddr
Sta
te{P
rocs}
Valu
e
P1: W
rite
10 to A
1W
rMs
P1
A1
A1
Ex
{P1}
Excl.
A1
10
DaR
pP
1A
10
P1: R
ead A
1E
xcl.
A1
10
P2: R
ead A
1
P2: W
rite
40 to A
2
P2
: W
rite
20 to
A1
A1 a
nd A
2 m
ap to the s
am
e c
ache b
lock
Pro
cesso
r 1
Pro
cesso
r 2
Inte
rco
nn
ect
Mem
ory
Dir
ecto
ry
CS2504, Spring'2007
©Dim
itris Nikolopoulos
Exam
ple
P2
: W
rite
20 to
A1
A1 a
nd A
2 m
ap to the s
am
e c
ache b
lock
P1
P2
Bus
Directo
ryM
em
ory
ste
pS
tate
Addr
Valu
eS
tate
Addr
Valu
eA
ction
Pro
c.
Addr
Valu
eA
ddr
Sta
te{P
rocs}
Valu
e
P1: W
rite
10 to A
1W
rMs
P1
A1
A1
Ex
{P1}
Excl.
A1
10
DaR
pP
1A
10
P1: R
ead A
1E
xcl.
A1
10
P2: R
ead A
1S
har.
A1
RdM
sP
2A
1
Shar.
A1
10
Ftc
hP
1A
110
10
Shar.
A1
10
DaR
pP
2A
110
A1
Shar.
{P1,P
2}
10
10
10
P2: W
rite
40 to A
210
Pro
cesso
r 1
Pro
cesso
r 2
Inte
rco
nn
ect
Mem
ory
Dir
ecto
ry A1
Write
Ba
ck
Write
Ba
ck
A1
CS2504, Spring'2007
©Dim
itris Nikolopoulos
Exam
ple
P2
: W
rite
20 to
A1
A1 a
nd A
2 m
ap to the s
am
e c
ache b
lock
P1
P2
Bus
Directo
ryM
em
ory
ste
pS
tate
Addr
Valu
eS
tate
Addr
Valu
eA
ction
Pro
c.
Addr
Valu
eA
ddr
Sta
te{P
rocs}
Valu
e
P1: W
rite
10 to A
1W
rMs
P1
A1
A1
Ex
{P1}
Excl.
A1
10
DaR
pP
1A
10
P1: R
ead A
1E
xcl.
A1
10
P2: R
ead A
1S
har.
A1
RdM
sP
2A
1
Shar.
A1
10
Ftc
hP
1A
110
10
Shar.
A1
10
DaR
pP
2A
110
A1
Shar.
{P1,P
2}
10
Excl.
A1
20
WrM
sP
2A
110
Inv.
Inval.
P1
A1
A1
Excl.
{P2}
10
P2: W
rite
40 to A
210
Pro
cesso
r 1
Pro
cesso
r 2
Inte
rco
nn
ect
Mem
ory
Dir
ecto
ry A1
A1
CS2504, Spring'2007
©Dim
itris Nikolopoulos
Exam
ple
P2
: W
rite
20 to
A1
A1 a
nd A
2 m
ap to the s
am
e c
ache b
lock
P1
P2
Bus
Directo
ryM
em
ory
ste
pS
tate
Addr
Valu
eS
tate
Addr
Valu
eA
ction
Pro
c.
Addr
Valu
eA
ddr
Sta
te{P
rocs}
Valu
e
P1: W
rite
10 to A
1W
rMs
P1
A1
A1
Ex
{P1}
Excl.
A1
10
DaR
pP
1A
10
P1: R
ead A
1E
xcl.
A1
10
P2: R
ead A
1S
har.
A1
RdM
sP
2A
1
Shar.
A1
10
Ftc
hP
1A
110
10
Shar.
A1
10
DaR
pP
2A
110
A1
Shar.
{P1,P
2}
10
Excl.
A1
20
WrM
sP
2A
110
Inv.
Inval.
P1
A1
A1
Excl.
{P2}
10
P2: W
rite
40 to A
2W
rMs
P2
A2
A2
Excl.
{P2}
0
WrB
kP
2A
120
A1
Unca.
{}20
Excl.
A2
40
DaR
pP
2A
20
A2
Excl.
{P2}
0
Pro
cesso
r 1
Pro
cesso
r 2
Inte
rco
nn
ect
Mem
ory
Dir
ecto
ry A1
A1
CS2504, Spring'2007
©Dim
itris Nikolopoulos 9
Clu
ste
rs
CS2504, Spring'2007
©Dim
itris Nikolopoulos
10
Clu
ste
rs
�A computer clusteris a group of tightly
coupled computers that work together
closely so that in many respects they can be
viewed as though they are a single computer
CP
U
NIC
Mem
ory
CP
U
NIC
Mem
ory
CP
U
NIC
Mem
ory
…
Co
mpute
r 1
Co
mpute
r 2
Co
mpute
r N
Hig
h-S
pee
d I
nte
rconnec
tS
har
ed
Sto
rage
Opti
onal
CS2504, Spring'2007
©Dim
itris Nikolopoulos
11
Clu
ste
rs
�Many applications can execute on these machines
−Not as tightly coupled as SMPs or even CC-NUMA
−Databases, web servers, multiprogramming, parallel programs
�These applications often requirehigh availability
−This mandates fault tolerance and reparability
�Clusters composed of commercial-off-the-shelf
components
−COTS keeps prices very low for the processing power
�ALL machines in the Top500 are clusters
CS2504, Spring'2007
©Dim
itris Nikolopoulos
12
Dra
wbacks o
f Clu
ste
rs
�Cost of administration of N machines
comparable to N
independent machines
−Cost for large MP comparable to single machine
�Slow connection through I/O bus
−SMPs communicate over memory bus
�Single threaded applications have only 1/N of
the available memory
CS2504, Spring'2007
©Dim
itris Nikolopoulos
13
Advanta
ges o
f Clu
ste
rs
�Biggest advantage: COST!
−Made from commodity parts (computers, switches)
�Hot-swapability
−When parts fail, they can be replaced without
bringing down the whole system
−This is very important when there are many parts
�Same properties allow for easy expandability
�Many people opt for hybrid clusters of SMPs
−Called a constellation
CS2504, Spring'2007
©Dim
itris Nikolopoulos
14
Netw
ork
Topolo
gie
s
�There are MANY proposed network topologies
−Many evaluation metrics
�Latency on an unloaded network, message
throughput, variation in performance
−Many measurements of cost
�Number of switches, number of links, bits per link,
link length
�Measure bandwidth to compare topologies
−Total network bandwidth –best case
−Bisection bandwidth –worst case
CS2504, Spring'2007
©Dim
itris Nikolopoulos
15
Exam
ple
Topolo
gie
s
�Two most obvious topologies:
−Ring Topology –Cheap and slow
−Fully Connected Network –Expensive and fast
CS2504, Spring'2007
©Dim
itris Nikolopoulos
16
Exam
ple
Topolo
gie
s
�Engineers seek to find something in between
−2D Grid or Mesh
N-cube tree
−Frequently, additional links are added to one of
these simple strategies to improve performance
CS2504, Spring'2007
©Dim
itris Nikolopoulos
17
Inte
rconnection N
etw
ork
s
•1
0G
bit
Eth
ern
et –
exte
nsi
on t
o t
rad
itio
nal
Eth
ern
et
•V
ery
hig
h C
PU
ov
erh
ead
–R
esu
lts
in l
on
g l
aten
cy
•C
om
pat
ible
wit
h o
lder
Eth
ernet
–C
an h
ave
a sy
stem
wit
h b
oth
•U
ses
stan
dar
d E
ther
net
cab
le
CS2504, Spring'2007
©Dim
itris Nikolopoulos
18
Inte
rconnection N
etw
ork
s
•M
yri
net
red
uce
s p
roto
col
ov
erhea
d
–O
ften
use
d d
irec
tly b
y t
he
appli
cati
on
–B
ypas
sing t
he
oper
atin
g s
yst
em
•C
lose
r to
pea
k n
etw
ork
th
roug
hp
ut
–L
ow
lat
ency
ver
y i
mport
ant
for
HP
C a
ppli
cati
ons
•T
wo
fib
re-o
pti
c ca
ble
s –
up
stre
am a
nd
do
wn
stre
am
•D
ecre
asin
g i
n p
opu
lari
ty o
n t
he
To
p5
00
lis
t
•M
yri
-10
G –
10
Gb
it/s
ec
•F
ault
tole
ran
t –
can
op
erat
e w
ith
a s
ub
set
of
ho
sts
•D
esig
ned
fo
r cl
ust
ers
spec
ific
ally
CS2504, Spring'2007
©Dim
itris Nikolopoulos
19
Inte
rconnection N
etw
ork
s
•In
fin
iban
d
•S
tand
ard
izat
ion
that
is
just
now
com
ing t
o l
ife
•D
esig
ned
for
man
y p
urp
ose
s
–C
lust
er i
nte
rco
nn
ect
–S
yst
em b
us
–W
ou
ld a
llo
w n
etw
ork
to
co
nn
ect
dir
ectl
y t
o b
us
•V
ery
hig
h t
hro
ugh
pu
t, p
erh
aps
the
hig
hes
t
–M
ediu
m l
aten
cy w
ith O
S-b
yp
ass
CS2504, Spring'2007
©Dim
itris Nikolopoulos
20
Inte
rconnection P
erform
ance R
esults
Graph borrowed from “Cluster Interconnect Overview”
CS2504, Spring'2007
©Dim
itris Nikolopoulos
21
VT’s S
yste
m X
•S
yst
em X
–D
ebu
ted
at
#3
, n
ow
at
#4
7
–3
rd f
aste
st a
cad
emic
SC
–w
ww
.tcf
.vt.
edu
/in
dex
.htm
l
•B
uil
t fr
om
1100
App
le X
serv
e G
5 c
lust
er n
odes
–E
ach
no
de
2 C
PU
s, 4
GB
mem
ory
, 8
0G
B h
ard
dri
ve
•N
etw
ork
ing w
ith
In
fin
iban
d (
10
GB
/s)
•R
unn
ing A
pp
le M
ac O
S X
•M
ax G
FL
OP
S:
12
250
CS2504, Spring'2007
©Dim
itris Nikolopoulos
22
Checkpoin
ting
�Applications executing in parallel systems may
execute for days, weeks, or even months
−Need to ensure that nothing goes wrong
−Don’t want to lose so much work on failures
�One solution is to perform checkpointing
−Taking periodic images of the system
−In case of failure these can then be restarted
−Only time since last checkpoint is lost
CS2504, Spring'2007
©Dim
itris Nikolopoulos
23
Checkpoin
ting in D
istrib
ute
d S
yste
ms
•M
ach
ines
WIL
L f
ail
reg
ula
rly
in
lar
ge
syst
ems
–L
eft
unch
ecked
, a
pro
gra
m t
hat
req
uir
es 1
day
to e
xec
ute
on a
la
rge
dis
trib
ute
d s
yst
em w
ill
NE
VE
R f
inis
h
•C
an t
ake
snap
sho
ts o
f al
l p
roce
sses
in
appli
cati
on
–W
hen
a n
ode
fail
s, i
t ca
n b
e m
igra
ted
to a
noth
er s
yst
em
•C
hec
kp
oin
ting
is
expen
siv
e ea
ch t
ime
–M
ust
bal
ance
over
hea
d w
ith c
ost
of
lost
work
•E
ver
gri
dco
mp
any
fo
un
ded
by
D
r. S
rin
idh
iV
arad
araj
anat
VT
CS2504, Spring'2007
©Dim
itris Nikolopoulos
24
Grid C
om
puting
•G
rid
com
pu
tin
gis
an e
mer
gin
g c
om
pu
tin
g m
od
el
that
dis
trib
ute
s pro
cess
ing
acr
oss
a p
aral
lel
infr
astr
uct
ure
of
loo
sely
co
nn
ecte
d c
om
pu
ters
1
234
5
6
7
8
CS2504, Spring'2007
©Dim
itris Nikolopoulos
25
Grid C
om
puting
•M
od
els
a g
ian
t g
eogra
ph
ical
ly d
istr
ibu
ted c
om
pu
ter
–E
ssen
tial
ly a
lo
ose
ly c
on
nec
ted
clu
ster
–C
on
nec
ted o
ver
th
e in
tern
et r
ath
er t
han
inte
rco
nn
ect
•C
om
po
sed
of
bo
th r
egu
lar
des
kto
ps
and s
erv
ers
–C
an b
e h
eter
og
eneo
us
–n
o r
estr
icti
on
s o
n a
rch
itec
ture
•N
ot
all
app
lica
tion
s ca
n b
e po
rted
to “
the
gri
d”
–R
equ
ires
lar
ge
amo
un
ts o
f co
mp
uta
tio
n
–V
ery
lit
tle
inte
rpro
cess
or
com
mu
nic
atio
n
CS2504, Spring'2007
©Dim
itris Nikolopoulos
26
Grid C
om
puting
•C
ondo
r cy
cle
scav
eng
er
–C
oo
rdin
ates
sh
arin
g o
f d
istr
ibute
d, id
le r
eso
urc
es
–S
ub
mit
jo
bs
to C
ondo
r, m
ay r
un
on
an
y i
dle
mac
hin
e
–W
hen
yo
ur
mac
hin
e is
idle
, o
ther
s’ j
ob
s m
ay r
un
on
it
–C
an u
se c
hec
kp
oin
ting
to
mig
rate
jo
bs
to o
ther
mac
hin
es
12
34
12
34
Your Parallel Job
Your Parallel Job
In Use
In Use
In Use
Idle
CS2504, Spring'2007
©Dim
itris Nikolopoulos
27
Grid C
om
puting
•D
on
atin
g y
our
cycl
es t
o a
go
od
cau
se
–P
erfo
rm s
om
e co
mp
uta
tio
n t
o h
elp
wit
h s
om
eth
ing
•S
ever
al s
cree
nsa
ver
s av
aila
ble
–S
eti@
Hom
e
•A
nal
yze
s dat
a fr
om
rad
io-t
eles
cope
to s
earc
h f
or
extr
ater
rest
rial
lif
e
–F
old
ing
@H
om
e
•C
om
pute
s pro
tein
fold
ing p
roper
ties
in t
he
sear
ch f
or
cure
s to
co
mm
on d
isea
ses
•990 G
FL
OP
S a
chie
ved
CS2504, Spring'2007
©Dim
itris Nikolopoulos
28
•G
oo
gle
req
uir
es c
on
tinu
ous
up
tim
e
–U
sers
aro
un
d t
he
wo
rld
su
bm
it q
uer
ies
all
the
tim
e
–S
erv
ices
ab
ou
t 10
00
qu
erie
s a
seco
nd
(m
ore
no
w)
–L
aten
cy c
ann
ot
exce
ed u
sers
’ p
atie
nce
(0
.5 s
ec)
•G
oo
gle
als
o c
onti
nuou
sly c
raw
ls t
he
web
–K
eep
pag
e in
form
atio
n u
p-t
o-d
ate
–S
tore
s lo
cal
cop
y o
f m
ost
pag
es t
o p
rovid
e sn
ipp
ets
•R
equir
es a
LO
T o
f co
mp
ute
po
wer
an
d s
tora
ge
CS2504, Spring'2007
©Dim
itris Nikolopoulos
29
’s C
luste
r of PCs
•T
o p
rov
ide
such
ser
vic
e G
oo
gle
m
ain
tain
s an
est
imat
ed 4
50
,000
ser
ver
s at
lo
cati
ons
aro
un
d t
he
wo
rld
–U
se l
ow
-end m
achin
es t
o s
ave
money
–533 M
Hz
Cel
eron t
o d
ual
1.4
GH
zP
enti
um
–O
ne
or
more
80G
B h
ard d
isk
per
ser
ver
–2–4 G
Bm
emory
per
mac
hin
e
•M
any
lo
cati
on
s to
pro
vid
e b
ette
r se
rvic
e
–L
ow
er l
aten
cy f
or
inte
rnat
ional
quer
ies
–H
igher
avai
labil
ity i
n c
ase
of
a pro
ble
m
CS2504, Spring'2007
©Dim
itris Nikolopoulos
30
’s C
luste
r of PCs
•E
ven
if
the
syst
em c
an s
urv
ive,
don
’t w
ant
to l
ose
d
ata
on
mac
hin
e cr
ashes
–T
he
idea
is
to "
sto
re d
ata
reli
ably
ev
en i
n t
he
pre
sen
ce o
f u
nre
liab
le m
ach
ines
" –
Jeff
rey
Dea
n,
Go
og
le
•G
oo
gle
rep
lica
tes
dat
a in
a s
oph
isti
cate
d w
ay
–F
iles
are
rep
lica
ted
ov
er m
any
“ch
un
kse
rver
s”
–“M
aste
r” s
erv
er k
eep
s tr
ack
wh
ere
dat
a is
sto
red
CS2504, Spring'2007
©Dim
itris Nikolopoulos
31
Sum
mary
�You learned:
−Many different ways of categorizing multiprocessors
�And what the categories mean
−Issues relating to creating parallel programs
�Writing parallel programs
�Synchronization
−Maintaining cache coherence in multiprocessors
�Snoopy-and directory-based approaches
−What clusters are and how they can be used