March 17, 2006 Zhiyi’s RSL 1
VODCA: View-Oriented, Distributed, Cluster-based
Approach to parallel computing
VODCA: View-Oriented, Distributed, Cluster-based
Approach to parallel computing
Dr Zhiyi HuangDept of Computer Science
University of OtagoNew Zealand
March 17, 2006 Zhiyi’s RSL 2
MotivationMotivation
DSM applications are not as efficient as DSM applications are not as efficient as MPI on cluster computersMPI on cluster computers
DSM applications are not as efficient as DSM applications are not as efficient as MPI on cluster computersMPI on cluster computers
0
5
10
15
20
25
2-p 4-p 8-p 16-p 32-p
TMKMPI
March 17, 2006 Zhiyi’s RSL 3
VOPPVOPP
VODCA is a system supporting View-VODCA is a system supporting View-Oriented Parallel Programming (VOPP)Oriented Parallel Programming (VOPP)
Why a new programming style?Why a new programming style? Improve the performance of DSM Improve the performance of DSM
applications on cluster computersapplications on cluster computers Provide a programming style better than Provide a programming style better than
MPIMPIMessage passing is notoriously known as a Message passing is notoriously known as a
difficult programming styledifficult programming style
VODCA is a system supporting View-VODCA is a system supporting View-Oriented Parallel Programming (VOPP)Oriented Parallel Programming (VOPP)
Why a new programming style?Why a new programming style? Improve the performance of DSM Improve the performance of DSM
applications on cluster computersapplications on cluster computers Provide a programming style better than Provide a programming style better than
MPIMPIMessage passing is notoriously known as a Message passing is notoriously known as a
difficult programming styledifficult programming style
March 17, 2006 Zhiyi’s RSL 4
What is a view?What is a view?
Suppose Suppose MM is the set of data objects in is the set of data objects in shared memoryshared memory
A view is a group of data objects from the A view is a group of data objects from the shared memoryshared memory V, VV, VMM
Views must not overlap each otherViews must not overlap each other Vi, Vj, i Vi, Vj, i j, Vi j, Vi Vj = Vj =
Suppose there are Suppose there are nn views in shared memory views in shared memory ∑ ∑ Vi=MVi=M
Suppose Suppose MM is the set of data objects in is the set of data objects in shared memoryshared memory
A view is a group of data objects from the A view is a group of data objects from the shared memoryshared memory V, VV, VMM
Views must not overlap each otherViews must not overlap each other Vi, Vj, i Vi, Vj, i j, Vi j, Vi Vj = Vj =
Suppose there are Suppose there are nn views in shared memory views in shared memory ∑ ∑ Vi=MVi=M
March 17, 2006 Zhiyi’s RSL 5
VOPP RequirementsVOPP Requirements
The programmer The programmer shouldshould divide the shared divide the shared data into a number of views according to the data into a number of views according to the data flowdata flow of the of the parallel parallel algorithmalgorithm..
A view should consist of data objects that A view should consist of data objects that are are always processed as an atomic set in a always processed as an atomic set in a program.program.
Views can be created and destroyed anytime.Views can be created and destroyed anytime. Each view has a unique view identifierEach view has a unique view identifier
The programmer The programmer shouldshould divide the shared divide the shared data into a number of views according to the data into a number of views according to the data flowdata flow of the of the parallel parallel algorithmalgorithm..
A view should consist of data objects that A view should consist of data objects that are are always processed as an atomic set in a always processed as an atomic set in a program.program.
Views can be created and destroyed anytime.Views can be created and destroyed anytime. Each view has a unique view identifierEach view has a unique view identifier
March 17, 2006 Zhiyi’s RSL 6
VOPP Requirements (cont.)VOPP Requirements (cont.)
View primitives View primitives such as such as acquire_viewacquire_view and and release_viewrelease_view must be used when a must be used when a view is accessed.view is accessed.
acquire_view(View_A);acquire_view(View_A);A = A + 1;A = A + 1;
release_view(View_A);release_view(View_A);acquire_acquire_RRviewview and and release_release_RRviewview can can
be used when a view is only read by a be used when a view is only read by a processor.processor.
View primitives View primitives such as such as acquire_viewacquire_view and and release_viewrelease_view must be used when a must be used when a view is accessed.view is accessed.
acquire_view(View_A);acquire_view(View_A);A = A + 1;A = A + 1;
release_view(View_A);release_view(View_A);acquire_acquire_RRviewview and and release_release_RRviewview can can
be used when a view is only read by a be used when a view is only read by a processor.processor.
March 17, 2006 Zhiyi’s RSL 7
ExampleExample
A VOPP program for a A VOPP program for a producer/consumer problemproducer/consumer problem
A VOPP program for a A VOPP program for a producer/consumer problemproducer/consumer problem
If(prod_id == 0){ acquire_view(1); produce(x); release_view(1);}barrier(0);acquire_Rview(1);consume(x);release_Rview(1);
March 17, 2006 Zhiyi’s RSL 8
Advantages of VOPPAdvantages of VOPP
Keep the convenience of shared memory Keep the convenience of shared memory programmingprogramming
Focus on data partitioning and data access Focus on data partitioning and data access instead of data race and mutual exclusioninstead of data race and mutual exclusion View primitives automatically achieve mutual View primitives automatically achieve mutual
exclusionexclusion View primitives are not extra burdenView primitives are not extra burden
The programmer can finely tune the parallel The programmer can finely tune the parallel algorithm by careful view partitioningalgorithm by careful view partitioning
Keep the convenience of shared memory Keep the convenience of shared memory programmingprogramming
Focus on data partitioning and data access Focus on data partitioning and data access instead of data race and mutual exclusioninstead of data race and mutual exclusion View primitives automatically achieve mutual View primitives automatically achieve mutual
exclusionexclusion View primitives are not extra burdenView primitives are not extra burden
The programmer can finely tune the parallel The programmer can finely tune the parallel algorithm by careful view partitioningalgorithm by careful view partitioning
March 17, 2006 Zhiyi’s RSL 9
Philosophy of VOPPPhilosophy of VOPP
Shared memory is a critical resource Shared memory is a critical resource that needs to be used with carethat needs to be used with care If there is no need to use shared memory, If there is no need to use shared memory,
don’t use itdon’t use it Justification is wanted before a view is Justification is wanted before a view is
createdcreated
Shared memory is a critical resource Shared memory is a critical resource that needs to be used with carethat needs to be used with care If there is no need to use shared memory, If there is no need to use shared memory,
don’t use itdon’t use it Justification is wanted before a view is Justification is wanted before a view is
createdcreated
March 17, 2006 Zhiyi’s RSL 10
VOPP vs. MPIVOPP vs. MPI
Easier for programmers than MPIEasier for programmers than MPI For problems like task queue, programming with For problems like task queue, programming with
MPI is horrific.MPI is horrific. Can mimic any finely-tuned MPI programCan mimic any finely-tuned MPI program
Shared message Shared message view view Send/recv Send/recv acquire_view acquire_view
Essential differencesEssential differences View is location transparentView is location transparent More barriers in VOPPMore barriers in VOPP
Easier for programmers than MPIEasier for programmers than MPI For problems like task queue, programming with For problems like task queue, programming with
MPI is horrific.MPI is horrific. Can mimic any finely-tuned MPI programCan mimic any finely-tuned MPI program
Shared message Shared message view view Send/recv Send/recv acquire_view acquire_view
Essential differencesEssential differences View is location transparentView is location transparent More barriers in VOPPMore barriers in VOPP
March 17, 2006 Zhiyi’s RSL 11
ImplementationImplementation
VODCA: View-Oriented, Distributed, VODCA: View-Oriented, Distributed, Cluster-based Approach to parallel Cluster-based Approach to parallel computingcomputing
VODCA version 1.0VODCA version 1.0 Released as an open source softwareReleased as an open source software A library run at the user spaceA library run at the user space Based on View-based ConsistencyBased on View-based Consistency Use an efficient consistency protocol Use an efficient consistency protocol
VOUPIDVOUPID
VODCA: View-Oriented, Distributed, VODCA: View-Oriented, Distributed, Cluster-based Approach to parallel Cluster-based Approach to parallel computingcomputing
VODCA version 1.0VODCA version 1.0 Released as an open source softwareReleased as an open source software A library run at the user spaceA library run at the user space Based on View-based ConsistencyBased on View-based Consistency Use an efficient consistency protocol Use an efficient consistency protocol
VOUPIDVOUPID
March 17, 2006 Zhiyi’s RSL 12
View-based ConsistencyView-based Consistency
Condition for View-based Consistency Before a processor Pi is allowed to access a view
by calling acquire_view or acquire_Rview, all previous write accesses to data objects of the view must be performed with respect to Pi according to their causal order.
In VOPP, barriers are only used for synchronization and have nothing to do with consistency maintenance for DSM.
Condition for View-based Consistency Before a processor Pi is allowed to access a view
by calling acquire_view or acquire_Rview, all previous write accesses to data objects of the view must be performed with respect to Pi according to their causal order.
In VOPP, barriers are only used for synchronization and have nothing to do with consistency maintenance for DSM.
March 17, 2006 Zhiyi’s RSL 13
Consistency protocolsConsistency protocols
They are page basedThey are page basedUpdate protocolUpdate protocol
Modify immediatelyModify immediately Invalidation protocolInvalidation protocol
Use a write notice to invalidate a pageUse a write notice to invalidate a page When the page is accessed, a page fault When the page is accessed, a page fault
causes the fetch of diffs which are applied causes the fetch of diffs which are applied on the pageon the page
They are page basedThey are page basedUpdate protocolUpdate protocol
Modify immediatelyModify immediately Invalidation protocolInvalidation protocol
Use a write notice to invalidate a pageUse a write notice to invalidate a page When the page is accessed, a page fault When the page is accessed, a page fault
causes the fetch of diffs which are applied causes the fetch of diffs which are applied on the pageon the page
March 17, 2006 Zhiyi’s RSL 14
Consistency protocols (cont.)Consistency protocols (cont.)
Home-based protocolHome-based protocol Based on invalidate protocol, butBased on invalidate protocol, but For each page, use a copy as its homeFor each page, use a copy as its home When a diff is created, it is applied to the When a diff is created, it is applied to the
home copy immediatelyhome copy immediately When the page is accessed, a page fault When the page is accessed, a page fault
causes the fetch of the home copy (Pros: causes the fetch of the home copy (Pros: resolve the diff accumulation problem)resolve the diff accumulation problem)
Home-based protocolHome-based protocol Based on invalidate protocol, butBased on invalidate protocol, but For each page, use a copy as its homeFor each page, use a copy as its home When a diff is created, it is applied to the When a diff is created, it is applied to the
home copy immediatelyhome copy immediately When the page is accessed, a page fault When the page is accessed, a page fault
causes the fetch of the home copy (Pros: causes the fetch of the home copy (Pros: resolve the diff accumulation problem)resolve the diff accumulation problem)
March 17, 2006 Zhiyi’s RSL 15
The VOUPID protocolThe VOUPID protocol
View-Oriented Update Protocol with View-Oriented Update Protocol with Integrated DiffIntegrated Diff BBased on the update protocolased on the update protocol DDiffs of a page of a view are merged into a iffs of a page of a view are merged into a
single diffsingle diff TThe single diff is used to update the page he single diff is used to update the page
when the view is acquiredwhen the view is acquired
View-Oriented Update Protocol with View-Oriented Update Protocol with Integrated DiffIntegrated Diff BBased on the update protocolased on the update protocol DDiffs of a page of a view are merged into a iffs of a page of a view are merged into a
single diffsingle diff TThe single diff is used to update the page he single diff is used to update the page
when the view is acquiredwhen the view is acquired
March 17, 2006 Zhiyi’s RSL 16
ExperimentExperiment
Use a cluster computerUse a cluster computer TheThe cluster computer, cluster computer, in Tsinghua Univ.in Tsinghua Univ., consists , consists
of of 128 Itanium 2 128 Itanium 2 running Linux 2.4, connected by running Linux 2.4, connected by InfiniBandInfiniBand. Each . Each nodenode has has two 1.3 GHztwo 1.3 GHz processorprocessorss and and 4 G4 Gbytes RAM. We run two bytes RAM. We run two processes on each node.processes on each node.
We used four applications, Integer Sort (IS), We used four applications, Integer Sort (IS), Gauss, Successive Over-Relaxation (SOR), Gauss, Successive Over-Relaxation (SOR), and Neural Network (NN).and Neural Network (NN).
Use a cluster computerUse a cluster computer TheThe cluster computer, cluster computer, in Tsinghua Univ.in Tsinghua Univ., consists , consists
of of 128 Itanium 2 128 Itanium 2 running Linux 2.4, connected by running Linux 2.4, connected by InfiniBandInfiniBand. Each . Each nodenode has has two 1.3 GHztwo 1.3 GHz processorprocessorss and and 4 G4 Gbytes RAM. We run two bytes RAM. We run two processes on each node.processes on each node.
We used four applications, Integer Sort (IS), We used four applications, Integer Sort (IS), Gauss, Successive Over-Relaxation (SOR), Gauss, Successive Over-Relaxation (SOR), and Neural Network (NN).and Neural Network (NN).
March 17, 2006 Zhiyi’s RSL 17
Related systemsRelated systems
TreadMarks (TMK) is a state-of-the-art TreadMarks (TMK) is a state-of-the-art Distributed Shared Memory system Distributed Shared Memory system based on traditional parallel based on traditional parallel programming.programming.
Message Passing Interface (MPI) is a Message Passing Interface (MPI) is a standard for message passing-based standard for message passing-based parallel programming. We used parallel programming. We used LAM/MPI.LAM/MPI.
TreadMarks (TMK) is a state-of-the-art TreadMarks (TMK) is a state-of-the-art Distributed Shared Memory system Distributed Shared Memory system based on traditional parallel based on traditional parallel programming.programming.
Message Passing Interface (MPI) is a Message Passing Interface (MPI) is a standard for message passing-based standard for message passing-based parallel programming. We used parallel programming. We used LAM/MPI.LAM/MPI.
March 17, 2006 Zhiyi’s RSL 18
Performance of NNPerformance of NN
0
5
10
15
20
25
30
35
2-p 4-p 8-p 16-p 32-p
VODCATMKMPI
March 17, 2006 Zhiyi’s RSL 19
Performance of ISPerformance of IS
0
5
10
15
20
25
2-p 4-p 8-p 16-p 32-p
VODCATMKMPI
March 17, 2006 Zhiyi’s RSL 20
Performance of SORPerformance of SOR
0
2
4
6
8
10
12
14
16
2-p 4-p 8-p 16-p 32-p
VODCATMKMPI
March 17, 2006 Zhiyi’s RSL 21
Performance of GaussPerformance of Gauss
0
5
10
15
20
25
2-p 4-p 8-p 16-p 32-p
VODCATMKMPI
March 17, 2006 Zhiyi’s RSL 22
Future work on VOPPFuture work on VOPP
More benchmarks/applicationsMore benchmarks/applications Performance evaluationPerformance evaluation on larger clusters on larger clusters Optimized implementation of barriers for Optimized implementation of barriers for
VOPPVOPP More auxiliary utilitiesMore auxiliary utilities for for VOPP VOPP programmersprogrammers A view-based debugger for VOPPA view-based debugger for VOPP A fault-tolerant system for VODCAA fault-tolerant system for VODCA
More benchmarks/applicationsMore benchmarks/applications Performance evaluationPerformance evaluation on larger clusters on larger clusters Optimized implementation of barriers for Optimized implementation of barriers for
VOPPVOPP More auxiliary utilitiesMore auxiliary utilities for for VOPP VOPP programmersprogrammers A view-based debugger for VOPPA view-based debugger for VOPP A fault-tolerant system for VODCAA fault-tolerant system for VODCA
March 17, 2006 Zhiyi’s RSL 23
Questions?Questions?