Date post: | 01-Jan-2016 |
Category: |
Documents |
Upload: | caesar-browning |
View: | 33 times |
Download: | 0 times |
11 July 2005
UPC/SHMEM Language UPC/SHMEM Language Analysis and Usability Analysis and Usability StudyStudy
Professor Alan D. George, Principal InvestigatorMr. Hung-Hsun Su, Sr. Research Assistant
Mr. Adam Leko, Sr. Research AssistantMr. Bryan Golden, Research Assistant
Mr. Hans Sherburne, Research Assistant
HCS Research LaboratoryUniversity of Florida
PATPAT
311 July 2005
Purpose and Method
Purpose Determine performance factors purely from
language’s perspective Gain insight into how to best incorporate
performance measurement with various implementations
Method Create a complete and minimal factor list Analyze UPC and SHMEM (Quadrics) specs Analyze various UPC/SHMEM implementations +
discussions with developers
411 July 2005
Factor List Creation
Factor list developed based on observations from other (tool, analytical model,
etc.) studies
Ensures factors are measurable
Provides insight into how they can be measured
Only basic events included to eliminate redundancy
Sufficient for time-based analysis and memory system analysis
Completion notification – Calling thread waiting for completion of a one-sided
operation initiated by calling thread
Synchronization – multiple threads waiting for each other to complete a single
task
Local access – refers only to access of local shared (global) variable
511 July 2005
Factor List
Factors Basic events
Computation
Execution (useful work) Time
System overhead Compiler, runtime, OS, thread, I/O
Communication
Bulk transfer Variable name, size, count, transfer time, overhead
Small transfer Variable name, count, value, total time (transfer + overhead)
Synchronization Notify time, wait time, count, overhead
Completion notification Wait time, count, overhead
Memory
Local access Variable name, time, count
Cache (remote data) Size, miss/hit count, variable name
Other
Scalability System size
Resource management I/O, etc.
611 July 2005
SHMEM Analysis Performed on Quadrics SHMEM specification and GPSHMEM library
Great similarity between implementations
Factors for each construct involves execution +
Small transfer (put/get)
Synchronization (other)
Variations between implementations troublesome
A standard for SHMEM/GPSHMEM function set is desirable
General: provides user with a uniform library set
PAT: reduces complexity of system (i.e. possibly only one wrapper library is sufficient)
Wrapper approach (ex: PSHMEM) fits very well
Can borrow many ideas from PMPI
However, analysis of data transfers needs special care to handle one-sided
communication
See Language Analysis sub-report for construct-factor assignments
711 July 2005
UPC Analysis (1)
Performed on UPC spec. 1.1, Berkeley UPC, Michigan Tech UPC, and HP
UPC (in progress)
See Language Analysis sub-report for construct-factor assignment
Specification analysis
Educated guesses, attempts to cover all aspect of language
Too generic for PAT development
Implementations
Many similarities between implementations
Wrapper mentality works with UPC function constructs PUPC proposal
Pre-processor needed to handle UPC non-function constructs
811 July 2005
UPC Analysis (2)
Implementations (cont.)
HP-UPC
Composed of UPC Compiler (compiler), Run-Time System (RTS), and
(optional) Run-Time Environment (RTE)
UPC global variable access translates to HW shared-memory access
impacts time of instrumentation
Waiting for Brian at HP to send details on UPC functions to complete
construct-factor assignment
GCC-UPC: will be studied after completion of HP UPC
911 July 2005
UPC Specification Construct-Factor Table (1)Shared variable
Declaration Execution, overhead, synchronization, local access
Assignment Execution, overhead, bulk/small transfer, completion, local access, cache
upc_threadof, upc_resetphase, upc_addrfield Execution
upc_affinitysize Execution, synchronization
Shared lock
upc_lock_t (declaration) Execution, overhead, synchronization
upc_global_lock_alloc, upc_lock_free Execution, overhead, small transfer, synchronization, local access
upc_all_lock_alloc, upc_lock, upc_unlock, upc_lock_attempt
Execution, overhead, small transfer, completion, local access
Global address space
upc_global_alloc, upc_all_alloc Execution, overhead, small transfer, synchronization
upc_alloc, upc_local_alloc, upc_free Execution, overhead, small transfer, completion, local access
1011 July 2005
UPC Specification Construct-Factor Table (2)
Environment
MYTHREAD, THREADS Execution, overhead
upc_strict / upc_strict.hupc_relaxed / upc_relaxed.h
N/A. This affects how the other constructs needs to be track but by itself does not relate to any factor.
upc_global_exit Execution, synchronization, resource management
Bulk memory transfer
upc_memcpy, upc_memget, upc_memput, upc_memset
Execution, bulk/small transfer, completion, local access, cache
Synchronization
upc_notify, upc_wait, upc_barrier, upc_fence Execution, synchronization
Other
upc_forall Execution, synchronization, local access
UPC_MAX_BLOCK_SIZE, upc_localsizeof, upc_blocksizeof, upc_elemsizeof
(compile time constants)
Execution, small transfer
1111 July 2005
Berkeley UPC Analysis (1)
Based on version 2.0.1
Analysis at UPC level with some consideration at communication level
Noteworthy implementation details
upc_all_alloc and upc_all_lock_alloc: use of all-to-all broadcast
Upc_alloc and upc_global_alloc behave like upc_local_alloc: double size of heap
when running out of space
Multiple mechanisms for implementing barrier
HW supported (Ex: InfiniBand)
Custom barrier (Ex: SHMEM/lapi)
Centralized (Other, current) logarithmic dissemination (other, future)
Impact on PAT
UPC level only instrumentation 1 unit, less accurate
UPC + communication level instrumentation multiple units, more accurate
1211 July 2005
Berkeley UPC Analysis (2)
Noteworthy implementation details (cont.)
Three different translations for upc_forall
All tasks can be done by 1 thread if statement followed by a regular for
loop
Tasks are cyclic distributed for loop with stride factor equal to number of
threads
Tasks are block distributed two-level for loops are used (outer level is
same as in second case and inner loop is a regular for loop corresponding to
all elements in block)
Impact on PAT – instrumentation needed before translation
1311 July 2005
Berkeley UPC Construct-Factor Table (1) Shared variable
Declaration Execution
Pointer assignment Execution
Scalar variable/array assignment Execution, small transfer if data is remote, completion
upc_threadof, upc_resetphase, upc_addrfield, upc_affinitysize
Execution
Shared lock
Shared lock declaration (upc_lock_t) Execution
upc_global_lock_alloc Node 0: Execution, local accessOther: ExecutionRare case: + small transfer, completion
upc_all_lock_alloc Execution, synchronization, local access
upc_lock, upc_lock_attempt Execution, small transfer, completion
upc_unlock, upc_lock_free Execution, small transfer
Global address space
upc_all_alloc Execution, synchronization, local access
upc_global_alloc , upc_alloc, upc_local_alloc Node 0: Execution, local accessOther: ExecutionRare case: + small transfer, completion
upc_free Execution, small transfer (if not owner) , local access
1411 July 2005
Berkeley UPC Construct-Factor Table (2) Environment
MYTHREAD, THREADS Execution
upc_strict / upc_strict.hupc_relaxed / upc_relaxed.h
N/A
upc_global_exit Execution, small transfer, synchronization (system dependent), local access
Bulk memory transfer
upc_memget, upc_memput, upc_memget, upc_memcpy
Execution, bulk/small transfer, completion, local access (see item 12 in the notes and observations section for detail for upc_memcpy)
Synchronization
upc_notify Execution, synchronization [notify]
upc_wait Execution, synchronization [wait]
upc_barrier Execution, synchronization
upc_fence Execution
Other
upc_forall Execution
UPC_MAX_BLOCK_SIZE, upc_localsizeof, upc_blocksizeof, upc_elemsizeof
Execution (expanded to constant at translation time)
1511 July 2005
Michigan Tech UPC Analysis
Based on version 1.1
Noteworthy implementation details
Uses a centralized control for most control processes (i.e. split and non-split
barriers, collective array allocation, collective lock allocation and global exit.)
Based on two pthreads system using consumer-producer mechanism.
Program thread (producer): adds entries to appropriate send queues
Communication thread (consumer): sending and processing requests via MPI (no
aggregation of data for optimization, bulk transfer = x small transfers)
Impact on PAT – transfer, completion and synchronization is much harder to track
Uses flat broadcast and tree broadcast
Caching capability complicates analysis
1611 July 2005
MTU UPC Construct-Factor Table (1)
Shared variable
Declaration Execution, overhead, local access
Pointer assignment Execution
Scalar variable/array assignment Owner: Execution, local access, cacheOther: Execution, small transfer
upc_threadof, upc_resetphase, upc_addrfield, upc_affinitysize
Execution
Shared lock
Shared lock declaration (upc_lock_t) Execution
upc_global_lock_alloc Execution
upc_all_lock_alloc Execution, synchronization (2, barrier + tree broadcast)
upc_lock, upc_unlock, upc_lock_attempt Execution, synchronization
upc_lock_free Owner: ExecutionOther: Execution, small transfer (actually should be classified as part of completion notification, however, the calling thread does not wait for reply)
Global address space
upc_global_alloc Node 0: Execution, local accessOther: Execution, completion
upc_all_alloc, upc_free Execution, synchronization (2, barrier + tree broadcast), local access
upc_alloc, upc_local_alloc Execution, local access
1711 July 2005
MTU UPC Construct-Factor Table (2) Environment
MYTHREAD, THREADS Execution
upc_strict / upc_strict.hupc_relaxed / upc_relaxed.h
Execution, cache
upc_global_exit Execution, synchronization
Bulk memory transfer
upc_memcpy Execution, small transfer, synchronization, local access (see item 10 in the notes and observations section for detail)
upc_memget, upc_memput, upc_memset Owner: Execution, local access, cacheOther: Execution, small transfer
Synchronization
upc_notify Execution, synchronization, cache
upc_wait Execution, synchronization, cache
upc_barrier Execution, small transfer, synchronization, cache
upc_fence Execution, cache
Other
upc_forall Execution
UPC_MAX_BLOCK_SIZE, upc_localsizeof, upc_blocksizeof, upc_elemsizeof
Execution
1811 July 2005
Summary
Factor list and construct-factor assignment provide basis for practical event tracing in UPC and SHMEM SHMEM
Wrapper library approach appears ideal Push for SHMEM standardization will simplify development
UPC Hybrid pre-processor/wrapper library approach appears appropriate
(compatible with GCC-UPC?) Analysis provides insights on how to instrument
UPC/SHMEM programs and raises awareness to possible difficulties
2011 July 2005
Usability: Purpose and Methods
Purpose Determine factors affecting usability of performance tools Determine how to incorporate knowledge about factors into PAT
Methods Elicit user feedback through a Performance Tool Usability
Survey (survey generated after some literature reviews) Review and provide a concise summary of literature in area of
usability for parallel performance tools Outline
Discuss common problems seen in performance tools Provide a discussion on factors influencing usability of
performance tools Outline how to incorporate user-centered design into PAT Present guidelines to avoid common usability problems
2111 July 2005
Performance Tool Usability:
General Performance Tool Problems Difficult problem for tool developer Inherently unstable execution environment Monitoring behavior may disturb original behavior Short lifetime of parallel computers
Users Tools too difficult to use Too complex Unsuitable for real-world applications Users skeptical about value of tools
2211 July 2005
Discussion on Usability Factors* (1) Ease-of-learning
Concern Important for attracting new users Tool’s interface shapes user’s understanding of its functionality Inconsistency leads to confusion (e.g. providing defaults for some object
but not all) Possible solutions
Strive for internally and externally consistent tool Stick to established conventions Provide uniform interface Target as many platforms as necessary so user can amortize time
invested over many uses Usefulness
Concern: How directly tool helps user achieve their goal Possible Solution: Make common case simple even if that makes rare
case complex
* C. Pancake, ‘‘Improving the Usability of Numerical Software through User-Centered Design,’’ The Quality of Numerical Software: Assessment and Enhancement, ed. B. Ford and J. Rice, pp. 44-60, 1997.
2311 July 2005
Discussion on Usability Factors (2) Ease-of-use
Concern: Amount of effort required to accomplish work with tool too high to justify tool’s use
Possible solutions Do not force user to memorize information about interface – use
menus, mnemonics, and other mechanisms Provide a simple interface Make all user-required actions concrete and logical
Throughput Concern: How does tool contribute to user productivity in
general Keep in mind that inherent goal of tool is to increase user
productivity
2411 July 2005
User-Centered Design Concept that usability should be driving factor in tool development Based on premise that usability will only be achieved if design process
is user-driven Four-step model to incorporate user feedback* (chronological)
Ensure initial functionality is based on user needs Solicit input directly from users
MPI users (for information about existing tools) UPC/SHMEM users Sponsor
Analyze how users identify and correct performance problems UPC/SHMEM users primarily Gain better idea of how the tool will actually be used on real programs Information from users then presented to sponsor for critique/feedback
Implement Incrementally Organize interface so that most useful features are best supported User evaluation of preliminary/prototype designs Maintain strong relationship with users with whom we have access
Have users evaluate every aspect of tool’s interface, structure, and behavior Alpha/Beta testing User tests should be performed at many points along the way Feature-by-feature refinement in response to specific user feedback
* S. Musil, G. Pigel, M. Tscheligi. “User Centered Monitoring of Parallel Programs with InHouse.” HCI ’94 Ancillary Proceedings, 1994.
2511 July 2005
Performance Tool Usability: Guidelines Issues for Performance Tools and Solutions
Many tools begin by presenting windows with detailed info on a performance metric
Users prefer broader perspective on application behavior Some tools provide multiple views of program behavior
Good idea, but need support for comparing different metrics For example, if CPU utilization drops in same place, L1 cache
miss rate rises
Also essential to provide source-code correlation to be useful
User does not want info that cannot be used to fix code
2611 July 2005
Performance Tool Usability: Summary Summary
Tool will not gain user acceptance until useable in real-world environment
Need to identify successful user strategies from existing tools for real applications
Devise ways to apply successful strategies to tool in an intuitive manner
Use this functionality in development of new tool
2711 July 2005
Presentation Methodology: Introduction Why use visualizations?
To facilitate user comprehension To convey complexity and intricacy of performance data Help bridge gap between raw performance data and performance
improvements When to use visualizations?
On-line: visualization while application is running (can slow down execution significantly)
Post mortem: after execution (usually based on trace data gathered at runtime)
What to visualize? Interactive displays to guide the user Default visualizations should provide high-level views Low-level information should be easily accessible
2811 July 2005
General Approaches to Performance Visualization
General Categories System/Application-independent: depict performance data for
variety of systems and applications – most tools use this approach Meta-tools: facilitate development of custom visualization tools
Other Categories On-line: visualization during execution
Can be intrusive Volume of information may be too large to interpret without playback
functionality Allows user to observe only interesting parts of execution without
waiting Post mortem: visualization after execution
Have to wait to see visualizations Easier to implement Less intrusion on application behavior
2911 July 2005
Useful Visualizations Techniques
Animation Has been employed by various tools to provide program execution replay Most commonly animated events are communication operations Viewing data dynamically may illuminate bottlenecks more efficiently However, animation usually very cumbersome in practice
Program graphs Generalized picture of entire system
Gantt charts De facto standard for displaying inter-process communication
Data access displays Each cell of 2D display is devoted to an element of the array Color distinguishes between local/remote and read/write
Critical path analysis Concerned with identifying program regions which most contribute to
program execution time Graph depicts synchronization and communication dependencies among
processes in program
3011 July 2005
Summary of VisualizationsVisualization Name Advantages Disadvantages Requirements
DocumentUsed For
Animation Adds another dimension to visualizations
CPU intensive, cumbersome at times
Advanced Various
Program Graphs
(N-ary tree)
Built-in zooming; Integration of high and low-level data
Difficult to see inter-process data
Other Comprehensive Program Visualization
Gantt Charts
(Time histogram; Timeline)
Ubiquitous; Intuitive Not as applicable to shared memory as to message passing
Core, Functional Communication Graphs
Data Access Displays
(2D array)
Provides detailed information regarding the dynamics of shared data
Narrow focus; users may not be familiar with this type of visualization
Core, Functional Data Structure Visualization
Kiviat Diagrams Provides an easy way to represent statistical data
Can be difficult to understand
Advanced Various statistical data (processor utilization, cache miss rates, etc.)
Event Graph Displays
(Timeline)
Can be used to display multiple data types (event-based)
Provides mostly high-level information
Advanced Inter-process dependency
3111 July 2005
Guidelines and Interface Evaluation General Guidelines*
Visualization should guide, not rationalize Scalability is crucial Color should inform, not entertain Visualization should be interactive Visualizations should provide meaningful labels Default visualization should provide useful information Avoid showing too much detail Visualization controls should be simple
Goals, Operators, Methods, and Selection Rules (GOMS) Formal user interface evaluation technique Way to characterize a set of design decisions from point of view of user Description of what user must learn; may be basis for reference documentation May be able to use GOMS analysis in design of PAT Knowledge described in a form that can actually be executed (there have been several
fairly successful attempts to implement GOMS analysis in software, e.g. GLEAN) Various incarnations of GOMS with different assumptions useful for more specific
analyses (KVL, CMN-GOMS, NGOMSL, CPM-GOMS, etc.)
* B. Miller. “What to Draw? When to Draw?:an essay on parallel program visualization,” Journal of Parallel and Distributed Computing, 18:2, pp. 265-269, 1993.
3211 July 2005
Simple GOMS Example: OS X
GOMS model for OS X Method for goal: delete a file
Step 1. Think of file name and retain as first filespec (file specifier)
Step 2. Accomplish goal: drag file to trash Step 3. Return with goal accomplished
Method for goal: move a file Step 1. Think of file name and retain as first filespec Step 2. Think of destination directory name and retain as
second filespec Step 3. Accomplish goal: drag file to destination Step 4. Return with goal accomplished
3311 July 2005
Simple GOMS Example: UNIX
GOMS model for UNIX Method for goal: delete a file
Step 1. Recall that command verb is rm -f Step 2. Think of file name and retain as first filespec Step 3. Accomplish goal: enter and execute a command Step 4. Return with goal accomplished
Method for goal: copy a file Step 1. Recall that command verb is cp Step 2. Think of file name and retain as first filespec Step 3. Think of destination directory name and retain as
second filespec Step 4. Accomplish goal: enter and execute a command Step 5. Return with goal accomplished
3411 July 2005
Summary
Plan for development Develop a preliminary interface that provides
functionality required by user while conforming to visualization guidelines
After preliminary design is complete, elicit user feedback
During periods where user contact is unavailable, may be able to use GOMS analysis or another formal interface evaluation technique