High Performance Computing (HPC)• As a service: NCI – Raijin• Katana – local HPC cluster
Cloud Computing• Research Cloud: NeCTAR• Commercial Cloud: Amazon AWS, Microsoft Azure, etc.• Seed money for exploration of new cloud technologies
Research Data• Help with Data Management• Assistance with data moves, storage, planning
Training• 40+ courses run on campus, including:
• Introduction into Linux• Getting started with HPC• Parallel Programming• Introduction to programming with Julia• Programming courses for Python, R, Matlab, Excel
Consulting• Help with developing / optimizing code• Help getting started on Raijin and Katana
HPC SystemsRaijin Katana
Owner NCI (ANU) RTS (UNSW)
Size Almost 4500 nodes Almost 200 nodes
Access UNSW, Intersect and NCMAS.
Free for small users. Buy-in for groups.
Project Storage 37 PB 1 PB
Node Interconnects Up to 100Gb/s Up to 10Gb/s
Max Walltime 48 hours 200 hours
Best for Large parallel jobs, complicated models.
Bioscience and genomics.
NCI Resources• User and project management inc. software at Mancini / MyNCI (https://my.nci.org.au)
• Resources (compute and storage) are provided at the project level.
• NCMAS – Open: 3rd Sept, Close: 19th Oct, Announced: 11th Dec
• UNSW Scheme – Request at any time but annual application in November, Announced: 19th Dec
• 2Gb home, 72Gb short per project (can be increased by NCI), /g/data and MDSS available on request
to UNSW.
• Copyq is used for moving data around.
• Useful commands
• lquota – How much storage are you using
• nci_account – List queues including how much compute have you used
• nf_limits -P project -n ncpus -q queue – Show memory, walltime, etc. limits
Your first steps• Use the module avail command to see what is installed. Then module help and
module load to get information and to load the module.
• Start with an interactive job. It is ok to request extra memory and CPU cores whilst you are figuring things out but you need to remember to trim the resources as you get more confident.
• Take note of what you are doing. You may want to open a text editor and copy your commands into it. That includes commands within your application.
• If you have a lot of similar jobs, Array Job on Katana or at NCI https://opus.nci.org.au/display/Help/How+to+submit+array+jobs+on+Raijin
• Have a look at the web site(s).
• https://opus.nci.org.au – NCI. It has everything! Instructions, software lists, queues, status, etc.
• https://ww.hpc.science.unsw.edu.au - Katana
Job Script OptionsRaijin Katana
Project Code #PBS -P a99 N/A
Job Queue #PBS -q normal Automatic
Memory #PBS -l mem=300GB #PBS -l vmem=300GB
Nodes and CPU Cores #PBS -l ncpus=4 #PBS –l nodes=1:ppn=3
Job Walltime #PBS –l walltime=12:00:00
#PBS –l walltime=12:00:00
Start in Current Dir. cd $PBS_O_WORKDIR cd $PBS_O_WORKDIR
Email when job finishes #PBS -m ae#PBS [email protected]
#PBS -m ae#PBS -M [email protected]
This is a Katana job scriptThis is a sample job script. The
same #PBS options can be used for interactive jobs.
Job QueuesKatana – automatic according to walltime
• 12 Hours – Any node in the cluster
• 48 hours – Shared nodes plus your nodes
• 100 hours – Shared nodes plus your nodes
• 200 hours – Your nodes
NCI – You need to specify the queue that you want to use. The easy way to list is to type “nci_account”.
• Visit https://opus.nci.org.au
• All nodes within each queue are identical.
• No production runs in Express queue.
• Special queues for big memory, KNL, GPU, etc. Check the queue status.
Nodes, CPU cores and Memory• Katana - nodes=1:ppn=1 – 1 CPU core on 1 compute node
• Katana - nodes=1:ppn=12 – 12 CPU cores on 1 compute node
• Katana - nodes=2:ppn=12 – 12 CPU cores on 2 separate compute nodes (Don’t use unless you know what MPI is). If you do, speak to us.
• At NCI selecting number of nodes happens automatically via ncpus. DO NOT SPECIFY NODES.Make sure you know how many cores the nodes in the queue have.
• Read the module help to see if there is a default number of cores (often called CPUs). Set the number of cores to match your job request. Sometimes it is better to have 1 CPU core for overhead.
• There is a full list of compute nodes on the web sites.
• You need to leave some space for the operating system. (i.e. avoid 96, 128, 144, 256, etc.). Request 2Gb less so that the operating system has some capacity.
Once a Job has RunTHE END OF JOB EMAIL IS IMPORTANT! READ IT!How long did your job run for?
If your job is less than 20 minutes combine calculations.
If your job has multiple stages then make each stage a different job and chain them together. The easiest way is to have a qsub command at the end of the job but there are many ways of chaining jobs. Speak to us!
How much memory did you use?
Can you reduce the amount of memory that you request next time?
How much CPU time did you use?
If ppn=6 and cputime = 2 * walltime then don’t bother with more than 1 CPU core.
If 1 core and cputime is less than half walltime then consider local scratch. Global = big files.
How do I figure out how my job is working (memory, CPU, I/O, etc.)?
https://opus.nci.org.au/display/Help/Debuggers%2C+Profilers+and+Simulators - NCI offers courses
NCI Helpful Links• User account and project management https://my.nci.org.au
• Wiki page https://opus.nci.org.au
• HelpDesk [email protected] or/and https://help.nci.org.au
• Job History https://usersupport.nci.org.au/report/job_history
• Raijin Live Status http://nci.org.au/user-support/current-job-details/
• Software License Status http://nci.org.au/user-support/getting-help/license-status
ContactsData Management and ResData: Outreach Librarians
https://www.library.unsw.edu.au/study/about-unsw-library/contact-us/outreach-librarians
HPC, cloud, specialist storage questions: Research Technology Services
https://research.unsw.edu.au/research-technology-services or email [email protected]
OneDrive, Data Archive, Research Active storage questions: UNSW IT
https://www.it.unsw.edu.au/, 9385 1333, [email protected]
Data Classification issues: Data Governance
https://www.datagovernance.unsw.edu.au/
Research Integrity issues: Your faculty Research Integrity Advisor
Questions