High Performance Computing (HPC) •As a service: …...• Katana -nodes=1:ppn=12–12 CPU cores on...

High Performance Computing (HPC)• As a service: NCI – Raijin• Katana – local HPC cluster

Cloud Computing• Research Cloud: NeCTAR• Commercial Cloud: Amazon AWS, Microsoft Azure, etc.• Seed money for exploration of new cloud technologies

Research Data• Help with Data Management• Assistance with data moves, storage, planning

[email protected]

Training• 40+ courses run on campus, including:

• Introduction into Linux• Getting started with HPC• Parallel Programming• Introduction to programming with Julia• Programming courses for Python, R, Matlab, Excel

Consulting• Help with developing / optimizing code• Help getting started on Raijin and Katana

[email protected]

HPC SystemsRaijin Katana

Owner NCI (ANU) RTS (UNSW)

Size Almost 4500 nodes Almost 200 nodes

Access UNSW, Intersect and NCMAS.

Free for small users. Buy-in for groups.

Project Storage 37 PB 1 PB

Node Interconnects Up to 100Gb/s Up to 10Gb/s

Max Walltime 48 hours 200 hours

Best for Large parallel jobs, complicated models.

Bioscience and genomics.

NCI Resources• User and project management inc. software at Mancini / MyNCI (https://my.nci.org.au)

• Resources (compute and storage) are provided at the project level.

• NCMAS – Open: 3rd Sept, Close: 19th Oct, Announced: 11th Dec

• UNSW Scheme – Request at any time but annual application in November, Announced: 19th Dec

• 2Gb home, 72Gb short per project (can be increased by NCI), /g/data and MDSS available on request

to UNSW.

• Copyq is used for moving data around.

• Useful commands

• lquota – How much storage are you using

• nci_account – List queues including how much compute have you used

• nf_limits -P project -n ncpus -q queue – Show memory, walltime, etc. limits

https://my.nci.org.au/

Your first steps• Use the module avail command to see what is installed. Then module help and

module load to get information and to load the module.

• Start with an interactive job. It is ok to request extra memory and CPU cores whilst you are figuring things out but you need to remember to trim the resources as you get more confident.

• Take note of what you are doing. You may want to open a text editor and copy your commands into it. That includes commands within your application.

• If you have a lot of similar jobs, Array Job on Katana or at NCI https://opus.nci.org.au/display/Help/How+to+submit+array+jobs+on+Raijin

• Have a look at the web site(s).

• https://opus.nci.org.au – NCI. It has everything! Instructions, software lists, queues, status, etc.

• https://ww.hpc.science.unsw.edu.au - Katana

https://opus.nci.org.au/display/Help/How+to+submit+array+jobs+on+Raijin

https://opus.nci.org.au/

http://www.hpc.science.unsw.edu.au/

Job Script OptionsRaijin Katana

Project Code #PBS -P a99 N/A

Job Queue #PBS -q normal Automatic

Memory #PBS -l mem=300GB #PBS -l vmem=300GB

Nodes and CPU Cores #PBS -l ncpus=4 #PBS –l nodes=1:ppn=3

Job Walltime #PBS –l walltime=12:00:00

#PBS –l walltime=12:00:00

Start in Current Dir. cd $PBS_O_WORKDIR cd $PBS_O_WORKDIR

Email when job finishes #PBS -m ae#PBS [email protected]

#PBS -m ae#PBS -M [email protected]

This is a Katana job scriptThis is a sample job script. The

same #PBS options can be used for interactive jobs.

Job QueuesKatana – automatic according to walltime

• 12 Hours – Any node in the cluster

• 48 hours – Shared nodes plus your nodes

• 100 hours – Shared nodes plus your nodes

• 200 hours – Your nodes

NCI – You need to specify the queue that you want to use. The easy way to list is to type “nci_account”.

• Visit https://opus.nci.org.au

• All nodes within each queue are identical.

• No production runs in Express queue.

• Special queues for big memory, KNL, GPU, etc. Check the queue status.


Nodes, CPU cores and Memory• Katana - nodes=1:ppn=1 – 1 CPU core on 1 compute node

• Katana - nodes=1:ppn=12 – 12 CPU cores on 1 compute node

• Katana - nodes=2:ppn=12 – 12 CPU cores on 2 separate compute nodes (Don’t use unless you know what MPI is). If you do, speak to us.

• At NCI selecting number of nodes happens automatically via ncpus. DO NOT SPECIFY NODES.Make sure you know how many cores the nodes in the queue have.

• Read the module help to see if there is a default number of cores (often called CPUs). Set the number of cores to match your job request. Sometimes it is better to have 1 CPU core for overhead.

• There is a full list of compute nodes on the web sites.

• You need to leave some space for the operating system. (i.e. avoid 96, 128, 144, 256, etc.). Request 2Gb less so that the operating system has some capacity.

Once a Job has RunTHE END OF JOB EMAIL IS IMPORTANT! READ IT!How long did your job run for?

If your job is less than 20 minutes combine calculations.

If your job has multiple stages then make each stage a different job and chain them together. The easiest way is to have a qsub command at the end of the job but there are many ways of chaining jobs. Speak to us!

How much memory did you use?

Can you reduce the amount of memory that you request next time?

How much CPU time did you use?

If ppn=6 and cputime = 2 * walltime then don’t bother with more than 1 CPU core.

If 1 core and cputime is less than half walltime then consider local scratch. Global = big files.

How do I figure out how my job is working (memory, CPU, I/O, etc.)?

https://opus.nci.org.au/display/Help/Debuggers%2C+Profilers+and+Simulators - NCI offers courses

https://opus.nci.org.au/display/Help/Debuggers,+Profilers+and+Simulators

NCI Helpful Links• User account and project management https://my.nci.org.au

• Wiki page https://opus.nci.org.au

• HelpDesk [email protected] or/and https://help.nci.org.au

• Job History https://usersupport.nci.org.au/report/job_history

• Raijin Live Status http://nci.org.au/user-support/current-job-details/

• Software License Status http://nci.org.au/user-support/getting-help/license-status

https://my.nci.org.au/


mailto:[email protected]

https://help.nci.org.au/

https://usersupport.nci.org.au/report/job_history

http://nci.org.au/user-support/current-job-details/

http://nci.org.au/user-support/getting-help/license-status

ContactsData Management and ResData: Outreach Librarians

https://www.library.unsw.edu.au/study/about-unsw-library/contact-us/outreach-librarians

HPC, cloud, specialist storage questions: Research Technology Services

https://research.unsw.edu.au/research-technology-services or email [email protected]

OneDrive, Data Archive, Research Active storage questions: UNSW IT

https://www.it.unsw.edu.au/, 9385 1333, [email protected]

Data Classification issues: Data Governance

https://www.datagovernance.unsw.edu.au/

Research Integrity issues: Your faculty Research Integrity Advisor

https://www.library.unsw.edu.au/study/about-unsw-library/contact-us/outreach-librarians

https://research.unsw.edu.au/research-technology-services

https://www.it.unsw.edu.au/

mailto:[email protected]

https://www.datagovernance.unsw.edu.au/

Questions

Date post:	14-Jul-2020
Category:	Documents
Upload:	others
View:	14 times
Download:	1 times

High Performance Computing (HPC) •As a service: …...• Katana -nodes=1:ppn=12–12 CPU cores on...

Documents