+ All Categories
Home > Documents > Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1...

Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1...

Date post: 21-May-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
35
Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] [email protected]
Transcript
Page 1: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

Keeping Big Projects Under Control

1 Updated for 2017-02-15

[web] portal.biohpc.swmed.edu

[email] [email protected]

Page 2: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

Overview

2

When working with large volumes of data, software, code and coordinating efforts in a team, how can we:

- Keeping Data Under ControlOrganizing files and analyses, relevant to anyone using BioHPC for analysis

work.

- Keeping Software Under Controlworking with modules and software environments for analysis use a

consistent set of software across your project and team.

- Keeping Code Under Controlan overview of some techniques we use that help us keep our large projects,

such as the BioHPC portal under control.

*more advanced training sessions that will be held later in 2017 addressing these areas in more depth

Page 3: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

1 – Keeping Data Under Control

3

Arrange data carefully to

Avoid Losing dataProtect sensitive dataMinimize duplicated dataMaking research reproducible…

How

Plan aheadKeep good structureProper permissionsBackup data…

Page 4: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

1 – Keeping Data Under Control : Arrange Data Carefully

4

Folder and File Structure

Benefits:- Easy to find files- Greatly facilitates sharing a project with others- Prevent contamination of raw data (proper permissions)

- Define it in advance- Limit the files you need to keep to only those that are strictly necessary- Maintaining a logical folder structure. Keep groups of files (raw data, final results) in separate but clearly labeled folders.

2nd: by process/step

1st: by project

2nd: by data type

3rd: by user

3rd: by job/date

*tips: use symbolic links smartly to minimize data duplication

Page 5: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

1 – Keeping Data Under Control : Data Integrity

5

accuracy & consistency

Storage

Database

User Web site

Database Design: Jun. 21st

- User has read access at storage for copy/download data

- Create/Delete/Edit only allowed from web

- Web server will update both DB & Storage at the same time to guarantee the data integrity.

Page 6: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

1 – Keeping Data Under Control : Data Security

6

Data SecuritySecurely remove personal identity information and other restricted data as early as possiblePrinciple of least privilege

setfaclgetfacl

Fine-grained permissions with

Page 7: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

1 – Keeping Data Under Control : Sharing

7

Lab/Dept level directories on BioHPC

Intra-department level directories

shared

The group ownership can be inherited by new files and folders created inside the intra-department folder

Tips: if you mv data into the folder, you need apply chgrp command to correct the group ownership

Page 8: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

1 – Keeping Data Under Control : Backup

8

http://www.backup4all.com/kb/incremental-backup-118.html

Mirror/Full backup is the starting point for all other backups and contains all the data in the folders

and files that are selected to be backed up.

/home2 (Mondays & Wednesdays)

/work (Fridays)

Incremental backup provides a faster method of backing up data than repeatedly running full/mirror

backups

/project (upon request)

What data should be backed up ?How often ?

Page 9: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

2 – Keeping Software Package Under Control

9

Difficulties

Versions

Dependencies

How to reproduce a research project?

Install everything from scratch? (Difficult and time consuming)

Solutions

Environment modules (partially)

Software Environments with Conda

Page 10: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

2 – Keeping Software Package Under Control : Environment Modules

10

provides for the dynamic modification of a user's environment via modulefilesModules can be loaded and unloaded dynamically and atomically, in an clean fashionKeep different versions

Environment Modules

set up a private module folder under user's home directory

module load use.ownmodule avail

module file defined in ~/privatemodules/<software module>/

Page 11: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

2 – Keeping Software Package Under Control : Environment Modules

11

Issue: incomparable between software/modules

V1.7 V1.8

Still need to install everything from scratchStill need manually solve the dependency issues

Page 12: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

2 – Keeping Software Package Under Control : Software Environments with Conda

12

Package managerCross-platformOpen source, BSD licenseCreated for python programs but can package and manage any softwareDose not require administrator privileges

Traditional package managers: apt-get, yum, homebrew, pip and etc.

Page 13: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

2 – Keeping Software Package Under Control : Software Environments with Conda

13

conda install: Install a packageconda remove: Remove a packageconda update: Update a packageconda list: list packages installedconda create: Create a new conda environmentconda search:source activate: Activate a conda environmentsource deactivate: Deactivate the current conda environment

* packages are hard-linked into the environment to save disk space

Page 14: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

2 – Keeping Software Package Under Control : Software Environments with Conda

14

Try: Update matplotlib to newest version (1.5.0->2.0.0)

Problem

Solution

Create isolated conda environment to have your own set of installed and managed packages

Page 15: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

2 – Keeping Software Package Under Control : Software Environments with Conda

15

Conda environment is installed in your home directory

Create a conda environment named test1 with latest anaconda package

Page 16: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

2 – Keeping Software Package Under Control : Software Environments with Conda

16

Use the environment you just created

Install packages in the conda environment

Page 17: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

2 – Keeping Software Package Under Control : Software Environments with Conda

17

User defined packages

anaconda search <software name>list all user defined packages

anaconda show <user defined package name>detailed info of a certain package

Page 18: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

2 – Keeping Software Package Under Control : Software Environments with Conda

18

A GitHub community-led conda channel

https://conda-forge.github.io/feedstocks

a channel for the conda package

manager specializing in

bioinformatics software

conda config --add channels conda-forge

conda config --add channels defaults

conda config --add channels r

conda config --add channels bioconda

Channel orderIf you add multiple channels in one environment, the latest or most recent added one have the highest priority. Python: Nov. 15th

Page 19: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

2 – Keeping Software Package Under Control : Software Environments with Conda

19

Using R with conda

Install all of the most popular packages with all of their dependenciesconda install –c r r-essentials

Update all of the packages and their dependencies with one commandconda update –c r-essentials

Update a single package in R-Essentialsconda update r-<package name>

Parallel R: Oct. 18th

Page 20: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

Keeping large code projects under control requires consistency and modularity

Common development environment that’s easy to setup for new developers

Version control strategy that can track and integrate everyone’s changes

Modules of code that can be extended/bug-fixed without affecting other areas

Lots of tools available to help! Some we use:

3 – Keeping Code Under Control

20

Page 21: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

A Great Reference

21

https://software-carpentry.org/lessons/

General lessons on Linux, Git,

Principles of good coding in multiple languagesPython, R, MATLAB

Focus on reproducible and open science

Page 22: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

Case Study – BioHPC Portal

22

A large Python/Django project

1,237 code/config/doc files

500 commits

Complex interactions:SLURM SchedulerUsage AccountingQuotasUser accountsInteractive Sessions

But – structured so new team members have a portal task as their first coding job

Page 23: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

Modularity – a complex project, of many small applications

23

Zinnia(news)

Django-CMS(Content)

accounts modules

otrs

sbatch

terminal

utils

quota plugin

slurmplugin

usage plugin

Celery scheduler Templates

Page 24: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

Code Re-use

24

sbatch terminal

Provides Web Job Submission

Routines to submit jobs to cluster

Provides webGUI, Desktop etc.

Can re-use cluster functions

Page 25: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

Keeping Track of Changes

25

Use a structured Git workflow – keep the stable version safe, allow easy merging

A stable branch with live versionFeature branches for each person adding a featureA develop branch to merge and test features

Git basics: April 17th

Advanced: July 19th

Page 26: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

Complexity – External Interactions

26

Zinnia(news)

Django-CMS(Content)

accounts modules

otrs

sbatch

terminal

utils

quota plugin

slurmplugin

usage plugin

Celery scheduler Templates

Nucleus Cluster

Accounting System

User Directory

Storage

Modules

Tickets

Web VNC

Page 27: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

Consistency and Ease of Entry

27

BioHPC team can start developing with just:

git clone [email protected]:biohpc/biohpc_portal.gitcd biohpc_portalvagrant up

Creates a development virtual machine with an emulated cluster, user directory etc.

Same everywhere – BioHPC workstation, laptop, at home….

+

Demo 1 - Vagrant

Page 28: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

Vagrantfile and Provisioning Scripts

28

Page 29: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

Defensive Coding – Assume Nothing! (or as little as possible)

29

Don’t trust: Inputs to be valid Check type, size, range etc.

Your code can be more reliable if you don’t trust the outside world.

A Norwegian woman mistyped her account number on an internet banking system. Instead of typing her 11-digit account number, she accidentally typed an extra digit, for a total of 12 numbers. The system discarded the extra digit, and transferred $100,000 to the (incorrect) account. A simple dialog box informing her that she had typed too many digits would have helped avoid this expensive error.

Olsen, Kai. “The $100,000 Keying error” IEEE Computer, August 2008

Page 30: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

Assertions – a quick way to capture invalid input

30

You don’t always need to code complex checks and user feedback.

If it’s just important that you catch invalid data use assertions:

Python

Matlab

Examples from Software Carpentry lessons

Page 31: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

Defensive Coding – Assume Nothing! (or as little as possible)

31

Don’t trust: Files to exist/not-exist Always write sanity checks!

Check early, check often!

Note: Function to construct paths (safe across different Linux, Mac, Windows)Check path is valid when we construct it, don’t defer until useUse the language’s own exceptions, and catch them elsewhereDebug level logging – speculative, but turns out to be useful

Page 32: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

Defensive Coding – Assume Nothing! (or as little as possible)

32

Don’t trust: External calls to work try and catch

Calling external programs is a classic point of failure in Bioinformatics code

Note: Using a call that collects output from the commandCatching errors from the called command and more general OSErrorRaising the error to a higher-level try-catch with a message that

makes sense to the end user.

Page 33: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

Testing & CI

33

Automated testing means you can check if changes break things easily

We provide GitLab CI so you can do this with your code

Case Study – BioHPC param_runner tool

Demo 2 – Overview of git.biohpc.swmed.edu CI

GitLab CI: July 19th

Page 34: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

Test Code & Test Runner

34

In this case we are using pytest to write and tests

Demo 3 – pytest run

Page 35: Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1 Updated for 2017-02-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu.

2017 Training Program – Coding & Managing Projects

35

3/15/17 Monitoring and troubleshooting on BioHPC

4/12/17 Git on BioHPC

5/17/17 Parallel Programming in Matlab on BioHPC (MDCS and parallel tool box)

6/21/17 Database System Design

7/19/17 Managing Software Projects in a Team

9/20/17 Data Security and Management

10/18/17 Parallel R/R on BioHPC

11/15/17 Python on BioHPC


Recommended