Keeping Big Projects Under Control - BioHPC Portal Home · Keeping Big Projects Under Control 1...

Post on 21-May-2020

1 views 0 download

transcript

Keeping Big Projects Under Control

1 Updated for 2017-02-15

[web] portal.biohpc.swmed.edu

[email] biohpc-help@utsouthwestern.edu

Overview

2

When working with large volumes of data, software, code and coordinating efforts in a team, how can we:

- Keeping Data Under ControlOrganizing files and analyses, relevant to anyone using BioHPC for analysis

work.

- Keeping Software Under Controlworking with modules and software environments for analysis use a

consistent set of software across your project and team.

- Keeping Code Under Controlan overview of some techniques we use that help us keep our large projects,

such as the BioHPC portal under control.

*more advanced training sessions that will be held later in 2017 addressing these areas in more depth

1 – Keeping Data Under Control

3

Arrange data carefully to

Avoid Losing dataProtect sensitive dataMinimize duplicated dataMaking research reproducible…

How

Plan aheadKeep good structureProper permissionsBackup data…

1 – Keeping Data Under Control : Arrange Data Carefully

4

Folder and File Structure

Benefits:- Easy to find files- Greatly facilitates sharing a project with others- Prevent contamination of raw data (proper permissions)

- Define it in advance- Limit the files you need to keep to only those that are strictly necessary- Maintaining a logical folder structure. Keep groups of files (raw data, final results) in separate but clearly labeled folders.

2nd: by process/step

1st: by project

2nd: by data type

3rd: by user

3rd: by job/date

*tips: use symbolic links smartly to minimize data duplication

1 – Keeping Data Under Control : Data Integrity

5

accuracy & consistency

Storage

Database

User Web site

Database Design: Jun. 21st

- User has read access at storage for copy/download data

- Create/Delete/Edit only allowed from web

- Web server will update both DB & Storage at the same time to guarantee the data integrity.

1 – Keeping Data Under Control : Data Security

6

Data SecuritySecurely remove personal identity information and other restricted data as early as possiblePrinciple of least privilege

setfaclgetfacl

Fine-grained permissions with

1 – Keeping Data Under Control : Sharing

7

Lab/Dept level directories on BioHPC

Intra-department level directories

shared

The group ownership can be inherited by new files and folders created inside the intra-department folder

Tips: if you mv data into the folder, you need apply chgrp command to correct the group ownership

1 – Keeping Data Under Control : Backup

8

http://www.backup4all.com/kb/incremental-backup-118.html

Mirror/Full backup is the starting point for all other backups and contains all the data in the folders

and files that are selected to be backed up.

/home2 (Mondays & Wednesdays)

/work (Fridays)

Incremental backup provides a faster method of backing up data than repeatedly running full/mirror

backups

/project (upon request)

What data should be backed up ?How often ?

2 – Keeping Software Package Under Control

9

Difficulties

Versions

Dependencies

How to reproduce a research project?

Install everything from scratch? (Difficult and time consuming)

Solutions

Environment modules (partially)

Software Environments with Conda

2 – Keeping Software Package Under Control : Environment Modules

10

provides for the dynamic modification of a user's environment via modulefilesModules can be loaded and unloaded dynamically and atomically, in an clean fashionKeep different versions

Environment Modules

set up a private module folder under user's home directory

module load use.ownmodule avail

module file defined in ~/privatemodules/<software module>/

2 – Keeping Software Package Under Control : Environment Modules

11

Issue: incomparable between software/modules

V1.7 V1.8

Still need to install everything from scratchStill need manually solve the dependency issues

2 – Keeping Software Package Under Control : Software Environments with Conda

12

Package managerCross-platformOpen source, BSD licenseCreated for python programs but can package and manage any softwareDose not require administrator privileges

Traditional package managers: apt-get, yum, homebrew, pip and etc.

2 – Keeping Software Package Under Control : Software Environments with Conda

13

conda install: Install a packageconda remove: Remove a packageconda update: Update a packageconda list: list packages installedconda create: Create a new conda environmentconda search:source activate: Activate a conda environmentsource deactivate: Deactivate the current conda environment

* packages are hard-linked into the environment to save disk space

2 – Keeping Software Package Under Control : Software Environments with Conda

14

Try: Update matplotlib to newest version (1.5.0->2.0.0)

Problem

Solution

Create isolated conda environment to have your own set of installed and managed packages

2 – Keeping Software Package Under Control : Software Environments with Conda

15

Conda environment is installed in your home directory

Create a conda environment named test1 with latest anaconda package

2 – Keeping Software Package Under Control : Software Environments with Conda

16

Use the environment you just created

Install packages in the conda environment

2 – Keeping Software Package Under Control : Software Environments with Conda

17

User defined packages

anaconda search <software name>list all user defined packages

anaconda show <user defined package name>detailed info of a certain package

2 – Keeping Software Package Under Control : Software Environments with Conda

18

A GitHub community-led conda channel

https://conda-forge.github.io/feedstocks

a channel for the conda package

manager specializing in

bioinformatics software

conda config --add channels conda-forge

conda config --add channels defaults

conda config --add channels r

conda config --add channels bioconda

Channel orderIf you add multiple channels in one environment, the latest or most recent added one have the highest priority. Python: Nov. 15th

2 – Keeping Software Package Under Control : Software Environments with Conda

19

Using R with conda

Install all of the most popular packages with all of their dependenciesconda install –c r r-essentials

Update all of the packages and their dependencies with one commandconda update –c r-essentials

Update a single package in R-Essentialsconda update r-<package name>

Parallel R: Oct. 18th

Keeping large code projects under control requires consistency and modularity

Common development environment that’s easy to setup for new developers

Version control strategy that can track and integrate everyone’s changes

Modules of code that can be extended/bug-fixed without affecting other areas

Lots of tools available to help! Some we use:

3 – Keeping Code Under Control

20

A Great Reference

21

https://software-carpentry.org/lessons/

General lessons on Linux, Git,

Principles of good coding in multiple languagesPython, R, MATLAB

Focus on reproducible and open science

Case Study – BioHPC Portal

22

A large Python/Django project

1,237 code/config/doc files

500 commits

Complex interactions:SLURM SchedulerUsage AccountingQuotasUser accountsInteractive Sessions

But – structured so new team members have a portal task as their first coding job

Modularity – a complex project, of many small applications

23

Zinnia(news)

Django-CMS(Content)

accounts modules

otrs

sbatch

terminal

utils

quota plugin

slurmplugin

usage plugin

Celery scheduler Templates

Code Re-use

24

sbatch terminal

Provides Web Job Submission

Routines to submit jobs to cluster

Provides webGUI, Desktop etc.

Can re-use cluster functions

Keeping Track of Changes

25

Use a structured Git workflow – keep the stable version safe, allow easy merging

A stable branch with live versionFeature branches for each person adding a featureA develop branch to merge and test features

Git basics: April 17th

Advanced: July 19th

Complexity – External Interactions

26

Zinnia(news)

Django-CMS(Content)

accounts modules

otrs

sbatch

terminal

utils

quota plugin

slurmplugin

usage plugin

Celery scheduler Templates

Nucleus Cluster

Accounting System

User Directory

Storage

Modules

Tickets

Web VNC

Consistency and Ease of Entry

27

BioHPC team can start developing with just:

git clone git@git.biohpc.swmed.edu:biohpc/biohpc_portal.gitcd biohpc_portalvagrant up

Creates a development virtual machine with an emulated cluster, user directory etc.

Same everywhere – BioHPC workstation, laptop, at home….

+

Demo 1 - Vagrant

Vagrantfile and Provisioning Scripts

28

Defensive Coding – Assume Nothing! (or as little as possible)

29

Don’t trust: Inputs to be valid Check type, size, range etc.

Your code can be more reliable if you don’t trust the outside world.

A Norwegian woman mistyped her account number on an internet banking system. Instead of typing her 11-digit account number, she accidentally typed an extra digit, for a total of 12 numbers. The system discarded the extra digit, and transferred $100,000 to the (incorrect) account. A simple dialog box informing her that she had typed too many digits would have helped avoid this expensive error.

Olsen, Kai. “The $100,000 Keying error” IEEE Computer, August 2008

Assertions – a quick way to capture invalid input

30

You don’t always need to code complex checks and user feedback.

If it’s just important that you catch invalid data use assertions:

Python

Matlab

Examples from Software Carpentry lessons

Defensive Coding – Assume Nothing! (or as little as possible)

31

Don’t trust: Files to exist/not-exist Always write sanity checks!

Check early, check often!

Note: Function to construct paths (safe across different Linux, Mac, Windows)Check path is valid when we construct it, don’t defer until useUse the language’s own exceptions, and catch them elsewhereDebug level logging – speculative, but turns out to be useful

Defensive Coding – Assume Nothing! (or as little as possible)

32

Don’t trust: External calls to work try and catch

Calling external programs is a classic point of failure in Bioinformatics code

Note: Using a call that collects output from the commandCatching errors from the called command and more general OSErrorRaising the error to a higher-level try-catch with a message that

makes sense to the end user.

Testing & CI

33

Automated testing means you can check if changes break things easily

We provide GitLab CI so you can do this with your code

Case Study – BioHPC param_runner tool

Demo 2 – Overview of git.biohpc.swmed.edu CI

GitLab CI: July 19th

Test Code & Test Runner

34

In this case we are using pytest to write and tests

Demo 3 – pytest run

2017 Training Program – Coding & Managing Projects

35

3/15/17 Monitoring and troubleshooting on BioHPC

4/12/17 Git on BioHPC

5/17/17 Parallel Programming in Matlab on BioHPC (MDCS and parallel tool box)

6/21/17 Database System Design

7/19/17 Managing Software Projects in a Team

9/20/17 Data Security and Management

10/18/17 Parallel R/R on BioHPC

11/15/17 Python on BioHPC