2/8/00 E. Buckley-Geer, CHEP 2000 1
Issues in managing HEP Software Development in a distributed environment
Elizabeth Buckley-Geer
Fermilab
CHEP 2000, Padova, Italy
2/8/00 E. Buckley-Geer, CHEP 2000 2
Contents
Characterizing the problemKey issues and solutions from CDF/D0
Collider Run IISome thoughts on the development processConclusions
2/8/00 E. Buckley-Geer, CHEP 2000 3
Characterizing the problem
Developer community of about 150 people (both collaborations) from North and South America, Europe, Asia, India, Russia
Widely varying quality of network connections between FNAL and remote locations
Widely varying abilities of groups to afford to purchase commercial tools
2/8/00 E. Buckley-Geer, CHEP 2000 4
Characterizing the problem
One common denominator since mid-1997: Everyone can buy a cheap PC and run Linux on it No more $10-20K workstations. Every member of the
group can have a PC They don’t want to rely on connecting to a central
machine at FNAL to do code development They want to make use of these PCs at their own
location to do their code development First release of CDF code for Linux was January 1998
– several years after the basic development environment was designed
2/8/00 E. Buckley-Geer, CHEP 2000 5
The situation during Run I (CDF - but similar for D0)
Highly centralized code development. Could only realistically develop code on central machine at
FNAL (VMS cluster) – no distributed development was supported even on other VMS systems
Code was ported to run on IRIX and AIX but only frozen releases were available on these platforms
Frozen release were distributed to remote sites as tar files or VMS save sets
Development version of the code was available to desktop VMS nodes at FNAL from 1993 onwards but code could not be committed to repository from these machines
2/8/00 E. Buckley-Geer, CHEP 2000 6
Run I development tools
Code was mostly Fortran with some small amounts of C. About 50 packages.
Used proprietary VMS tools for for version control and package building (CMS and MMS)
Used vendor compilers and debuggers . Only UNIX vendors who supported VMS extensions were considered. Luckily the list was sufficiently long!
No serious use of design tools – some early attempts at D0 but didn’t survive
No tools to locate memory leaks due to the nature of the memory management packages in use – YBOS and ZEBRA
2/8/00 E. Buckley-Geer, CHEP 2000 7
Goals for Run II development environment – early 1996
Obviously needed to migrate from VMS as a primary platform
Provide ability to do remote development – recognized as important even before the Linux revolution
Reduce the need for proprietary tools for base system
Handle move from Fortran to C++ Identify useful software engineering tools
2/8/00 E. Buckley-Geer, CHEP 2000 8
Configuration Management Joint Project
Formed joint D0, CDF, FNAL Computing Division working group to study configuration management in early 1996 (see E248 for more on Run II joint projects)
Charge was to find and implement a common solution for CDF and D0 for software management Version control Package and release organization Building packages Distribution Validation
2/8/00 E. Buckley-Geer, CHEP 2000 9
Configuration Management Joint Project
Group looked at existing tools in use in HEP and elsewhere
ChoseCVS for version control with customizations
from Sloan Digital Sky Survey (SDSS)SoftRelTools from BaBar for package
organization and buildingUPS/UPD from FNAL for product setup and
distribution tools
2/8/00 E. Buckley-Geer, CHEP 2000 10
CVS
Run in client/server mode – adopted from SDSSRepository on server + cvsuser pseudo account
running a restricted shell CVSH that only allows cvs commands to be executed
Local and remote access are identical so users do not need to be on a FNAL computer to access repository – necessary condition for remote development
2/8/00 E. Buckley-Geer, CHEP 2000 11
SoftRelTools (SRT)
Adapted from BaBar experimentUses cpp used to create dependencies and gmake
used to build libraries & binariesBaBar and FNAL agreed to diverge on
development It was becoming difficult to add new features
given the original structure of the packageHave since done a re-write (Spring 1999) of the
package at FNAL to make it more maintainable
2/8/00 E. Buckley-Geer, CHEP 2000 12
UPS – Unix Product Setup
FNAL product in use since 1991Supports existence of multiple versions of
a product. Choice is made using a ‘setup’ command.
Re-write for Run IICompleted in summer 1998In use by both CDF and D0
2/8/00 E. Buckley-Geer, CHEP 2000 13
Use of these tools at CDF
~ 65 code developers 1.3 million lines of code
71% C++ , 20% Fortran, 8% C, 0.6% Java + external packages 144 packages
Development release built every night on IRIX, TRU64, SUN, Linux
Daily build logs scanned for errors and reported to developers. Build logs are posted on web
Development builds lead to timely detection and fixing of bugs
Create frozen releases about every 2 months. Also create releases to capture code used for certain milestones.
2/8/00 E. Buckley-Geer, CHEP 2000 14
Use of these tools at CDF
Success of development rebuild varies. Somewhat correlated with number of files changed
0102030405060708090
Jul-
98
Sep
-98
Nov
-98
Jan
-99
Mar
-99
May
-99
Jul-
99
Sep
-99
% of succesfuldevelopmentbuilds
2/8/00 E. Buckley-Geer, CHEP 2000 15
Use of these tools at D0
~60 code developers have write access to repository Essentially 100% C++ except for external packages 280 packages – but big variation in size
Test release of entire package weekly on IRIX and Linux. Goal is to have operational reconstruction exe at the end of every release. Currently 80% success rate.
Production releases occur at intervals determined by the management. Used to capture important milestones and provide stable working versions.
5 production releases to date
2/8/00 E. Buckley-Geer, CHEP 2000 16
Code Distribution
CDF has a set of custom scripts to distribute code to remote sites.
Both frozen releases and development are distributed Fairly straightforward to get distribution. Currently fairly manpower intensive for development
release on remote nodes – ½ FTE devoted for fixing problems
Working on switching to UPD for ease of maintenance No significant automatic code distribution happening in
D0 yet
2/8/00 E. Buckley-Geer, CHEP 2000 17
Code Distribution
Majority of distribution is to Linux machines
Linux IRIX TRU64 Solaris
Development
44 7 3 2
Frozen
Release
115 13 6 2
2/8/00 E. Buckley-Geer, CHEP 2000 18
Compilers
We wanted to write code that adhered to the C++ ANSI standard – not get into the Fortran extensions quagmire!
GCC and vendor compilers were not thought sufficiently compliant in summer 1997
Chose KAI compiler from Kuck and Associates Compiler was available on the relevant platforms – including
LINUX Has led to issues with availability of KAI versions of external
products that must be built with the CDF/D0 software – e.g. we paid for a port of Open Inventor
We still believe it was the right choice at the time but expect to use EGCS and vendor compilers in the future
2/8/00 E. Buckley-Geer, CHEP 2000 19
Debuggers and other tools
Quality of the debugging tools has left a lot to be desired This was one of the few downsides of choosing KAI. Things
have been particularly problematic on Linux Have purchased TotalView which is in use on IRIX and
will shortly be available for Linux – seems to improve the situation
Case tools – used GDPro and Rational Rose Mostly used to document design – did not use automatic code
generation features Purify and Insure++ used to look for memory leaks – but
not currently available for Linux
2/8/00 E. Buckley-Geer, CHEP 2000 20
Licensed products
Has been very beneficial to negotiate license agreements that cover use of a product by all Run II developers independent of their location
Have done this with KAI, Open InventorGet better price - all licenses must be
ordered through Fermilab
2/8/00 E. Buckley-Geer, CHEP 2000 21
Thoughts on the development process
Borrowing from the terminology and observations presented in “The Cathedral and the Bazaar” by Eric Raymond – O’Reilly Books
Our code is clearly Open Source because (by and large) it is freely available to anyone who wants to use it from another experiment
However, both CDF and D0 software projects are run using the traditional “cathedral” style of software development
This is necessitated by the requirements to provide schedules, obtain manpower resources from a limited pool, meet milestones and convince review committees that you know what you are doing
We can make some comparisons between aspects of the Open Source
(aka Linux) model and what we are doing in HEP
2/8/00 E. Buckley-Geer, CHEP 2000 22
Thoughts on the development process
“Treat your users as co-developers” Two user communities in an experiment
Those working on the software project – programmers and physicists
The rest of the experiment – the physicist-user The first group tends to be like the Linux community
– working on the project because they are interested in the problem and want to improve the product
The second group just want to use the software to get physics results – they want to improve their physics analysis software but not the infrastructure
2/8/00 E. Buckley-Geer, CHEP 2000 23
Thoughts on the development process
“Release early, release often”CDF has shown that this leads to more timely
bug fixes and shorter integration time and is very desirable for the project developers
However, it drives the physicist-user to distraction because he/she just wants something that works!
Have to have stable frozen releases in addition
2/8/00 E. Buckley-Geer, CHEP 2000 24
Thoughts on the development process
Some of the skills necessary to co-ordinate a successful Open Source project are relevant to managing an HEP computing project Must have good people and communication skills Need to be able to attract people to the project and
keep them interested and happy These can often be more important than possessing
great technical prowess If often feels like we are in a bazaar rather than a
cathedral!
2/8/00 E. Buckley-Geer, CHEP 2000 25
Conclusions
CDF and D0 are successfully managing their software development projects with ~ 60 – 70 developers per experiment and 1 million lines of C++ each
We are expected to have schedules, milestones and reviews which makes it unlikely that we can ever manage a project using the bazaar model
However, some of the Open Source concepts are applicable to HEP projects
2/8/00 E. Buckley-Geer, CHEP 2000 26
Use of these tools at CDF
On days that development builds we create a rawhide release. This satisfies developers who need the up-to-date code but also need the whole release to actually build