+ All Categories
Home > Technology > The technology of the Human Protein Reference Database (draft, 2003)

The technology of the Human Protein Reference Database (draft, 2003)

Date post: 08-Jul-2015
Category:
Upload: kiran-jonnalagadda
View: 614 times
Download: 1 times
Share this document with a friend
Description:
Between 2002 and 2004, I managed the technology team that built the Human Protein Reference Database (http://hprd.org) at the Institute of Bioinformatics in Bangalore and Johns Hopkins University in Baltimore. These are my notes on the tech from sometime in 2003, rediscovered in 2014 when I was looking through old files.
Popular Tags:
33
Human Protein Reference Database An analysis of the technology powering the database and website, and how it was developed. Kiran Jonnalagadda
Transcript
Page 1: The technology of the Human Protein Reference Database (draft, 2003)

Human Protein Reference Database

An analysis of the technology powering the database and website,

and how it was developed.

Kiran Jonnalagadda

Page 2: The technology of the Human Protein Reference Database (draft, 2003)

2

Facts About HPRD

• HPRD is a database of all disease causing proteins in the human body.

• It is the most comprehensive database of its kind in the world today.

• Unlike most other biological databases, HPRD is protein-centric, not gene-centric.

Page 3: The technology of the Human Protein Reference Database (draft, 2003)

3

Factors Leading to Choice of DB

• The biologists hadn’t settled on what information was to be stored and therefore the data type definitions changed often.

• Several data types were fairly similar to others but not the same.

• Future extensions had to be built by tech-savvy biologists with minimal assistance from programmers.

Page 4: The technology of the Human Protein Reference Database (draft, 2003)

4

What We Used

• The Zope application server, comprising of:– The Web publishing object framework.– ZODB, the object database storage system.– ZCatalog, the indexing and search system.– ZEO, the stand-alone database server for

multiple front-end Web servers.

Page 5: The technology of the Human Protein Reference Database (draft, 2003)

5

Why an RDBMS Was Not Suited

• Data type definition changed frequently. In an RDBMS, this would have meant redefining tables every week.

• The code currently has about forty data classes. Imagine having that many data tables, plus tables for relationships between them, all under frequent revision.

Page 6: The technology of the Human Protein Reference Database (draft, 2003)

6

How Zope Handled These Issues

• Zope is built on Python, which offers dynamic data structures.

• ZODB uses this ability to makes the entire database look like one large data structure, transparently swapping unused parts to disk and recovering them as needed.

• ZCatalog indexes data for searching.

Page 7: The technology of the Human Protein Reference Database (draft, 2003)

7

At Zope’s Core is Python

• Python is a dynamic language.• When I say dynamic, I mean everything is dynamic!• Code, variables, classes, modules, everything can

be modified at run-time.• Most of Zope is built around this ability. Zope

could not have been implemented in another language.

Page 8: The technology of the Human Protein Reference Database (draft, 2003)

8

Data Storage in Zope

• In Zope, data is stored in instances of a data class.• The data class has variables, which are like fields,

and methods, which manipulate data.• Instances of a data class (objects) are stored in

the ZODB, making the database.• Objects can contain other objects, forming

hierarchies.

Page 9: The technology of the Human Protein Reference Database (draft, 2003)

9

Components of Zope

• ZServer (formerly Medusa)– Handles incoming requests.– Does HTTP, FTP, WebDAV, XML-RPC; soon SOAP.

• ZPublisher– Maps URLs to objects and handles security.

• ZODB (Zope Object DataBase)– Stores objects on disk in a transactional DB.

• ZEO (Zope Enterprise Objects)– ZODB server for multiple Zope front-end servers.

Page 10: The technology of the Human Protein Reference Database (draft, 2003)

10

Security in Zope

• Security is fine grained.• Security is defined around four concepts:

– Users, Roles, Permissions and Hierarchies.• A user is assigned one or more roles.• A role is assigned a set of permissions.• This set can be reassigned at different

positions in the hierarchy.

Page 11: The technology of the Human Protein Reference Database (draft, 2003)

11

Security Outside Zope

• Zope’s security mechanism is limited to the Web front.

• It is applied only to objects that directly interface with the end-user.

• Code written in a module in the filesystem has no security restrictions. It can do anything.

Page 12: The technology of the Human Protein Reference Database (draft, 2003)

12

Limitations in Zope

• The API for creating extensions (called Products) is complicated and poorly documented.

• The Property Manager interface is too primitive. It only handles the very basic data types such as strings, integers, boolean fields, selection lists and multi-line text.

Page 13: The technology of the Human Protein Reference Database (draft, 2003)

13

Our Extensions to Zope

• A framework for separating Zope specifics from our data types, making it much simpler to add new data types.

• An extended property management system that could handle changes in data type definitions and automatically migrate data.

Page 14: The technology of the Human Protein Reference Database (draft, 2003)

Part IIUser Interface

The rationale behind decisions affecting how a user experiences the

database.

Page 15: The technology of the Human Protein Reference Database (draft, 2003)

15

User Interface Design

• We started with exposing Zope’s hierarchy as the public user interface

• But there were some elements such as the category browser and the

Page 16: The technology of the Human Protein Reference Database (draft, 2003)

16

Templates for the Web UI

• Choice of DTML and ZPT for templates.• ZPT for templating system.

Page 17: The technology of the Human Protein Reference Database (draft, 2003)

Part IIIProject Management Lessons

What we learnt about managing a project across continents and distant

time zones.

Page 18: The technology of the Human Protein Reference Database (draft, 2003)

18

Project Management Issues 1

• We learnt the hard way that a project manager’s place is with his team, not with the client.

• Productivity suffers in the absence of an effective collaboration tool.

• E-mail and instant messengers are not effective collaboration tools.

Page 19: The technology of the Human Protein Reference Database (draft, 2003)

19

Project Management Issues 2

• Collaboration over e-mail imposes the burden of articulation on the communicator, which many dislike and therefore avoid.

• Instant messaging prevents collecting thoughts before presenting them and is therefore a poor planning tool.

Page 20: The technology of the Human Protein Reference Database (draft, 2003)

20

Collaboration Tools

• We experimented with several collaboration systems, with varying effectiveness:– Phone calls.– Instant messengers.– Wikis.– Issue tracking software.– Mailing lists.

Page 21: The technology of the Human Protein Reference Database (draft, 2003)

21

Phone Calls

• Next best thing to face-to-face discussions.• But only connect two people unless non-

standard equipment is used.• International calls are usually too expensive

for the resulting gain.

Page 22: The technology of the Human Protein Reference Database (draft, 2003)

22

Instant Messengers

• Provide critical communication between geographically distributed team members.

• But the pressure of maintaining continuity in a conversation hinders pausing to gather thoughts.

• Typing is much slower than talking. Therefore little else gets done alongside.

Page 23: The technology of the Human Protein Reference Database (draft, 2003)

23

Wikis

• The easy hyperlinking system of a wiki combined with structured text makes presenting information a snap.

• With a little code thrown in, Wikis could make a wonderful project management tool.

• A changed page notification system is needed or changes go unnoticed.

Page 24: The technology of the Human Protein Reference Database (draft, 2003)

24

Issue Tracking Software

• We use BugZilla to track issues.• But in eight months using it, only 30 issues have

been reported using it.• The other few hundred were reported over e-

mail, instant messengers and in person.• Clearly, the problem is with BugZilla’s usability.

Search for a new system is on.

Page 25: The technology of the Human Protein Reference Database (draft, 2003)

25

Mailing Lists

• E-mail is push media: the latest is always on top of your inbox.

• E-mail makes an effective to-do list in an interface the user is comfortable with.

• Mailing lists are e-mail in broadcast mode.• Mailing lists have been the most effective

collaboration tool we’ve used so far.

Page 26: The technology of the Human Protein Reference Database (draft, 2003)

26

Issues With Programmers

• Programmer skill levels and attitudes vary.• C programmers tend to write C code in

Python.• PHP programmers tend to write PHP code

in Python.• Learning Python is easy but thinking in

Python takes a long time.

Page 27: The technology of the Human Protein Reference Database (draft, 2003)

27

Programming Tools We Used

• CVS for source control.• ViewCVS for a Web front-end to CVS.• Vim in GUI mode for source editing

(preferred editor of everyone in the team).• The print statement for debugging.

Page 28: The technology of the Human Protein Reference Database (draft, 2003)

28

Tools We Should Have Used

• WingIDE is a $35 piece of software that provides an interactive Python debugger usable with Zope that would have in a few minutes of usage more than paid for itself for the hours in programmer time we instead spent debugging using the print statement.

Page 29: The technology of the Human Protein Reference Database (draft, 2003)

Part IVThings Needing Fixing

Mistakes we made during development, how they affect things

now, and how they can be fixed.

Page 30: The technology of the Human Protein Reference Database (draft, 2003)

30

Naming Conventions

• We started with assuming HPRD was gene-centric and named several things as GeneSomething.

• In code, this can be considered just an identifier.

• But in a URL, there is potential for confusing users and needs renaming.

Page 31: The technology of the Human Protein Reference Database (draft, 2003)

31

Reusable Modules

• All of the code currently sits in one directory.

• Several important pieces have nothing to do with how they are being used.

• These modules could be separated and contributed independently to the open source code pool.

Page 32: The technology of the Human Protein Reference Database (draft, 2003)

32

Data in Code

• There are bits of implementation specific data embedded in code in some places, particularly related to graph generation.

• These were introduced as quick patches for a temporary problem but have remained in place for months now.

• These need to be taken out so that the code is truly reusable.

Page 33: The technology of the Human Protein Reference Database (draft, 2003)

33

Documentation

• DocStrings needed in code.• Consistent language in DocStrings.• HTML documentation files to be

distributed with code.


Recommended