Board on Research Data and Information, National Research Council“Changing Roles of Libraries in Support of Scientific Data Activities”
June 3, 2010
More Data, More Use, Less Lead Time:Scientific Data Activities at the
National Library of Medicine
Betsy L. HumphreysDeputy Director
National Library of Medicinewww.nlm.nih.gov
NLM & Scientific Data
• Data categories– Substances– Sequences– Clinical Research– Taxonomies/Nomenclatures/Ontologies
NLM & Scientific Data
• Challenges (aka Problems)– Much more data
• Greater NIH/other investment in generating data• High throughput methods• New, unfunded mandate(s)
– Much less lead time• Need to achieve standardization more rapidly
Growth In PubChem Tested Substances
start to
12
/31
/20
05
1/3
0/2
00
53
/6/2
00
54
/10
/20
05
5/1
5/2
00
56
/19
/20
05
7/2
4/2
00
58
/28
/20
05
10
/2/2
00
51
1/6
/20
05
12
/11
/20
05
1/1
5/2
00
62
/19
/20
06
3/2
6/2
00
64
/30
/20
06
6/4
/20
06
7/9
/20
06
8/1
3/2
00
69
/17
/20
06
10
/22
/20
06
11
/26
/20
06
12
/31
/20
06
2/4
/20
07
3/1
1/2
00
74
/15
/20
07
5/2
0/2
00
76
/24
/20
07
7/2
9/2
00
79
/2/2
00
71
0/7
/20
07
11
/11
/20
07
12
/16
/20
07
1/2
0/2
00
82
/24
/20
08
3/3
0/2
00
85
/4/2
00
86
/8/2
00
87
/13
/20
08
8/1
7/2
00
89
/21
/20
08
10
/26
/20
08
11
/30
/20
08
1/4
/20
09
2/8
/20
09
3/1
5/2
00
94
/19
/20
09
5/2
4/2
00
96
/28
/20
09
8/2
/20
09
9/6
/20
09
10
/11
/20
09
11
/15
/20
09
12
/20
/20
09
1/2
4/2
01
0
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
100,000
Week of
R
ec
ord
s
7
ICMJE
FDAAA 801
~25-30 / wk
~250 / wk
~320 / wk
Number of Studies Registered at ClinicalTrials.gov since May 1, 2005
2,317 Results Records submitted (Sept 2008 – March 2010)– About 30 new results records per week; 80 re-submissions per week
– Anticipate increase in rate as rules become clear and outreach continues
UMLS Metathesaurus – May 2010 version
NLM & Scientific Data• Strengths
– Mission & Track Record• Curation, Storage, Permanent Access, Standards, R & D
– Robust Infrastructure• Staff Expertise, Advisory Structure, Computing, Communications
– Connections between different kinds of data, information– Strong US partnerships and international collaborations– Heavy use
• Weaknesses– The “defects of our qualities”– Limited resources– Less user outreach/training than desirable
Hazardous Substances Data, 1978-
Toxic Release Inventory Data, 1987-
National Center for Biotechnology Information, 1988-
– Design, develop, implement, and manage automated systems for collection, storage, retrieval, analysis, & dissemination of knowledge concerning molecular biology, biochemistry, & genetics
– Perform research into advanced methods of computer-based information processing capable of representing and analyzing the vast number of biologically important molecules and compounds
– Enable persons engaged in biotechnology research and medical care to use these systems & methods
– Coordinate, as much as is practicable, efforts to gather biotechnology information on an international basis
Benzene – PubChem Bioassay Results
300,000
200,000
100,000
Entrez Web Traffic (Unique IP Addresses): 1999 - 2009
400,000
19
98
500,000
600,000
700,000
19
99
20
00
20
01
20
02
20
03
20
04
20
05
20
06
800,000
20
07
900,0002
00
8
20
09
1,000,000
- ~2 million users a day - 100 million hits a day - 5 terabytes of data a day - 3,500 web hits a second (peak)
17
PubChem Users per Day
Current Activities/Future Plans
• Continued emphasis on:– Improving the input
• Tagging, standardization, explicit links (e.g., GenBank #s, NCT #s)
– Increasing data curation efficiency– Use of “influentials” to promote standards, best
practices– US Partnerships & International collaborations – Computer center efficiency, security– Better discovery, retrieval, display methods
21
0 times2%
1 time8%
2 times4%
3-5 times8%
6-10 times8%
11-100 times41%
100+ times28%
PubMed Central Article Request Frequency - cal-endar 2009
Available: 1.9 Million Articles Used: 98%, Used > 10 times: 69%