Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | martina-mccall |
View: | 9 times |
Download: | 0 times |
Science, Data, You, and the Future: A variation on the “The 3 Little
Pigs.” Which “Little Pig” will You be?!
A Presentation for “A Presentation for “NSF Facilities Users’ Workshop: Working Together to Meet New NSF Facilities Users’ Workshop: Working Together to Meet New Observational Challenges.”Observational Challenges.”
September 24, 2007September 24, 2007Boulder, COBoulder, CO
Raymond McCord Raymond McCord Oak Ridge National Laboratory*Oak Ridge National Laboratory*
Oak Ridge, TennesseeOak Ridge, Tennessee
*Oak Ridge National Laboratory is operated by UT-Battelle, LLC, for the U.S. Department of Energy under contract DE-AC05-00OR22725*Oak Ridge National Laboratory is operated by UT-Battelle, LLC, for the U.S. Department of Energy under contract DE-AC05-00OR22725
OutlineOutline
• Objectives• Conclusions (already??)• Storytelling• Introduction to “the story”• Evaluation of current and future issues• Pathways to the future??• An ending to “the story”• Conclusions (again!!)
ObjectivesObjectives
• To present my assessment of current scientific data management practices and issues that need to be addressed in the future.
• To be informative, provocative, and entertaining
To StOP YOU from thinking about supper!!??
Conclusions Conclusions (already??)(already??)
• We are swamped with more information than we can access*– *Access is a broad topic (EPDUS = ????)
• Our current practices may not be sustainable and reliable.– Exponential vs linear capacity increases– Optimization is unbalanced
• Scientific expertise within data centers will improve future data access.• Science and data management must be integrated.• Many solutions are NOT technological, but behaviorial.
– Think - Training• “Data Science” training must developed and implemented.• The needed changes will not happen by accident.
– “My ~30 years of experience and systems observation suggests otherwise!!”
• Sooner is better than later.
StorytellingStorytelling
• Storytelling is a VERY OLD form of “information technology” (IT)– Preservation and access
• When old IT meets new IT– Supercomputer implementation
• Just go ask!
• Excuse for Analogies– Engages the listener
• “The 3 Little Pigs”– “Once upon a time…”
About RaymondAbout Raymond
• Trained as a Theoretical Ecologist (landscape ecology)– Conducted extensive statistical analysis
• Scientific data analyst (to pay the bills)– Tired of rerunning analyses at last minute to correct data
management problems• Data manager / System “whacker”
– GIS implementation in “early PC days”• Implementer and manager of progressively larger
environmental information systems!!– Requires “research” outside of “science”– “Smell the fumes” of many scientific disciplines– Very few publications!!??– Acquired respect???
CreditsCredits
• The concepts presented are derived from managing environmental data and information systems over the past 30 years.
• Variations of these concepts were observed from many disciplines:– plant community research – impact assessment in marine systems– acid rain surveys– environmental monitoring and cleanup projects at DOE facilities– land use assessment– climate change research (atmospheric research)
• These concepts extend to other scientific disciplines.
Quotes from RaymondQuotes from Raymond
• “Storing data is easy. Finding and using data later is NOT…”
• “Systematically and consistently organized data does not occur without cost.”– “The existence of “no cost”, well-organized
data is not supported by the current situation”– “Consider the results from previous science
projects with “no cost” for data archiving.”• “The natural tendency over time for data
and information is chaos. Effort must be exerted to overcome this.”
• “Successfully managed data by projects may not be ready to be archived. (for permanent access)”
Pop Quiz (Wake UP!!)Pop Quiz (Wake UP!!)
• What is “access combination” to my lock?– Hints:
• “I love it”• X=(Yz)/12) - z+1
• How is my necktie related to:– Data?– Metadata?– Scientists?– 2 year old children?
• “Why do I care?”
(Answers near the end of the presentation.)
Story time…Story time…
““The 3 Little Pigs”The 3 Little Pigs”
• Characters– The Wolf– First Pig builds a house of Straw– Second Pig builds a house of Sticks– Third Pig builds a house of Bricks
• What does this have to do with “Data, Science, You, and The Future…?”
The The WolfWolf
• Unending appetite• Out of control • Bad mannered• Too clever?
• Exponential growth in:– Data retention
capacity and habits– Data re-use demands
• Significant chaos in:– Data automation styles– Data documentation
• Lack of training in– {ditto above!!}
Who will eat whom? Scientists or Data
managers?
0
500000
1000000
1500000
2000000
2500000
Oct-95
Oct-96
Oct-97
Oct-98
Oct-99
Oct-00
Oct-01
Oct-02
Oct-03
Oct-04
Oct-05
Oct-06
files MB
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
Oct-95Oct-96 Oct-97 Oct-98Oct-99 Oct-00 Oct-01Oct-02 Oct-03 Oct-04Oct-05 Oct-06
files MB
Data out Data in
The The StrawStraw Building Pig Building Pig
• Gathered more and more of “what was at hand.”
• Wanted to go back to “being a pig”.
• Metadata catalogs• Metadata harvesting• Layers of ontologies• Automated “data mining”??• Can we sort through all of
the details?• What about
recommendations and priorities for use?
• When will the “straw quality” improve?– Changing the “masses”
Stay out of the way of the
Scientists
The The SticksSticks Building Pig Building Pig
• Did a bit more work to gather materials
• Used a bit more structure• Did not have good
specifications• Only acted after First Pig failed• Wanted to go back to “being a
pig”. (after some effort)
• A mixture of data structures, metadata, and a few standards
• XML Links• Automated data access• Data warehouse
– Business information concept• How do we know the
balance of structure, metadata, and standards?
• What is the evolutionary pathway?
• Many “Sticks” to choose from • Can we show the improvement
over “Straw”?
Work with the Scientists
The The BricksBricks Building Pig Building Pig
• Did a lot more work to gather (AND PREPARE) materials
• Used significantly more structure• Required working with “a plan”• Was a braggart over First and
Second Pig• Wanted to go back to “being a
pig”. (after “winning”)
• Metadata standards and more standards
• Internet does not decide (distributed vs central)
• Removes ambiguity of definitions, but contents get “boxed”.
• What about Type I errors vs. Type II errors?
• An “odds box or junk bin” will always remain.
• “Bricks” are:– Hard to change– Slow and costly to make
• CHANGE is fundamental to SCIENCE (more later!!)Defeat or
stymie the Scientists?
Elements of Data Preservation for Elements of Data Preservation for Future AccessFuture Access
• A “framework” for assessing improvements for the future
• Restricts flow like irregular plumbing
• “We want {more} Cake!!??”
Elements of “Permanent Access”…Elements of “Permanent Access”…
• “Permanent access” to scientific information requires ALL of the following:
– Existence– Permission– Discovery– Understanding– Support
ExistenceExistence
• Definition– Information is recorded and retained.– Information can be found and used by “experts”.
• Requirements– Information technology is used for recording and retention.– Scientists are trained and required to record and retain information.
• Issues– The availability of information technology will far surpass the “ability” to
use it effectively.• Training will be needed to extend “ability” beyond the immediate need.• Training must include both fact and philosophy.
– Plans to use information technology must be “pushed” beyond the immediate objectives.
• Need to establish reasonable and more “global” plans and objectives.
““Why Don’t I Archive My Data?”Why Don’t I Archive My Data?”
• No incentives - What’s in it for me? • No acknowledgment - Does a dataset = a
publication?• Give up publication rights - Will somebody scoop
me?• Poor planning - It was not in “the Plan”.• No resources - Who’s going to pay for it?• No future – Who will support this later?• Lack of training - What do I need to do first?• Unsure about metadata content - How much is
enough?
PermissionPermission
• Definition– Someone “beyond” the originator is allowed to acquire and use
the data.• Requirements
– Scientists relinquish control of the data.– Sponsors and agencies relinquish control of the data.– “They” not only allow future use, but encourage it.
• Issues– Encourage data re-use.
• Explain larger research objectives.• Reward data citation.
– Balance openness and protection.• Allow early discovery.• Prevent resource abuse.• Protect individual privacy.
DiscoveryDiscovery
• Definition– Starts with the inspiration to look “here” for “what you want”.– Includes knowing how to find “what you want”.– Ends with recognizing it when “what you want” is found.
• Requirements– Logical organization.– Good and meaningful metadata (categories and keywords).– Multiple pathways for discovery.
• Issues– Documentation must be significantly extended beyond the “local view” of
the data.– Documentation development is “not career building” for scientists.– Interactions between “developers and users” must be sustained.
Measurement
An Initial View of Data… An Initial View of Data…
Measurement
Single Experiment ViewSingle Experiment View
datesample
ID
parameter name
location
Measurement
Integrated System & Archive ViewIntegrated System & Archive View
QA flag
media
generator
method
datesample
ID
parameter name
location
records
Units
Sample def.typedatelocationgenerator
labfield
Method def.
words, wordsunitsmethod
Parameter def.
org.typenamecustodianaddress, etc.
coord.elev. typedepth
Recordsystem
datewords, words.
QA def.
Units def.
GIS
“I need to look in the odd parts bin.”Direct access to IOP data. Navigate /year/site/iop directory tree. Also use narrow Google search.
“I need to read about what you have, then I will decide.”Discover areas of interest by browsing the ARM web documentation and collect items of interest.
“I will know what I want when I see it.”Searching with a combination of predefined selection criteria and visual review of data plots
“I know what I want. Do you have it?”Searching with predefined selection criteria.
“I am not sure what I want. I need to see what you have available.”Browsing a hierarchy of availability summaries.
Comparison of User Interface OptionsComparison of User Interface Options
IOP, special, PI, and beta data
IOP Data Browser
Routine ARM data and some IOP data
Web Shopping Cart
Most routine ARM data
Thumbnail Browser
Routine ARM data
Catalog Interface
Routine ARM data
ARM Data Browser
“Shopping” approach
([email protected], 1-888-ARM-DATA)
Accessible data
Interface name
Moving on to … Moving on to … Results-based searchingResults-based searching
• An interface of “Statistical Views” (or data) under development for the ARM Archive.
• Not all users want “data.”
User interface to select thumbnails of Statistical Views Detailed view of graph; options to
order statistics, data, or data files.
UnderstandingUnderstanding
• Definition– The interpretation of the full context of the information.
• Requirements– Descriptive metadata that correctly “matches up” information that was:
• Generated from a variety of sources, • Collected for a variety of purposes, • Retained over a broad range of time.
– “Understanding” applies to both:• Persons who read documentation.• Computers that “read” the data format.
• Issues– “Language barriers” must be overcome between scientific disciplines.– Inadequate documentation and software can make “data” useless.– Additional effort will need to be allocated beyond original purpose.
• Trade off between: current quantity of measurements and future use.
Measurement
Sequence of Information BirthSequence of Information Birth
QA flag
media
generator
method
datesample
ID
parameter name
location
records
Units
Sample def.typedatelocationgenerator
labfield
Method def.
words, wordsunitsmethod
Parameter def.
org.typenamecustodianaddress, etc.
coord.elev. typedepth
Recordsystem
datewords, words.
QA def.
Units def.
GIS
SupportSupport
• Definition– Providing help and service beyond the creation of initial
information and documentation.
• Requirements– Answers user questions beyond the initial documentation.– Responds to the evolution of information technology.– Includes scientific and technology expertise.
• Issues– Maintaining information does not:
• Fit traditional science program planning.• Contain “whiz bang” appeal.
– Requires development of new “career pathways”.
Research Implies Change …Research Implies Change …
repeat…
New datarequirements
New questions
Research
DiscoveryThis is not always true for
other information
systems.
Issues to Consider about Issues to Consider about ChangeChange
• What will change?
• Which changes can be controlled?
• How are changes approved?
• How are users notified about changes?
• How and when can changes be “smoothed” in the cumulative “Archive” view?
Pathways to the Future??Pathways to the Future??
EPDUS = EPDUS = ?????????? (1) (1)
• Existence– Technology has pushed this out of control – a path to chaos– Dilution of value causes a recovery problem– Develop procedures for retention guidelines
• Permission– Plans to that encourage permanent access of scientific data are
a “management responsibility”– Consistent rules to protect privacy, resources, and propriety
• Discovery– Significant effort on cataloging and searching– Large scale data collections depend on rational metadata– Need interrelated discovery pathways (query, catalog, pictures)– Results-based views are still very limited from large scale data– Inspiration is “an undeveloped frontier”
EPDUS = EPDUS = ?????????? (2) (2)
• Understanding– Expanding human and computer “interpretation” is difficult;
• Does not keep up increase in diversity of information types
– Web documentation has an inverted outline of scientific publications
• Web users don’t read !!!• (??!!! More later !!!??)
• Support– Inclusion of scientific expertise in Data Centers is still debated
and limited– Programmatic justification of Data Centers outside of (or after!!)
measurement program has limited “sponsor appeal”
““Inverted” Documentation Outline!?! Inverted” Documentation Outline!?! Science Publication vs. “Web Reading”Science Publication vs. “Web Reading”
• Science Publication– Abstract– Introduction– Literature review– Materials / methods– Results– Discussion– Conclusion– References
• “Web Reading”– Conclusions– Results– Abstract– Materials / methods– Discussion– Literature review– References– Introduction
Reference: McCord (200?) ???
Cross cutting issuesCross cutting issues
• Training about scientific data management– Some for all scientists, graduate program for “data scientists”– Reward system for scientific data “reuse”
• Feudal relationship between more “Science” and data preservation– More measurements and experiments– Bigger computers driving science– Stop it!! Cooperation is needed!!
• Scientific input needed for:– Metadata creation
• Mesh with scientific planning– Defining priorities and recommendations
• “An answer is better than NO answer!!” (Going for 0 points??!!)– Defining a reasonable boundary between “system and scientist”
• Handshake needed for QA review, analysis tools, documentation, and automated discovery (?!?)
Looking from the Past to the FutureLooking from the Past to the Future(Common questions from my peers.)(Common questions from my peers.)
• “Should I computerize my data?” (~1974-1975)• “Should I save my {computerized} data?” (~1978-1980)• “Why would anyone want my data?” (~1980)• “Can anyone else properly understand my data?” (~1985)• “Can I have your data?” (~1990)• “Can I find your data?” (~1993-1994)• “Will I have to contact you to know how you used your data?”
(~1998-1999)• “Can you tell me who else has used your data?” (~2000 - ????)• “Can you tell me where to find similar data?” (~2003 - ????)• “Do you want to know (or get back) how I used your data?” (~2005 -
????)• “Will you work together with me on ‘our’ data?” (????)• “Can we work together with our and ‘their’ data?” (????)• … What next … ??
Interactive computing
starts
PC gets common
www.??? takes off
Cheap storage
Collaboration is common
Internet “premie”
Conclusions Conclusions (again!!)(again!!)
• We are swamped with more information than we can access*– *Access is a broad topic (EPDUS = ????)
• Our current practices may not sustainable and reliable– Exponential vs linear capacity increases– Optimization is unbalanced
• Scientific expertise within data centers will improve future data access.
• Science and data management must be integrated.• Many solutions are NOT technological, but behaviorial.
– Think - Training• “Data Science” training must developed and implemented.• The needed changes will not happen by accident.
– “My ~30 years of experience and systems observation suggests otherwise!!”
• Sooner is better than later.
Story time … Story time … (again)(again)
An Ending to the Story…An Ending to the Story…(More conclusions)(More conclusions)
• The best house is probably a combination of:– Bricks to build on– Sticks (wood) to bend and change with– Straw to rest on when sorting out the problem is too
early.• The Wolf needs to be tamed with:
– Reduction in needless data management and documentation chaos and uninformed practices.
– More thought (research??) about “our appetite” (priorities) for storage and retention.
• The Wolf and Pigs both need more training!! • And they live happily ever after…!
Quiz Answers
Pop Quiz (Answer 1)Pop Quiz (Answer 1)
• What is “access combination” to my Lock?
• Hints + missing hints– “I love it”
• Decode numeric sequence from word lengths
– {X=Yz – ((Y*z)/12)}• All unknowns are integers • Solution is the integer number showing the
sequence
• Combination = 142
Pop Quiz (Answer 2)Pop Quiz (Answer 2)
• How is my necktie related to:– Data?
• They all look alike at first?
– Metadata?• Neckties distinguish the teddy bears
– Scientists?• They distinguish data in varying and “unseen” ways
– 2 year old children?• They distinguish teddy bears in varying and “unseen” ways
Both CRY when given the “wrong one”
Pop Quiz (Answer 3)Pop Quiz (Answer 3)
• “Why do I care?”– Better data access can turbo charge Science.– Things are a bigger mess than necessary.– Progress toward improvement is too passive
and too slow.– Independently managing information from
each project is like “paying rent” rather than “building equity.”
ReferencesReferences
• Information about my current project– Atmospheric Radiation Measurement (ARM) Program www.arm.gov – ARM Archive www.archive.arm.gov
• Extended version of “The Three Little Pigs”– http://math-www.upb.de/~odenbach/pigs/pigs.html– Linked to a German Math professor’s web site??– An English version is presented
• Very good reference on Data, Science, and the need for new roles– “Long-Lived Digital Data Collections Enabling Research and Education
in the 21st Century” – http://www.nsf.gov/pubs/2005/nsb0540/ – Sponsored by NSF National Science Board
• Reports on other NSF cyber infrastructure activities to watch and encourage– http://www.nsf.gov/od/oci/reports.jsp