PROVENANCE MANAGEMENT & CITATIONS IN CURATED DATABASES
Kleisarchaki Sophia,
HY561, 05/05/09
He works in the Database Group in the Laboratory for Foundations of Computer Science (University of Edinburgh).
He spent many years in the Database Group of the Department of Computer and Information Science at the University of Pennsylvania.
You can find him..
..in polynomial time.
ABOUT THE AUTHOR – PETER BUNEMAN
CONTENTS
“Provenance Management In Curated Databases” Peter Buneman, Adriane P. Chapman, James Cheney
“How to cite curated databases and how to make them citable” Peter Buneman
1st paper 2nd paper
“Curated Databases” Peter Buneman, James Cheney, Wang-Chiew Tan
“Provenance in Databases (Tutorial Outline)” Peter Buneman, Wang-Chiew Tan
Before All..
CURATED DATABASES
What is a Curated Database?
The term “curated” comes from the Latin curare – to care for.
Are a result of a great deal of annotation, correction and transfer data from other sources.
Are databases that are populated & updated with a great deal of human effort through the consultation, verification and aggregation of existing sources and the interpretation of new raw data.
CURATED DATABASES
What a Curated Database IS NOT?
Curated databases are not warehouses. They are manually constructed by highly skilled scientists.
They are not views.
They are not computed automatically from existing datasets.
CURATED DATABASES
Notable examples of curated databases
UniProt (formerly called SwissProt) used in molecular biology. CIA World Factbook: source of demographic data.
IUPHAR: receptor database. Maintained by volunteers.
Such databases are not confined to biology; they are also being developed in areas such as astronomy and geology. Wikipedia and other wikis are also curated in that they are the product of direct human effort.
Nuclear Protein Database (NPD).
Reference manuals, dictionaries and gazetteers.
CURATED DATABASES
Which are the characteristics of a Curated Database?
Source. Data that is copied and edited from existing sources, perhaps other curated databases. Knowing the origin – provenance – is important.
Annotation. In addition to core data, curated databases also contain annotations that carry additional pieces of information such as provenance.
Update. A common practice is to maintain a working database updated and to “publish” versions of it. Schema and structure. Constructed “on the cheap”, usally stored in a text file. Almost inevitably the structure of the entries evolves over time.
CURATED DATABASES
Which are the characteristics of a Curated Database?
Source. Data that is copied and edited from existing sources, perhaps other curated databases. Knowing the
origin – provenance – is important. Annotation. In addition to core data, curated databases also contain annotations that carry additional pieces of information such as provenance.
Update. A common practice is to maintain a working database updated and to “publish” versions of it. Schema and structure. Constructed “on the cheap”, but almost inevitably the structure of the entries evolves over time.
PROVENANCE IN DATABASES (1/2)
Provenance – also called lineage and pedigree – describes the source and derivation of data.
Helps to: Determine the authenticity of a work. Establish the historical importance of a work by
suggesting other artists who might have seen and be influenced by it.
Determine the legitimacy of current ownership. Trust the data.
Why is provenance important?
PROVENANCE IN DATABASES (2/2)
Overview of provenance
Provenance
Workflow or coarse-grain provenance
Dataflow or fine-grain provenance
Why – provenance
Where – provenance
Describes the source
and derivation of
data.Record a complete
history of the derivation of
some data set.
Derivation of part of the
resulting data set.
Keeps the justification
for the element
appearing in the output.
The identification of the source
elements where the data in the target is copied
from.
WHERE-, WHY- PROVENANCE
Hotel Restaurant
Peacock Alley
Bull & Bear
Pacifica
Soho Kitchen & Bar
Waldorf Astoria
Waldorf AstoriaWaldorf Astoria
Holiday Inn DT
Cost$$$
$$$$
$
Hotel Zip
Rating
Waldorf Astoria
Restaurant Cost Type
Peacock Alley
Bull & Bear
PacificaSoho Kitchen & Bar
Zip
$$$ French 10022
$$$ Seafood 10022
$ Chinese 10013$ American10022
Holiday Inn DT
10022
10013
4.5
4.0
JOIN, PROJECT
NYHotels (Source table)
Why?
Where?
View
4.5
4.5
Rating
4.5
4.0
(Where-provenance)
(Why-provenance)
CONTENTS
“Provenance Management In Curated Databases” Peter Buneman, Adriane P. Chapman, James Cheney
“How to cite curated databases and how to make them citable” Peter Buneman
1st paper 2nd paper
WHAT IS THE PROBLEM BEING ADDRESSED IN THE PAPER?
Database technology is employed not only to provide access to source data, but also to the derived knowledge of scientifics who have interpreted the data.
Provenance or metadata describing creation, recording, ownership, processing, or version history is essential for assessing the value of such data.
What information should be retained?
How should it be
managed?
WHAT IS THIS PAPER ABOUT?
Investigates general-purpose techniques for recording provenance for data that is copied among databases.
Describes an approach in which they track the user’s actions, in order to record them in a convenient, query able form.
Presents an implementation of this technique and use it to evaluate the feasibility of database support for provenance management.
CURATED DATABASES - EXAMPLE
Example
a) Copies records of some interesting proteins from a SwissProt webpage into her database.
b) Fixes the new entries so that the PTM (post translational modification) found in SwissProt is not confused with her.
c) Copies some publications from OMIM and NCBI.
d) One year later she finds a discrepancy between two PTMs.
THE PROBLEM It is necessary to retain provenance information
describing the source and version history of the data.
We focus on “fine-grained” provenance, which describes how data has moved through a network of databases.
Need to record both local modifications to the database (insert, delete, update) and global operations such as copying data from external sources. Constraints: 1. There is not a standard for storing or exchanging
provenance. 2. Varying practices for identifying or locating data. 3. Past versions may not be archived. 4. Curators employ a variety of application programs
that cannot be changed.
External source
databases
Local databas
eAuxiliary
provenance database
OUR APPROACH (1/2)
User’s actions are captured as a sequence of insert, delete, copy and paste by provenance-aware application.
Provenance architecture
OUR APPROACH (2/2) Implemented a naïve approach and several more
sophisticated.
The naïve approach increases the time to process each update by 28%. The amount of provenance information stored is proportional to the size of the changed data.
Optimization techniques: Transactional provenance management. Hierarchical provenance management. Together these optimizations reduce the added
processing cost of provenance tracking to less than 5-10% per operation and reduce the storage cost by a factor of 5-7 relative to the naïve approach. Typical provenance queries can be executed more efficiently.
MANUAL UPDATES AND PROVENANCE (1/2)
“Where a piece of data comes from?” We need to have a means for describing the
location of any data element.
Two assumptions: Database can be viewed as a tree. Labels on edges occur on at most one path.
(SwissProt/Release{20}/Q01780 identify a specific entry)
MANUAL UPDATES AND PROVENANCE (2/2)
Update operations are of the form:
u ::= ins{a:u} into p | del a from p | copy q into p
Inserts an edge labeled a with value v intothe subtree at p.
Deletes an edge and its subtree.
Replaces the subtree at p with a copy of the subtree at location q.
PROVENANCE TRACKING
Prov(Tid, Op, Loc, Src)
Provenance architecture
External source
databases
Local databas
eAuxiliary
provenance database
NAÏVE PROVENANCE
Store one provenance record for each copied, inserted or deleted node.
Wasteful in terms of space. Retains the maximum possible
information about the user’s actions.
One transaction per line
TRANSACTIONAL PROVENANCE
Actions are grouped into transactions larger than a single operation.
Store only provenance links describing the net changes resulting from a transaction. Details about intermediate states are not
retained. Less precise than naïve approach. Number of transactional provenance
records: i + d + ci: number of inserted nodes in the output.d: number of nodes deleted in the input.c: number copied nodes in the output. Entire update as
one transaction
HIERARCHICAL PROVENANCE (1/2)
It is not necessary to store all of the provenance links explicitly.
The provenance of a child of a copied node can often be inferred from its parent’s provenance using a simple rule. Does not discard any information. Does not require user to group
operations into transactions.
Hierarchical version of naïve approach.25% smaller than Prov, but much larger savings are possible.
HIERARCHICAL PROVENANCE (2/2)
We can define the full provenance table as a view of the hierarchical table as follows: If the provenance is specified in HProv, then it is
just copied into Prov. Otherwise, The provenance of every target path p/a not
mentioned in HProv is q/a, provided p was copied from q.Infer(t, p) ¬( x, q.Hprov(t, x, p, q))Prov(t, op, p, q) Hprov(t, op, p, q)
Prov(t, C, p/a, q/a) Prov(t, C, p, q), Infer(t, p)Prov(t, I, p/a, ) Prov(t, I, p, ), Infer(t, p)Prov(t, D, p/a, ) Prov(t, D, p, ), Infer(t, p)
TRANSACTIONAL-HIERARCHICAL PROVENANCE
Combination of transactional and hierarchical provenance techniques.
Storage is: i + d + C,i: number of inserted nodes in the output.d: number of nodes deleted in the input.C: number of roots of copied subtrees
that appear in the output.
Hierarchical version of (b).
Entire update as one transaction
PROVENANCE QUERIES
Define some convenient views of the raw Prov table.
“p was unchanged
during transaction
t”
Ins(t, p) Prov(t, I, p, )
“p was inserted during
transaction t”
Del(t, p) Prov(t, D, p, )
“p was deleted during
transaction t”
Copy(t, p, q) Prov(t, C, p, q)
“p was copied from
q during transaction
t”
Unch(t, p) ¬( x, q.Prov(t, x, p, q))
PROVENANCE QUERIES
Define some convenient views of the raw Prov table. “node p
comes from q during
transaction t”“the data at location p at the end of
transaction t “came from” the data at
location q at the end of transaction u”
Trace(p, t, q, u)Trace(p, t, p, t).Trace(p, t, q, u) Trace(p, t, r, s), Trace(r, s, q, u).Trace(p, t, q, t-1) From(t, p, q).
From(t, p, q)From(t, p, q) Copy(t, p, q)From(t, p, q) Unch(t, p)
Let’s answer some… “simple” questions!
PROVENANCE QUERIES (1/2)
Q1: Src
Q2: Hist
Q3: Mod
What transaction first created the data at a location? (e.g. who entered your telephone number incorrect?)
What is the sequence of all transactions that copied a node to its current position?
What transactions are responsible for the creation or modification of the subtree under a node?
Src(p) = {u | q.Trace(p, tnow, q, u), Ins(u, q)}
Hist(p) = {u | q.Trace(p, tnow, q, u), Copy(u, q)}
Mod(p) = {u, | q.p ≤ q, Trace(p, tnow, r, u), ¬Unch(u, r)}
PROVENANCE QUERIES (2/2) There are many interesting queries that
mention both provenance and the row data. Q4
Such queries are tricky to write by hand. Providing advanced support for provenance
queries is future work. Note: If some source databases do not track
provenance then queries stop following the chain of provenance.
Project the A field out of relation R(Id, A, B) along with its current provenance. Q(x, Px) R(k, x, y), From(tnow, “R/” + k + “/A”, Px)
Provenance architecture
Source database -
OrganelleDB
Target database -
MiMI
Auxiliary provenance
database
IMPLEMENTATION
Wrappers for source and target databases
IMPLEMENTATION OF PROVENANCE TRACKING (1/2) Naïve provenance
Is a straightforward process of recording target and source information of every transaction that affects the target database. For a paste operation we add one record per node in
the copied subtree.
Transactional provenance When a commit action occurs, CPDB stores the
provenance links connecting the current version with its predecessor. No links corresponding to temporary data are stored. The implementation maintains a provlist, of
provenance links that will be added to the provenance store when the user commits.
IMPLEMENTATION OF PROVENANCE TRACKING (2/2)
Hierarchical Provenance Stores at most one record per operation. For a copy, stores the record connecting the root
of the copied tree to the root of the source.
Hierarchical Transactional Provenance Maintains hierarchical provenance instead of
naïve provenance records in provlist. Checks and removes redundant links from
provlist.E.g. copy S/a to T/a,
copy S/a/b to T/a/b redundant links
PROVENANCE QUERIES - IMPLEMENTATION
Src, Mod, Hist implemented as programs.
For naïve and transactional provenance, query directly the provenance store.
For hierarchical provenance, the provenance store corresponds to the Hprov relation. Query the provenance store directly and
compute the appropriate provenance links on the fly.
EVALUATION
The experiments focused primarily on the storage and processing requirements of provenance tracking for the different approaches. Query optimization and database tuning left for
future work.
Chose to use random sequences of copy-paste operations to simulate worst case behavior.
EXPERIMENTAL SETUP
Performed five sets of experiments.
Used six patterns of update operations.
Update patterns Deletion patterns
FIRST TWO EXPERIMENTS
First Experiment Second Experiment
Figure 7: Number of entries in the provenance store after a variety of update patterns of length 3500.
Figure 8: Number of entries in the provenance store after mix and real update patterns of length 14000. The number at the top of each bar shows the physical size of the table.
N, T store 4 records/copy.H, HT store
only 1 record.
SECOND EXPERIMENT
Figure 9 shows the time spent on storing provenance information for all the techniques.
Figure 9: The average amount of time for target database processing and for add, delete, copy and commit operations on the provenance store during 14000-mix update.
Copying in T is close to zero,
because copies do not involve
interaction with the provenance
store.
SECOND EXPERIMENT
Figure 10: The overhead of provenance tracking per operation as a percentage of the time to perform each basic operation.
For naïve approach all operations require less
than 30% of the processing time needed for interaction with the
target DB.
H-provenance requires more
time to process inserts than
copies.
H-provenance treats deletes as
naïve provenance.
T-provenance: Insertsand copies run
essentially instantaneously,
because no interaction with the target database or provenance store is
needed.
THIRD EXPERIMENT
Measured the effects of deletes on provenance storage.
Figure 11: The effect of deletion on the provenance store. The notation (ac) indicates provenance table size when only add and copy operations are performed while (acd) includes deletes.
HT-provenance stores the fewest
records among the approaches for each update
pattern.
FOURTH EXPERIMENT
Figure 12: The effect of transaction size on provenance processing time.
Time to process a commit
grows approximately linearly with transaction
length.
FIFTH EXPERIMENT
Displays the time needed to perform basic provenance queries.
Figure 13: The time needed to perform basic provenance queries.
The queries ran fastest for
transactional provenance for all
three queries,
CONCLUSIONS
The experimental results affirm that provenance can be tracked and managed efficiently using our approach.
This is a promising first step towards providing powerful, general-purpose tools that will make life easier for scientific data curators and increase the reliability and transparency of the scientific record.
CONTENTS
“Provenance Management In Curated Databases” Peter Buneman, Adriane P. Chapman, James Cheney
“How to cite curated databases and how to make them citable” Peter Buneman
1st paper 2nd paper
WHAT IS THE PROBLEM BEING ADDRESSED IN THE PAPER?
Importance of citing databases. Citing something that has: Internal structure. Evolves over time.
Propose a stable citation system for IUPHAR. Describe:
How to publish the database in a form that can be cited.
How to ensure that the citations remain valid. How to generate and validate the citations
automatically.
PRELIMINARIES (1/4)
Bioessays 17:999-1001
Bard JB and Davies JA. Development, Databases and the Internet.
PRELIMINARIES (1/4)
Citations are used to identify the source material and provide some additional information.
Example:The citations.. Ann. Phys., Lpz 18 639-641 Nature, 171,737-738
while adequate for identification, hardly convey the importance of these publications.
PRELIMINARIES (2/4)
A citation does not give us a specific mechanism for retrieving a document. It is useful to find what we are looking for. It is a structure that can be used by a variety of
mechanisms such as online indexes and search engines.
A citation consists of two kinds of information.
Bard JB and Davies JA. Development, Databases and the Internet. Bioessays. 1995 Nov;17(11):999-1001.
PRELIMINARIES (2/4)
A citation does not give us a specific mechanism for retrieving a document. It is useful to find what we are looking for. It is a structure that can be used by a variety of
mechanisms such as online indexes and search engines.
A citation consists of two kinds of information.
Bard JB and Davies JA. Development, Databases
and the Internet. Bioessays. 1995 Nov;17(11):999-1001.
Location
PRELIMINARIES (2/4)
A citation does not give us a specific mechanism for retrieving a document. It is useful to find what we are looking for. It is a structure that can be used by a variety of
mechanisms such as online indexes and search engines.
A citation consists of two kinds of information.
Bard JB and Davies JA. Development, Databases and the Internet. Bioessays. 1995 Nov;17(11):999-1001.
Descriptive Information
(authorship, title, date)
PRELIMINARIES (3/4)
Requirements concerning citations: There is some “thing” that is being cited. The thing should be accessible. The thing should not change over time.
There are few accepted practices for supporting citation of data. Few standards. Little supporting technology.
PRELIMINARIES (4/4)
D1 For any citation C, <C> should remain fixed Since database change, this simple requirement
is not always easy to maintain.
D2 Any citable thing T should contain a citation C such that <C> = T Anything we cite should provide us with at least
one way of citing it.This is not always done in journal publications. It is essential because:1.One wants confirmation that we have found the correct citation. Even if we have found T using some other citation C’ (<C> = <C’>), we want to be sure that they refer to the same thing2.If we found <C> by some other means (search engine) , we want to know how to cite it.
CURRENT PRACTICE
On-line databases frequently give recommendations on how to cite them. They often omit version information. Fail to provide adequate location.
The Columbia Guide to Online Style although it discusses issues of permanence of links, does not mention D1 as one of its citation “principles”.
ISO690 standard deals with citations of parts of electronic documents.
STRUCTURAL ISSUES (1/5)
Databases have explicit structure. This offers the possibility of a citation using this
structure to home in on the relevant data.
Example (IUPHAR database)
Figure 1: Rough structure of the IUPHAR web interface.
The structure of what the user sees is not the same as the underlying database.
STRUCTURAL ISSUES (2/5)
Consider the following:1. The IUPHAR database (C1) contains no
information about Ginandtonicin. 2. The IUPHAR database (C2) lists five ligands for
Melatonin receptor MT1.3. The IUPHAR database (C3) asserts that
luzindole is an antagonist ligand for receptor MT1.
1. Making the context two narrow can be as counterproductive as making it too wide.
C1 should refer to the
whole database
<C2> should be the web
page for that receptor or maybe the receptor
family page.
Citing just that row or the table?Better, cite the receptor or its
family.
STRUCTURAL ISSUES (3/5)
One citation is coarser than another if it refers to a higher structure (<C1> is coarser than <C2>).
D3 It should be possible to cite a database at varying degrees of coarseness.
In order to make further progress we have to look at the internal structure of citation.Life Sci., 53, 393-398journ
al
Volume
number
pages Our understanding is based on a common structure of all journals.
STRUCTURAL ISSUES (4/5)
A “concrete syntax” for citations is a sequence {k1 = v1, k2 = v2, ..}, where k1, k2, ... are keywords and v1, v2, … are associated values.Example{Journal = “Life Sci.”, Number = 53, Pages =
3930398}
There is a natural “part of” relationship among citations.Example{Journal = “Life Sci.”} and{Journal = “Life Sci.”, Number = 53}
STRUCTURAL ISSUES (5/5)
D4 If C and C’ are citations and <C’> is coarser than <C> then the location information in C’ should be a part of the location information in C.
C’: {DB=IUPHAR, IUPHAR-Receptor-family=Melatonin}
C: {IUPHAR-Receptor-family=Melatonin}
TEMPORAL ISSUES (1/2)
The obvious way to deal with change in citation is to provide, in the citation, a version number. {DB=IUPHAR, Version=17, Family=Melatonin} Using time may be misleading.
D5 Versions should be recorded at the database level. The rate of publication of versions is much
much slower than the rate of updates. Having such a citation obliges someone to
keep past versions. It is possible to cite a range of versions
{..Version=2-8..}
To what does the version refer?
TEMPORAL ISSUES (2/2)
Now, what is <{DB=IUPHAR, Family=Melatonin}>, a citation without a version number? The latest version of the database.
So, we need two words: One for a fixed citation, One for a “current link”, the place at which you
may find the latest information. A good job of distinguishing between “this”
version, the “latest” version and previous versions of documents was presented.
PRESENTATION, CONTENT AND PRESERVATION
The structure of the cited “thing” is not necessarily the same as the structure of the underlying database.
The underlying database contains information – working notes etc – that is not intended as part of the published material. We should not be making direct citations to the
internal structure of the database.
The hierarchy that the user sees should be represented as an XML document.
AUTOMATICALLY GENERATING CITATIONS (1/3)
Insert citation data manually is both time consuming and error prone.
Automatically generation of citations is a good check on the integrity of the document. Guarantees that the contents of the document
are consistent with the citation. Give guarantees on the descriptive information
(e.g. there is at most one Title)
AUTOMATICALLY GENERATING CITATIONS (2/3)
A rule that generates location information:
{DB=IUPHAR, Version=$v, Family=$f} /Root[]/Version[Number=$’v]/Data[] /Family[FamilyName=$’f]
The pattern is expressed in the syntax of Xpath.
A concrete syntax of
citations with variables.
The database
or document
has a unique root.
Each Version must have a Number that uniquely
identifies the node and provides a value for $v.
Indicates that for each
Version, there is precisely one data
node.
Each family node has a FamilyName
which uniquely identifies the
family.
AUTOMATICALLY GENERATING CITATIONS (3/3)
A rule that generates description information: {DB=IUPHAR, Version=$v, Family=$f,
Receptor=$r, Contributors=$a, Editor=$e, Date=$d, DOI=$i} /Root[]/Version[Number=$’v, Editor=$?e, DOI=#.i, Date=$.d]/Data[]/Family[FamilyName=$’f]/Contributor-list/Contributor=$+a]/Receptor[ReceptorName=$’r]
Generates: { DB=IUPHAR, Version=11, Family=Calcitonin,
Receptor=CALCR, Contributors={Debbie Hay, David R. Poyner},Editor=Tony Harmar, Date=Jan, 2006, DOI=10.1234}
Exactly one value.
At most one value.One or more
values expected.
CONCLUSIONS
We have to do a modest amount of work in structuring the data appropriately in XML, after which citations can be specified and generated by some simple rules.