Date post: | 14-Apr-2017 |
Category: |
Technology |
Upload: | paolo-missier |
View: | 476 times |
Download: | 0 times |
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Data Trajectories:tracking reuse of published data
for transitive credit attribution
Paolo [email protected]
School of Computing ScienceNewcastle University, UK
IDCC’16Amsterdam, Feb 24, 2016
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
A crowded space in Open Research Data (Repositories)
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Data publication and reuse: a potential virtuous cycle
Publication
Reuse
Tracking
Partial credit
Article “reuse” == Article citation• Easy, but limited semantics
Data reuse is more interesting / complicated:
• Data derivation can take many forms• Multiple programs, information systems• Multiple generations
1. What happens to published datasets after their publication?2. Can we follow their trajectory through transformations?3. Can we use this knowledge to quantify credit to data contributors?
Measuring data impact (see eg [1])
[1] Alex Ball, Monica Duke (2015). ‘How to Track the Impact of Research Data with Metrics’. DCC How-to Guides. Edinburgh: Digital Curation Centre.Available online: http://www.dcc.ac.uk/resources/how-guides
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Data publication & reuse: a hypothetical scenario
Who gets credit for what?How much credit should Alice, Bob, Charlie receive?
RO = “Research Object”
RO3
RO5
RO2
4RO3
RO4
Charlie
RO1
P2
3️⃣
DR1
Alice
RO1
1️⃣
DR3
DR2
RO3
RO2RO1
Bob
2️⃣
P1
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Recording reuse chains
Sequence of derivations viewed as a provenance graph• W3C PROV compliant
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Assignment and transitive propagation of credit
Inductive defintion of credit:1. External credit:
• Can be assigned to any ROx in the graph at any time• How? Don’t care: any (community-based) mechanism is ok
2. Transitively propagated partial credit:• If ROy is reachable from ROx in the graph, then ROy should
receive a portion of the credit given to ROx
Assuming this graph can be constructed:
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Data trajectories
The trajectory DT(RO) of contains all RO’ on which RO has had an impact
For each RO, its credit is defined by induction on its trajectory graph:
External credit
Transitive credit
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Next steps
1. Define a suitable credit transfer function f
2. Build the provenance graph in practice• Provlets and their composition
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Credit propagation patterns - 1
Most general case:
RO has been reused r times, by activities, a1 … ar:
Then, we consider patterns that involve a single activity a
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Credit propagation patterns - 2
we want RO to receive a fraction of RO’s credit.
credit transfer parameter through a:
𝝰 models the value of the transformation a relative to its inputs data RO
High value transformation: low value 𝝰 low credit to ROSimple transformation: high value 𝝰 high credit to RO
1. Single-input, single-output activity
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Credit propagation patterns -3
We account for relative importance of each of A’s inputs RO1 … ROn
Modelled using n new factors:
2. multi-input activity: RO is only one of n>1 inputs to A
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Credit propagation patterns -4
RO receives credit from each output RO’These are all part of DT(RO)
3. multi-input, multi-output activity: A generates M>1 outputs
Relative importance of derived data products RO’1 … RO’m:
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Credit propagation patterns - unknown activity
When activity a is unknown, none of the parameters α,β,γ can be used
Exists some activity a such that:
(*) https://www.w3.org/TR/prov-constraints/#derivations
Modelled using a derivation transfer parameter:
For n known derivations of RO:
PROV-CONSTRAINTS (*)
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Credit from data to Agents
Agents are the actual people to whom the ROs are attributed
Each agent may be responsible for a set R or ROs.
The credit to this agent is simply:
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Summary of credit model
RO reuse events provenance statements about RO
complete provenance graph DT(RO)
cr(RO)
Three elements to cr(RO):
1. External credit that is independent of reuse- May follow any community-based scoring scheme of data
relevance
2. Credit propagation rules computed inductively from DT(RO)- These formalise the notion of \transitive credit
3. A collection of credit transfer parameters- These account for the nature of the activities involved DT(RO)
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
How it might work
How it might work: a data reuse simulator
Events:- Data re-use through an activity- Adjustments to external credit
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Next steps
Define a suitable credit transfer function f• Credit transfer parameters
2. Build the provenance graph in practice• Provlets and their composition
Issues in building a graph of reuse events:
1. Modelling reuse events using PROV [easy]
2. Detecting and reporting reuse events in practice [hard!!]
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Modelling reuse using PROV
Alice generates RO1
Bob reuses RO1, generating RO2, RO3
Charlie reuses RO1 and RO3, generating RO4 through P2
Unknown Agent reuses RO2 and RO3, generating RO5 through an unkonwn activity
Observable events:
Provlets are PROV document fragments generated by multiple, independent, autonomous Information Systems
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Provlets - I
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Provlets - II
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Provlets - III
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Provlets - IV
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Provlets generation and composition
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Is this really practical?
Provlets are generated by multiple, independent, autonomous Systems• Not necessarily cooperative• Especially in the long tail of science
No guarantee of• Completeness• Consistency eg of RO PID usage
Alice misses out on credit due to dependenciesRO2 RO1, RO3 RO1
Provenance and trajectories can be incomplete, partially disconnected
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Challenges: A research agenda
Vision: tracking data re-use in the wild
1. Community efforts• Incrementally instrument key systems to be provenance-friendly and cooperative
• Python NoWorkflow• R• Workflows (Kepler, Taverna, Pegasus, VisTrails, …)
• Facilitate consistent use of PIDs• Incentivise proactive reporting of re-use instances
2. Research into probabilistic provenance• Can we estimate the likelihood of some of the missing derivations?• Uncertain graph management a rich foundation
• Can we design robust credit models that incorporate uncertainty of derivation?
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
A crowded space in Open Research Data (Repositories)
P. M
issi
erID
CC
‘16
– Fe
b. 2
016
Selected references
• Bechhofer, S., De Roure, D., Gamble, M., Goble, C. & Buchan, I. (2010). Research Objects: Towards Exchange and Reuse of Digital Knowledge. Nature Precedings.
• Callaghan, S., Donegan, S., Pepler, S., Thorley, M., Cunningham, N., Kirsch, P., . . . Wright, D. (2012, may). Making Data a First Class Scientific Output: Data Citation and Publication by NERC’s Environmental Data Centres (Vol. 7) (No. 1).
• Katz, D. S. (2014). Transitive credit as a means to address social and technological concerns stemming from citation and attribution of digital products. Journal of Open Research Software, 2(1), e20.
• Moreau, L. & Groth, P. (2013, sep). Provenance: An Introduction to PROV. Synthesis Lectures on the Semantic Web: Theory and Technology, 3(4), 1–129.
• Wallis, J. C., Rolando, E. & Borgman, C. L. (2013, jul). If We Share Data, Will Anyone Use Them? Data Sharing and Reuse in the Long Tail of Science and Technology. PLoS ONE, 8(7), e67332.