Hazy Research Group Led by Christopher Ré (Zifei Shan, Mikhail Sushkov, Feiran Wang, Ce Zhang)
The Architecture of DeepDive
Rigorous ProbabilisIc Framework
ExecuIve Summary
Simpler Feature Engineering
PaleoDeepDive
We gratefully acknowledge the support from Toshiba, Google, DARPA DEFT and XDATA, ONR, and NSF. Any opinions, findings, and conclusions or recommendaGons expressed in this material are those of the authors and do not necessarily reflect the view of Toshiba, Google, DARPA, ONR, NSF, or the US government.
Takeaways DeepDive enables macroscopic science by building a “dark data” extrac9on system.
Developers should think about features, not algorithms. It is possible to abstract probabilis9c inference for use by domain scien9sts using standard SQL and Python.
We can achieve comparable (or beEer) quality to human volunteers.
We demonstrate these three takeaways!
DeepDive is the underlying framework
DeepDive has three features!
Unstructured text Structured Resources
RelaIonal Input Tables
Variables, factors, and connecIons between variables and factors
Factor Graph
e.g., Appear(Taxon, FormaIon) relaIon RelaIonal OutputTables
Text spans Freebase, Bing results…
StaIsIcal Inference & Learning
Feature ExtracIon
Feature extracIon with Python & SQL
DeepDive supports different high-‐level languages to specify a factor graph, e.g., Markov logic network, WinBUGS, etc.
PaleoDeepDive is built with a combinaIon of Python and SQL
Python SQL
Input relaIons (e.g. Coref Task)
PID DOC TEXT P1 D1 Columbia Fm. P2 D1 Columbia
Phrase Phrase A is coreferent to phrase B if the edit distance between A and
B is smaller than 5 and they appear in the same document.
We write an SQL query to generate all phrase pairs that appear in the same document, and pair it with a python funcAon.
SELECT t0.PID, t0.TEXT, t1.PID, t1.TEXT FROM Phrase t0, Phrase t1 WHERE t0.DOC=t1.DOC USEPYTHON pyfunc
We write a Python funcAon to process all phrase pairs and make predicAons.
def pyfunc(p1, t1, p2, t2): if edit_dist(t1, t2) < 5: emit(“Coref”, p1, p2)
DeepDive will learn the weight automaGcally
Extract Error analysis Extractor
ApplicaIon
Write/Improve extractors
DeepDive is able to integrate a diverse set of signals & feedback
DeepDive supports an “E3 loop” for feature engineering
Domain Experts
Unstructured Data
Structured Knowledge Base
-‐ Training labels -‐ HTML documents -‐ Scanned ArIcles -‐ Maps, photos, images
-‐ Freebase -‐ Macrostrat -‐ DicIonaries
-‐ Training labels -‐ HeurisIcs & Rules -‐ Hard Constraints
More-‐Structured
Signal
More-‐Supervised Signal
Crowd
The more signals we use, the beger quality we can expect!
Our DimmWiEed System is able to run Gibbs sampling in the speed of 100 million variables/sec! (hgp://arxiv.org/abs/1403.7550)
Geoscience Research Group Led by Shanan Peters (Jackson Borchardt, Tim Foltz)
PaleoDeepDive: A Applica9on Sta9s9cal Inference using Familiar Data-‐Processing Languages
Try out ! hgp://deepdive.stanford.edu
?! How does climate change impact biodiversity?
T. Rex are found daIng to the upper Cretaceous.
(“T. Rex”, “Cretaceous”)
380 volunteer scienGsts manually read 11K journal arGcles since 1994!
Extract biodiversity-‐related relaAons from journal arAcles.
The DeepDive Approach
The central task of PaleoDeepDive is to extract relaAons from unstructured text automaAcally:
PaleoDeepDive achieves comparable (or beNer) quality with human volunteers, in a cheaper way.
0
500
1000
1500
2000
2500
3000
0 100 200 300 400 500
Total D
iversity
Geological Time (M.A.)
Human
PaleoDeepDive
Different Sources of Signals DeepDive uses a joint probability model that enables rigorous probabilisIc interpretaIon
0
0.5
1
0 0.5 1
Accuracy
Actual
Ideal
0K
1K
2K
3K
4K
0 0.2 0.4 0.6 0.8
# Extrac9o
ns
Goal
Candidates for improvement
Output to users
Output Probability
Example Extractor
We expect that 8 of 10 with probability 0.8 will be correct
PaleoDeepDive