Date post: | 18-Dec-2015 |
Category: |
Documents |
Upload: | cuthbert-black |
View: | 218 times |
Download: | 1 times |
Unifying Diverse Watershed Data to Enable Analysis
C van Ingen, D Agarwal, M Goode, J Gupchup, J Hunt, R Leonardson, M Rodriguez, N Li
Berkeley Water CenterJohn Hopkins University
Lawrence Berkeley LaboratoryMicrosoft Research
University of California, Berkeley
IntroductionOver the past year, we’ve been
exploring how to build and user a digital watershed in the cloud
Our focus is enabling end-user analysis Assumes data access will get better
(thanks to CUAHSI and others) Bottoms up approach: start with
database and build to the tool Just in time approach: build tools to
solve science needs In the cloud to free the scientist from
any operational issues associated with the technology we use
http:/www.berkeley.edu/RussianRiver
Hydrologic Data Analysis PipelineDistributed Data Sets
An
alysis G
ate
wayD
ata
Gate
way
Models, Analysis Tools
Knowledge discovery,Hypothesis testing, Water
Synthesis
Dissemination
Challenge is to Connect Data, Resources, and People
DataArchive
DataTransformations
Data Flow Pipeline
Agency web site, streaming sensor data, or other source
CSV Files
BWC SQL Server Database
BWC Data Cube
Reports, Excel Pivot Table, MatLab, ArcGIS
Key Schema Abstractions Data, ancillary data, and metadata
Analyses often require combining time series data with fixed, or nearly fixed ancillary data such as river mile, vegetative cover, sediment grain size
Ancillary data used as fixed property, time series, or event time window Metadata describing algorithms, measurement techniques, etc. Normalized table structure simplifies adding variables and cube building
Versioning and folder-like collections Accommodate algorithm changes, temporal granularity and derived quantities Track derivations through processing pipeline Define and track analysis “working set”
Namespace translation Data assembly traverses different repositories each with own (useful?) name
space Some repositories encode metadata in variable name space (eg USGS
turbidity)
Any access layer shares the same abstractions.
Database Schema Subsetdata
FK1 sitesetidFK2 siteidFK3 datumid valueFK4 timeFK5 exdatumidFK6 repeatFK7 offsetidFK8 qualityid
dataset
PK datasetid
howmade createTime lastAppendTime lastModifyTime appendOnlyTime fixTime deleteTimeFK2 creatorid name description
offset
PK offsetid
value units
site
PK siteid
... name ...
time
PK time
siteset
PK sitesetid
FK1 siteid createTime lastAppendTime lastModifyTime appendOnlyTime fixTime deleteTime ingestChecksumFK2 parentSitesetidFK3 creatorid name description howmade path
dataset_siteset
FK1 datasetidFK2 sitesetid
repeat
PK repeat
datumtype
PK datumid
shortname units name offsetunits
exdatumtype
PK exdatumid
debris
quality
PK qualityid
qualityflags gapflags
investigator
PK investigatorid
... name ...
• Star schema for data similar to CUAHSI ODM
• Ancillary data shredded like data– Active over a
time range– Numeric or text– Flows to the data
cube as site attribute or time series data
• Two level versioning maps to data sourcing– Bound into a
dataset version with spline filter
– Only the dataset flows to the datacube
0
5000
10000
15000
20000
25000
30000
AU
ST
IN C
NR
CA
ZA
DE
RO
CA
BIG
SU
LPH
UR
C A
G R
ES
OR
T N
R
BIG
SU
LPH
UR
C N
R C
LOV
ER
DA
LE C
A
BIG
SU
LPH
UR
C N
R M
IDD
LET
OW
N C
A
CO
LGA
N C
NR
SE
BA
ST
OP
OL
CA
DR
Y C
NR
CLO
VE
RD
ALE
CA
DR
Y C
NR
MO
UT
H N
R H
EA
LDS
BU
RG
DR
Y C
NR
YO
RK
VIL
LE C
A
DR
Y C
TR
IB N
R H
OP
LAN
D C
A
DU
TC
HE
R C
NR
AS
TI
CA
EF
RU
SS
IAN
R A
ND
PO
TT
ER
VA
LLE
Y
EF
RU
SS
IAN
R N
R U
KIA
H C
A
EF
RU
SS
IAN
R T
RIB
NR
PO
TT
ER
VA
L
FE
LIZ
C N
R H
OP
LAN
D C
A
FR
AN
Z C
NR
KE
LLO
GG
CA
LAG
UN
A D
E S
AN
TA
RO
SA
A S
TO
NY
PT
LAG
UN
A D
E S
AN
TA
RO
SA
C N
R
MA
AC
AM
A C
NR
KE
LLO
GG
CA
MA
TA
NZ
AS
C A
SA
NT
A R
OS
A C
A
PE
NA
C N
R G
EY
SE
RV
ILLE
CA
PO
TT
ER
VA
LLE
Y I
RR
IG C
N 5
+6
NR
PO
TT
ER
VA
LLE
Y I
RR
IG C
N E
5 N
R
PO
TT
ER
VA
LLE
Y I
RR
IG C
N E
6 N
R
PO
TT
ER
VA
LLE
Y P
H (
TR
ON
LY)
NR
RU
SS
IAN
R A
DIG
GE
R B
EN
D N
R
RU
SS
IAN
R A
GE
YS
ER
VIL
LE C
A
RU
SS
IAN
R N
R C
LOV
ER
DA
LE C
A
RU
SS
IAN
R N
R G
UE
RN
EV
ILLE
CA
RU
SS
IAN
R N
R H
EA
LDS
BU
RG
CA
RU
SS
IAN
R N
R H
OP
LAN
D C
A
RU
SS
IAN
R N
R R
ED
WO
OD
VA
LLE
Y C
A
RU
SS
IAN
R N
R U
KIA
H C
A
SA
NT
A R
OS
A C
A S
AN
TA
RO
SA
CA
SA
NT
A R
OS
A C
A W
ILLO
WS
IDE
RD
NR
SA
NT
A R
OS
A C
NR
SA
NT
A R
OS
A C
A
WA
RM
SP
RIN
GS
C N
R A
ST
I C
A
Dataset USGS Surface Water Data Jan 2007 Datumtype Mean Discharge Quality All
Count
Site
Year
Datacube Basics A data cube is a database specifically
for data mining (OLAP) Organizes data along dimensions such
as time, site, or variable type Easy to group, filter, and aggregate
data in a variety of ways Simple aggregations such as sum,
min, or max can be pre-computed for speed
Additional calculations such as median can be computed dynamically
SQL Server Analysis Services (SSAS) provides the OLAP engine
SQL Server Business Intelligence Development Studio is used to define and tune
Excel and other client tools enable simple browsing
Minimizes total software burden writing queries (SQL or MDX) Discharge and Turbidity variability
Daily Discharge Availability by Site by Year
Each bar is a count of data points color coded by reporting per year The higher the bar, the more reported datal
Learnings and Observations Simplifying data discovery speeds analysis
Discovery is a necessary precursor step to analysis
What data where when? At what quality? Versioning is critical
Site-variable most naturally maps to analysis patterns
Dataset too coarse; individual measurement too fine
Ancillary data must be versioned as well Matching the scientific notion of time to
commercial tools can problematic Second month of water year has 30 days in US MODIS week Granularity widely varying
Plan on decode stage for name, location, time, quality
Don’t forget historic (non-digital) data
0
500
1000
1500
2000
0 500 1000 1500 2000
Annual Precipitation [mm]
An
nu
al R
un
off
[m
m] Ukaih (100 sq mi)
Hopland (362 sq mi)Cloverdale (503 sq mi)Healdsburg (793 sq mi)Guerneville (1338 sq mi)