Post on 30-Dec-2015
transcript
Benchmark dataset processing
P. Štěpánek, P. Zahradníček
Czech Hydrometeorological Institute (CHMI), Regional Office Brno, Czech Republic,
COST-ESO601 meeting, Tarragona, 9-11 March 2009
E-mail: petr.stepanek@chmi.cz
OutlineOutline
Outliers (daily data)Detection (monthly data) + correction
(daily data)
(description of methodology)Benchmark dataset (monthly data)
Processing before any data analysisProcessing before any data analysis
Software
AnClim,
ProClimDB
Data Data QQuality uality CControl ontrol FFinding inding OOutliersutliers
Two main approaches: Using limits derived from interquartile
ranges (time series)
comparing values to values of neighbouring stations (spatial analysis)
-4.0
-2.0
0.0
2.0
4.0
6.0
8.0
10.0
1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000
-4.0
-2.0
0.0
2.0
4.0
6.0
8.0
10.0
1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000
-4.0
-2.0
0.0
2.0
4.0
6.0
8.0
10.0
1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000
Example of outputs for outliers assessmentExample of outputs for outliers assessment
Altitudes
and distances of neighbours
List of neighbours
Neighbour stations valuesExpected value
Suspicious values
Quality controlQuality control
Run for period 1961-2007, daily data (measured values in observation hours)
All stations (200 climatological stations, 800 precipitation stations) All meteorological elements (T, TMA, TMI, TPM, SRA, SCE,
SNO, E, RV, H, F) – parameters set individually
Historical records will follow now
Air temperature, Air temperature, number of outliers 1961-2007, number of outliers 1961-2007, from from 33..431431..000000 station-days station-days
0
200
400
600
800
1000
1200
T_07:00 T_14:00 T_21:00 T_AVG TMA TMI TPM
T – air temperature at obs. hour, TMA – daily maximum temp., TMI – daily min. temp., TPM – daily ground minimum temp.
Air temperature, Air temperature, number of outliers 1961-2007, number of outliers 1961-2007, from from 33..431431..000000 station-days station-days
Temperature
0
50
100
150
200
250
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
07:00 14:00 21:00 AVG
Air temperature at obs. hour, AVG – daily average temp.
Air temperature, Air temperature, number of outliers 1961-2007, number of outliers 1961-2007, from from 33..431431..000000 station-days station-days
0
50
100
150
200
250
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
TMA TMI TPM
v
Max, min temperature, ground minimum
TMA – daily maximum temp., TMI – daily min. temp., TPM – daily ground minimum temp.
Air temperature, Air temperature, number of outliers 1961-2007, number of outliers 1961-2007,
Number of outliers per one station (all observation hours, AVG)
0.000
0.020
0.040
0.060
0.080
0.100
0.120
1961 1964 1967 1970 1973 1976 1979 1982 1985 1988 1991 1994 1997 2000 2003 2006
temperature
Spatial distribution of precipitation stationsSpatial distribution of precipitation stations
period 1961-2007 600 stations mean minimum distance: 7.5 km
Problematic detections Problematic detections (heavy rainfall)(heavy rainfall)
Problematic detections Problematic detections (heavy rainfall), Radar information(heavy rainfall), Radar information
Precipitation, Precipitation, number of outliers 1961-2007, from number of outliers 1961-2007, from 1313..724724..000000 station-daysstation-days
.
0
100
200
300
400
500
600
700
800
900
1000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
precipitation
Precipitation, Precipitation, number of outliers 1961-2007, number of outliers 1961-2007,
Number of outliers per one station
0.000
0.050
0.100
0.150
0.200
0.250
0.300
0.350
0.400
0.450
1961 1964 1967 1970 1973 1976 1979 1982 1985 1988 1991 1994 1997 2000 2003 2006
precipitation
ConclusionsConclusions
Only combination of several methods for outliers detection leads to satisfying results (“real” outliers detection, supressing fault detection -> Emsemble approach)
Parameters (settings) has to be found individually for each meteorological element, maybe also region (terrain complexity) and part of a year (noticeable annual cycle in number of outliers)
Similar to homogenization of time series, it is important to use measured value (e.g. from observation hours) - outliers are masked in daily average (and even more in monthly or annual ones)
Errors found in all elements and investigated countries (AT, CZ, SK, HU)
Experiences with homogenization in the Czech Experiences with homogenization in the Czech RepublicRepublic
HomogenizationHomogenization Change of measuring conditions
inhomogeneities
DetectingDetecting Inhomogeneities Inhomogeneities by SNHT by SNHT (p=0.05, 950 series)(p=0.05, 950 series)
0
20
40
60
80
100
120
.1 .2 .3 .4 .5 .6 .7 .8 .9 1.0
Amount of change in level /°C
Inh
om
og
en
eit
ies
de
tec
ted
/ %
>2
2
1
0
Error/years
most of metadata incomplete
we depend upon statistical tests results
uncertainty in test results - right inhomogeneity detection is problematic (for smaller amount of change)
Assessing Homogeneity - Assessing Homogeneity - ProblemsProblems
Proposed solutionProposed solution
most of metadata incomplete uncertainty in test results - the right inhomogeneity
detection is problematic
„Ensemble“ approach - processing of big amount of test
results for each individual series
To get as many test results for each candidate series as possible
D a ta P ro ce ss ing
In te rq ua rtile R an ge C o m p ar ing to N e igh b ou rs
A le xa n de rsso n te st B iva r ia te Te st t- te st M a nn -W h itn ey -P e tt it
fro m C o r re la tio ns fro m D is tan ces
Filling M iss. Values
Adjusting Data
Hom. Assessm ent
Reference Series
Hom ogeneity T esting
Com bining Near Stations
Q uality Control - Outliers
Monthly, Seasonal and Annual Averages
Several Iterations
Probability
Days, Months, seasons, year
How to increase number of test resultsHow to increase number of test results
for monthly, daily data (each month individually)
weighted/unweighted mean from neighbouring stations criterions used for stations selection (or combination of it):
– best correlated / nearest neighbours (correlations – from the first differenced series)
– limit correlation, limit distance– limit difference in altitudes
neighbouring stations series should be standardized to test series AVG and / or STD
(temperature - elevation, precipitation - variance)
- missing data are not so big problem then
Creating RCreating Reference eference SSerieseries
Relative homogeneity testingRelative homogeneity testing
Available tests:– Alexandersson SNHT– Bivariate test of Maronna and Yohai– Mann – Whitney – Pettit test– t-test– Easterling and Peterson test– Vincent method– …
20 year parts of the daily series (40 for monthly series with 10 years overlap),
in SNHT splitting into subperiods in position of detected significant changepoint
(30-40 years per one inhomogeneity)
Homogeneity assessmentHomogeneity assessment
Homogeneity assessmentHomogeneity assessment Quality control Homogenization Data Analysis
Test Ref I II III IV V VI VII VIII IX X XI XII Win Spr Sum Aut Year
A avg 1927 1929 1927 1927 1927 1928 1927 1926 1926 1926 1926 1926 1927 1927 1927 1926 1927A 1930A corr 1927 1927 1927 1927 1927 1928 1927 1926 1926 1926 1926 1926 1927 1927 1927 1926 1927A 1939 1938 1939 1940 1922 1937 1937 1935A dist 1927 1928 1927 1927 1927 1928 1927 1926 1926 1926 1926 1926 1927 1927 1927 1926 1927A 1930 1940 1918B avg 1927 1928 1927 1927 1927 1928 1927 1926 1926 1926 1926 1926 1927 1927 1927 1926 1927B 1922B corr 1927 1927 1927 1927 1927 1928 1927 1926 1926 1926 1926 1926 1927 1927 1927 1926 1927B 1936 1938 1939 1944 1922 1935 1937 1937 1935B 1937B dist 1927 1928 1927 1927 1927 1928 1927 1926 1926 1926 1926 1926 1927 1927 1927 1926 1927B 1930 1940 1931 1913 1918V corr 1927 1926V 1937 1922 1935V 1937V dist 1927 1927 1927V 1918
Output example: Station Čáslav, 3rd segment, 1911-1950, n=40
Homogeneity assessmentHomogeneity assessment
Begin End LengthInHomogen
eityNumber
% detected inhom
% possible inhom
EndMissin
g
1911 1950 40 140 100 120
1927 60 43 511926 37 26 32
1928 9 6 8 4
1937 7 5 61922 4 3 31935 4 3 31918 3 2 31930 3 2 31939 3 2 31940 3 2 3 21938 2 1 21913 1 1 1 3 31929 1 1 11931 1 1 11936 1 1 11944 1 1 1
1926 1927 2 97 69 831926 1931 6 111 79 951935 1940 6 20 14 17
1911 1920 10 4 3 31921 1930 10 114 81 97
1931 1940 10 21 15 181941 1950 10 1 1 1
Homogeneity assessmentHomogeneity assessment, , Output II example:
Summed numbers of detections for individual years
Homogeneity assessmentHomogeneity assessment
Homogeneity assessmentHomogeneity assessment
ID ELEMYEAR_INHOMBEGINEND YEAR_COUNTY_POSSIBL YEAR_ENDMISSVALSX_BEGIN_DAX_END_DATEX_BEGINX_ENDLATITUDELONGITUDEALTITUDEB_FULLNAMEREMARKC_OBSERVERC_IDx B1BOJK01 x 1985 41 14.24 12 23.3.1984 31.3.2003 # # Bojkovicechange
B1BOJK01 x 1985 41 14.24 12 23.3.1984 31.12.9999 # # obs Vladimˇr Maz lekB1BOJK01B1BYSH01 x 1978 37 12.85
? B1BYSH01 x 1979 33 11.46? B1BYSH01 x 1980 43 14.93? B1HLHO01 x 1965 31 10.76 4 1
B1HOLE01 x 1976 33 11.46B1KROM01 x 1977 1978 31 10.76
x B1RADE01 x 1994 44 15.28 2 1.1.1994 31.12.9999 # # RadýjovchangeB1RADE01 x 1994 44 15.28 2 1.1.1994 31.12.9999 # # obs Josef Pˇ§aB1RADE01
x B1RYCH01 x 1973 49 17.01 1.5.1973 28.2.1991 # # VyÜkov, Rychtß°ov, Ŕ.157changeB1RYCH01 x 1973 49 17.01 1.9.1972 28.2.1991 # # obs Marie Hor kov B1RYCH01
xx? B1STRZ01 x 1987 53 18.40B1STRZ01 x 1988 30 10.42B1UHBR01 x 1983 31 10.76 18.2.1984 31.1.1999 # # Uherskř Brod, MoŔidla 354changeB1UHBR01 x 1983 31 10.76 18.2.1984 12.5.1993 # # obs Josef KudelaB1UHBR01
x B1UHBR01 x 1984 77 26.74 18.2.1984 31.1.1999 # # Uherskř Brod, MoŔidla 354changeB1UHBR01 x 1984 77 26.74 18.2.1984 12.5.1993 # # obs Josef KudelaB1UHBR01B1VELI01 x 1978 31 10.76
? B1VELI01 x 1977 1978 44 15.28? B1VKLO01 x 1984 29 10.07x B1VYSK01 x 1999 32 11.11 -1 1.4.1998 31.12.9999 # # VyÜkov, Dukelskß 12change
B1VYSK01 x 1999 32 11.11 -1 1.4.1998 31.12.9999 # # obs VojtŘch Sur kB1VYSK01B2BOSK01_rx 1968 33 11.46B2BREC01 x 1968 35 12.15B2BRUM01 x 1989 51 17.71 1.2.1989 31.3.1994 # # BrumovchangeB2BRUM01 x 1989 51 17.71 1.2.1989 31.3.1994 # # obs Marta Paýˇzkov B2BRUM01
-1.0
-0.8
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1911 1915 1919 1923 1927 1931 1935 1939 1943 1947
combining several outputs (sums of detections in individual years, metadata, graphs of differences/ratios, …)
Adjusting monthly dataAdjusting monthly data using reference series based on correlations adjustment: from differences/ratios 20 years before and after a
change, monhtly
smoothing monthly adjustments (low-pass filter for adjacent values)
I II III IV V VI VII VIII IX X XI XII
Example:
Adjusting values - evaluation
Iterative homogeneity testingIterative homogeneity testing
several iteration of testing and results evaluation– several iterations of homogeneity testing and
series adjusting (3 iterations should be sufficient)
– question of homogeneity of reference series is thus solved:
• possible inhomogeneities should be eliminated by using averages of several neighbouring stations
• if this is not true: in next iteration neighbours should be already homogenized
Filling missing valuesFilling missing values
Before homogenization: influence on right inhomogeneity detection
After homogenization: more precise - data are not influenced by possible shifts in the series
Dependence of tested series on reference series
#
#
Prague
Brno
HomogenizationHomogenization of the series of the series in the Czech Republicin the Czech Republic
Correlations between tested and reference series Air temperature
Boxplots:
- Median
- Upper and lower quartiles
(for 200 testes series)
1 - 07h2 - 14h3 - 21h4 - AVG
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
4
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
3
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2
Correlations between tested and reference series Precipitation, snow depth, new snow
Boxplots:
- Median
- Upper and lower quartiles
(for 800 testes series)
1 - 07h precip.2 - 14h snow depth3 - 21h new snow
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
3
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2
Correlations between tested and reference series Wind speed
Boxplots:
- Median
- Upper and lower quartiles
(for 200 testes series)
1 - 07h2 - 14h3 - 21h4 - AVG
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
4
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
3
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2
Number of significant inhomogeneities (0.05) detected by used tests
(A, B tests, c and d reference series, alltogether)
0
100
200
300
400
500
600
700
800
900
1000
I II III IV V VI VII VIII IX X XI XII
T_07
T_14
T_21
T_AVG
Air temperature
Homogeneity testing resultsHomogeneity testing results Air temperatureAir temperature
Amount of adjustments, averages of absolute values, T_AVG
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
I II III IV V VI VII VIII IX X XI XII
°C
T_07 T_14 T_21 T_AVG
Air temperature
Homogeneity testing resultsHomogeneity testing resultsAir temperatureAir temperature
Homogeneity testing resultsHomogeneity testing resultsPrecipitationPrecipitation
4 tests, 4 reference series, 12 months + 4 seasons and year Number of detected inhomogeneities (significant)
0
1000
2000
3000
4000
5000
6000
I II III IV V VI VII VIII IX X XI XIIMonth
Nu
mb
er
of
de
tec
tio
ns
Amount of change (ratios – standardized to be >1.0), precipitation(reference series calculation based on correlations)
Boxplots:
- Median
- Upper and lower quartiles
(for 589 testes series)
-0.005
0.000
0.005
0.010
0.015
0.020
0.025
I II III IV V VI VII VIII IX X XI XII
Co
rrel
atio
n in
crea
se
1.000
1.050
1.100
1.150
1.200
1.250
1.300
I II III IV V VI VII VIII IX X XI XII
Am
ou
nt o
f ch
an
ge
(st
an
da
rdiz
ed
)
Correlation improvement
HomogenizationHomogenization Final rFinal remarks, recommendationsemarks, recommendations 1/2 1/2
data quality control before homogenization is of very importance (if it is not part of it)
Using series of observation hours (complementarily to daily
AVG) is highly recommended (different manifestation of breaks)
be aware of annual cycle of inhomogeneities, adjustments, …
to know behavior of spatial correlations (of element being processed) to be able to create reference series of sufficient quality …
HomogenizationHomogenization Final rFinal remarks, recommendations 2/2emarks, recommendations 2/2
Because of Noise in the time series it makes sense: - „Ensemble“ approach to homogenization (combining
information from different statistical tests, time frames, overlapping periods, reference series, meteorological elements, …)
- more information for inhomogeneities assessment – higher quality of homogenization in case metadata are incomplete
Software used for data processingSoftware used for data processing
LoadData - application for downloading data from central database (e.g. Oracle)
ProClimDB software for processing whole dataset (finding outliers, combining series, creating reference series, preparing data for homogeneity testing, extreme value analysis, RCM outputs validation, correction, …)
AnClim software for homogeneity testing
http://www.http://www.cclimahom.limahom.eueu
AnClim softwareAnClim software
AnClim softwareAnClim software
ProcDataProcData software software
ProProClimDBClimDB software software
Testing of benchmark datasetTesting of benchmark dataset
Fully automated (just for one click), but it should not work like this in reality:
detection phase can be fully automated, but desicion about breaks to be corrected should be man-made (comparison with metadata, plots of differences – ratios, …)
Fully automatic detection phase in Fully automatic detection phase in ProProClimDBClimDB software softwareselecting neighbours for reference seriesselecting neighbours for reference series
Fully automatic detection phase in Fully automatic detection phase in ProProClimDBClimDB software softwarereference series calculation and launching AnClimreference series calculation and launching AnClim
Automation in AnClim, tests can be selected in advanceAutomation in AnClim, tests can be selected in advance
Test results processed back in ProClimDBTest results processed back in ProClimDB
Non-automated desicion about breaks before correctionNon-automated desicion about breaks before correction
http://www.http://www.cclimahom.limahom.eueu