Probabilistic forecasting for isolated thunderstorms using a genetic algorithm: The DC3 campaign

JOURNAL OF GEOPHYSICAL RESEARCH: ATMOSPHERES, VOL. 119, 65–74, doi:10.1002/2013JD020195, 2014

Probabilistic forecasting for isolated thunderstorms usinga genetic algorithm: The DC3 campaignChristopher J. Hanlon,1 George S. Young,1 Johannes Verlinde,1 Arthur A. Small,2

and Satyajit Bose2

Received 15 May 2013; revised 7 September 2013; accepted 2 October 2013; published 14 January 2014.

[1] Researchers on the Deep Convective Clouds and Chemistry (DC3) field campaign insummer 2012 sought airborne in situ measurements of isolated thunderstorms in threedifferent study regions: northeast Colorado, north Alabama, and a larger region extendingfrom central Oklahoma through northwest Texas. Experiment objectives requiredthunderstorms that met four criteria. To sample thunderstorm outflow, storms had to belarge enough to transport boundary-layer air to the upper troposphere and have a lifetimelong enough to produce a large anvil. The storms had to be small enough to sample safelyand isolated enough that experimenters could distinguish the impact of a particularthunderstorm from other convection in the area. To aid in the optimization of daily flightdecisions, an algorithmic forecasting system was developed that produced probabilisticforecasts of suitable flight conditions for each of the three regions. Atmospheric variablesforecast by a high-resolution numerical weather prediction model for each region wereconverted to probabilistic forecasts of suitable conditions using fuzzy logic trapezoids,which quantified the favorability of each variable. In parallel, the trapezoid parameterswere tuned using a genetic algorithm and the favorability values of each of theatmospheric variables were weighted using a logistic regression. Results indicate that theautomated forecasting system shows predictive skill over climatology in each region,with Brier skill scores of 16% to 45%. Averaged over all regions, the automatedforecasting system showed a Brier skill score of 32%, compared to the 24% Brier skillscore shown by human forecast teams.Citation: Hanlon, C. J., G. S. Young, J. Verlinde, A. A. Small, and S. Bose (2014), Probabilistic forecasting for isolatedthunderstorms using a genetic algorithm: The DC3 campaign, J. Geophys. Res. Atmos., 119, 65–74, doi:10.1002/2013JD020195.

1. Introduction[2] Field campaigns in the atmospheric sciences typically

require the deployment of limited resources under condi-tions of uncertainty about the evolving atmospheric state. Inmost cases, human forecasters use experience and heuris-tics to forecast the state of the atmosphere and conveythis information to decision makers. Algorithmic decisionrecommendation systems using probabilistic forecasts haveshown promise in improving upon traditional heuristic fore-casting and decision-making methods for field campaignsstudying boundary layer clouds [Small et al., 2011] andcirrus clouds [Hanlon et al., 2013]. Such a decision rec-ommendation algorithm provides decision support each day

1Department of Meteorology, Penn State University, University Park,Pennsylvania, USA.

2Venti Risk Management, State College, Pennsylvania, USA.

Corresponding author: C. J. Hanlon, Department of Meteorology,Penn State University, 503 Walker Bldg., University Park, PA 16802, USA.([email protected])

©2013. American Geophysical Union. All Rights Reserved.2169-897X/14/10.1002/2013JD020195

of a field campaign conditional on both the forecast stateof the atmosphere and the current state of the field cam-paign (e.g., the number of flight hours remaining and thenumber of days remaining), with the goal of maximiz-ing the amount of data collected from a finite budgetof resources.

[3] The Deep Convective Clouds and Chemistry (DC3)project during late spring and early summer 2012 soughtto sample isolated thunderstorms in three study regions,each region with different climatology. The data col-lection stage of the DC3 project began on 16 May2012, continuing through 30 June 2012. DC3 investigatorsdeployed extensively instrumented aircraft to three studyregions defined by the coverage of research-grade ground-based facilities to gather observations to improve under-standing of the role of convective clouds in determiningthe composition and chemistry of the upper troposphereand lower stratosphere [Barth et al., 2012]. Observa-tions were taken in three regions chosen for their cover-age by ground-based facilities: northeast Colorado, northAlabama, and a larger region extending from centralOklahoma through northwest Texas which could be coveredby mobile radars.

65

HANLON ET AL.: PROBABILISTIC FORECASTING USING A GA

Figure 1. The three study regions for which proba-bilistic forecasts were generated: Alabama, Colorado, andOklahoma/Texas.

[4] In order to build a decision recommendation sys-tem analogous to those implemented by Small et al. [2011]and Hanlon et al. [2013], a calibrated probabilistic forecast-ing system was required. The forecasting system needed toprovide two major inputs for the decision recommendationsystem. First, for each day during the field experiment, theforecasting system needed to supply an estimated probabil-ity of regional weather conditions suitable for data collectionusing aircraft, conditional on the modeled state of the atmo-sphere. Second, the forecasting system had to provide ahistorical probability distribution of forecasts. To meet therequirements of the decision recommendation system, wedeveloped an automated forecasting system rather than aforecasting system fed by human forecasts.

[5] While an automated forecasting system offers lessnuance than human forecasts, the automated forecastingsystem offers the advantages of calibration and historicalapplicability. By calibrating forecasts to outcomes usinghistorical model and radar data, an automated forecastingsystem removes systematic biases. In contrast, because fieldexperiment forecasters often lack the opportunity to calibratetheir forecasts to the particular problem’s climatology, theymay systematically over- or under-forecast the probabilityof suitable conditions. The development of the automatedforecasting system also yields a historical probability distri-bution of forecasts, which provides context to the forecastingsystem that is essential for the decision recommendationsystem. Obtaining such a historical distribution of humanforecasts is impossible or impractical for most applications.For these reasons, an automated forecasting system wasdeveloped for the DC3 campaign as input to a decision rec-ommendation system. The forecasting system is the subjectof this paper.

2. Definition of “Good” Conditions[6] A quantitative assessment of the probability of suit-

able data collection conditions for a given time periodnecessarily requires a precise definition of suitable con-ditions. While an experienced human forecaster may beable to “eyeball” good conditions, an automated forecastingsystem requires an exact definition, in advance, applied con-sistently. A precise definition allows for the generation ofpre-experiment statistical analysis but precludes potentiallyhelpful tinkering with the definition during the experiment.Creating such a precise definition for the DC3 campaignrequired an interpretation of investigators’ preexperimentdocumentation, interviews with principal investigators, andresults of test flights. Having a definition of suitable condi-tions that matches the working definition used by researchersis critical to the value of the decision recommendation sys-tem. The DC3 campaign sought isolated, deep convection[Barth et al., 2012]. To build a training data set for theforecasting system, we need a way to quantitatively iden-tify historical “good” conditions. For the purposes of thisforecasting system, five subregions were considered. TheOklahoma-Texas region was represented by three subre-gions, each subregion defined by the horizontal extent of aNational Weather Service (NWS) Doppler radar site: cen-tral Oklahoma (Twin Lake, KTLX), southwest Oklahoma(Frederick, KFDR), and northwest Texas (Lubbock, KLBB).The north Alabama subregion was defined by an approx-imation of the dual-Doppler coverage area from threeradars: the Advanced Radar for Meteorological and Opera-tional Research (ARMOR) located at the Huntsville airport,the University of Alabama at Huntsville Mobile AlabamaX-band dual-polarimetric (MAX) radar, and the Hytop, AL(KHTX) NWS Doppler radar. This area was entirely coveredby the KHTX radar, which was used to verify thunder-storm conditions in this subregion. The northeast Coloradosubregion was defined by an approximation of the dual-Doppler coverage area from the Colorado State University(CSU) CSU-CHILL and CSU-Pawnee radars, modified bythe assumption that planes could not fly west of the longitudeof Boulder, CO, due to topography. This area was entirelycovered by the KFTG radar site, which was used to verifythunderstorm conditions in this subregion. A map of the hor-izontal extent of the five subregions is shown in Figure 1.The historical set of complete volume scans at the NationalClimatic Data Center for all the sites was incomplete; there-fore, only base reflectivity data were used to characterize thestate of convection.

[7] The definition of suitable data collection conditionsconformed to experiment objectives, which required air-craft sampling of thunderstorms that met several criteria.In order to sample thunderstorm outflow, storms had to be

Table 1. Criteria Required for Good Conditions During the DC3 Campaign Using Base ReflectivityDataa

Criterion Conditions

1 Contiguous area of 50 dBZ reflectivity in subregion2 Contiguous 50 dBZ area > 20 km2 (40 km2 in OK/TX region)3 80 km by 80 km box centered on area centroid has <250 km2 of 50 dBZ coverage4 80 km by 80 km box centered on area centroid has <1200 km2 of 30 dBZ coverage

aFor an hour to be considered good, these four criteria must be met for 80% of the radar volume scans in the hour.

66


Figure 2. A sample fuzzy logic trapezoid. In this idealizedexample, the quality of CAPE is 1 for CAPE values between1000 and 2000 J/kg and 0 for CAPE values below 500 J/kgand above 3000 J/kg and varies linearly along the slopedportions of the trapezoid. The trapezoid can be defined by itsfour vertices, at 500, 1000, 2000, and 3000 J/kg.

large enough to transport boundary-layer air to the uppertroposphere and have a lifetime long enough to produce alarge anvil. The storms had to be small enough to samplesafely and isolated enough that experimenters could dis-tinguish the impact of a particular thunderstorm from thatof other convection in the area. Isolated thunderstorms andsupercell thunderstorms were deemed to be ideal targets forthe DC3 campaign. Larger-scale thunderstorm systems wereconsidered to be too large during their mature stages butcould be viable targets earlier in their development. Table 1summarizes the quantitative criteria used to define “good”conditions for a particular radar volume scan. Criteria 1 and2 ensured that convection was deep, while Criteria 3 and 4ensured that convection was isolated and not too large. Agood hour is defined as one during which at least 80% ofradar scans are good, while a good day is defined as one withat least one good hour between 15 Z and 00 Z.

3. Forecasting System Designand Implementation

[8] The automated forecasting system developed for theDC3 campaign transforms raw model data into probabili-ties of atmospheric conditions suitable for data collection.Rather than directly using model-derived reflectivity tomake predictions, the automated forecasting system uses apostprocessor to convert certain model output variables toa probability of good conditions. Because it is trained onpast model data and radar data, this conditional probabilityof good conditions is well calibrated: the forecasting sys-tem will not, in the long run, over- or under-forecast theprobability of suitable conditions.

[9] The forecasting system is inspired by the requi-site conditions for thunderstorm development outlined byFawbush and Miller [1953]. Fawbush and Miller’s four con-ditions required simultaneously for tornadic thunderstormdevelopment were conditional instability, relatively dry air

aloft, wind shear, and the presence of a lifting mechanism.Doswell III [1987] and Johns and Doswell III [1992] notedthe importance of low-level moisture, conditional instability,and a lifting mechanism as ingredients necessary for deepconvection, and Johns and Doswell III [1992] also notedthe importance of wind shear for tornado-producing super-cell thunderstorms. While the forecasting system was notseeking tornado development, we used these conditions asa proxy for the conditions required for the development ofsupercells, which were ideal targets for DC3. Our systemrepresents these four conditions with four model-forecastmeteorological predictors for each subregion. Conditionalinstability is approximated by subregion median mixed-layer convective available potential energy (MLCAPE),moisture aloft is approximated by subregion median 700mbar relative humidity (RH) (500 mbar RH used for COand TX subregions to account for higher elevation), windshear is approximated by subregion median bulk Richardsonnumber (BRN), and the presence of a lifting mechanismis approximated by subregion maximum 850 mbar verti-cal velocity (700 mbar vertical velocity used for CO andTX subregions to account for higher elevation). The use oflow-level vertical velocity as a predictor is also intended toaccount for convective inhibition. If the modeled convectiveinhibition is stronger than the modeled lifting mechanisms,the subregion will not have high values of vertical velocity.The forecasting system was trained on these predictors ratherthan model-forecast radar reflectivity because the modelreflectivity is sensitive to the model-resolved microphysics,introducing an unnecessary source of error.

[10] Predictors for the forecasting system were drawnfrom the 0000 UTC run of the National Center for Atmo-spheric Research (NCAR) 3 km Weather Research andForecasting (WRF) model [Weisman et al., 2008]. Each hourbetween 1500 UTC and 0000 UTC during May and Junein the period of record of the NCAR WRF was treated asan independent case of training data for the forecasting sys-tem. The period of record used for training included 3 years(2004 to 2006) when the model was run with 4 km resolu-tion and 5 years (2007 to 2011) when the model was runwith 3 km resolution. Corresponding radar data from eachhour for each subregion were converted to a binary response

Figure 3. A diagram explaining the conversion of model-forecast predictors to forecast probabilities. The trapezoidsfit by the genetic algorithm determine the values of Pi, i =1, : : : , 6, which serves as a measure of predictor suitabil-ity where 1 is ideal and 0 is unsuitable. Based on historicalforecast and verification data, a logistic regression is usedwith predictors Pi, yielding coefficients ˇi, i = 0, : : : , 6. Thecoefficients are then combined with the predictor suitabil-ity values, giving a probability of good conditions Pgood =ˇ0 +

P6i=1 ˇiPi.

67


Table 2. Settings Used by the Genetic Algorithma

Setting

CAPE lower bound (J/kg) 0BRN lower bound 0RH lower bound 0CAPE upper bound (J/kg) 5000BRN upper bound 1000RH upper bound 1W upper bound (m/s) 18.5Initial CAPE trapezoid parameters (J/kg) unif(0,4000)Initial BRN trapezoid parameters unif(0,80)Initial RH trapezoid parameters unif(0,1)Initial W trapezoid parameters (m/s) unif(0,15)Generations 100EliteCount 0HybridFcn @fminsearchPopulationSize 80FitnessScalingFcn @fitscalingrank_4th_root

aLower bounds of the parameters for CAPE, BRN, and RH preventnegative values of parameters for those variables. Upper bounds of theparameters for CAPE, BRN, RH, and W are set on the order of the high-est model-forecast values of those variables. Initial parameters are drawnfrom a uniform distribution and sorted from smallest to largest. “Genera-tions” is the maximum number of iterations before the genetic algorithmstops. “EliteCount” is the number of individuals that survive to the nextgeneration. “HybridFcn” is the function that continues optimization afterthe genetic algorithm terminates. “PopSize” is the number of individualsin the population. “FitnessScalingFcn” is the function that scales valuesof the fitness function. All other genetic algorithm settings are defaultsettings from the Matlab Global Optimization toolbox.

variable as described in Table 1, denoting good hours andbad hours. Concurrent model data and radar data wereavailable for approximately 3000 h for each subregion, con-stituting approximately 3000 cases of training data. The fivesubregions were treated separately in the forecasting system

development because the convective climatologies differfor each.

[11] Though the use of a high-resolution numericalweather prediction model was deemed necessary in order toresolve the isolated convection sought by DC3 investigators,the use of a research-grade model presented issues. The lim-ited amount of data available from the NCAR WRF forced asimplifying assumption that all hours in the training data areused to train the same model, regardless of time of day andday of season. This assumption implies that the diurnal andseasonal variation in the probability of isolated thunderstormformation is explained entirely by the diurnal and seasonalvariation of the model predictors. Given more training data,alternate forecast systems could be developed for differ-ent times of day and different times of the year. Anothermajor consequence of using a research-grade model is thatyear-over-year changes in the physics, parameterizations,and resolution of the NCAR WRF could have significantlyaffected the forecast predictors. An operational model suchas the Global Forecast System (GFS) would have offered alarger sample and stable model physics but exhibited lessskill in resolving the relevant meteorology. Future work willtest the sensitivity of the skill of the forecast system to thechosen model.

[12] As inspired by systems used by NCAR for otheratmospheric applications [Williams et al., 2008], fuzzy logictrapezoids were used to transform the raw values of each pre-dictor. For each predictor, the “suitability” of that variableis assumed to be expressible on a scale from 0 to 1. Whilein traditional fuzzy logic, this suitability value is treated asa probability, we use this value as a measure of parame-ter suitability in order to combine the suitability of multiplepredictors in a calibrated fashion. The “variable suitability”

Figure 4. The trapezoids from each subregion, as fit by the genetic algorithm. The five subregionsare represented by the corresponding NEXRAD site: northeast Colorado (FTG), north Alabama (HTX),central Oklahoma (TLX), southwest Oklahoma (FDR), and northwest Texas (LBB).

68


Figure 5. A sample forecast from the Alabama regionalforecast team, issued on the morning of 20 May.

as a function of the variable is defined by a trapezoid func-tion, for example, as shown in Figure 2. Figure 2 showsCAPE suitability versus CAPE. In this idealized example,subregion median CAPE below 500 J/kg or above 3000 J/kgis assigned a suitability value of 0 because too little CAPElikely produces no storms and too much CAPE likely pro-duces storms that are too vigorous or too numerous to safelysample. In this example, CAPE between 1000 J/kg and2000 J/kg is ideal and assigned a value of 1. On the slopedparts of the trapezoid, CAPE suitability varies linearly withCAPE. This trapezoid is defined by four parameters: its fourvertices. Each of the four predictors has a correspondingtrapezoid function, giving a total of 16 tunable parametersfor each of the five subregions. The shape of each trape-zoid for each predictor, defined by four parameters, conveysinformation about the relationship between the predictor andthe probability of good conditions. Together, the shape of allfour trapezoids for the four predictors yields a probabilisticforecast of good conditions.

[13] The conversion from sets of “suitabilities” of predic-tors into a forecast probability employs a logistic regression.Six predictors (CAPE suitability, BRN suitability, midlevelRH suitability, low-level vertical velocity suitability, CAPEsuitability � BRN suitability, and the product of all foursuitability values), each with values between 0 and 1, areinput into the logistic regression. These six predictors corre-spond to six factors believed to be useful predictors of thedesired conditions; the relative importance of each predic-tor is determined by the logistic regression, which outputsregression coefficients corresponding to each predictor. Thevalue of these six predictors aggregated using this tuned setof six regression coefficients produces a value between 0 and1 corresponding to the probability of suitable conditions ina given hour. Figure 3 offers a visual demonstration of theprogression from variable suitability to forecast probability.

[14] The forecasting system now has a total of 22 tunableparameters: the 16 vertices of the trapezoids correspond-ing to the 4 predictors and the 6 regression coefficientsused to generate the final forecast probability. The tuningof the 16 parameters defining the trapezoids and the tuningof the six logistic regression coefficients occur in parallel

using a genetic algorithm, an iterative nonlinear optimiza-tion tool from the field of artificial intelligence [Haupt andHaupt, 2004]. The genetic algorithm settings are displayedin Table 2. The genetic algorithm searches for a length-16vector corresponding to the 16 trapezoid parameters suchthat the trapezoids yield the best possible forecast. Thegenetic algorithm initializes with 80 population members,each of which is a random guess at the length-16 vectorof trapezoid parameters drawn from a uniform distributionover a reasonable range of values. Each of these 80 ran-dom population members is ranked and scored based on itsforecast skill on the training data. A second “generation” of80 population members is drawn by overweighting the mostskilled population members from the first generation. Thealgorithm continues iteratively until converging on a solu-tion of 16 trapezoid parameters or until a maximum numberof generations (100) is reached.

[15] The genetic algorithm solves for the 16 trapezoidparameters such that the fit of the 6 logistic regression coef-ficients minimizes the Brier score [Brier, 1950] of the set offorecasts on the hourly training data. Because the Brier scorerepresents the mean squared error of a probabilistic forecast,a more accurate forecasting system will have a lower Brierscore. Ten instances of the genetic algorithm are run for eachsubregion as a genetic algorithm ensemble. The median ofeach parameter from the genetic algorithm ensemble is usedin the forecasting system for that subregion.

[16] Figure 4 shows the trapezoids generated for eachpredictor in each subregion. The forecasting system prefersmoderate values of MLCAPE. The BRN values are con-sistent with the range for supercell thunderstorms (10–40)

Figure 6. A diagram showing the reliability of the auto-mated forecasting system during the DC3 campaign. Inthis diagram, binned forecast probability is plotted on theabscissa while the corresponding realized probability on alldays in that bin is plotted on the ordinate. A more reliableforecasting system will more closely follow the x = y diago-nal than a less reliable forecasting system. The size of pointsare area-weighted by the number of forecasts in each bin.A forecasting system with more resolution will have moreweight in the extreme bins (forecasts closer to 0% or 100%)than a forecasting system with less resolution.

69


Figure 7. A diagram showing the reliability of the humanforecasters during the DC3 campaign. This figure is thesame as Figure 6 but showing the forecasts from the humansrather than the automated system.

given by Weisman and Klemp [1982, 1986]. The systemprefers high values of midlevel relative humidity, consistentwith the K-index criteria for air mass thunderstorms [Reapand Foster, 1979]. In the two subregions (KFTG and KLBB)where 700 mbar vertical velocity was used instead of 850mbar vertical velocity to account for differences in elevation,the system allows higher values of vertical velocity. In theother three regions, the forecast system seeks to avoid moreviolent updrafts.

[17] Experience has shown that in mesoscale-forced sit-uations such as those relevant to the DC3 campaign, themission suitability of a day can change rapidly, on timescales of minutes to hours. In order to represent this rapidevolution properly, the forecasting system generates forecastprobabilities of good conditions for each hour of each after-noon. In order to match the DC3 decision cycle, these sets ofhour-by-hour forecast probabilities need to be converted to

the probability of suitable conditions occurring at any timeduring the day.

[18] A second logistic regression is used to convert a setof 10 hourly probabilities from 15 Z through 00 Z inclusiveinto a single daily probability. The hourly probabilities offeran upper bound and lower bound on the daily probability. Ifall forecast hours were independent, the daily upper boundprobability of good conditions would be

fxUB = [1 –10Y

i=1

(1 – Pi)] (1)

where Pi is the hourly forecast probability for hour i. Thedaily lower bound probability can be no lower than thehighest hourly probability,

fxLB = max(Pi). (2)

[19] The logistic regression uses fxLB and fxUB as twopredictors for each day, yielding a daily forecast probabil-ity. This allows the actual degree of serial correlation in thehourly probabilities from a single afternoon to be accountedfor using the historical data. The Oklahoma region dailyprobability is defined as the maximum of the three dailyprobabilities for the three subregions,

POK = max(Ps), (3)

where s = 1, 2, 3 denote the three subregions. This defini-tion is based on the in-flight mobility of the aircraft and thestatement from DC3 principal investigators that a success-ful flight to any of those subregions is equally acceptable tomeet experiment objectives for the larger Oklahoma region.This calibrated conversion from hourly forecasts to dailyforecasts is possible with the forecasting system because arecord is available of what the system would have predictedin the past. For human forecasts, no such historical recordis available for most field experiments, so calibrating hourlyforecasts to daily forecasts is difficult.

[20] Because only approximately 300 days of concur-rent model and radar data were available in each region,the forecasting system used all available training data, leav-ing no independent test data. The lack of independent test

Table 3. Results of a Murphy Decomposition of Brier Scores for EachRegional Forecast Team and the Automated Forecasting System in All ThreeRegionsa

Reliability Resolution Uncertainty Brier score

C Oklahoma 0.024 0.122 0.349 0.251SW Oklahoma 0.061 0.109 0.450 0.402

System NW Texas 0.019 0.065 0.420 0.374Colorado 0.024 0.099 0.214 0.139Alabama 0.065 0.221 0.394 0.238OK/TX 0.237 0.297 0.498 0.439

Humans CO 0.108 0.079 0.219 0.248AL 0.052 0.260 0.401 0.193

aLow values of “reliability” indicate a more reliable forecasting system, while highvalues of “resolution” indicate a forecasting system with better resolution. “Uncertainty”is a measure of sample climatology: Higher values of uncertainty indicate climatologycloser to 50%. (The slight differences between the Uncertainty values in the AL andCO region between the automated forecasting system and the human forecasters is dueto a few days where forecasts were available from one system but not the other.) TheBrier score is Reliability – Resolution + Uncertainty. Low values of Brier score indicategreater skill.

70


Figure 8. A diagram showing the reliability of theOklahoma human forecasters during the DC3 campaign.This figure is the same as Figures 6 and 7 but only showingthe forecasts from the Oklahoma human forecasters.

data increases the risk that the forecasting system is overlyoptimistic due to overfitting [Witten and Frank, 2005]. Withthe number of good days in each region on the order oftens, we conjectured that the cost of not using all trainingdata outweighed the benefit of having independent test data.The performance of the forecasting system during the DC3campaign serves as independent test data for the forecast-ing system. For future forecasting problems with more dataavailable, the forecasting system can be cross-validated ontesting data prior to the operational implementation of thesystem, reducing the risk of a model overfit to training data.

4. Results[21] On each day of the DC3 field campaign, a morn-

ing weather briefing occurred at 0830 CDT (1330 UTC)to discuss the expected weather conditions for the nextseveral days. Regional forecast teams composed of humanforecasters with significant expertise in each of the threeregions issued a probabilistic forecast of thunderstorms intheir region in 3 h time increments and 20% probabilityincrements for the upcoming 2 days. The probabilistic fore-cast was presented as a percent chance of thunderstorms andthe most probable storm mode: isolated, scattered, supercell,squall line, or mesoscale convective system. An example ofa regional forecast is shown in Figure 5.

[22] At the same time, model output from the NCARWRF model, available to all forecasting teams, was used bythe automated forecasting system to generate a probabilisticforecast for each of the three regions. Hourly probabilis-tic forecasts were aggregated into a probability of suitablethunderstorms during the afternoon for each of the threeresearch regions and displayed, along with a recommen-dation of which region to sample, if any. At each dailyweather briefing a representative of the decision recommen-dation team provided the probability of suitable conditionsin each subregion for the current day and the accompanyingdecision recommendation. The probabilistic forecasts fromthe automated forecasting system presented during these

briefings were rounded to the nearest 5%; for a fairer com-parison to the human forecasters, we have rounded theseforecasts to the nearest 20% when evaluating their skill.

[23] The automated forecasting system showed skill overclimatology in each of the three regions. The forecasts issuedby the automated forecasting system showed a 45% Brierskill score improvement over climatology in Colorado, a36% improvement over climatology in Alabama, and a 16%improvement over climatology in Oklahoma.

5. Comparison With Human Forecast Teams[24] Evaluating the skill of the automated forecasting sys-

tem against the regional forecast teams was challenging,because the forecast teams produced forecasts in three-hourincrements, while the automated forecasting system pro-duced forecasts in hourly increments aggregated to a dailyforecast, as required by the decision makers. Another chal-lenge is the ambiguity of the precise meaning of the forecastprobabilities from the forecast teams. A 3 h increment fore-cast probability of 40% at 1800 UTC could be interpretedseveral ways: 40% probability of thunderstorms at any timebetween 1800 and 2100 UTC, 40% probability of thun-derstorms exactly at 1800 UTC, 40% probability of thun-derstorms at any time between 1730 and 1830 UTC, 40%probability of thunderstorms at any time between 1630 and1930 UTC, or 40% probability of thunderstorms at any timethis afternoon if the 1800 UTC conditions were to hold allafternoon. Furthermore, the forecasters for different regionsand even different forecasters for the same region may haveinterpreted the forecast probabilities differently. By compar-ison, the forecasting system offers a precise, unambiguousdefinition of its forecast probabilities for the day, althoughin the aggregation process, information to guide the decisionfor the optimal flight time was lost.

[25] As a first attempt for comparison, the forecasts fromthe regional forecast teams were linearly interpolated tohourly forecasts. For example, as shown in Figure 5, theforecast probability of thunderstorms at 1500 UTC is 20%and the forecast probability of thunderstorms at 1800 UTCis 40%. This forecast was interpolated to a 1600 UTC fore-cast probability of 27% and a 1700 UTC forecast probabilityof 33%. Based on the line graph presentation of the fore-cast probabilities in Figure 5, linear interpolation of forecastprobabilities is a logical interpretation; the method of fore-cast presentation will influence the thought process of thedecision maker. These hourly forecast probabilities werethen verified against hourly radar data and compared toan hourly forecast consisting of the climatological hourlyprobability of suitable conditions. Using this method ofcomparison, the regional forecast teams performed worsethan a climatological forecast, forecasting thunderstorms tooccur much more often than climatology. We concludedthat this method of aggregating the human forecast teams’probabilities was poorly representing what the forecastteams meant.

[26] An alternate method of assessing the regional fore-cast teams’ skill was aggregating their interpolated hourlyforecast probabilities into a daily forecast probability, usingthe same logistic regression coefficients that were usedto aggregate the forecasting system’s hourly forecasts intodaily forecasts. This method of assessment showed a similar

71


result: worse performance than climatology for the regionalforecast teams.

[27] A final method of verifying the skill of the regionalforecast teams was to use the maximum hourly probabil-ity predicted by the regional forecast teams during the 1500UTC to 0000 UTC period as the forecast probability ofsuitable thunderstorms occurring at some time during thatperiod. Because using max (three-hour probabilities) as anestimate of

S(three-hour probabilities) would underesti-

mate the daily forecast probability if the subprobabilitieswere well calibrated, this method adjusts for the tendency forthe regional forecast teams’ hourly forecast probabilities tobe too high. Using this method, the regional forecast teamsdemonstrated skill on average, with Brier skill scores show-ing an 8% skill reduction from climatology in Colorado, a52% improvement over climatology in Alabama, and a 28%improvement over climatology in Oklahoma.

[28] After adjusting for the bias in their forecasts, thehuman forecasters performed better than climatology. Aver-aged over all regions, the automated forecasting system(BSS = 32%) showed a small advantage over the humanregional forecasters (BSS = 24%). Figure 6 shows thereliability diagram for the automated forecasting system,aggregated over all regions, while Figure 7 shows the reli-ability diagram for the human forecasters, aggregated overall regions. The human forecasters issue more forecastsfor probabilities above the 0% probability bin but are lessreliable on these forecasts than the forecasting system.

6. Discussion6.1. Difficulty in Quantifying the Definition of Good

[29] The forecasting system requires a specific definitionof good conditions. The process of interpreting a quantita-tive definition from the DC3 operations plan required severaliterations.

[30] Our first definition of suitable conditions attempted toinclude deep convection while excluding multicell or super-cell thunderstorms that would be more difficult to samplesafely with the aircraft. Feedback following a conferencewith DC3 principal investigators suggested that our upperbound on the size of convection was too restrictive: theywere willing to fly near larger and more severe convectionthan our definition allowed. Supercell thunderstorms, whichinvestigators considered to be ideal, were being excluded byour definition, resulting in systematically low probabilitiesof suitable conditions.

[31] A second definition was crafted to allow for largerisolated and supercell thunderstorms while still excludingmesoscale convective systems, which were deemed to be toocomplicated to allow reliable attribution of sampled outflowto a particular portion of the inflow boundary layer. Dur-ing the instrument testing phase before the DC3 campaign,probabilities tuned on this definition were communicatedto DC3 investigators. Investigators indicated that this sec-ond definition was not restrictive enough on the lower endof convective intensity. This definition was too generous indefining as suitable shallow, “popcorn” convection, whichwas not a viable target for the DC3 campaign because of itslack of upper tropospheric outflow.

[32] To eliminate shallow convection, the final definitionintroduced a minimum horizontal area coverage of 50 dBZ

reflectivity. In the Colorado and Alabama regions, a thun-derstorm needed to have at least 20 km2 of contiguous 50dBZ reflectivity to be defined as deep convection. In theOklahoma region, this minimum horizontal area was 40 km2.This minimum area was used as a proxy for thunderstormdepth, preventing unsuitable shallow convection from beingconsidered suitable by the forecasting system. This third def-inition was accepted by the DC3 investigators, who foundthe probabilities tuned on this definition appropriate basedon their knowledge of atmospheric conditions.

[33] Effective forecasting and decision making are ham-pered without a clear definition of the desired environmentalconditions. The problems with our initial definitions were anexample of the difficulties of cross-disciplinary communica-tion: The chemists and meteorologists in the field had a clearvision of what they wanted, but we struggled to correctlyinterpret their terminology. It is imperative that the teamdeveloping an automated forecasting system interact withthe experiment principal investigators to assure that the def-inition meet the experimental requirements. Moreover, careshould be given in the design of the system to allow fast resetof the forecasting and decision making system should theinvestigation team decide to modify their definition basedon conditions encountered during the deployment. Havingagreement on such a quantitative definition of suitable condi-tions from principal investigators before the experiment willnot only help inform the development of an algorithmic deci-sion system but also provide the investigator team access toa statistical analysis of the events they seek to study.

6.2. Murphy Decomposition of Forecasting Systems[34] From a forecasting perspective, the relative strengths

and weaknesses of the forecasting methods can be shownby a Murphy decomposition [Murphy, 1973, Table 2] ofeach forecasting method’s Brier score. The Murphy decom-position segments the Brier score into three components:reliability, resolution, and uncertainty. “Reliability” is a mea-sure of the calibration of the forecast system. For a perfectlyreliable forecast, the realized probability equals the forecastprobability for each bin: If 20% is forecast 10 times, for areliable forecast system, two of those instances will verify.“Resolution” is a measure of the forecast system’s abilityto distinguish good days and bad days from climatologicalaverage days; a forecast of climatology every day wouldbe perfectly reliable but offer no resolution. “Uncertainty”is a measure of sample climatology. Uncertainty is highestwhen the climatological probability of an event is 50%. Thehumans forecasting in the Colorado and Alabama regionsrotated during the field season, while the Oklahoma regionemployed the same forecasters throughout the field season.The decompositions are shown in Table 3. In the Coloradoregion, the resolution of the human forecast team is similarto that of the automated forecasting system, but the auto-mated forecasting system is more reliable and thus scoresbetter. In the Alabama region, the human forecasters wereroughly as reliable as the automated forecasting system, buttheir forecasts showed greater resolution, leading to a betteroverall skill score.

[35] The results from the Oklahoma human forecasts offera particularly striking example of the difference betweenBrier resolution and Brier reliability. Figure 8 shows thereliability diagram for the Oklahoma forecast team. The

72


human forecasts are unreliable: for example, 8 of 10 dayswhen the 20% category was forecast by the Oklahoma fore-casters verified as good. However, the diagram suggeststhat the Oklahoma forecasters were quite skillful at dis-tinguishing good days from “bad” days. The 0% categorywas forecast by the Oklahoma forecasters 11 times; of these11 days, 0 were good. Some other category was forecastby the Oklahoma forecasters 25 times; of these 25 days,19 were good. This indicates that the experts in Oklahomaare excellent forecasters, as one might expect. However,the probabilities submitted are systematically too low. Thedegree of miscalibration shown by the Oklahoma humanforecast team is unlikely to occur in an automated systemanchored to historical data.

6.3. Ambiguities in Human Probabilistic Forecasting[36] Every day, field campaigns require making expen-

sive decisions under probabilistic information, a process thatdemands exactitude in the specification of forecasts and thedefinition of suitable conditions. This precision in proba-bilistic forecasting is inherent to an automated forecastingsystem but is difficult for human forecasts to achieve dueto the number of potential ambiguities associated with ahuman forecast.

[37] The interpretation of the regional forecast teams’probabilistic forecasts of convection presents a source ofpossible ambiguity. Each morning during DC3, the regionalforecast teams issued probabilistic forecasts for each three-hour period for the next two days, as shown in Figure 5.In this format, the presentation of a forecast probabilityat, for example, 18Z, allows for several alternative plausi-ble interpretations, including the probability of conditionsbeing present at any time between 1800 UTC and 2100UTC, the probability of conditions being present any timebetween 1630 UTC and 1930 UTC, the probability of con-ditions being present any time between 1730 UTC and 1830UTC, and the probability of conditions being present atexactly 1800 UTC. Discussions with operational forecast-ers, both involved with and independent of DC3, indicatedthat there is no clear standard for interpreting the “validtime” of such a probabilistic forecast. The evaluation ofthe forecasters’ skill is sensitive to the interpretation of theforecasts’ time window. An automated forecasting system,however, is developed such that only a decision-relevantprobabilistic forecast is issued, removing any ambiguity inthe interpretation of the forecast time window.

[38] The probabilistic forecasts were issued by theregional forecast teams in 3 h increments rather than dailyincrements. This increased time resolution provides impor-tant guidance for field campaign investigators but obscuresthe decision problem. Because only one flight to one regioncan be undertaken per day, the DC3 decision cycle requiresthat flight decisions be made on daily intervals rather thanthree-hour intervals. Calculating a daily forecast probabil-ity from a set of subprobabilities contained in the same timeperiod requires a covariance matrix of the subprobabilities.Facing the same issue, rather than calculate a covariancematrix of hourly probabilities from historical data, the auto-mated forecasting system used a logistic regression anchoredto historical forecast and verification data to convert sub-daily probability forecasts to daily probability forecasts. Nosuch record of historical forecast data exists for the human

forecasters, making the aggregation of the subdaily prob-ability forecasts into a decision-relevant daily forecast adifficult task.

[39] Another possible source of ambiguity in the humanforecasts issued during the DC3 campaign arises from theconstraint that the human forecasts were required to beissued in increments of 20%. For the evaluation of theskill of the automated forecasting system, forecasts weresorted into the same bins as the human forecasts. Automatedforecasts below 10% were placed in the “0%” bin, whileautomated forecasts between 10% and 30% were placed inthe “20%” bin. For this reason, the “0%” bin is evaluatedas a forecast probability of 5%, the halfway point betweenthe bounds of the bin, 0% and 10%. However, anecdotal evi-dence suggests that human forecasts are perhaps required toconform to the expectation that a 0% forecast correspondsto a probability of exactly zero, rather than a probabilityof approximately zero, while an automated forecast of 0%could be interpreted as “approximately 0%.” If present, thiseffect suggests that human forecasts of 0% should be scoredas “exactly 0%,” not “in the 0%–10% bin” and some ofthe human forecasts of 20% should be scored as somethingless than 20%. It is cumbersome, if not impossible, to pickout which human forecasts are subject to this effect. Weargue that the overall scores would be mostly unchanged:as shown in Figure 7, human forecasts in the 0% bin wouldimprove under this change, but human forecasts in the 20%bin, which were already too high, would worsen underthis change.

[40] Ambiguity in the human probabilistic forecasts com-plicates the comparison of the human forecasters to theautomated forecasting system by preventing the one-to-onecomparison of the human forecasts to the automated fore-casts. While this ambiguity is inconvenient for us whenevaluating the performance of our forecasting system, it alsocould be inconvenient to decision makers relying on theseforecasts. The nature of the automated forecasting systemallows for these ambiguities to be resolved and translatedinto a probabilistic forecast on a decision-relevant time scaleusing quantitative decision-relevant criteria.

7. Conclusion[41] The particular forecasting problem faced by DC3

investigators was many dimensional. A forecast of suitabledata collection conditions needed to resolve, as a function oftime throughout an afternoon, the location of a storm withrespect to the spatial coverage of the ground-based facilities,the size of a storm, the presence of nearby convection, andthe time duration of suitable conditions. Years of experienceallow human forecasters to provide tremendous forecast-ing insight that may never be possible for an automatedforecasting system, but an automated system may be bettersuited for the many dimensional forecasting problems facedby DC3 and other atmospheric field campaigns. In mostregions, using reasonable interpretations of the human fore-casts, the human forecasters showed Brier resolution betterthan that of the automated forecasting scheme. However, theskill advantage from the human forecasters’ better Brier res-olution was offset by the skill advantage from the automatedforecasting system’s better Brier reliability. The authorssuggest that due to the numerous possible ambiguities

73


at various stages of the probabilistic forecasting process,relatively poor Brier reliability is likely to be a consistentproblem for human forecasters in field campaign decisionmaking applications.

[42] While the automated forecasting system imple-mented for the DC3 campaign showed skill in all regionscomparable to skill shown by teams of expert human fore-casters, for most atmospheric field campaigns, an automatedforecasting system cannot replace forecasters. In the case ofDC3, the automated forecasting system offers no informa-tion on spatial scales below the region level, nor can sucha system provide the real-time forecasting support neededfor flight decisions. Human forecasters offer the ability toanticipate a rapidly unfolding weather scenario and react toatmospheric conditions that can change on time scales onthe order of minutes. A skilled human forecaster can drawon experience and physical intuition to forecast the possibletimeline of events in a way that our automated forecastingsystem is unable to emulate.

[43] We argue that the genetic algorithm forecastingmethod used to forecast isolated thunderstorms for the DC3campaign is applicable to other forecasting problems facedby investigators undergoing atmospheric field research. Forforecasting problems other than isolated thunderstorms, thisforecasting system can be adapted by changing the relevantpredictors to match the phenomenon of interest. To miti-gate the risk of overfitting the genetic algorithm to trainingdata, care should be taken to choose predictors that are phys-ically related to the phenomenon of interest. Provided thatmodel data and corresponding verification data are available,this forecasting system could be applicable to a variety ofatmospheric applications. Future work will explore the sen-sitivity of this forecasting system to the numerical weatherprediction model used to forecast predictors.

[44] The advantage of an automated forecasting systemis that it will produce unambiguous, consistent, calibratedprobabilities for the desired events. The availability of theseforecasts will allow the human forecasters to focus all theirattention on forecast situations where they thrive, takingadvantage of the detailed physics and situational knowl-edge not available to the automated system. An automatedforecasting system issuing accurate forecasts on a day- andregion-scale allows for the use of an optimizing decisionrecommendation system, which has been shown promise inoptimizing field campaign resource deployment. We sug-gest that automating the day- and region-scale part of theforecast and decision for field campaigns will allow human

forecasters to focus on other crucial forecast challenges thatthis system cannot handle while maximizing field campaigndata yield.

[45] Acknowledgments. The authors acknowledge financial supportfrom National Science Foundation Atmospheric and Geospace Sciencegrants AGS-1063692 and AGS-1063733. We are grateful to DC3 principalinvestigators Mary Barth, William Brune, Chris Cantrell, and Steven Rut-ledge for allowing us to apply our methodology to their field campaign. Wewould like to especially acknowledge Morris Weisman for making modeldata available to us and offering invaluable comments on our manuscript.

ReferencesBarth, M., W. Brune, C. Cantrell, and S. Rutledge (2012), Deep Con-

vective Clouds and Chemistry (DC3) operations plan. Available athttp://www.eol.ucar.edu/projects/dc3/documents/DC3_Operations_Plan_4_Apr_2012.pdf.

Brier, G. W. (1950), Verification of forecasts expressed in terms of proba-bility, Mon. Wea. Rev., 78, 1–3.

Doswell III, C. A. (1987), The distinction between large-scale andmesoscale contribution to severe convection: A case study example, Wea.Forecasting, 2, 3–16.

Fawbush, E. J., and R. C. Miller (1953), Forecasting tornadoes, UnitedStates Air Force Air University Quarterly Review, 1, 107–118.

Hanlon, C. J., J. B. Stefik, A. A. Small, J. Verlinde, and G. S.Young (2013), Statistical decision analysis for flight decision support:The SPartICus campaign, J. Geophys. Res. Atmos., 118, 4679–4688,doi:10.1002/jgrd.50237.

Haupt, R. L., and S. E. Haupt (2004), Practical Genetic Algorithms,272 pp., John Wiley, Hoboken, N. J.

Johns, R. H., and C. A. Doswell III (1992), Severe local storms forecasting,Wea. Forecasting, 7, 588–612.

Murphy, A. H. (1973), A new vector partition of the probability score, J.Appl. Meteorol., 12, 595–600.

Reap, R. M., and D. S. Foster (1979), Automated 12–36 hour probabilityforecasts of thunderstorms and severe local storms, J. Appl. Meteor., 18,1304–1315.

Small, A. A., J. B. Stefik, J. Verlinde, and N. C. Johnson (2011),The cloud hunter’s problem: An automated decision algorithm toimprove the productivity of scientific data collection in stochas-tic environments, Mon. Wea. Rev., 139, 2276–2289, doi:10.1175/2010MWR3576.1.

Weisman, M. L., and J. B. Klemp (1982), The dependence of numericallysimulated convective storms on vertical wind shear and buoyancy, Mon.Wea. Rev., 110, 504–520.

Weisman, M. L., and J. B. Klemp (1986), Characteristics of isolated con-vective storms, in Mesoscale Meteorology and Forecasting, edited byP. S. Ray, pp. 353–354, Amer. Meteor. Soc., Boston.

Weisman, M. L., C. Davis, K. W. Manning, and J. B. Klemp(2008), Experiences with 0–36-h explicit convective forecasts withthe WRF-ARW model, Wea. Forecasting, 23, 407–437, doi:10.1175/2007WAF20070051.

Williams, J. K., C. J. Kessinger, J. Abernethy, and S. Ellis (2008), Fuzzylogic applications, in Artificial Intelligence Methods in the EnvironmentalSciences, edited by A. Pasini, C. Marzban, and S. E. Haupt, pp. 347-377,Springer, New York, N. Y.

Witten, I. H., and E. Frank (2005), Data Mining: Practical Machine Learn-ing Tools and Techniques, 145, Morgan Kaufmann, San Francisco, CA.

74

Date post:	30-Jan-2017
Category:	Documents
Upload:	satyajit
View:	215 times
Download:	3 times

Probabilistic forecasting for isolated thunderstorms using a genetic algorithm: The DC3 campaign

Documents