A Hidden Markov Model for Rainfall Using Breakpoint Data

42 VOLUME 11J O U R N A L O F C L I M A T E

q 1998 American Meteorological Society

A Hidden Markov Model for Rainfall Using Breakpoint Data

JOHN SANSOM

National Institute of Water and Atmospheric Research, Wellington, New Zealand

(Manuscript received 24 September 1996, in final form 14 February 1997)

ABSTRACT

Pluviographs, which are rainfall accumulation–timeplots, indicate a strong tendency for rainfall intensity toabruptly change from one steady rate of fall to another with these steady rates persisting for some time. Digitizingfrom pluviographs the times of change from one steady rain rate to another yields breakpoint data, that is, astream of data pairs consisting of the rainfall rate, which includes zero, and the duration of that rate. Breakpointsprovide a complete record of rainfall with information on the rain rates and their durations during periods ofcontinuous steady precipitation and on the durations of dry periods.

In a hidden Markov model (HMM), the state of the process at a given time is not known; only the values ofthe observables, and the range of possible states, are known. For rainfall, there is a hierarchy of states: aprecipitation event is either taking place, or not; if one is, then there are episodes when the mechanism isconvection (showers) and when it is large-scale uplift (rain); and finally, the current rate of rainfall and itsduration will have particular values with periods of zero rate being the dry periods within an episode of aparticular mechanism. Thus, there are five states: the time between events when no precipitation is possible,showery times when a shower is taking place, showery times when no shower is taking place, rain times withrain taking place, and dry intervals during a rainy time.

Such a model was initially fitted using the expectation maximization (EM) algorithm, but the parameters werereestimated using HMM fitting procedures, which also provided estimated probabilities of the transition matrix.The Viterbi algorithm was used to classify the individual points in the data stream. The rate and durationdistributions’ parameters, the state transition probabilities, and the classification of the data accord with the viewthat during widespread rain there may be many changes of rain rate but little dry time, while during showers,shorter periods of steady precipitation tend to be interspersed with longer dry periods.

Discrepancies were found between the data and simulations made using the HMM’s estimated parameters.The major of these was that the simulated dwell times within an episode were shorter than in the data, and thatthe simulated number of episodes per event was greater. Merely restricting certain transitions did not increasethe dwell times, but some indications were found that it might be necessary to either change to a hidden semi-Markov model and/or increase the number of states.

1. Introduction

Pluviographs, which are rainfall accumulation–timeplots, indicate a strong tendency for rainfall intensity toabruptly change from one steady rate of fall to anotherwith these steady rates persisting for some time. Digi-tizing from pluviographs the times of change from onesteady rain rate to another, yields breakpoint data, thatis, a stream of data pairs consisting of the rainfall rate,which includes zero, and the duration of that rate. San-som (1992) and Barring (1992) give full details on thebreakpoint representation of rainfall, which is essen-tially different from the traditional one, in which theaccumulated total over some fixed period is noted or,in the case of tipping bucket gauges, the time of ac-cumulation for a fixed amount is noted. Breakpoints

Corresponding author address: Dr. John Sansom, National Instituteof Water and Atmospheric Research Ltd., P.O. Box 14-901 Kilbirnie,Wellington, New Zealand.E-mail: [email protected]

provide a complete record of rainfall with informationon intensities during periods of continuous precipitation,rather than merely mean rates during periods with gen-erally a mixture of wet and dry times.

The data examined in this paper were digitized fromthe daily pluviographs of a Dines tilting siphon auto-matic rain gauge sited at Invercargill, New Zealand(468259S, 1688209E) for the 15-yr period January 1972to December 1986. Sansom (1987) gives details of thedigitization scheme while Sansom (1988) has describedsome of the seasonal and diurnal features of this data.Sansom and Thomson (1992) showed that the break-point data could be statically modeled as a mixture oflognormal components, univariate for the dry periodsand bivariate for the wet periods. They also proposeda dynamic model, suitable for use with breakpoint data,which was physically realistic since it recognized thatmore than one mechanism is responsible for rainfallgeneration, and that the mechanisms operate over vari-able length periods and at any one place and time onlyone mechanism can be operating. Further grounding forthis model was provided by Sansom (1995a).

JANUARY 1998 43S A N S O M

The proposed model is dependent upon breakpointdata since with such data a more detailed view of pre-cipitation can be taken than that possible when only thecommon ‘‘fixed fall’’ data is available. This view is laidout within the following definitions.

Event: A period of time during which the atmo-spheric conditions continuously give rise to a nonzeroprobability for the occurrence of precipitation. Withinan event, dry times do occur, especially if the physicalmechanism changes, but not to the extent of the inter-event dry breaks when for a considerable period thereis no chance of any precipitation.

Episode: A period of time within an event when thephysical cause of the precipitation does not change, thatis, the type of rain-generating mechanism during thistime does not change. Dry breaks may occur within anepisode.

Subepisode: Part of an episode in which there are nodry breaks.

Period or duration: Part or all of a subepisode duringwhich the rate of accumulation of precipitation is con-stant. (These two terms are sometimes omitted as when‘‘wets’’ or ‘‘drys’’ are referred to instead of wet periods,etc.)

Fixed fall: This is the common format for rain andis the amount accumulated over a period of time whichis of a fixed length and is also fixed with respect to theclock, for example, daily data. It should be noted thatsuch periods could be a part of, or encompass all of,any of those periods defined above.

A hierarchy is implied within these definitions withevents consisting of episodes and episodes of durationswith steady (or zero) rain rates. Any particular obser-vation can be assigned a position within this hierarchyor equivalently it can be assigned to a state. There arefive states involved: I, the time between events whenno precipitation is possible; Sw, showery times when ashower is taking place; Sd, showery times when noshower is taking place; Rw, rain times with rain takingplace, and Rd, dry intervals during a rainy time. A Mar-kov model is a natural choice in such a situation; how-ever, each observation is of the rain rate and its durationand no direct information is available concerning whichstate the system was in when the observation was made.Thus, the data were fitted to a hidden Markov model(HMM) using procedures from Rabiner (1989), whichincludes details of the Viterbi algorithm, that is, a meth-od for assigning each observation to a state. More recentwork (e.g., Leroux 1992; Bickel and Ritov 1996) hasconfirmed some of the underlying assumptions of Ra-biner (1989), which remains a practical exposition forthe fitting of HMM’s.

The model allows for the distributions of rates anddurations to differ from mechanism to mechanism, orstate to state, and these distributions of rate and durationwere taken to be lognormal, bivariate for the wet dataand univariate for the dry. The fitting procedures areiterative and require initialization for both the transition

matrix and the parameters of the state distributions. Ra-biner (1989) suggests that uniform probabilities are suf-ficient to initialize the transition matrix, but values closeto the eventual estimates are needed for the distribu-tions’ parameters and these were obtained from applyingan extension of the EM (expectation maximization) al-gorithm suitable for truncated/censored datasets. TheEM algorithm was used by Sansom and Thomson (1992)to decompose the breakpoint data into components thatcould be attributed to different precipitation mecha-nisms. Doubt over the classification of some short-du-ration low-intensity periods as being wet or dry, mo-tivated, Sansom and Thomson (1997, manuscript sub-mitted to J. Amer. Stat. Assoc., hereafter ST97; alsoSansom 1995b) to modify the EM algorithm for situ-ations where doubtful data is dropped.

Section 3 presents the application of the EM algo-rithm to acquire initial values for the HMM fitting pro-cedures and the results of applying these will be givenin section 4, which is followed in section 5 by somediscussion of the fit and of some simulations. However,initially, in the next section, a review of rainfall modelswill be given to show that the HMM is more physicallybased than other models and some consideration willbe given to the concept of rainfall rate.

2. Rainfall models

Rainfall models can basically be divided betweenthose that attempt to model the daily rainfall observa-tions directly and those that model rainfall events anduse either monthly, daily, or hourly data as verification.The former line goes back as far as Newnham (1916)with reviews by Woolhiser and Roldan (1982), Sternand Coe (1984), and Hutchinson (1995) among others,while the latter line probably started with Le Cam (1961)and has recently been reviewed by Burlando and Rosso(1993).

The occurrence of rainfall events is acknowledged toa certain extent in the modeling of daily rainfalls byfirst modeling the occurrence of wet days and then mod-eling the amount of rain on the wet days. To accountfor the persistence that is seen in the record of wet days,a first-order Markov model was proposed by Gabrieland Neuman (1962) so that the probability that a par-ticular day is wet depends solely on whether the pre-vious day was wet or dry, and the lengths of dry andwet spells are geometrically distributed. This simplemodel has proved effective and can be extended to ac-count for seasonal variations (Woolhiser and Pegram1979). When the model has seemed less effective, eitherthe order of the Markov chain has been increased (Den-nett et al. 1983), or distributions other than the geo-metric have been fitted to the lengths of the wet anddry spells (Roldan and Woolhiser 1982). However, pa-rameter estimates for higher-order Markov chains canbe unreliable, especially in dry areas. The other methodalso suffers from poor parameter estimation unless 25


or more years of observations are available. Overall, thefirst-order two-state Markov model generally fits thedata adequately and is simpler than more elaborate mod-els.

To model the amount of rain that falls on wet days,the common assumption has been that the amounts ofrain on successive wet days are independent and fit astandard distribution. The ones that have been used in-clude the lognormal, exponential, gamma, and Weibull,with the gamma being the most popular. However, smallbut significant correlations have been observed betweenthe length of the wet period and the rainfall amount andseparate parameter estimates have been made after clas-sifying days according to the wet/dry status of adjacentdays (Katz 1977; Buishand 1978). Buishand definedthree classes (i.e., solitary wet days, wet days betweena wet and dry or a dry and wet, and wet days betweenwet days). He found significant differences in the meanrainfalls for each class and proposed a model in whichthese means depended on the wet day class. An alter-native method of including the correlations has been themultistate first-order Markov model of Haan et al.(1976), where the transition probabilities are conditionalon the rainfall amounts; Guzman and Torrez (1985) pro-vided a simpler version.

A model that encompasses both the dry and wet daysis the truncated power of a normal model and it has alsobeen found to apply equally well to hourly and monthlydata. In this model, the data are transformed by a powerand then fitted to the upper tail of a normal distribution,which has been truncated at zero. Thus, it is a three-parameter model, that is, the mean of the normal (whichmay well be negative although the mean of the data isnot), the normal’s standard deviation, and the power ofthe transformation. Both square and cube roots havebeen used (Stidd 1973; Richardson 1977), and Hutch-inson et al. (1993) allowed for a spatially varying powerand found that the goodness of fit improved if the trun-cation was set at a small positive value rather than zerowhere the fit remained acceptable. The model can beeasily fitted for both spatial and seasonal variation andso is practical and useful, but it does lack a physicalbasis.

The occurrence of rainfall in events needs to be ex-plicitly recognized in order to establish a physical basisand this cannot be achieved by directly modeling dailyrainfalls or, indeed, any fixed falls. The second line men-tioned above attempts to do this by assuming that thestarting points of rainfall events are distributed random-ly along the time continuum and with each of these isan associated random amount and/or duration of rain.Of the point processes available (Cox and Isham 1980),the Poisson process provides the best compromise be-tween simplicity and generality.

In the independent Poisson marks (IPM) model (Ea-gleson 1972; Bacchi et al. 1989), events occur as Pois-son arrivals, each with an associated random vectormark of two variables: the average intensity through the

event and the duration of the event, which is assumedto be short compared to the interarrival time of theevents. The time variation of rain rate for this modelconsists of a series of rectangular approximations to theactual variation and the model has been extended byforming a closer approximation to the actual variationby using several rectangles. This Poisson rectangularpulse (PRP) model was developed by Rodriguez-Iturbeet al. (1987) and, although its time variation of intensityis closer to reality, it is less realistic than the IPM sincethe pulses will generally overlap, implying that a newregime initiates at a point before the prior one ceases.Also, despite its extra complications, it performs nobetter than the IPM.

The events modeled by the IPM and PRP models arereferred to as such in the literature but would moreproperly be called episodes from the definitions offeredin the introduction. To model the events as defined there,the clustering of episodes needs to be included. TheNeyman–Scott (NS) and Bartlett–Lewis (BL) processes(see Cox and Isham 1980) both do this but in slightlydifferent ways. For both events occur as a Poisson pro-cess, each with an associated random number of epi-sodes, which in the NS case have random starts in re-lation to the event origin with no episode starting at thatpoint, whereas in the BL case, it is the interarrival timesof the episodes that are random and the first episodeoccurs at the event origin (Burlando and Rosso 1993).At each episode origin, a random pulse is generated asin the IPM model, and since there are no constraintsbetween the lengths of the pulses and the arrival timesof the episodes, consecutive pulses will often overlapleading to the PRP situation, but now with some phys-ical basis for the pulses.

In these Poisson models (i.e., IPM, PRP, NS, and BL),the rate parameter can be estimated by counting thenumber of events that occur over the observation periodand the other model parameters can be fitted from thestatistics of the events. However, the crucial task is thedelineation of the events. There is no standard methodand results vary according to the duration of the fixedfall being used and the ‘‘critical duration’’ chosen to bethat which separates adjacent events (Bonta and Rao1988). To avoid such choices, an event model can beverified against data accumulated over a timescale lon-ger than that of the events, that is, monthly data. Re-vfeim (1982) fitted a two-parameter model (i.e., rate ofevent occurrence and event size) to monthly data andlater (Revfeim 1984) fitted a model which included theevent duration.

Another way of avoiding the delineation of events orepisodes is to model the subepisodes, or continuouslywet periods, and the intervening dry periods. Such amodel is an alternating renewal continuous time(ARCT) model and was first proposed by Green (1964)with exponentially distributed durations for the wet anddry periods. When verified against daily data, it per-formed as well as the Markov model of Gabriel and


Neumann (1962), but Small and Morgan (1986) founddifficulties when using hourly data. Hutchinson (1990)has extended the model by adding a transition state thatis always dry and divides the absolutely dry spells fromthe wet spells, which only connect via the transitionstate. Hutchinson found the dwell times in the states tobe mixed exponentials and that Green’s model was anatural generalization for daily timescales. As for theIPM model, rainfall amounts can be associated with thewet periods and Hutchinson (1991) replaced the con-stant, exponentially distributed rate by a serially cor-related, gamma-distributed intensity process.

It should be emphasized that the CT in ARCT refersto continuous time and, thus, continuous time data isrequired. With discrete, or fixed-fall data, the delineationof a continuously wet period depends heavily on theperiod length of the fixed fall: if the period were longenough, then no or few dry periods would be found;for daily data, the problem reduces to the modeling ofwet and dry days; and for high time resolution data, thenumber of alternations between wet and dry increasesas the fixed-fall period decreases. Thus, using fixed-falldata in ARCT models has some intrinsic difficulties,which would not be suffered by breakpoint data if itwere used, since it is continuous with the wet and drytimes available.

Apart from the purely descriptive ones such as thetruncated power of normal, the models described above,as a minimum, cover the sequences of wet and dry daysand the amounts of rain on wet days and, at most, rec-ognize that rainfall episodes are clustered into eventsbut have difficulty in delineating these events or epi-sodes. This difficulty can be circumvented by modelingsubepisodes, but these models, like those at the event–episode level, suffer from the discretization of rainfalldata and lack any representation of the variation of rainrate through wet periods. A physically based model mustretain the ideas of subepisodes, within episodes, withinevents and should ensure that episodes do not overlap.Furthermore, unlike most of the models above, thismodel needs to clearly recognize that precipitation isnot generated by a single process.

The proposed model complies with all these require-ments and the available breakpoint data is suitable tofit to such a model. Thus, unlike most models that at-tempt physical realism, it is unique in that the data tobe used closely follow the short timescale variationsand it is directly fitted to that data rather than beingfitted to fixed falls from which much of the real behaviorof rainfall is lost. Furthermore, the model will sum-marize the short-term variations as climatologically use-ful statistics such as the mean rain rate for convectiveprecipitation, etc. Although the model would not appearto be easily extendable for spatial modeling, the sum-mary statistics for a set of stations could be examinedfor spatial variation.

It should also be noted that the primary variable thatis to be fitted in the HMM is the rate of precipitation

rather than its accumulation as is usually the case. Gen-erally, only accumulations are available since fixed-fallmeasurements, which include much dry time, are mucheasier to obtain than good estimates of rain rate. Thereare also some essential difficulties with the precision ofrain-rate measurements since rainfall is a discrete pro-cess, and in the limit the rain rate during a period ofsteady rain will vary between zero, when a raindrop isnot at the point of measurement, and a large value whena drop is present. However, this ambiguity can be re-solved by considering the work of Marshall and Palmer(1948) and their successors (e.g., Joss and Waldvogel1969; Torres et al. 1994) who showed that for a givenprocess, the rainfall rate is dependent on the distributionof raindrop sizes. Thus, in much the same way that thetemperature of a gas is a bulk measure of the movementof the molecules, the rain rate is a bulk measure of thenumbers and sizes of raindrops. It is assumed that thebreakpoint data provides a reliable measure of this am-bient rain rate.

3. Initialization of the HMM

Within each state of the HMM for the breakpointrainfall data, the observations that can be attributed tothat state have a probability density whose parametersrequire initial values, which can subsequently be rees-timated by the HMM-fitting procedure. According toRabiner (1989), these initial values need to be close tothe final estimates and are usually estimated from atraining dataset in which the state of each observationis known. However, such a dataset is not available withthe breakpoint data. This lack is circumvented by as-suming that if the breakpoint data is statically1 modeledas a finite mixture distribution, then the components (orsubsets of components) of this will align with theHMM’s states. Furthermore, these components’ distri-butions will be close to those of the states of the dy-namic2 HMM model.

Sansom and Thomson (1992) decomposed the mix-ture distribution of the breakpoint data using the EMalgorithm of Dempster et al. (1977) described by Rednerand Walker (1984). They found that the wet periods inthe breakpoint data were composed of two major com-ponents representing contributions from the rain-gen-erating mechanisms (i.e., the Rw and Sw states ormodes) and two minor components of which one, des-ignated as mode E, was shown by simulation to be dueto the inherent imperfection of manual digitization andthe other, designated as mode D, due to occasional con-fusion over whether a particular period was wet or dry.A component representing this confusion was also foundin the dry periods which, in addition, had a component

1 In the sense that the temporal order of the data is ignored.2 In the sense that the temporal order of the data is taken into

consideration.


FIG. 1. (Top) A scatterplot of the 40 570 wet breakpoint data points,each of which has a rain rate and a duration for which that ratepersisted. The plot is on log–log axes and the center part has beencontoured. (Middle) four-component EM fit to the wet data where asmall amount of truncation has been used such that data below thesloping line across the plot has been discarded. The lower of theselines shows the least amount of truncation that was used, and the

←

upper the greatest amount before the pattern shown ceased to be thatof the most probable fit. (Bottom) Similar to the middle but for highertruncations, i.e., levels between those shown by the sloping lines. Inthe top panel the truncation lines are also shown. In both the middleand bottom panels the same contours are used from mode to modeexcept where a mode is small and the lowest contour for the othermodes cannot be used in which case the mode is shown by a dashedcontour set at 95% of that mode’s maximum and the value of thecontour as a percentage of the lowest contour.

for each rain-generating mechanism (i.e., Rd and Sd).The interevent drys required two components desig-nated as I and M, where the first of these is just a singledry period between precipitation events, while the sec-ond (i.e., M.) was explained by Sansom (1995a) as dueto some events being weak, not giving precipitation atthe observation site, not being detected there, and thusgiving rise to a multiple interevent dry period.

Sansom and Thomson’s (1992) data were similar tothat shown at the top of Fig. 1 and their results similarto the middle panel of Fig. 1 and the bottom of Fig. 2with component E small and located away from the Rwcomponent such that E and Sw could be taken togetherto represent the Sw state. On the other hand, D wascollocated with significant mass of both the Rw and Swcomponents in the region of low rates and short periodsand might well have given rise to erroneous estimatesfor both the Rw’s and Sw’s distribution’s parameters. Inan attempt to refine the Rw and Sw estimates and despitethe inherent difficulty, due to rainfall being a discon-tinuous process, that within any rainfall measurementsystem ambiguity exists over the presence or absenceof precipitation, Sansom (1995a) analyzed a recompiledbreakpoint dataset in which a new criteria to differen-tiate within the manually digitized data between wet anddry periods had been applied. In this paper, a furtherattempt to refine the Rw and Sw estimates is made bydispensing with that part of the dataset where doubtfuldifferentiation between wet and dry exists.

The rest of this section details the result of applyingthe EM algorithm with modifications as presented inSansom (1995b) and ST97 to the truncated or censoredbreakpoint dataset. The wet data that was discarded wasfor the lower rates at all durations, but rather than dis-card any dry periods, all periods, both wet and dry, wereanalyzed as a whole. To finalize parameter estimates forthe wet distributions, some of the ‘‘all periods’’ resultswere used as ‘‘fixed’’ values in a reanalysis. For thosefixed components, their locations and scales were notupdated to a new estimate with each iteration of the EMalgorithm, but rather those parameters were treated asgiven constants and only their fractional representationswere estimated.

But first, it should be noted that a sufficiently longperiod with no rain can easily be recognized as dry whilefor short periods, some threshold rate of rain needs tobe detected if the period is to be treated as wet. Fur-


FIG. 2. (Top) Histogram of the 16 112 dry periods lengths. (Middle)Histogram of all 56 682 period lengths, i.e., both the dry and thewets. (Bottom) Six-component EM fit to all the periods with thelocation with respect to duration of the Sw and Rw components ofthe bottom panel of Fig. 1 shown by vertical dashed lines.

thermore, the shorter the period, the higher the thresholdand the placement of the truncation–censoring line overthe rate versus period plane was set accordingly (seeFig. 1). The difference between truncation and censoringshould also be noted: in the former all information aboutcertain data is lost, while in the latter, the count of thenumber of data being ignored is retained. It should alsobe noted that this can be estimated for the truncationcase; that is, the size of the dataset before truncation,here designated N, can be estimated. Since it is notknown how many periods have been misclassified withrespect to their wet or dry status, truncation rather thancensoring seemed more appropriate.

Figure 1 shows the wet data at the top and the resultsof the decomposition of the truncated wet periods withthe modal pattern for low truncation in the middle paneland for higher truncation in the bottom panel. The lowtruncation pattern is as described above with an N ofabout 42 000, which is only a little larger than the 40 570observations of the dataset. For higher truncation, theE mode disappears or moves to a location (i.e., ‘‘?’’ inthe bottom panel of Fig. 1), which is not supported bythose simulations that earlier suggested that E was dueto digitizing effects. Also, in the higher truncation pat-tern the D mode becomes very prominent, representing30% of the estimated N, which is close to 50 000, thus,40 570–50 000 3 0.3 ø 5500 dry periods were mis-classified as wet. This number represents about 14% ofthe wet data, which is much higher than would be ex-pected given the known performance of the gauge–dig-itizer system that is the source of the data.

A similar calculation for the low truncation case alsoyields around 5000 drys misclassified as wets; thus, toavoid this area of doubt in an analysis of the dry periods,truncation, so that only those periods longer than about1 h remain, is required. Such truncation would be severe,and from the top panel of Fig. 2, which shows a his-togram of the dry data, it can be seen that the mode ofthe dry periods’ distribution is at about 1 h. However,even if there is doubt at times over whether a period iswet or dry, it can be assumed that all period lengthshave been digitized sufficiently correctly and all peri-ods, both wet and dry rather than only the dry periods,can be decomposed as a whole to give the dry com-ponents and the marginal duration distributions for thewet periods. A histogram of all the period lengths isgiven in the middle panel of Fig. 2.

The result of fitting all the periods is shown in thebottom panel of Fig. 2 in which the locations found forRw and Sw in the more highly truncated wet data areshown by vertical lines; no components that might havealigned with the wet E and D modes could be found.Thus, the estimations found in this all-period analysisfor Rw and Sw are close to those derived from thetruncated wet data. Therfore, the all-periods fit can berepeated with the locations and scales of Rw and Swfixed at values mean between the truncated-wet-dataestimate and the all-periods estimate. The fit differed


FIG. 3. (Top) The four dry components of the EM fit to all periodsrepeated from the bottom panel of Fig. 2 after allowing for those Rd,which were misclassified as wet. (Bottom) In the style of the lowerpanels of Fig. 1 showing the final four components fit to the wet dataafter fixing E, Sw, and Rw from a censored fit in which the locationand sizes of Sw and Rw were fixed from a mean of the all periodsfit and the more highly truncated wet data fit.

little from that at the bottom of Fig. 2 and the otherfour components, which align with previous estimatesfor Rd, Sd, I, and M, represented 18 726 observations,which should be compared with the 16 112 that wereclassified as dry. Thus, about 2600 drys appear to havebeen misclassified as wet periods in which case the ac-tual number of wet periods that should be in the datasetis about 38 000 and this can be used as N in an EM fitto the censored wet data.

A three-mode (i.e., Rw, Sw, and E) censored-wet-data fit with N 5 38 000 was obtained, but it bore littleresemblance to the other fits and it was necessary to fixthe locations of Rw and Sw to the mean values usedbefore. Also, their relative sizes were fixed in the ratiofound in the all-periods-fixed fit and a range of valuesfrom 1% to 10% for the representation of E was tried.It was found that with E’s representation at 5%, the scaleparameters for Rw and Sw were close to the fixed valuespreviously used. Also, the rate variate parameters forRw and Sw were similar to the estimates from the trun-cated-wet-data fit and E was located in the area sug-gested by the simulation of the manual digitization pro-cess.

The final step in finding initial values for the statesdistributions’ parameters is to find the parameters for acomponent in the wet dataset that, given the Rw, Sw,and E modes, will represent those data that, althoughclassified as wet, were really dry times. This can beachieved by finding a four-component (i.e., Rw, Sw, E,and D) fit to the full wet dataset but with all the pa-rameters of the Rw, Sw, and E modes fixed to the es-timates made by the censored-wet-data fit. The bottompanel of Fig. 3 shows the result of such a fit in whichthe D component represents 2434 of the 40 570 wet dataand with respect to duration is located close to the Rdmode in the dry data. The top panel of Fig. 3 showsthe components from the all-periods fit with fixed Rwand Sw, which are attributable to dry periods, but Rd’srepresentation has been reduced from 7264 data by the2434 of these, which had been classified as wet that is,the D mode of the bottom panel.

4. Fitting the HMM

The procedures detailed in Rabiner (1989) were usedto fit the breakpoint dataset to a HMM with the transitionprobability matrix, P initialized to the same probabilityfor all transitions. The state distributions were initializedusing the values illustrated in Fig. 3 with some of thecomponents paired to form mixture distributions forsome of the states, that is, Rd with D together composedthe Rd state, Sw with E the Sw state, and I with M theI state. The Rw and Sd states had single-componentdistributions. The fitting resulted in new estimates forall these distributions, which are illustrated in Fig. 4,and it also gave an estimate for P that is,

Rw Rd Sw Sd I

0.742 0.113 0.115 0.023 0.006 Rw

0.549 0.000 0.451 0.000 0.000 Rd P 5 0.044 0.068 0.349 0.433 0.106 Sw

0.037 0.000 0.963 0.000 0.000 Sd 0.194 0.000 0.806 0.000 0.000 I.

The Viterbi algorithm was then used to determine thesequence of states3 that maximizes the probability of the

3 Hence the proportional representation of each state.


FIG. 4. In the style of Fig. 3 showing the distributions of the HMMas reestimated using the values of Fig. 3 as initial values in the HMMfitting procedures.

observation sequence and the value of this probabilityis also available.

Figure 4 resembles Fig. 3 in broad outline, but theRw and Rd modes in the HMM are located as for theslightly truncated EM algorithm fit (i.e., the middle pan-el of Fig. 1) only Rw is now larger and Rd is smallerthan D. With regard to the components for the dry pe-riods, their locations in Figs. 3 and 4 are similar, butthe Sd and I modes appear larger in the HMM and theRd and M smaller.

Except for the E and D modes, the labels used on thevarious wet and dry components of Figs. 1–4 have notas yet been justified in any way but merely selected toconform with anecdotal expectations. The associationbetween the wet and dry components was fixed by min-imizing the number of episodes, which is equivalent tomaximizing rain-generating mechanism persistence.

The decision as to which pair of wet and dry modescan be assigned to rain and which to showers was madethrough a comparison with contemporary hourly manualweather observations. Both these methods were used inSansom (1995a) where further details are given.

An episode has been defined as a period of time dur-ing which the precipitation mechanism remains constantbut during which dry intervals can take place; thus, itis a time with the sequence of period labels like Rw . . .RwRdRw . . . RwRdRw, or like Sw . . . SwSdSw . . .SwSdSw. 16 441 of the periods were labeled Rw or Rd,and these were grouped into 1808 episodes; and 37 858periods were labeled Sw or Sd in 3846 episodes; where-as a random mixture of 16 441 things of one kind with37858 things of another kind would on average produce22 926 runs rather than 1808 1 3846 5 5654, which isequivalent to 176 standard deviations too few. On theother hand, if the Rd and Sd labels are swapped, then26 213 are labeled Rw or Rd in 14 628 runs, and 28 085are labeled Sw or Sd in 16 663 runs, but a total of 27 117runs might be expected instead of 14 628 1 16 663 531 291, which is equivalent to 36 standard deviationstoo many. Thus, the chosen labeling minimizes the num-ber of episodes that are of a quantity which shows thatmuch persistence exists in the data as there are far fewerruns than might have been expected.

With regard to the comparison with manual weatherobservation, it was found that in the 15-yr period of thedata, 75% of the hours labeled as I were also judgedby the human observer to be a time of no precipitation.The remaining 25%, when the observer suggested someprecipitation, were mainly times when adjacent showerswere reported and so corresponded to those parts of theM mode, which were really Sd’s. For those hours labeledas S, the manual observation agreed 88% of the time,and for R the agreement was 66%; however, with theHMM S and R labels switched, the agreements bothdropped to 35%. It should be noted here that, despitethe shortcomings of the HMM, which will be mentionedlater, these levels of agreement for the adopted labelingand disagreement when the R, S labels were switchedindicate some improvement in the HMM over the staticmodel of Sansom (1995a).

The histograms of Fig. 5 show the distributions ofepisode lengths in terms of both the number of break-points and duration as hours; the mean and standarddeviations of the distributions are also shown. Accord-ing to the definitions given in the introduction, eventsare composed of episodes and, if now R is used to denotethe series of Rw’s and Rd’s within a rain episode andsimilarly for S, then, between any two interevent drys(i.e., I’s) there will be at least one R or S episode andpossibly a string of RS episode pairs. The structure ofevents in terms of the number of episodes is given inthe left-hand column of Table 1, where in the fourthrow, the n is the number of rain-episode–shower-episode(RS) pairs that occurred within an event.


FIG. 5. Histograms of episode lengths in terms of the number ofbreakpoints in the top two panels and in terms of hours in the nextlower two panels. The bottom panel is for the interevent dry periodsand is in terms of hours; an equivalent in terms of breakpoints is notshown since all such periods are just one breakpoint long. In the topright-hand corner of each panel the number of episodes and the meanand standard deviation of their lengths is given.

TABLE 1. Some statistics of events in terms of the number ofepisodes within the events.

StatisticNo. indata

Mean No.in simu-lations

Total no. of eventsNo. of events with a single SNo. of events with a single RNo. of events like (S)(RS)n(R)No. with an initial SNo. followed by a RNo. with n 5 0 (i.e., SR pairs)Maximum of nMean of n

25721441

221109647322381.61

269498617

16911195

8131122.17

5. Discussion and simulations

To a certain extent, the reversion of the modal patternin Fig. 4 from that of Fig. 3 to that of the middle panelof Fig. 1, which is similar to that used in Sansom(1995a), vindicates the discriminant analysis presentedin that paper. Similarly, with Rd of Fig. 4 now sug-gesting that only 714 of the 40 570 wet periods shouldhave been classified as dry, the performance of thegauge–digitizer system as a source for breakpoint datais also vindicated. It should also be noted that the rel-ative sizes of I and M in Fig. 4 are similar to thosefound in Sansom (1995a).

Both Figs. 4 and 5, as well as P, and Table 1 showthat the HMM conforms with anecdotal expectationsthat during widespread rain there may be changes ofrain rate but little dry time, while during showers shorterperiods of steady precipitation tend to be interspersedwith longer dry periods. The model indicates that, forthe location concerned, about 170 precipitation eventsoccurred every year with 56% of these being a single-shower episode and 43% a succession of RS episodepairs with an average of 1.61 such pairs in each event.However, 58% of these RS pairs actually started withan S episode and a few ended with an R with most ofthese consisting of an SR pair. The events covered 23%of the available time, which was divided between rainand shower episodes as 3.7% and 19.3%, respectively,but 20% of the rain time was dry as was 77% of theshower time. Rain episodes yielded about 63% of thetotal precipitation although there were over twice asmany shower episodes with an average duration of 6.6h, while rain episodes only averaged 2.7 h. The meanduration of the interevent drys was about 39 h.

For a given transition matrix, the expected propor-tional representation for each state, p, can be found bysolving pP 5 p with the constraint that, where M isthe number of states, pi 5 1. For the P estimatedMSi51

from the breakpoint data, the expected count for the I’swas approximately as observed, but the Rw and Rdstates were less represented in the data than might havebeen expected (by 1000 each), while the Sw and Sdwere over represented (also by 1000 each). A numberof simulations were run using the estimated P and state


TABLE 2. Some statistics of episodes in terms of the number of breakpoints (brkpts.) and hours within the episodes (NB: No. of runs [No. of episodes).

Episode type

Data

No. ofbrkpts.

No. ofruns

Simulations

No. ofbrkpts.

No. ofruns

Rain episodesShower episodes

16 44137 858

18083846

18 433 6 50035 740 6 500

3702 6 505783 6 100

Mean Std dev Mean Std devRain episodes (brkpts.)Shower episodes (brkpts.)Rain episodes (h)Shower episodes (h)Interevent dry periods (h)

9.19.82.76.6

39.1

8.412.1

2.98.5

41.1

5.0 6 0.056.2 6 0.101.5 6 0.024.7 6 0.10

36.5 6 1.00

4.9 6 0.155.9 6 0.151.6 6 0.056.9 6 0.50

41.6 6 2.00

distributions to assess how these differences would ef-fect the episode and event statistics. In doing this, itwas found that the observed counts for the Rw, Rd, Sw,Sd, and I states were equivalent to distances of 23.8,221.8, 6.4, 8.7, and 1.8 standard deviations, respec-tively, from the expected counts.

Table 1 presented the statistics of both the actual andthe simulated events, and Table 2 presents the statisticsfor both actual (repeated from Fig. 5) and simulatedepisodes. It can be seen from these tables that both therain and shower episodes found in the data are longer,in terms of both the number of breakpoints and thetemporal extension, on average than suggested by thesimulations. Also, there are fewer episodes per event inthe data than in the simulations in which the standarderror of the mean of n was 0.03 and, thus, the meannumber in the data is about 18 standard deviations fromthe expected mean number.

Overall, it appears that the HMM allows easier exitfrom an episode than is found in the data, and someadjustment to the model is required. In the above fittingprocedures no allowance for seasonality had been made,and an initial adjustment could be to fit the model ona month-by-month basis. However, when this was donefor the initialization values, little variation was found,and when these were compared to fits for individualyears, the interannual variability was also small but larg-er than the intraannual variation. Thus, it seems unlikelythat allowing for seasonality would be sufficient ad-justment to the HMM. Essentially, the required changewould be to the the dwell times, in terms of the numberof breakpoints, which are too short, and since thesetimes in a Markov model are geometrically distributed,the most direct means of increasing the dwell times isto adopt a distribution other than the geometric, in par-ticular, one with its mode greater than unity. Such amodel, which also requires the exclusion of self-tran-sitions, would be a hidden semi-Markov model(HSMM) and fitting procedures are available (Rabiner1989).

The HSMM fitting procedures are significantly morecomplex than those for the HMM, and before turningto the HSMM, some adjustments to the HMM should

be considered. These are of two kinds: first, by explicitlydisallowing some transitions and thus restricting theconnectivity of the states, and second, by increasing thenumber of states and thus allowing the HMM to findfiner structure in the data. The former type of adjustmentwas attempted and some details are given below, butwhile pursuing this, some suggestion was found of aseven-state model with a greater likelihood than the five-state model. However, the physical interpretation of thestates proved difficult and the second type of adjustmentwas not attempted any further.

The degree of persistence found in the data exceededthat of the HMM and any restrictions within P were,therefore, aimed at correcting this by reducing the op-tions available for changes between episode type. In theestimated P, the transitions from or to Rd and Sd or Iwere not exactly zero but of the order 10240 or less,however, these might have become significant if otherelements of P were set to zero, and as a first restriction,they were set to zero so that all transitions between drystates were forbidden. A second restriction was to dis-allow changes between Rw and Sw so that episodesmust change through a dry period, and a third was toinsist that an Rd can only change to an Rw so that anR episode would always start and end with Rw states.

Taking the three restrictions singly, in pairs, and al-together, gave seven other HMM’s, which were com-pared to the HMM with no restrictions through the prob-abilities of the observation sequence given the particularmodel and through their p’s. With regard to the latter,for all models, the expected and observed populationsizes of the states were significantly different and usu-ally in the same sense as the unrestricted model. Thus,by this measure, there was no improvement throughimposing transition restrictions and by the measurethrough the observation sequence probability in onlyone instance was the unrestricted HMM value exceeded.This was with just the second restriction, when changesbetween Rw and Sw were forbidden, but what in othermodels was taken to be the Rd state, no longer seemedto fill that role. Instead over half the outward transitionsfrom Rd were to Sw/Sd and a quarter to I, while forinward transitions to Rd half were from I and a quarter


from Sw/Sd. Also, the size of the state had grown atthe expense of Sw’s population size suggesting that thestate was more concerned with light, considering itslocation in the duration–rate plane, showers than withdry intervals in rain episodes.

6. Conclusions

The putative light shower state alluded to at the endof the last section suggests that there may be finer struc-ture in the rainfall process than that modeled by the fivestates Rw, etc., in which case, if further rain-generatingmechanisms are excluded, then one if not both of theR and S mechanisms will need to be divided into sub-classes. This could certainly be handled within an HMMby allowing further states and would be acceptable phys-ically since more than one class of synoptic situationgives rise to showers and similarly for rain. Alterna-tively, further states could be introduced to allow forsecond-order effects where a transition between statesmay be influenced by the prior state and, for example,a state denoted by RRw would indicate that the currentstate is Rw and the prior one was either Rw or Rd.

There is no particular indication, apart from the ex-cessive dwell times in the data compared to the model,that states like RRw may be required, but some indi-cation of subclasses within showers can be seen in thetop panel of Fig. 5. In that figure, the shape of thehistogram is such that it might represent a mixture withmodes at one and three breakpoints, that is, one forsingle light showers and another for longer showeryepisodes. However, in all of the nine simulations inwhich a mixture distribution4 was not used, the resultingequivalent histograms were of a similar shape with thesecond class smaller than the first and third.

On the other hand, the rain breakpoint distribution inthe second panel from the top of Fig. 5, where the modeis clearly not in the first class, implies that a distributionother than geometric for these is required, and hence anHSMM rather than HMM should be fitted. Furthermore,in all the simulations the equivalent distribution’s modewas distinctly in the first class, and it should also benoted that to achieve with a mixture a mode away fromthe origin requires one of the components to be otherthan geometric. Thus, since merely restricting sometransitions was insufficient to enable the HMM to ad-equately model the observations, the impetus for ad-vancing to an HSMM appears stronger than just in-cluding additional states to the five used in this paper.

Despite the deficiencies of the HMM, it did, as notedearlier, give a closer match to manual hourly weatherobservations than the static model of Sansom (1995a),and the description given at the beginning of section 5

4 It should be noted that a mixture of geometrics would still havea mode in the first class.

and the implications of Fig. 4 accord with the generalanecdotal view of rainfall. Furthermore, it appears thatthere is more agreement between the data and this viewthan with the simulations, which suggested shorter ep-isodes and more episodes per event than in the data.Thus, in a well-fitting HSMM with possibly more thanfive states and suitable transition restrictions it is pos-sible that episodes may be yet longer and the numberof episodes per event smaller in which case even greateraccord with the anecdotal view might be claimed.

It is unfortunate that, apart from the manual weatherobservations, no independent dataset exists that givesat a high temporal resolution an assessment of the am-bient state of the atmosphere with regard to precipita-tion, that is, whether at a particular time it is R, S, orI. It is also unfortunate that much manual effort is re-quired to produce the breakpoint data and that even withthe greatest care there are inherent errors in the digitiz-ing process. Both of these issues are currently beingaddressed: the first by locating breakpoint gauges in thevicinity of a weather radar from whose images it shouldbe possible to assess which of R, S, or I is ambient; andthe second with the development of processes to au-tomatically yield breakpoint data from high temporalresolution gauges. However, the immediate future thrustwill be to fit the currently available data to an HSMMwith five or more states.

REFERENCES

Bacchi, B., P. Burlando, and R. Rosso, 1989: Extreme value analysisof stochastic models of point rainfall. Third Scientific Assemblyof IAHS, Baltimore, MD, IAHS.

Barring, L., 1992: Comments on ‘‘Breakpoint representation of rain-fall.’’ J. Appl. Meteor., 31, 1520–1524.

Bickel, P. J., and Y. Ritov, 1996: Inference in hidden Markov modelsI: Local asymptotic normality in the stationary case. Bernoulli,2, 199–228.

Bonta, J. V., and A. R. Rao, 1988: Factors affecting the identificationof independent rainstorm events. J. Hydrol., 98, 275–293.

Buishand, T. A., 1978: Some remarks on the use of daily rainfallmodels. J. Hydrol., 36, 295–308.

Burlando, P., and R. Rosso, 1993: Stochastic models of temporalrainfall: Reproducibility, estimation and prediction of extremeevents. Stochastic Hydrology and Its Use in Water ResourcesSystems Simulation and Optimization, J. B. Marco, Ed., KluwerAcademic, 137–173.

Cox, D. R., and V. Isham, 1980: Point Processes. Chapman and Hall,188 pp.

Dempster, A. P., N. M. Laird, and D. B. Rubin, 1977: Maximumlikelihood from incomplete data via the EM algorithm. J. Roy.Stat. Soc., Ser. B, 39, 1–38.

Dennett, M. D., J. A. Rodgers, and J. D. H. Keatinge, 1983: Simu-lation of a rainfall record for a new site of a new agriculturaldevelopment: An example from northern Syria. Agric. Meteor.,29, 247–258.

Eagleson, P. S., 1972: Dynamics of flood frequency. Water Resour.Res., 8, 878–898.

Gabriel, K. R., and J. Neumann, 1962: A Markov chain model fordaily rainfall occurrence at Tel Aviv. Quart. J. Roy. Meteor.Soc., 88, 90–95.

Green, J. R., 1964: A model for rainfall occurrence. J. Roy. Stat.Soc., Ser. B, 26, 345–353.

Guzman, A. G., and W. C. Torrez, 1985: Daily rainfall probabilities:


Conditional on prior occurrence and amount of rain. J. ClimateAppl. Meteor., 24, 1009–1014.

Haan, C. T., D. M. Allen, and J. O. Street, 1976: A Markov chainmodel of daily rainfall. Water Resour. Res., 12, 443–449.

Hutchinson, M. F., 1990: A point rainfall model based on a three-state continuous Markov occurrence process. J. Hydrol., 114,125–148., 1991: Climatic analysis in data sparse regions. Climatic Riskin Crop Production, R. C. Muchow and J. A. Bellamy, Eds.,CAB International, 55–71., 1995: Stochastic space–time weather models from ground-based data. Agric. Forest Meteor., 73, 237–264., C. W. Richardson, and P. T. Dyke, 1993: Normalization ofrainfall across different time steps. Management of Irrigationand Drainage Systems, Park City, UT, Irrigation and DrainageDivision, ASCE, U.S. Dept. of Agriculture, 432–439.

Joss, J., and A. Waldvogel, 1969: Raindrop size distribution andsampling size errors. J. Atmos. Sci., 26, 566–569.

Katz, R. W., 1977: Precipitation as a chain dependent process. J.Appl. Meteor., 16, 671–676.

Le Cam, L., 1961: A stochastic description of precipitation. Proc.Fourth Berkeley Symposium on Mathematical Statistics andProbability, Berkeley, CA, Office of Ordinance Research, U.S.Army, 165–186.

Leroux, B. G., 1992: Maximum-likelihood estimation for hidden Mar-kov models. Stochastic Processes and their Applications, 40,127–143.

Marshall, J. S., and W. M. Palmer, 1948: Relation of raindrop sizeto intensity. J. Meteor., 5, 165–166.

Newnham, E. V., 1916: The persistence of wet and dry weather.Quart. J. Roy. Math. Soc., 42, 153–162.

Rabiner, L. R., 1989: A tutorial on hidden Markov models and se-lected applications in speech recognition. Proc. IEEE, 77, 257–285.

Redner, R. A., and H. F. Walker, 1984: Mixture densities, maximumlikelihood, and the EM algorithm. Soc. Ind. Appl. Math., Rev.,26, 192–239.

Revfeim, K. J. A., 1982: Comments ‘‘On the study of a probabilitydistribution for precipitation total.’’ J. Appl. Meteor., 21, 1942–1945; Corrigendum, 22, 502.

, 1984: An initial model of the relationship between rainfallevents and daily rainfalls. J. Hydrol., 75, 357–364.

Richardson, C. W., 1977: A model of stochastic structure of dailyprecipitation over an area. Colorado State University, Fort Col-lins Hydrology Paper 91.

Rodriguez-Iturbe, I., D. R. Cox, and V. Isham, 1987: Some modelsfor rainfall based on stochastic point process. Proc. Roy. Soc.London, Ser. A, 410, 269–288.

Roldan, J., and D. A. Woolhiser, 1982: Stochastic daily precipitationmodels. 1. A comparison of occurrence processes. Water Resour.Res., 18, 1451–1459.

Sansom, J., 1987: Digitising pluviographs. J. Hydrol. N.Z., 26, 197–209., 1988: Rainfall variation at Invercargill, New Zealand. N.Z. J.Geol. Geophys., 31, 247–256., 1992: Breakpoint representation of rainfall. J. Appl. Meteor.,31, 1514–1519., 1995a: Rainfall discrimination and spatial variation usingbreakpoint data. J. Climate, 8, 624–636., 1995b: The breakpoint representation of rainfall. Proc. SixthInt. Meeting on Statistical Climatology, Galway, Ireland, Uni-versity College Galway, 355–358., and P. J. Thomson, 1992: Rainfall classification using break-point pluviograph data. J. Climate, 5, 755–764.

Small, M. J., and D. J. Morgan, 1986: The relationship between acontinuous-time renewal model and a discrete Markov chainmodel of precipitation occurrence. Water Resour. Res., 22, 1422–1430.

Stern, R. D., and R. Coe, 1984: A model fitting analysis of dailyrainfall data. J. Roy. Stat. Soc., Ser. A, 147, 1–34.

Stidd, C. K., 1973: Estimating the precipitation climate. Water Re-sour. Res., 9, 1235–1241.

Torres, D. S., J. M. Porra, and J. Creutin, 1994: A general formulationfor raindrop size distribution. J. Appl. Meteor., 33, 1494–1502.

Woolhiser, D. A., and G. G. S. Pegram, 1979: Maximum likelihoodestimation of Fourier coefficients to describe seasonal variationsof parameters in stochastic daily precipitation models. J. Appl.Meteor., 18, 34–42., and J. Roldan, 1982: Stochastic daily precipitation models. 2.A comparison of distributions of amounts. Water Resour. Res.,18, 1461–1468.

Date post:	28-Oct-2014
Category:	Documents
Upload:	toto111
View:	41 times
Download:	0 times

A Hidden Markov Model for Rainfall Using Breakpoint Data

Documents