Data science and engineering for local weather€¦ · Data science and engineering for local...

Post on 20-May-2020

13 views 0 download

transcript

1

Data science and engineering for local weather forecasts

Nikhil R PodduturiData {Scientist, Engineer}

November, 2016

Agenda

● AboutMeteoGroup

● Introductiontoweatherdata

● Problemdescription

● Datascienceandweatherforecasting

● Engineering

● Verification

● Results

● Questions

3

4

Howmanyofyoucheckweatherforecasts frequently?

5

6

Weatherdata

1.5TB/day

7

8

Typesofdata

Observations:●WMOweatherstations(e.g:surface,upper-air,ships,driftingbuoys,aircraftsetc)

●MeteoGroupmeasurement network

9

Typesofdata

Observations:●WMOweatherstations(e.g:surface,upper-air,ships,driftingbuoys,aircraftsetc)

●MeteoGroupmeasurement network

Satellitedata

10

Typesofdata

Observations:●WMOweatherstations(e.g:surface,upper-air,ships,driftingbuoys,aircraftsetc)

●MeteoGroupmeasurement network

Satellitedata

Radardata

11

Typesofdata

Observations:● WMOweatherstations(e.g:surface,upper-air,ships,driftingbuoys,aircrafts

etc)● MeteoGroupmeasurementnetwork

Satellite data

Radardata

Userdata

12

Typesofdata

Observations:● WMOweatherstations(e.g:surface,upper-air,ships,driftingbuoys,aircrafts

etc)● MeteoGroupmeasurementnetwork

Satellite data

Radardata

Userdata

Numericalweatherpredictionmodeldata

13

Numericalweatherpredictionmodels

●Complex andMultidimensional data

14

Numericalweatherpredictionmodels

●Complexandmultidimensionaldata

● 5NWPmodels fromdifferentproviders

15

Numericalweatherpredictionmodels

●Complexandmultidimensionaldata

● 5NWPmodelsfromdifferentproviders

●Datasizeperday- 0.5TB

Datascienceandweatherforecasting

16

17

18

Outcome

● Took24hoursfor24hourforecasts

●Gridinterval- 736km

● Poorresults

MeteoGroupForecastingsystem

19

MeteoGroupforecastingsystem

20

Forecasts3 years of NWP data

3 years of observation

data

Daily NWP data

Machine learningmodel Trained

model

MeteoGroupforecastingsystem

Writteninpascal

21

MeteoGroupforecastingsystem

Written inpascal

Runsoninhousehighperformance computing cluster

22

MeteoGroupforecastingsystem

Written inpascal

Runsoninhousehighperformancecomputingcluster

Limitations●Hardtomaintain●Notverytransparent● Scalability

23

24

Problemdescription

Nextgenerationforecastingsystem

●Cloudbasedsolution

25

Nextgenerationforecastingsystem

●Cloudbasedsolution

● Transparent

26

Nextgenerationforecastingsystem

●Cloudbasedsolution

● Transparent

● Scalable

27

Nextgenerationforecastingsystem

●Cloudbasedsolution

● Transparent

● Scalable

● Improveforecastingaccuracy

28

29

Baselinemodel

NWP data Downscale to location Linear modelInterpolate

missing values

30

Baselinemodel

NWP data Downscale to location Linear modelInterpolate

missing values

Outcome:●Veryfast● Pooraccuracy●Multicollinearity

Iteration1

●Addressmulticollinearityusingfeatureselection● Scalethefeatures

31

NWP data Downscale to location Linear modelInterpolate

missing valuesFeature selection

Scale features

Iteration1

●Addressmulticollinearityusingfeatureselection● Scalethefeatures

32

NWP data Downscale to location Linear modelInterpolate

missing valuesFeature selection

Scale features

Outcome:● Improvedaccuracy

Iteration2

33

●Modelselectionbetween linearandnon-linearmodels●Advancedfeatureselection

NWP data Downscale to location

Model selection

(linear and non-linear models)

Interpolate missing values

Advance feature

selection

Scale features

Iteration2

34

●Modelselectionbetween linearandnon-linearmodels●Advancedfeatureselection

NWP data Downscale to location

Model selection

(linear and non-linear models)

Interpolate missing values

Advance feature

selection

Scale features

Outcome:●Onparwithexistingforecastingsystem● Slowtraining

Engineeringtoscaletheproduct

35

Baselinemodelengineering

36

(Scikit-learn, NumPy, Keras with TensorFlow)

Modelengineering

37

(Scikit-learn, NumPy, Keras with TensorFlow)

Good:● PythonMLecosystem● Familiarityamongtheteam● TestdrivenandAgileDevelopment● Failfast

Modelengineering

38

(Scikit-learn, NumPy, Keras with TensorFlow)

Good:● PythonMLecosystem● Familiarityamongtheteam● TestdrivenandAgileDevelopment● Failfast

Bad:● Notscalable

47000*15*360modelruns

39

Locations Weather attributese.g: temperature, wind etc

Hours

ScalingwithApacheAirflow

40

ApacheAirflow• ByAirBnB• Apacheproductsinceearly2016

DirectedAcyclicGraph(DAG)

Components• UI• Scheduler• Executor(s)

ApacheAirflowDAG

41

●Hooks(connections)

●Operators(tasks)

● Schedule

●Dependencies

AirflowandMesos

42

deploy

Mesos cluster

persist AWS S3

Airflow scheduler

AirflowandMesos

43

deploy

Mesos cluster

Persist AWS S3

Airflow scheduler

Cont Integ

Verification

44

45

Deploy DAG Verify model

Improve DAG

Modelimprovementcycle

Forecastverification

46

AWS S3 withmodels

Forecast Engine

JSON-LD

Verificationmetrics

47

●Meanabsoluteerror●Rootmeansquarederror●Meanerror●Heidkeskillscore● Equitablethreatscore● Probabilitydensity functions● Errorpercentiles

48

Mean absolute error for different models (Temperature)

49

Probability distribution function for multiple models (Temperature)

Percentile graphs for each model (Temperature)

FordemopleasestopbyMGbooth

51

52

Results

Cloudbasedsolution● AWSS3,EC2,ElastiCache

Transparent

Scalable

Improveforecastingaccuracy

53

Results

Cloudbasedsolution● AWSS3,EC2,ElastiCache

Transparent● Verificationmicroservice

Scalable

Improveforecastingaccuracy

54

Results

Cloudbasedsolution● AWSS3,EC2,ElastiCache

Transparent● Verificationmicroservice

Scalable● Mesoscluster● Trainingtimeamonthto5hours(approx)

Improveforecastingaccuracy

55

Results

Cloudbasedsolution● AWSS3,EC2,ElastiCache

Transparent● Verificationmicroservice

Scalable● Mesoscluster● Trainingtimeamonthto5hours(approx)

Improveforecastingaccuracy● Onparorbetter

Improvements

Hyperlocal

AWSlambdaintegration

Iterateformoreaccuracy

56

Questions?

57

We are hiring!

59