Date post: | 27-Dec-2015 |
Category: |
Documents |
Upload: | jeffrey-burns |
View: | 216 times |
Download: | 1 times |
Developing a Data Harvester in the Amazon Cloud for the
Automated Assimilation of Florida’s Healthy Beaches Reports into
the GCOOS Data Portal
Robert Currier, Mote Marine LaboratoryDr. Barbara Kirkpatrick, TAMU/GCOOS
OverviewFL Department of Health monitors 34 coastal
countiesE. coli/Enterroccus samples taken weeklyDOH data publicly available but no APIOriginal DOH website used standard
HTML/CSSPython “web scraping” app developed to
harvest dataDOH outsourced website to commercial
provider
We had no access to DOH staff or API for the data
In “Big Data” world of today this is becoming typical:
What we built broke when data format changed
This is the story of how we fixed the harvester
Original Data HarvesterWritten in PythonUsed the ‘urllib’ library for web scrapingData stored in MySQL databaseHarvester ran nightly out of cronApp walked through list of counties and built
url: http://esetappsdoh.doh.state.fl.us/irm00beachwater/beachresults.apx?county=’sarasota’
Data returned as Python text objectText object fed to regular expression for
matching
Original Data Format
And Then It Stopped Working…FL DOH suddenly (to us) outsourced in early
2013New website used proprietary JavaScript and
MapsPlain HTML no longer sent to the browserInstead, custom JavaScript was loadedThe JavaScript used AJAX and DOM
manipulation
New Data Format
The SolutionEmulating a browser with Selenium
Portable software test framework for web applicationsCan act like FireFox, Chrome and IETypically used for building automated testsWe repurposed and used as a virtual browserAs a browser Selenium can execute JavaScript
Soup’s On!Selenium worked and we now had data
availableBut data was very unstructured and
massively uglyBeautifulSoup4 to the rescue…
And The Soup Was Tasty!BeautifulSoup4 gave us back our
“structured” dataSome modification needed to data parsing
code as…Locations, variables and dates were not on
same line
The New Code Worked PerfectlyIn Our Development Environment
But Failed Spectacularly When We Deployed
What Happened?Amazon EC-2 instances are “headless” serversNo display hardwareNo graphics libraries (GTK+)Since no graphics libraries, no browsersWithout a browser, we crash and burn
Adding A Virtual Headhttp://joekiller.com provided us with a script
that pulled the source and built GTK+ on our cloud server in under two hours. Thanks, Joe Lawson!
Unfortunately, the script bombed and didn’t build FireFox. We had to download the source and build by hand.
Now we had a working browser, but no monitor on which to display our output…
Getting A Head with XVFBXVFB: The X virtual frame bufferPerforms all graphical operations in memoryDoesn’t show outputPrimarily used for testing, but…We repurposed, just like Selenium
+ =
Automating The Process
ConclusionsDon’t be afraid to use untraditional data
sourcesBut be prepared for your code to breakWe live in a data rich environmentBut most of the data is very
messy/unstructuredSo tread lightly, and don’t lose your head!
Thanks To:Mote Marine LaboratoryGulf Coast Ocean Observing SystemsTexas A&M Department of OceanographyAll the Free and Open Source Software
developers
In Remembrance OfSeth Vidal, creator of ‘yum’, friend and FOSS
guruKilled while biking on July 8th 2013 in
Durham, NC