Date post: | 17-Aug-2015 |
Category: |
Data & Analytics |
Upload: | godatadriven |
View: | 52 times |
Download: | 2 times |
GoDataDrivenPROUDLY PART OF THE XEBIA GROUP
Real time data driven applications
Giovanni LanzaniData Whisperer
Using Python + pandas as back end
Who am I?
2008-2012: PhD Theoretical Physics
2012-2013: KPMG
2013-Now: GoDataDriven
GoDataDriven
Feedback
@gglanzani
GoDataDriven
Real-time, data driven app?
• No store and retrieve;
• Store, {transform, enrich, analyse} and retrieve;
• Real-time: retrieve is not a batch process;
• App: something your mother could use:
SELECT attendees FROM pydataberlin2014 WHERE password = '1234';
GoDataDriven
Get insight about event impact
GoDataDriven
Get insight about event impact
GoDataDriven
Get insight about event impact
GoDataDriven
Get insight about event impact
GoDataDriven
Is it Big Data?
• Raw logs are in the order of 40TB;
• We use Hadoop for storing, enriching and pre-processing;
• (10 nodes, 24TB per nodes)
GoDataDriven
Challenges
1. Privacy;2. Huge pile of data;3. Real-time retrieval;
4. Some real-time analysis.
GoDataDriven
1. Privacy
GoDataDriven
3. Real-time retrieval
• Harder than it looks;
• Large data;
• Retrieval is by giving date, center location + radius.
GoDataDriven
4. (Some) real-time analysis
GoDataDriven
Architecture
AngularJS app.py
helper.py
REST
Front-end Back-end
JSON
GoDataDriven
JS - 1
GoDataDriven
JS - 2
Flask
from flask import Flaskapp = Flask(__name__)
@app.route("/hello")def hello(): return "Hello World!"
if __name__ == "__main__": app.run()
GoDataDriven
app.py example
@app.route('/api/<postcode>/<date>/<radius>', methods=['GET'])@app.route('/api/<postcode>/<date>', methods=['GET'])def datapoints(postcode, date, radius=1.0): ... stats, timeline, points = helper.get_json(postcode, date, radius) return … # returns a JSON object for AngularJS
GoDataDriven
data example
date hour id_activity postcode hits delta sbi
2013-01-01 12 1234 1234AB 35 22 1
2013-01-08 12 1234 1234AB 45 35 1
2013-01-01 11 2345 5555ZB 2 1 2
2013-01-08 11 2345 5555ZB 55 2 2
GoDataDriven
Levenshtein distancedef levenshtein(s, t): if len(s) == 0: return len(t) if len(t) == 0: return len(s) cost = 0 if s[-1] == t[-1] else 1 return min(levenshtein(s[:-1], t) + 1, levenshtein(s, t[:-1]) + 1, levenshtein(s[:-1], t[:-1]) + cost)
levenshtein("search", “engine”) # 6 levenshtein("Amsterdam", “meetup") # 7levenshtein("Lplein", “Leidseplein") # 5
GoDataDriven
Aside
• Memoize every recursive algorithm in Python!from cytoolz import functoolz
@functoolz.memoizedef memo_levenshtein(s, t): if len(s) == 0: return len(t) if len(t) == 0: return len(s) cost = 0 if s[-1] == t[-1] else 1 return min(levenshtein(s[:-1], t) + 1, levenshtein(s, t[:-1]) + 1, levenshtein(s[:-1], t[:-1]) + cost)
GoDataDriven
Aside
• Memoize the algorithm if you’re using Python!a = “Marco Polo str.”b = “Marco Polo straat”%timeit memo_levenshtein(a, b) # 213ns%timeit levenshtein(a, b) # 1.84 µs
GoDataDriven
helper.py example
def get_json(postcode, date, radius): ...
lat, lon = get_lat_lon(postcode) postcodes = get_postcodes(postcode, radius) data = get_data(postcodes, dates)
stats = get_statistics(data, sbi) timeline = get_timeline(data, sbi)
return stats, timeline, data.to_json(orient='records')
GoDataDriven
helper.py example
def get_statistics(data, sbi): sbi_df = data[data.sbi == sbi] # select * from data where sbi = sbi hits = sbi_df.hits.sum() # select sum(hits) from … delta_hits = sbi_df.delta.sum() # select sum(delta) from … if delta_hits: percentage = (hits - delta_hits) / delta_hits else: percentage = 0
return {"sbi": sbi, "total": hits, "percentage": percentage}
GoDataDriven
helper.py exampledef get_timeline(data, sbi): df_sbi = data.groupby([“date”, “hour", “sbi"]).aggregate(sum) # select sum(hits), sum(delta) from data group by date, hour, sbi return df_sbi
GoDataDriven
helper.py exampledef get_json(postcode, date, radius): ... lat, lon = get_lat_lon(postcode) postcodes = get_postcodes(postcode, radius) dates = date.split(';') data = get_data(postcodes, dates)
stats = get_statistics(data) timeline = get_timeline(data, dates)
return stats, timeline, data.to_json(orient='records')
GoDataDriven
Who has my data?
• First iteration was a (pre)-POC, less data (3GB vs 500GB);
• Time constraints;
• Oeps:
import pandas as pd...source_data = pd.read_csv("data.csv", …)...def get_data(postcodes, dates): result = filter_data(source_data, postcodes, dates) return result
GoDataDriven
Advantage of “everything is a df ”
Pro:
• Fast!!
• Use what you know
• NO DBA’s!
• We all love CSV’s!
Contra:
• Doesn’t scale;
• Huge startup time;
• NO DBA’s!
• We all hate CSV’s!
GoDataDriven
If you want to go down this path
• Set the dataframe index wisely;
• Align the data to the index:
• Beware of modifications of the original dataframe!source_data.sort_index(inplace=True)
GoDataDriven
If you want to go down this path
The reason pandas is faster is because I came up with a better algorithm
GoDataDriven
New architecture
data = get_data(db, postcodes, dates)data = get_data(postcodes, dates)
database.py
Data
psycopg2
AngularJS app.py
helper.py
REST
Front-end Back-end
JSON
GoDataDriven
Handling geo-datadef get_json(postcode, date, radius): """ ... """ lat, lon = get_lat_lon(postcode) postcodes = get_postcodes(postcode, radius) dates = date.split(';') data = get_data(postcodes, dates)
stats = get_statistics(data) timeline = get_timeline(data, dates)
return stats, timeline, data.to_json(orient='records')
GoDataDriven
Issues?!
• With a radius of 10km, in Amsterdam, you get 10k postcodes. You need to do this in your SQL:
• Index on date and postcode, but single queries running more than 20 minutes.
SELECT * FROM datapoints WHERE date IN date_array
AND postcode IN postcode_array;
GoDataDriven
Postgres + Postgis (2.x)
PostGIS is a spatial database extender for PostgreSQL. Supports geographic objects allowing location queries
SELECT *FROM datapointsWHERE ST_DWithin(lon, lat, 1500)AND dates IN ('2013-02-30', '2013-02-31');-- every point within 1.5km -- from (lat, lon) on imaginary dates
Other db’s?
GoDataDriven
Steps to solve it
1. Align data on disk by date;2. Use the temporary table trick:
3. Lose precision: 1234AB→1234
CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL PRIMARY KEY);INSERT INTO tmp (postcodes) VALUES postcode_array;
SELECT * FROM tmp JOIN datapoints d ON d.postcode = tmp.postcodes WHERE d.dt IN dates_array;
GoDataDriven
Take home messages
1. Geospatial problems are “hard” and can kill your queries;
2. Not everybody has infinite resources: be smart and KISS!
3. SQL or NoSQL? (Size, schema)
GoDataDriven
We’re hiring / Questions? / Thank you!
Giovanni LanzaniData Whisperer