Date post: | 30-Aug-2014 |
Category: |
Technology |
Upload: | benjamin-bengfort |
View: | 239 times |
Download: | 3 times |
Building Data Products with PythonDistrict Data Labs
Links to various resources
Introduction to Pythonhttp://bit.ly/1gJ73Tt
Github Repositoryhttp://bit.ly/1eLBzki
About the Instructor
Benjamin Bengfort
Data Science:
● MS Computer Science from North Dakota State● PhD Candidate in CS at the University of Maryland● Data Scientist at Cobrain Company in Bethesda, MD● Board member of Data Community DC● Lecturer at Georgetown University
Python Programmer:
● Python developer for 7 years● Open source contributor● My work on Github: https://github.com/bbengfort
About the Instructor
Benjamin Bengfort
I am available to collaborate and answer questions for all of my students.Twitter: twitter.com/bbengfortLinkedIn: linkedin.com/in/bbengfort Github: github.com/bbengfortEmail: [email protected]
About the Teaching Assistant
Keshav Magge
● MS Computer Science from University of Houston● Lead Data/Software Engineer at Cobrain Company in
Bethesda, MD
Python Programmer:
● Python developer for 7 years● Plone/Zope for 2 years, Django for 5 years● My work on Github: https://github.com/keshavmagge
About the Teaching Assistant
Keshav Magge
Reach out to me to talk about all things python/data or just about lifeTwitter: twitter.com/keshavmaggeLinkedIn: linkedin.com/pub/keshav-magge/12/a2a/324/Github: github.com/keshavmaggeEmail: [email protected]
Building Data Products
Hilary Mason
A data product is a product that is based on the combination of data and algorithms.”
“
Mike Loukides
A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product. Data science enables the creation of data products.”
“
The Data Science Pipeline
Data Ingestion Data Munging and Wrangling
Computation and Analyses
Modeling and Application
Reporting and Visualization
Data Ingestion● There is a world of data out
there- how to get it? Web crawlers, APIs, Sensors? Python and other web scripting languages are custom made for this task.
● The real question is how can we deal with such a giant volume and velocity of data?
● Big Data and Data Science often require ingestion specialists!
● Warehousing the data means storing the data in as raw a form as possible.
● Extract, transform, and load operations move data to operational storage locations.
● Filtering, aggregation, normalization and denormalization all ensure data is in a form it can be computed on.
● Annotated training sets must be created for ML tasks.
Data Wrangling
● Hypothesis driven computation includes design and development of predictive models.
● Many models have to be trained or constrained into a computational form like a Graph database, and this is time consuming.
● Other data products like indices, relations, classifications, and clusters may be computed.
Computation and Analyses
Modeling and Application
This is the part we’re most familiar with. Supervised classification, Unsupervised clustering - Bayes, Logistic Regression,
Decision Trees, and other models.
This is also where the money is.
● Often overlooked, this part is crucial, even if we have data products.
● Humans recognize patterns better than machines. Human feedback is crucial in Active Learning and remodeling (error detection).
● Mashups and collaborations generate more data- and therefore more value!
Reporting and Visualization
Don’t forget feedback!(Active Learning for Data
Products)
What we’re going to build today
SCIENCE BOOKCLUB!!
● A book club that chooses what to read via a recommender system.
● Uses GoodReads data to ingest and return feedback on books.
● Statistical model is a non-negative matrix factorization
● Reporting using Jinja (almost a web app)
Workflow1. Setting up a Python skeleton2. Creating and Running Tests3. Wading in with a configuration4. Ingestion with urllib and requests5. Creating a command line admin with argparse6. Wrangling with BeautifulSoup and SQLAlchemy7. Modeling with numpy8. Reporting with Jinja2
Octavo Architecture (really clear DSP)
requests.py
IngestionModule
Raw Data Storage Computational
Data Storage
WranglingModule
BeautifulSoup
SQLAlchemy
RecommenderModule
Numpy
ReportingModule
Jinja2Matplotlib
requests.py
Octavo Architecture (really clear DSP)
requests.pyIngestionModule
Raw Data Storage
Computational Data Storage
WranglingModule
BeautifulSoup
SQLAlchemy
RecommenderModule
Numpy
ReportingModule
Jinja2
Matplotlib
How to tackle this course ...
How to tackle this course ...
Lean into it- absorb as much as possible, don’t worry about falling
behind - it will be in your head!
Then afterwards - lets all digest it together (keep in touch)