Getting started with data science and machine learning
GradEx Workshop2019-02-21
Mher Kazandjian
About me
Not about me
https://tinyurl.com/y3lsmcnk
Data science and Machine learning
Learning objectives
- Pointers on data science and machine learning from a practitioner's perspective- How to get started: software and skills requirements- Data science and machine learning exercise (hands on)
Data science - applications
- Advertisement- Linguistics- Astronomy- Forensics- Intelligence / security- Weather forecasts- Financial/economic forecasts
What data science is not
- Building models- Data visualizations- Write custom programs to process data
What data science is
- Using data to create as much impact as possible to solve a certain problem- Give insight- Data products- Product recommendations
In Order to achieve this it might be necessary to use tools such as
- Building models- Data visualizations- Write custom programs to process data
Big Data
- Data started to grow dramatically after 2004- Unstructured datasets- Traditional techniques were too slow and inadequate to handle growth- New tools and paradigms emerged- Data science term was coined around the late 2000s- Machine learning and AI are among the main benefactors of big data
- e.g- In 2007 a deep neural network beat traditional model for the first time
Scales
Data volumes:- Few GB to PB or bigger?
Processors:- A multi-core machine -> 1000s of servers with 10,000 cores- Accelerators (GPUs, FPGAs, ASICs)
Memory:- Several GB to TB
Network:- Gbit+
New software technologies
- Map-reduce- Hadoop- Spark- No SQL databases
- Elasticsearch- Mongodb
SQL
New software technologies
- Cloud computing- Data science services- Pre-deployed software that users can interact with:
- process data- Run models- Find data
Cloud providers save you time when it comes to setting upconfiguring hardware and software
- Good for:- fast prototyping - Fast testing- Fast deployment
Machine learning
- When traditional (algorithmic) methods fail, use machine learning- Success of Machine learning techniques is enabled by the abundance of data
- E.g, instead of writing code such as:
is replaced by
Machine learning
- When traditional (algorithmic) methods fail, use machine learning- Success of Machine learning techniques is enabled by the abundance of data
- E.g, instead of writing code such as:
x
y
z
t
w
Car 95%
Tree 0.1%
Door 0.9%
Shool bus 4%
Length = 4.5m
Width = 2m
color = red
Mass = 1800 kg
Has glass =True
Machine learning
- Logistic regression- Support vector machines- Neural networks
- Convolutional- Recurrent neural networks- Long Short Term Memory networks (LSTM)- Generative adversarial networks (GAN)
Machine learning
Applications
- Object detection, image recognition and classification- License plate recognition- Object classification
- Speech recognition- Siri, Google assistant, Alexa, …
- Self driving cars- Tesla, Volvo, Uber, Apple (probably soon)- Already outperform humans on average (precision and safety)
- AI in every TV (soon)- Built in Home assistant- Upscale/super resolution
- Face recognition and feature matching- Humanless terminals for passport checks
- Games- Go, Chess, Starcraft 2
exercise
exercise
This looks quite steep
- No need to be an expert in data science to work in a good company- e.g good analytic skills- some courses (coursera material is more than enough- some (self) training (e.g or www.kaggle.com)- average coding skills- contribute to open-soruce projects (gives you good exposure)- write a blog
This looks quite steep
- Companies are just starting to integrate data science and AI into their products- lots of opportunities and more to come in the near future- Many problems that benefit our everyday life use simple models such as
logistic regression
- 90% of the effort goes into getting good data andthe rest is e.g just to use that data to produce some visualization or train a model to classify orpredict a trend/class
Typical workflow
- download/collect/store/generate data- transform/filter/enhance/re-sample/augment data- label/classify/sort data- train/model data- Use models to understand the data / solve a problem / extract solution / reduce
dimensionality
Typical workflow
Top languages / packages
- Python- Numpy, scipy, pandas, Keras, tensorflow, pytorch
- R- Java- Scala- Matlab
DS and ML at AUB
- GPUs for deep learning- On campus:
- Mid range problems (8x Nvidia K20m - available now - mid range problems)- High end cards (2xV100 fall 2019 - 64GB GPU ram problems)
- On demand spark cluster (experimental)- R / R studio (available / on demand)- Jupyter notebooks (available / on demand)- Data processing up to 1TB on disk and 0.5 TB in ram- Aggregated number of cores for research ~ 800
On the cloud:- Azure (via grant / dept / faculty funding)
Hands on demo
Demo 1: Data science
Official exam gradeshttps://colab.research.google.com/drive/1qiMUfiSPkR8oVvpyP0HgBnUJwFiS4wcY
Demo 2: Simple but useful neural network (not deep)
A classifierhttps://colab.research.google.com/drive/1-Ui0SPgYaYCvdWR6LlnsFtFbBYm9ffqQ
Dataset layout
A huge 3d numpy array
Shape = (3, 3, 4)
Demo 3: Super resolution out of the box
Adversarial neural network (kindof)
https://github.com/fperazzi/proSR
Thank you for your attention