+ All Categories
Home > Data & Analytics > Python for Data Science - TDC 2015

Python for Data Science - TDC 2015

Date post: 21-Apr-2017
Category:
Upload: gabriel-moreira
View: 2,987 times
Download: 5 times
Share this document with a friend
69
PYTHON FOR DATA SCIENCE Gabriel Moreira Machine Learning Engineer @gspmoreira 2015
Transcript
Page 1: Python for Data Science - TDC 2015

PYTHON FOR DATA SCIENCE

Gabriel MoreiraMachine Learning Engineer

@gspmoreira

2015

Page 2: Python for Data Science - TDC 2015

Why so much buzz?

Page 3: Python for Data Science - TDC 2015

https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century

Page 4: Python for Data Science - TDC 2015

Big Data

Page 5: Python for Data Science - TDC 2015

ONLINE PERSONALIZATION

Page 6: Python for Data Science - TDC 2015

WHAT IS DATA SCIENCE

http://drewconway.com

Page 7: Python for Data Science - TDC 2015

WHAT IS DATA SCIENTISTA Data Scientist is someone with deliberate dual personality who can first build a curious business case defined with a telescopic vision and can then dive deep with microscopic lens to sift through DATA to reach the goal while defining and executing all the intermittent tasks.

http://www.datasciencecentral.com/profiles/blogs/are-you-a-data-scientist

Page 8: Python for Data Science - TDC 2015

http://nirvacana.com/thoughts/becoming-a-data-scientist/Data Science MetroMap Curriculum

Page 9: Python for Data Science - TDC 2015
Page 10: Python for Data Science - TDC 2015

TYPES OF ANALYTICS

Investigative Analytics Operational AnalyticsConsumers: Humans Consumers: Machines

http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/https://hbr.org/2014/08/the-question-to-ask-before-hiring-a-data-scientist/

Page 11: Python for Data Science - TDC 2015

[Hillary Mason, Data Scientist]

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

DATA SCIENCE IS IOSEMN

Page 12: Python for Data Science - TDC 2015

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

PYTHON IS IOSEMN

jsOutsider

Page 13: Python for Data Science - TDC 2015

ANALYTICS CASE CORPORATE SOCIAL NETWORKS

Page 14: Python for Data Science - TDC 2015

Full Data Analysis demo available in IPython Notebookbit.ly/python4ds_nb

Page 15: Python for Data Science - TDC 2015

Investigative AnalyticsConsumers: Humans

Page 16: Python for Data Science - TDC 2015

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

Page 17: Python for Data Science - TDC 2015

INQUIRE

1.Which communities are more popular?

2.Is the user engagement increasing?

3.What is the distribution of publishing time?

4.What is the distribution of user interactions?

5.Is there a relationship between publishing hour and number of interactions?

Page 18: Python for Data Science - TDC 2015

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

Page 19: Python for Data Science - TDC 2015

OBTAIN

•Download data from another location (e.g., a web page or server)

•Query data from a database (e.g., MySQL or Oracle)•Extract data from an API (e.g., Twitter, Facebook) •Extract data from another file (e.g., an HTML file or

spreadsheet) •Generate data yourself (e.g., reading sensors or

taking surveys)

Page 20: Python for Data Science - TDC 2015

READING INTERACTIONS FROM CVS

Page 21: Python for Data Science - TDC 2015

READING POSTS FROM JSON LINES

Page 22: Python for Data Science - TDC 2015

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

Page 23: Python for Data Science - TDC 2015

SCRUB

Page 24: Python for Data Science - TDC 2015

SCRUB

Page 25: Python for Data Science - TDC 2015

SCRUB

Page 26: Python for Data Science - TDC 2015

SCRUB

Dealing with nulls

Page 27: Python for Data Science - TDC 2015

SCRUB

Page 28: Python for Data Science - TDC 2015

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

Page 29: Python for Data Science - TDC 2015

1 - WHICH COMMUNITIES ARE MORE POPULAR?

Page 30: Python for Data Science - TDC 2015

1 - WHICH COMMUNITIES ARE MORE POPULAR?

Page 31: Python for Data Science - TDC 2015

2 - IS USER ENGAGEMENT INCREASING?

Page 32: Python for Data Science - TDC 2015

2 - IS USER ENGAGEMENT INCREASING?

Page 33: Python for Data Science - TDC 2015

3 - WHAT IS THE DISTRIBUTION OF PUBLISHING TIME?

Page 34: Python for Data Science - TDC 2015

4 - HOW IS THE DISTRIBUTION OF USER INTERACTIONS?

Page 35: Python for Data Science - TDC 2015

4 - HOW IS THE DISTRIBUTION OF USER INTERACTIONS?

Page 36: Python for Data Science - TDC 2015

4 - HOW IS THE DISTRIBUTION OF USER INTERACTIONS?

Page 37: Python for Data Science - TDC 2015

5 - RELATIONSHIP BETWEEN PUBLISHING TIME AND NUMBER OF INTERACTIONS?

Page 38: Python for Data Science - TDC 2015

5 - RELATIONSHIP BETWEEN PUBLISHING TIME AND NUMBER OF INTERACTIONS?

Page 39: Python for Data Science - TDC 2015

5 - RELATIONSHIP BETWEEN PUBLISHING TIME AND NUMBER OF INTERACTIONS?

Page 40: Python for Data Science - TDC 2015

5 - RELATIONSHIP BETWEEN PUBLISHING TIME AND NUMBER OF INTERACTIONS?

http://viverdeblog.com/melhoresahorarios-para-postar-nas-redes-sociais/

Page 41: Python for Data Science - TDC 2015

Operational AnalyticsConsumers: Machines

Page 42: Python for Data Science - TDC 2015

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

Page 43: Python for Data Science - TDC 2015

1. Discover the most relevant words in the posts

2. Find related posts, with similar content

Operational Analytics Tasks example

Find Related Posts

Page 44: Python for Data Science - TDC 2015

1 - RELEVANT WORDS IN A POST

TF-IDF - More “relevant" terms in a document are frequent terms in the document and rare in other documents

Page 45: Python for Data Science - TDC 2015

1 - RELEVANT WORDS IN A POST

Page 46: Python for Data Science - TDC 2015

1 - RELEVANT WORDS IN A POST

Page 47: Python for Data Science - TDC 2015

1 - RELEVANT WORDS IN A POST

Page 48: Python for Data Science - TDC 2015

BONUS - GLOBAL RELEVANT TERMS [ALL POSTS]

Page 49: Python for Data Science - TDC 2015

2 - SIMILAR POSTS

Cosine Similarity Measure of similarity between two vectors being the cosine of the angle between them.

Page 50: Python for Data Science - TDC 2015

2 - SIMILAR POSTS

Page 51: Python for Data Science - TDC 2015

2 - SIMILAR POSTSOriginal Post Did you ever wonder how great it would be if you could write your jmeter tests in ruby ? This projects aims to do so. If you use it on your project just let me now. On the Architecture Academy you can read how jmeter can be used to validate your Architecture. modulo 13 arch definition architecture validation | academia de arquiteturaMost similar post (cosine similarity = 0.30) Foram disponibilizados no site Enterprise Architecture, na parte de Knowledge Base de performance, alguns how-tos relacionados a testes de performance.Entre eles, como definir os requisitos (throughput, cálculo de threads para o JMeter etc.), utilização do JMeter, geração de massa de dados e monitoramento. planning and executing performance testing | enterprise architecture - how to identify performance acceptance criteria | enterprise architecture - how to geracao de massa de dados | enterprise architecture - how to jmeter | enterprise architecture - how to monitoramento | enterprise architecture

Page 52: Python for Data Science - TDC 2015

SIMILAR PEOPLE!

Page 53: Python for Data Science - TDC 2015

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

Page 54: Python for Data Science - TDC 2015

INTERPRET

•Drawing conclusions from your data

•Evaluating what your results mean

•Communicating your result

Page 55: Python for Data Science - TDC 2015

DATA PRODUCTS“If information has context and the context is interactive, insights are not predictable."

[Agile Data Science, O’Reilly, 2014]

Page 56: Python for Data Science - TDC 2015

SENTIMENT ANALYSIS

bit.ly/eleicoes2014debatesbt

Analytical Dashboard

Page 57: Python for Data Science - TDC 2015

SENTIMENT ANALYSISAnalytical Dashboard

bit.ly/eleicoes2014debatesbt

Page 58: Python for Data Science - TDC 2015

NETWORK ANALYSIS

https://linkedjazz.org/network/js

Page 59: Python for Data Science - TDC 2015

What about Python for Big Data?

Page 60: Python for Data Science - TDC 2015

PYTHON ON HADOOP

Streaming

HADOOPYPig UDFs in Jython

Page 61: Python for Data Science - TDC 2015

HADOOP STREAMINGHadoop Streaming - Allows MapReduce jobs from any executable script - including Python

Page 62: Python for Data Science - TDC 2015

HADOOP STREAMING

http://workingsweng.com.br/2014/04/clusterizando-raios-com-hadoop-e-k-means-em-map-reduce/

K-Means with Python on MapReduce

140.000 lightnings em 28/02/2014 in 137 data files

Running on Amazon Elastic Map Reduce•Instances: 10 m1.small•Time (k=10): 10 iterations => 32 minutes•Time (k=50): 50 iterations => 164 minutes

Page 63: Python for Data Science - TDC 2015
Page 64: Python for Data Science - TDC 2015

IS DATA SCIENTIST THENEW WEBMASTER?

Page 65: Python for Data Science - TDC 2015

[Doing Data Science, O’Reilly, 2014]

Page 66: Python for Data Science - TDC 2015

DATA SCIENCE COURSES• Introduction to Data Science (Univ. of Washington)

• Data Science specialization (Johns Hopkins)

• Intro to Hadoop and MapReduce (Cloudera)

• Machine Learning (Stanford)

• Statistical Learning (Stanford)

• Mining Massive Datasets (Stanford)

• Scalable Machine Learning (Berkeley)

http://workingsweng.com.br/2014/04/cursos-mooc-e-especializacoes-em-data-science/

Page 67: Python for Data Science - TDC 2015

BOOKS

Page 68: Python for Data Science - TDC 2015

Happy data geeking!

Page 69: Python for Data Science - TDC 2015

Gabriel Moreira@gspmoreira

http://about.me/gspmoreira

Thank you!

2015

PYTHON FOR DATA SCIENCE

Slides: http://bit.ly/python4ds_tdc


Recommended