+ All Categories
Home > Education > Automated data analysis with Python

Automated data analysis with Python

Date post: 06-May-2015
Category:
Upload: gramener
View: 2,465 times
Download: 11 times
Share this document with a friend
Description:
Talk at #pyconindia2012
Popular Tags:
27
AUTOMATED DATA ANALYSIS WITH PYTHON (PART II) [email protected]
Transcript
Page 1: Automated data analysis with Python

AUTOMATED DATA ANALYSIS

WITH PYTHON (PART II)

[email protected]

Page 2: Automated data analysis with Python

DO WE FOLLOW PEP8?

Page 3: Automated data analysis with Python

author repo filename errno name count

Michael0x2a axe interpreter-parsing_rules.py E231 missing whitespace after ',' 14177

egirault googleplay api-googleplay_pb2.py E121 continuation line indentation is not a multiple of four 12953

mdwrigh2 pyice parsetab.py E231 missing whitespace after ',' 5452

steviesteveo projecteuler euler22.py E231 missing whitespace after ',' 5162

xiongchiamiov Mirage mirage.py W191 indentation contains tabs 4593

Ariel team Ariel-compiler-Sintactic.py E231 missing whitespace after ',' 4489

albertz PyCParser cparser.py W191 indentation contains tabs 3041

pombredanne PyCParser cparser.py W191 indentation contains tabs 3025

cshen PyCParser cparser.py W191 indentation contains tabs 3025

bohr PyCParser cparser.py W191 indentation contains tabs 2988

fj PyCParser cparser.py W191 indentation contains tabs 1863

steviesteveo projecteuler euler42.py E231 missing whitespace after ',' 1786

steviesteveo projecteuler wordlist.py E231 missing whitespace after ',' 1785

aixp pycoco Core.py E111 indentation is not a multiple of four 1760

mdoege 3NewsFeed newsfeed.py W191 indentation contains tabs 1738

mdoege 3NewsFeed newsfeed.py E101 indentation contains mixed spaces and tabs 1738

ebegoli EMLPy w3c_ir_assertions.py W191 indentation contains tabs 1650

AvsPmod AvsPmod AvsP.py E501 line too long (80 > 79 characters) 1598

chikuzen AvsPmod AvsP.py E501 line too long (80 > 79 characters) 1598

tweetr python twitter-twitterapi.py E101 indentation contains mixed spaces and tabs 1507

duartebarbosagoogletranslate Languages.py E101 indentation contains mixed spaces and tabs 1422

duartebarbosagoogletranslate Languages.py W191 indentation contains tabs 1422

nrub python twitter-twitterapi.py E101 indentation contains mixed spaces and tabs 1297

danudey python twitter-twitterapi.py E101 indentation contains mixed spaces and tabs 1278

idan python twitter-twitterapi.py E101 indentation contains mixed spaces and tabs 1278

Page 4: Automated data analysis with Python

import sysimport pandas as pd

data = pd.read_csv(sys.argv[1])

Page 5: Automated data analysis with Python

import sysimport pandas as pd

data = pd.read_csv(sys.argv[1])print data.groupby('name').sum().sort('count')

tab after keyword 1blank lines found after function decorator 2tab after operator 28tab before keyword 31unexpected indentation 41expected an indented block 78multiple spaces after keyword 120...blank line contains whitespace 40543no spaces around keyword / parameter equals 41858indentation is not a multiple of four 44109missing whitespace around operator 47286indentation contains mixed spaces and tabs 52633line too long (80 > 79 characters) 78201missing whitespace after ',' 91612indentation contains tabs 168842

Page 6: Automated data analysis with Python

LET’S TAKE MARKS

Page 7: Automated data analysis with Python

DIST_CODE DOB Day Caste B/G Med CondTotal SCHOOL_NAMEKannada

English HindiMaths

Science

Social

CHIKKABALLAPUR 13-Jul-95 Thu ST G K N 111 PRIYADHARSHINI HIGH SCHOOL 46 7 10 30 8 10

GADAG09-Feb-

95 ThuOTHERS B E N 458 LOYALA HIGH SCHOOL GADAG 86 69 52 70 90 91

MANGALORE27-Oct-

95 FriOTHERS B K N 390 GOVT.HIGH SCHOOL KOKKADA 105 35 65 76 67 42

BELGAUM15-Jun-

95 Thu ST B M N 151 MADYAMIKA VIDYALAYA BELAVATTI 14 23 25 26

MADHUGIRI11-Sep-

95 MonOTHERS B K N 240 SRI KALIDASA VIDYAVARDHAKA H.S. 57 35 35 48 30 35

KOLAR08-May-

95 MonOTHERS B E N 363 DR.AMBEDKAR HIGH SCHOOL 57 63 60 61 62 60

BIJAPUR24-May-

95 WedOTHERS B K N 451

LOYOLA HIGH SCHOOL STATION BACK 90 51 87 79 81 63

UDUPI05-Feb-

96 Mon SC B K N 239 GOVT JUNIOR COLLEGE BAILOOR 54 30 65 30 30 30

BANGALORE NORTH

20-Oct-95 Fri

OTHERS G E N 530 ST MARY'S HIGH SCHOOL NO 1 T 92 78 69 77

GULBARGA03-Jan-

95 TueOTHERS G K N 397

GOVERNMENT HIGH SCHOOL ANDOLA, 96 47 61 65 67 61

BELGAUM10-May-

94 Tue CAT-1 B K N 111GOVERNMENT HIGH SCHOOL SULEBHAVI 21 35 9 22 18 6

BIJAPUR 10-Jul-95 MonOTHERS B K N 380 H G P U COLLEGE SINDAGI BIJAPUR 87 43 69 65 60 56

CHIKODI25-Apr-

95 TueOTHERS B K N 408 GOVERNMENT HIGH SCHOOL 94 54 85 47 63 65

SHIMOGA18-Dec-

95 Mon SC G K N 215 SAHYADRI HIGH SCHOOL SHIMOGA 44 35 40 31 30 35

BIJAPUR18-Nov-

93 Thu SC B K N 157 TILAGUL HIGH SCHOOL TILAGUL 29 12 35 20 31 30

KOLAR26-Sep-

93 Sun SC B K N 237GOVERNMENT HIGH SCHOOL MEDIHAL 55 30 37 30 38 47

KOPPAL01-Jun-

93 TueOTHERS B K N 254 GOVERNMENT HIGH SCHOOL HIRE 38 42 37 53 49 35

CHIKKABALLAPUR21-Apr-

96 SunOTHERS B K N 251 GOVT. HIGH SCHOOL KADALAVENI 77 40 53 40 26 15

CHIKODI25-Nov-

95 SatOTHERS B M N 477

ARUN SHAMARAO PATIL HIGH SCHOOL 70 80 66 77

BELGAUM16-Feb-

95 ThuOTHERS G U N 307 BEGUM LATIFA GIRLS HIGH SCHOOL 44 9 50 56

Page 8: Automated data analysis with Python

import sysimport pandas as pd

data = pd.read_csv(sys.argv[1])print data.groupby('DIST_CODE').means().sort('TOTAL_MARKS')

TOTAL_MARKS Kannada ... Social ScienceDIST_CODEBIDAR 245.018650 56.594794 ... 40.368867YADGIR 285.778553 63.193738 ... 48.891916MADHUGIRI 291.869219 73.725051 ... 43.854291......CHIKODI 354.548775 79.675186 ... 58.088485SIRSI 356.859926 82.086493 ... 56.168686UDUPI 358.532346 82.697818 ... 50.479084

Page 9: Automated data analysis with Python
Page 10: Automated data analysis with Python

KANNADA ENGLISH HINDI

MATHS SCIENCE SOCIAL SCIENCE

Page 11: Automated data analysis with Python

HOW DO WE GENERALISE?

Page 12: Automated data analysis with Python

Groups(Dimensions)

Numbers(Metrics)

Things you can group byPlace, Categories, Attributes

Things you can measureSizes, Values, Growth, Frequencies

string, datetime, int

float, int

Page 13: Automated data analysis with Python

category title kJ ratedairy Activia Pouring Natural Yogurt 1X950g 216 0.21dairy Activia Pouring Strawberry Yogurt 1X950g 250 0.21dairy Activia Pouring Vanilla Yogurt 1X950g 263 0.21icecream Almondy Daim 400G 1804 0.75icecream Almondy Toblerone 400G 1850 0.5cereals Alpen 10 Pack Lite Summer Fruits Cereal Bars 210G 1222 1.57cereals Alpen 10Pk Fruit Nut And Chocolate Cereal Bars 290G 1812 1.14cereals Alpen Coconut And Chocolate Cereal Bars 5Pk 145G 1863 1.24cereals Alpen Fruit And Nut With Chocolate Cereal Bar 5X29g 1812 1.24cereals Alpen High Fruit 650G 1439 0.4cereals Alpen Light Bars Chocolate And Orange 5X21g 1246 1.71cereals Alpen Light Chocolate And Fudge Bar 5X21g 1264 1.71cereals Alpen Light Sultana & Apple Bars 5Pk 105G 1197 1.71cereals Alpen Light Summer Fruits Bars 5Pk 105G 1222 1.71cereals Alpen No Added Sugar 1.3Kg 1488 0.31cereals Alpen No Added Sugar 560G 1488 0.46cereals Alpen Original 1.5Kg 1509 0.27cereals Alpen Original Muesli 750G 1509 0.35cereals Alpen Raspberry And Yoghurt Cereal Bars5x29g 1748 1.24cereals Alpen Strawberry With Yoghurt Cereal Bar 5X29g 1756 1.24dairy Alpro Natural Yofu 500G 0.28dairy Alpro Raspberry Vanilla Yofu 4X125g 0.35dairy Alpro Strawberry And Fof Soya Yofu 4X125g 0.35dairy Alpro Vanilla Yofu 500G 0.28

Which categories of food are light? Which are inexpensive?

Page 14: Automated data analysis with Python

import sysimport pandas as pd

data = pd.read_csv(sys.argv[1])groups = data.dtypes[data.dtypes != float].indexnumbers = data.dtypes[data.dtypes == float].index

>>> groupsIndex([category, title], dtype=object)

>>> numbersIndex([kJ, rate], dtype=object)

Page 15: Automated data analysis with Python

import sysimport pandas as pd

data = pd.read_csv(sys.argv[1])groups = data.dtypes[data.dtypes != float].indexnumbers = data.dtypes[data.dtypes == float].index

for group in groups:ave = data.groupby(group).mean()for num in numbers:

print ave.sort(num, ascending=False)

Page 16: Automated data analysis with Python
Page 17: Automated data analysis with Python

LET’S APPLY THIS

MARKSTRAINSCRICKET

Page 18: Automated data analysis with Python

Afghanistan’s s/r Australia’s s/r

55 60 65 70 75

High probability that s/r is different

Average probability that s/r is different

Low probability that s/r are different

Difference is largecompared to the spread

55 60 65 70 75

55 60 65 70 75

Page 19: Automated data analysis with Python

WELCOME TO STATS 201

scipy.stats.mstats.ttest_indscipy.stats.mstats.f_oneway

Page 20: Automated data analysis with Python

import sysimport pandas as pdfrom scipi.stats.mstats import f_oneway

data = pd.read_csv(sys.argv[1])groups = data.dtypes[data.dtypes != float].indexnumbers = data.dtypes[data.dtypes == float].index

for group in groups:grouped = data.groupby(group)ave = grouped.mean()for num in numbers:

F, prob = f_oneway(*grouped[number].values)print probprint ave.sort(num, ascending=False)

Page 21: Automated data analysis with Python

LET’S APPLY THIS

GROCERIESCRICKETTRAINS

Page 22: Automated data analysis with Python
Page 23: Automated data analysis with Python

import sysimport pandas as pdfrom scipi.stats.mstats import f_oneway

data = pd.read_csv(sys.argv[1])groups = data.dtypes[data.dtypes != float].indexnumbers = data.dtypes[data.dtypes == float].index

for group in groups:grouped = data.groupby(group)ave = grouped.mean()for num in numbers:

F, prob = f_oneway(*grouped[number].values)improvement = (ave[number].max() / data[number].mean() – 1)print improvement, prob# print ave.sort(num, ascending=False)

Page 24: Automated data analysis with Python

LET’S APPLY THISGROCERIES

CRICKETMARKSTRAIN

Page 25: Automated data analysis with Python

Hypotheses Data Insight

Data Autolysis

TAKE ANY DATASETTHROW IT AT A PROGRAM

GET INSIGHTS

Page 26: Automated data analysis with Python

DIRECTIONSCROSS TABULATIONS

CORRELATIONSOUTLIERS

HULLS

Page 27: Automated data analysis with Python

We handle terabyte-size data

via non-traditional analytics and visualise it in real-time.

A data analytics and visualisation company

We’re recruiting

[email protected]


Recommended