+ All Categories
Home > Data & Analytics > Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

Date post: 16-Apr-2017
Category:
Upload: pydata
View: 604 times
Download: 1 times
Share this document with a friend
87
NO-SQL PYTHON Aileen Nielsen Software Engineer, One Drop, NYC [email protected] 1
Transcript
Page 1: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

NO-SQL PYTHON

Aileen NielsenSoftware Engineer, One Drop, NYC

[email protected]

1

Page 2: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

OUTLINE

1. WHY? ( O T H ER T H AN T H E T R ENDY NAME)

2. HOW?

3. WHY? ( AGAIN)

2

Page 3: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

1. WHY?

3

Page 4: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

LET’S START WITH STANDARD SQL-LIKE, TIDY DATA

Name Day Score

Allen 1 25

Joe 3 17

Joe 2 14

Mary 2 14

Mary 1 11

Allen 3 9

Mary 3 9

Joe 1 1

4

Page 5: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

LET’S START WITH STANDARD SQL-LIKE, TIDY DATA

What makes this data tidy? Name Day Score

Allen 1 25

Joe 3 17

Joe 2 14

Mary 2 14

Mary 1 11

Allen 3 9

Mary 3 9

Joe 1 15

Page 6: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

LET’S START WITH STANDARD SQL-LIKE, TIDY DATA

What makes this data tidy?

• Observations are in rows

Name Day Score

Allen 1 25

Joe 3 17

Joe 2 14

Mary 2 14

Mary 1 11

Allen 3 9

Mary 3 9

Joe 1 16

Page 7: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

LET’S START WITH STANDARD SQL-LIKE, TIDY DATA

What makes this data tidy?

• Observations are in rows

• Variables are in columns

Name Day Score

Allen 1 25

Joe 3 17

Joe 2 14

Mary 2 14

Mary 1 11

Allen 3 9

Mary 3 9

Joe 1 17

Page 8: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

LET’S START WITH STANDARD SQL-LIKE, TIDY DATA

What makes this data tidy?

• Observations are in rows

• Variables are in columns

• Contained in a single data set

Name Day Score

Allen 1 25

Joe 3 17

Joe 2 14

Mary 2 14

Mary 1 11

Allen 3 9

Mary 3 9

Joe 1 18

Page 9: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

LET’S START WITH STANDARD SQL-LIKE, TIDY DATA

What makes this data tidy?

• Observations are in rows

• Variables are in columns

• Contained in a single data set

Name Day Score

Allen 1 25

Joe 3 17

Joe 2 14

Mary 2 14

Mary 1 11

Allen 3 9

Mary 3 9

Joe 1 1

But can you tell me anything useful about this data set?

9

Page 10: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

LET’S START WITH STANDARD SQL-LIKE, TIDY DATA

Name Day Score

Allen 1 25

Joe 3 17

Joe 2 14

Mary 2 14

Mary 1 11

Allen 3 9

Mary 3 9

Joe 1 1

Sure. These are easy to see:

• Highest score

• Lowest score

• Total observations

10

Page 11: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

LET’S START WITH STANDARD SQL-LIKE, TIDY DATA

Name Day Score

Allen 1 25

Joe 3 17

Joe 2 14

Mary 2 14

Mary 1 11

Allen 3 9

Mary 3 9

Joe 1 1

Not-so-easy

• How many people?

• Who’s doing the best?

• Who’s doing the worst?

• How are individuals doing?

11

Page 12: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

HOW ABOUT NOW?

What Changed?

• The data’s still tidy, but we’ve changed the organizing principle

Name Score Day

Allen 25 1

Mary 11 1

Joe 1 1

Mary 14 2

Joe 14 2

Joe 17 3

Allen 9 3

Mary 9 3

12

Page 13: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

OK HOW ABOUT NOW? (LAST TIME I PROMISE)

Name Ordered Scores

Joe [1, 14, 17]

Mary [11, 14, 9]

Allen [25, NA, 9]

13

Page 14: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

OK HOW ABOUT NOW? (LAST TIME I PROMISE)

Name Ordered Scores

Joe [1, 14, 17]

Mary [11, 14, 9]

Allen [25, NA, 9]

This data’s NOT TIDY but...

14

Page 15: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

OK HOW ABOUT NOW? (LAST TIME I PROMISE)

This data’s NOT TIDY but...

I can eyeball it easily

Name Ordered Scores

Joe [1, 14, 17]

Mary [11, 14, 9]

Allen [25, NA, 9]

15

Page 16: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

OK HOW ABOUT NOW? (LAST TIME I PROMISE)

This data’s NOT TIDY but...

I can eyeball it easily

And new questions become interesting and easier to answer

Name Ordered Scores

Joe [1, 14, 17]

Mary [11, 14, 9]

Allen [25, NA, 9]

16

Page 17: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

OK HOW ABOUT NOW? (LAST TIME I PROMISE)

• How many students are there?• Who improved?• Who missed a test?• Who was kind of meh?

Name Ordered Scores

Joe [1, 14, 17]

Mary [11, 14, 9]

Allen [25, NA, 9]

17

Page 18: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

DON’T GET MAD

I’m not saying to kill tidy Name Ordered Scores

Joe [1, 14, 17]

Mary [11, 14, 9]

Allen [25, NA, 9]

18

Page 19: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

DON’T GET MAD

I’m not saying to kill tidy

But I worry we don’t use certain methods more often because it’s not as easy as it could be.

Name Ordered Scores

Joe [1, 14, 17]

Mary [11, 14, 9]

Allen [25, NA, 9]

19

Page 20: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

BEFORE I GOT INTO THE ’NOSQL’ MINDSET I S IGHED WHEN ASKED QUESTIONS LIKE…

• App analytics What usage patterns do we see in our long-term app users? How do those patterns evolve over time at the individual level?

20

Page 21: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

BEFORE I GOT INTO THE ’NOSQL’ MINDSET I S IGHED WHEN ASKED QUESTIONS LIKE

THESE

• App analytics What usage patterns do we see in our long-term app users? How do those patterns evolve over time at the individual level?

• Health research Can we predict early on in an experiment what’s likely to happen? Do our experiments need to be as long as they are?

21

Page 22: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

BEFORE I GOT INTO THE ’NOSQL’ MINDSET I S IGHED THINKING ABOUT…

• App analytics What usage patterns do we see in our long-term app users? How do those patterns evolve over time at the individual level?

• Health research Can we predict early on in an experiment what’s likely to happen? Do our experiments need to be as long as they are?

• Consumer research Do people like things because they like them or because of the ordering they saw them in?

22

Page 23: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

I ALSO TENDED NOT TO ASK NO-SQL QUESTIONS TOO OFTEN

• Status quo bias: humans tend to take whatever default is presented. That happens in data analysis too.

23

Page 24: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

I ALSO TENDED NOT TO ASK NO-SQL QUESTIONS TOO OFTEN

• Status quo bias: humans tend to take whatever default is presented. That happens in data analysis too.

• Endowment effect: humans tend to want what they already have and think it’s more valuable than what’s offered for a trade.

24

Page 25: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

I ALSO TENDED NOT TO ASK NO-SQL QUESTIONS TOO OFTEN

• Status quo bias: humans tend to take whatever default is presented. That happens in data analysis too.

• Endowment effect: humans tend to want what they already have and think it’s more valuable than what’s offered for a trade.

• Especially deep finding: humans are lazy

25

Page 26: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

Option 1:>>> no_sql_df = df.groupby('Name').apply(lambda df: list(df.sort_values(by='Day')['Score']))>>> no_sql_dfNameAllen [25, 9]Joe [1, 14, 17]Mary [11, 14, 9]

26

IT’S TRUE. YOU CAN ANSWER THESE QUESTIONS WITH THE TIDY DATA FRAMES I

JUST SHOWED YOU.

You can always ’reconstruct’ these trajectories of what happened by making a data frame per user

>>> dfName Day Score

0 Allen 1 251 Joe 3 172 Joe 2 143 Mary 2 144 Mary 1 115 Allen 3 96 Mary 3 97 Joe 1 1

Page 27: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

IT’S TRUE. YOU CAN ANSWER THESE QUESTIONS WITH THE TIDY DATA FRAMES I

JUST SHOWED YOU.

Option 2:>>> new_list = []>>> for tuple in df.groupby(['Name']):... new_list.append({tuple[0]: zip(tuple[1]['Day'], tuple[1]['Score'])})... >>> new_list[{'Allen': [(1, 25), (3, 9)]}, {'Joe': [(3, 17), (2, 14), (1, 1)]}, {'Mary': [(2, 14), (1, 11), (3, 9)]}]

You can always ’reconstruct’ these trajectories of what happened by making a data frame per user

27

>>> dfName Day Score

0 Allen 1 251 Joe 3 172 Joe 2 143 Mary 2 144 Mary 1 115 Allen 3 96 Mary 3 97 Joe 1 1

Page 28: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

IT’S TRUE. YOU CAN ANSWER THESE QUESTIONS WITH THE TIDY DATA FRAMES I

JUST SHOWED YOU.

Option 3:>>> def process(new_df):... return [new_df[new_df['Day']==i]['Score'].values[0] if i in list(new_df['Day']) else None for i in range(1,4)]... >>> df.groupby(['Name']).apply(process)NameAllen [25, None, 9]Joe [1, 14, 17]Mary [11, 14, 9]

You can always ’reconstruct’ these trajectories of what happened by making a data frame per user

28

>>> dfName Day Score

0 Allen 1 251 Joe 3 172 Joe 2 143 Mary 2 144 Mary 1 115 Allen 3 96 Mary 3 97 Joe 1 1

Page 29: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

LET’S BE HONEST…NO ONE WANTS THAT TO BE A F IRST STEP TO EVERY QUERY. ..AND

INCREASINGLY WE’LL BE REQUIRED TO MAKE THESE SORTS OF QUERIES

• Google ads (well…maybe less so in Europe)

29

Page 30: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

LET’S BE HONEST…NO ONE WANTS THAT TO BE A F IRST STEP TO EVERY QUERY. ..AND

INCREASINGLY WE’LL BE REQUIRED TO MAKE THESE SORTS OF QUERIES

• Google ads (well…maybe less so in Europe)

• Wearable sensors

30

Page 31: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

LET’S BE HONEST…NO ONE WANTS THAT TO BE A F IRST STEP TO EVERY QUERY. ..AND

INCREASINGLY WE’LL BE REQUIRED TO MAKE THESE SORTS OF QUERIES

• Google ads (well…maybe less so in Europe)

• Wearable sensors

• The unit of an observation should be the actor not the particular action observed at a particular time.

31

Page 32: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

LET’S BE HONEST…NO ONE WANTS THAT TO BE A F IRST STEP TO EVERY QUERY. ..AND

INCREASINGLY WE’LL BE REQUIRED TO MAKE THESE SORTS OF QUERIES

• Google ads (well…maybe less so in Europe)

• Wearable sensors

• The unit of an observation should be the actor not the particular action observed at a particular time.

• Maybe we should rethink what we mean by ‘observations’

32

Page 33: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

• High scalability

• Distributed computing

• Schema flexibility

• Semi-structured data

• No complex relationships

• Schema change all the time

• Patterns change all the time

• Same units of interest repeating new things

33

We don’t look for No-SQL because we have No-SQL databases... We have No-SQL Databases because we have no-SQL data.

Page 34: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

WHAT IS NO-SQL PYTHON?

Data that doesn’t seem like it fits in a data frame

• Arbitrarily nested data

• Ragged data

• Comparative time series

34

Page 35: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

WHERE DO WE FIND NO-SQL DATA?

Here’s where I’ve found it…

• Physics lab

• Running data

• Health data

• Reddit

35

Page 36: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

2. HOW?

36

Page 37: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

GETTING THE DATA INTO PYTHON WHEN IT ’S STRAIGHTFORWARD

• Scenario: you’re grabbing bunch of NoSQL data from an API or from a NoSQL db.

• We’ll stick with JSON since it’s a common format.

• Best case scenario. You’ll take everything however you can get it. In this case stick with pandas. The normalize_jsonworks great.

37

Page 38: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

NORMALIZE_JSON WORKS PRETTY WELL

38

{"samples": [{ "name": "Jane Doe",

"age" : 42,"profession": "architect","series": [

{"day": 0,"measurement_value": 0.97

},{

"day": 1,"measurement_value": 1.55

},{

"day": 2,"measurement_value": 0.67

}]},

{ name": "Bob Smith","hobbies": ["tennis", "cooking"],"age": 37,"series":{

"day": 0,"measurement_value": 1.25

} }]}

Page 39: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

NORMALIZE_JSON WORKS PRETTY WELL

39

{"samples": [{ "name": "Jane Doe",

"age" : 42,"profession": "architect","series": [

{"day": 0,"measurement_value": 0.97

},{

"day": 1,"measurement_value": 1.55

},{

"day": 2,"measurement_value": 0.67

}]},

{ name": "Bob Smith","hobbies": ["tennis", "cooking"],"age": 37,"series":{

"day": 0,"measurement_value": 1.25

} }]}

Page 40: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

NORMALIZE_JSON WORKS PRETTY WELL

40

{"samples": [{ "name": "Jane Doe",

"age" : 42,"profession": "architect","series": [

{"day": 0,"measurement_value": 0.97

},{

"day": 1,"measurement_value": 1.55

},{

"day": 2,"measurement_value": 0.67

}]},

{ name": "Bob Smith","hobbies": ["tennis", "cooking"],"age": 37,"series":{

"day": 0,"measurement_value": 1.25

} }]}

Page 41: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

NORMALIZE_JSON WORKS PRETTY WELL

41

{"samples": [{ "name": "Jane Doe",

"age" : 42,"profession": "architect","series": [

{"day": 0,"measurement_value": 0.97

},{

"day": 1,"measurement_value": 1.55

},{

"day": 2,"measurement_value": 0.67

}]},

{ name": "Bob Smith","hobbies": ["tennis", "cooking"],"age": 37,"series":{

"day": 0,"measurement_value": 1.25

} }]}

>>with open(json_file) as data_file:>> data = json.load(data_file)>> normalized_data = json_normalize(data['samples'])

Easy to process

Page 42: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

NORMALIZE_JSON WORKS PRETTY WELL

42

{"samples": [{ "name": "Jane Doe",

"age" : 42,"profession": "architect","series": [

{"day": 0,"measurement_value": 0.97

},{

"day": 1,"measurement_value": 1.55

},{

"day": 2,"measurement_value": 0.67

}]},

{ name": "Bob Smith","hobbies": ["tennis", "cooking"],"age": 37,"series":{

"day": 0,"measurement_value": 1.25

} }]}

>>with open(json_file) as data_file:>> data = json.load(data_file)>> normalized_data = json_normalize(data['samples'])

Easy to process

>> print(normalized_data['series'][0])[1]>> {u'measurement_value': 1.55, u'day': 1}

Basically, it just works

Page 43: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

NORMALIZE_JSON WORKS PRETTY WELL

43

{"samples": [{ "name": "Jane Doe",

"age" : 42,"profession": "architect","series": [

{"day": 0,"measurement_value": 0.97

},{

"day": 1,"measurement_value": 1.55

},{

"day": 2,"measurement_value": 0.67

}]},

{ name": "Bob Smith","hobbies": ["tennis", "cooking"],"age": 37,"series":{

"day": 0,"measurement_value": 1.25

} }]}

>>with open(json_file) as data_file:>> data = json.load(data_file)>> normalized_data = json_normalize(data['samples'])

Easy to process

Easy to add columns>> normalized_data['length'] = normalized_data['series'].apply(len)

>> print(normalized_data['series'][0])[1]>> {u'measurement_value': 1.55, u'day': 1}

Basically, it just works

Page 44: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

USING SOME PROGRAMMER STUFF ALSO HELPS

44

class dfList(list):def __init__(self, originalValue):

if originalValue.__class__ is list().__class__:self = originalValue

else:self = list(originalValue)

def __getitem__(self, item):result = list.__getitem__(self, item)try:

return result[ITEM_TO_GET]except:

return result

def __iter__(self):for i in range(len(self)):

yield self.__getitem__(i)

def __call__(self):return sum(self)/list.__len__(self)

• Subclass an iterable to shorten your apply() calls

Page 45: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

USING SOME PROGRAMMER STUFF ALSO HELPS

45

class dfList(list):def __init__(self, originalValue):

if originalValue.__class__ is list().__class__:self = originalValue

else:self = list(originalValue)

def __getitem__(self, item):result = list.__getitem__(self, item)try:

return result[ITEM_TO_GET]except:

return result

def __iter__(self):for i in range(len(self)):

yield self.__getitem__(i)

def __call__(self):return sum(self)/list.__len__(self)

• Subclass an iterable to shorten your apply() calls

• In particular, you need to subclass at least __getitem__ and __iter__

Page 46: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

USING SOME PROGRAMMER STUFF ALSO HELPS

46

class dfList(list):def __init__(self, originalValue):

if originalValue.__class__ is list().__class__:self = originalValue

else:self = list(originalValue)

def __getitem__(self, item):result = list.__getitem__(self, item)try:

return result[ITEM_TO_GET]except:

return result

def __iter__(self):for i in range(len(self)):

yield self.__getitem__(i)

def __call__(self):return sum(self)/list.__len__(self)

• Subclass an iterable to shorten your apply() calls

• In particular, you need to subclass at least __getitem__ and __iter__

• You should probably subclass __init__ as well for the case of inconsistent format

Page 47: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

USING SOME PROGRAMMER STUFF ALSO HELPS

47

class dfList(list):def __init__(self, originalValue):

if originalValue.__class__ is list().__class__:self = originalValue

else:self = list(originalValue)

def __getitem__(self, item):result = list.__getitem__(self, item)try:

return result[ITEM_TO_GET]except:

return result

def __iter__(self):for i in range(len(self)):

yield self.__getitem__(i)

def __call__(self):return sum(self)/list.__len__(self)

• Subclass an iterable to shorten your apply() calls

• In particular, you need to subclass at least __getitem__ and __iter__

• You should probably subclass __init__ as well for the case of inconsistent format

• Then __call__ can be a catch-all adjustable function...best to load it up with a call to a class function, which you can adjust at-will anytime.

Page 48: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

CUSTOM CLASSES PAIR NICELY WITH CLASS METHODS

48

class Test:def __init__(self, name)

self.name1 = name

def print_class_instance(instance):print(instance.name1)

def print_self(self):self.__class__.print_class_instance(self)

>>> test1 = Test('test1')>>> test1.print_self()test1>>> def new_printing(instance):... print("Now I'm printing a constant string")... >>> test1.print_self()test1>>> Test.print_class_instance = new_printing>>> test1.print_self()Now I'm printing a constant string

• Design flexible classes that often reference class methods rather than instance methods

• Then as you are processing data, you can quickly swap out methods to call different field names in the event of highly nested JSON

• Data processing is faster and no mental gymnastics or annoying parse efforts required

Page 49: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

GETTING NOSQL DATA: COMMONLY-ENCOUNTERED PROBLEMS

• CSVs with arrays

• Highly-nested JSON

• Unknown or Unreliably formatted API results

49

Page 50: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

SOMETIMES YOU GET WEIRD CSV FILES…

• Sometimes your problem is as simple as getting a csv file with nested data

50

Page 51: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

SOMETIMES YOU GET WEIRD CSV FILES…

• Sometimes your problem is as simple as getting a csv file with nested data

• This is pretty straightforward to deal with…use regex and common Python string operations to clean up the data

51

Page 52: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

SOMETIMES YOU GET WEIRD CSV FILES…

• Sometimes your problem is as simple as getting a csv file with nested data

• This is pretty straightforward to deal with…use regex and common Python string operations to clean up the data

• Apply() is your best friend

52

Page 53: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

SOMETIMES YOU GET WEIRD CSV FILES…

• Sometimes your problem is as simple as getting a csv file with nested data

• This is pretty straightforward to deal with…use regex and common Python string operations to clean up the data

• Apply() is your best friend

• Common problems: spaces between “,” and column name or column value (df = pd.read_csv("in.csv",sep="," , skipinitialspace=1)) use a parameter to avoid this problem

53

Page 54: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

SOMETIMES YOU GET WEIRD CSV FILES…

name,favorites,agejoe,"[madonna,elvis,u2]",28mary,"[lady gaga, adele]",36allen,"[beatles, u2, adele, rolling stones]"

This isn’t even that weird

Page 55: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

SOMETIMES YOU GET WEIRD CSV FILES…

name,favorites,agejoe,"[madonna,elvis,u2]",28mary,"[lady gaga, adele]",36allen,"[beatles, u2, adele, rolling stones]"

This isn’t even that weird

>> df = pd.read_csv(file_name, sep =",")Downright straightforward

Page 56: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

SOMETIMES YOU GET WEIRD CSV FILES…

56

name,favorites,agejoe,"[madonna,elvis,u2]",28mary,"[lady gaga, adele]",36allen,"[beatles, u2, adele, rolling stones]"

This isn’t even that weird

>> df = pd.read_csv(file_name, sep =",")Downright straightforward

Hmmm….>> print(df['favorites'][0][1])>> m

Page 57: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

SOMETIMES YOU GET WEIRD CSV FILES…

57

name,favorites,agejoe,"[madonna,elvis,u2]",28mary,"[lady gaga, adele]",36allen,"[beatles, u2, adele, rolling stones]"

This isn’t even that weird

>> df = pd.read_csv(file_name, sep =",")Downright straightforward

Hmmm….>> print(df['favorites'][0][1])>> m

Regex to the rescue…Python’s exceptionally easy string parsing a huge asset for No-SQL parsing>> df['favorites'] = df['favorites'].apply(lambda s: s[1:-1].split())>> print(df['favorites'][0][1])>> adele

Page 58: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

WHAT ABOUT THIS ONE?

58

name,favorites,agejoe,[madonna,elvis,u2],28mary,[lady gaga, adele],36allen,[beatles, u2, adele, rolling stones]

This isn’t even that weird

>> df = pd.read_csv(file_name, sep =",")Downright straightforward?

Actually this fails miserably>> print(df['favorites'])>> joe [madonna elvis u2]mary [lady gaga adele] 36Name: name, dtype: object

We need more regex…this time before applying read_csv()....

Page 59: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

59

name,favorites,agejoe,[madonna,elvis,u2],28mary,[lady gaga, adele],36allen,[beatles, u2, adele, rolling stones]

Missing quotes arouns arrays:

Basically, put in a the quotation marks to help out read_csv()

Page 60: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

60

name,favorites,agejoe,[madonna,elvis,u2],28mary,[lady gaga, adele],36allen,[beatles, u2, adele, rolling stones]

Missing quotes arouns arrays:

pattern = "(\[.*\])"with open(file_name) as f:

for line in f:new_line = linematch = re.finditer(pattern, line)try:

m = match.next()while m:

replacement = '"'+m.group(1)+'"'new_line = new_line.replace(m.group(1), replacement)m = match.next()

except:pass

with open(write_file, 'a') as write_f:write_f.write(new_line)

new_df = pd.read_csv(write_file)

Basically, put in a the quotation marks to help out read_csv()

Page 61: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

61

name,favorites,agejoe,[madonna,elvis,u2],28mary,[lady gaga, adele],36allen,[beatles, u2, adele, rolling stones]

Missing quotes arouns arrays:

pattern = "(\[.*\])"with open(file_name) as f:

for line in f:new_line = linematch = re.finditer(pattern, line)try:

m = match.next()while m:

replacement = '"'+m.group(1)+'"'new_line = new_line.replace(m.group(1), replacement)m = match.next()

except:pass

with open(write_file, 'a') as write_f:write_f.write(new_line)

new_df = pd.read_csv(write_file)

Basically, put in a the quotation marks to help out read_csv()

With multiple arrays per row, you’re gonna need to accommodate the greedy nature of regexpattern = "(\[.*?\])"

Page 62: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

62

THAT WAS A LOT OF TEXT…ALMOST DONE

Page 63: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

SOMETIMES YOU GET JSON AND YOU KNOW THE STRUCTURE, YOU JUST DON’T LIKE IT

• Use json_normalize()and then shed columns you don’t want. You’ve seen that today already (slides 32-38).

• Use some magic: sh with jq module to simplify your life…you can pick out the fields you want with jq either on the command line or with sh

• jq has a straightforward, easy to learn syntax: . = value, [] = array operation, etc… 63

cat = sh.catjq = sh.jqrule = """[{name: .samples[].name, days: .samples[].series[].day}]""”out = jq(rule, cat(_in=json_data)).stdoutjson.loads(uni_out)

Page 64: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

AND SOMETIMES YOU HAVE NO IDEA WHAT’S IN AN ENORMOUS JSON FILE

• Inconsistent or undocumented API

• Legacy Mongo database

• Someone handed you some gnarly JSON because they couldn’t parse it

64

Page 65: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

YOU’RE A PROGRAMMER…USE ITERATORS

• The ijson module is an iterator JSON parser…you can deal with structure one bit at a time

• This also gives you a great opportunity to make data parsing decisions as you go

• This isn’t fast, but it’s also not fast to shoot from the hip when you’re talking about gnarly JSON

65

Page 66: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

YOU’RE A PROGRAMMER…USE ITERATORS

66

with open(file_name, 'rb') as file:results = ijson.items(file, "samples.item")

for newRecord in results:record = newRecordfor k in record.keys():

if isinstance(record[k], dict().__class__):recursive_check(record[k])

if isinstance(record[k], list().__class__):recursive_check(record[k])

process(record)

Page 67: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

YOU’RE A PROGRAMMER…USE ITERATORS

67

total_dict = defaultdict(lambda: False)

def recursive_check(d):

if isinstance(d, dict().__class__):if not total_dict[tuple(sorted(d.keys()))]:

class_name = raw_input("Input the new classname with a space and then the file name defining the class ")

mod = import_module(class_name)cls = getattr(mod, class_name)total_dict[tuple(sorted(d.keys()))] = cls

for k in d.keys():new_class = recursive_check(k)if new_class:

d[k] = new_class(**d[k])

return total_dict[tuple(sorted(d.keys()))]

elif isinstance(d, list().__class__):for i in range(len(d)):

new_class = recursive_check(d[i])if new_class:

d[i] = new_class(**d[i])else:

return False

• Basically, you can build custom classes or generate appropriate named tuples as you go.

• This lets you know what you have and lets you build data structures to accommodate what you have.

• Storing these objects in a class rather than simple dictionary again gives you the option to customize .__call__() to your needs

Page 68: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

YOU’RE A PROGRAMMER…USE ITERATORS

68

total_dict = defaultdict(lambda: False)

def recursive_check(d):

if isinstance(d, dict().__class__):if not total_dict[tuple(sorted(d.keys()))]:

class_name = raw_input("Input the new classname with a space and then the file name defining the class ")mod = import_module(class_name)cls = getattr(mod, class_name)total_dict[tuple(sorted(d.keys()))] = cls

for k in d.keys():new_class = recursive_check(k)if new_class:

d[k] = new_class(**d[k])

return total_dict[tuple(sorted(d.keys()))]

elif isinstance(d, list().__class__):for i in range(len(d)):

new_class = recursive_check(d[i])if new_class:

d[i] = new_class(**d[i])else:

return False

• Basically, you can build custom classes or generate appropriate Named tuples as you go. This lets you know what you have and lets you build data structures to accommodate what you have.

• Again remember that class methods can easily be adjusted dynamically, so it’s good to code classes with instances that reference class methods.

Page 69: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

3. WHY? (AGAIN)

69

Page 70: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

CLUSTERING TIME SERIES

• Reports of clustering and classifying time series are surprisingly rare

• Methods are computationally demanding O(N2)… but we’re getting there

• Relatedly ‘classification’ can also be used for series-related predictions

• Can use many commonly applied clustering algorithms once you get the distance metric

70

http://www.cse.ust.hk/vldb2002/VLDB2002-proceedings/slides/S12P01slides.pdf

Page 71: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

WHEN DO PEOPLE GO RUNNING?

71

Page 72: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

WHEN DO PEOPLE GO RUNNING?

72Actually, I made these plots with R…

Page 73: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

NANO-SCALE PHYSICS

73Meisner et al, J . Am. Chem. Soc. 2012, 134, 20440−20445

• You can build an electrical circuit which has a single molecule as its narrowest part

• It turns out it’s quite easy to distinguish different molecules depending on their trajectory as you pull on them

• Particularly their summed behavior looks quite different

• Suggests that we could cluster and identify individual measurements with reasonable certainty

Page 74: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

74

• Several months of pulling the top 25 threads off Reddit’sfront page shows significantly different trends for different subreddits.

REDDIT

Page 75: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

75

• Several months of pulling the top 25 threads off Reddit’sfront page shows significantly different trends for different subreddits.

REDDIT

Page 76: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

76

• Several months of pulling the top 25 threads off Reddit’sfront page shows significantly different trends for different subreddits.

• Some kinds of posts don’t last long (r/TwoX and r/videos)

• r/personalfinance shows a remarkable ability to have a second peak/second life on the front page

• r/videos do great but burn out quickly

REDDIT

Page 77: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

QUICK: HOW IT WORKS I .

77

Page 78: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

QUICK: HOW IT WORKS II .

78

• O(N2) in theory

• Various lower bounding techniques significantly reduce processing time

• Dynamic programming problem

http://wearables.cc.gatech.edu/paper_of_week/DTW_myths.pdf

Page 79: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

WHY THE FANCY METHOD?

79

Euclidean distance matches ts3 to ts1, despite our intuition that ts1 and ts2 are more alike.

http://www.cse.ust.hk/vldb2002/VLDB2002-proceedings/slides/S12P01slides.pdf

http://nbviewer.jupyter.org/github/alexminnaar/time-series-classif ication-and clustering/blob/master/Time%20Series%20Classif ication%20and%20Clustering.ipynb

Page 80: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

BIKE-SHARING STANDS

80http://ofdataandscience.blogspot.nl/2013/03/capital-bikeshare-time-series-clustering.html?m=1

http://ofdataandscience.blogspot.nl/2013/03/capital-bikeshare-time-series-clustering.html?m=1

Page 81: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

FUTURE RESEARCH POSSIBILITIES

81

http://wearables.cc.gatech.edu/paper_of_week/DTW_myths.pdf

Page 82: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

WHY?

Time series classification and related metrics can be one more thing to know…or even several more things to know

82

Name Ordered Scores

ScoreTrajectory Type

Number of Tests

PredictedScore For Next Test

Joe [1, 14, 17] good 3 19

Mary [11, 14, 9] meh 3 11

Allen [25, NA, 9] underachiever 2 35

Info from classificationInfo from prediction

Info from easy apply() calls

Page 83: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

THE SHORT VERSION

• Pandas is already well-adapted to the No-SQL world

83

Page 84: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

THE SHORT VERSION

• Pandas is already well-adapted to the No-SQL world

• Make your data format work for you

84

Page 85: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

THE SHORT VERSION

• Pandas is already well-adapted to the No-SQL world

• Make your data format work for you

• Comparative time series go hand-in-hand with the increasing availability of No-SQL data. Everything is a time series if you look hard enough.

85

Page 86: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

THE SHORT VERSION

• Pandas is already well-adapted to the No-SQL world

• Make your data format work for you

• Comparative time series go hand-in-hand with the increasing availability of No-SQL data. Everything is a time series if you look hard enough.

• Non-time series collections are also informative. This was just one example of what you can do.

86

Page 87: Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

THANK YOU

87


Recommended