Date post: | 16-Apr-2017 |
Category: |
Data & Analytics |
Upload: | pydata |
View: | 604 times |
Download: | 1 times |
OUTLINE
1. WHY? ( O T H ER T H AN T H E T R ENDY NAME)
2. HOW?
3. WHY? ( AGAIN)
2
1. WHY?
3
LET’S START WITH STANDARD SQL-LIKE, TIDY DATA
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
4
LET’S START WITH STANDARD SQL-LIKE, TIDY DATA
What makes this data tidy? Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 15
LET’S START WITH STANDARD SQL-LIKE, TIDY DATA
What makes this data tidy?
• Observations are in rows
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 16
LET’S START WITH STANDARD SQL-LIKE, TIDY DATA
What makes this data tidy?
• Observations are in rows
• Variables are in columns
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 17
LET’S START WITH STANDARD SQL-LIKE, TIDY DATA
What makes this data tidy?
• Observations are in rows
• Variables are in columns
• Contained in a single data set
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 18
LET’S START WITH STANDARD SQL-LIKE, TIDY DATA
What makes this data tidy?
• Observations are in rows
• Variables are in columns
• Contained in a single data set
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
But can you tell me anything useful about this data set?
9
LET’S START WITH STANDARD SQL-LIKE, TIDY DATA
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
Sure. These are easy to see:
• Highest score
• Lowest score
• Total observations
10
LET’S START WITH STANDARD SQL-LIKE, TIDY DATA
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
Not-so-easy
• How many people?
• Who’s doing the best?
• Who’s doing the worst?
• How are individuals doing?
11
HOW ABOUT NOW?
What Changed?
• The data’s still tidy, but we’ve changed the organizing principle
Name Score Day
Allen 25 1
Mary 11 1
Joe 1 1
Mary 14 2
Joe 14 2
Joe 17 3
Allen 9 3
Mary 9 3
12
OK HOW ABOUT NOW? (LAST TIME I PROMISE)
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
13
OK HOW ABOUT NOW? (LAST TIME I PROMISE)
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
This data’s NOT TIDY but...
14
OK HOW ABOUT NOW? (LAST TIME I PROMISE)
This data’s NOT TIDY but...
I can eyeball it easily
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
15
OK HOW ABOUT NOW? (LAST TIME I PROMISE)
This data’s NOT TIDY but...
I can eyeball it easily
And new questions become interesting and easier to answer
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
16
OK HOW ABOUT NOW? (LAST TIME I PROMISE)
• How many students are there?• Who improved?• Who missed a test?• Who was kind of meh?
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
17
DON’T GET MAD
I’m not saying to kill tidy Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
18
DON’T GET MAD
I’m not saying to kill tidy
But I worry we don’t use certain methods more often because it’s not as easy as it could be.
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
19
BEFORE I GOT INTO THE ’NOSQL’ MINDSET I S IGHED WHEN ASKED QUESTIONS LIKE…
• App analytics What usage patterns do we see in our long-term app users? How do those patterns evolve over time at the individual level?
20
BEFORE I GOT INTO THE ’NOSQL’ MINDSET I S IGHED WHEN ASKED QUESTIONS LIKE
THESE
• App analytics What usage patterns do we see in our long-term app users? How do those patterns evolve over time at the individual level?
• Health research Can we predict early on in an experiment what’s likely to happen? Do our experiments need to be as long as they are?
21
BEFORE I GOT INTO THE ’NOSQL’ MINDSET I S IGHED THINKING ABOUT…
• App analytics What usage patterns do we see in our long-term app users? How do those patterns evolve over time at the individual level?
• Health research Can we predict early on in an experiment what’s likely to happen? Do our experiments need to be as long as they are?
• Consumer research Do people like things because they like them or because of the ordering they saw them in?
22
I ALSO TENDED NOT TO ASK NO-SQL QUESTIONS TOO OFTEN
• Status quo bias: humans tend to take whatever default is presented. That happens in data analysis too.
23
I ALSO TENDED NOT TO ASK NO-SQL QUESTIONS TOO OFTEN
• Status quo bias: humans tend to take whatever default is presented. That happens in data analysis too.
• Endowment effect: humans tend to want what they already have and think it’s more valuable than what’s offered for a trade.
24
I ALSO TENDED NOT TO ASK NO-SQL QUESTIONS TOO OFTEN
• Status quo bias: humans tend to take whatever default is presented. That happens in data analysis too.
• Endowment effect: humans tend to want what they already have and think it’s more valuable than what’s offered for a trade.
• Especially deep finding: humans are lazy
25
Option 1:>>> no_sql_df = df.groupby('Name').apply(lambda df: list(df.sort_values(by='Day')['Score']))>>> no_sql_dfNameAllen [25, 9]Joe [1, 14, 17]Mary [11, 14, 9]
26
IT’S TRUE. YOU CAN ANSWER THESE QUESTIONS WITH THE TIDY DATA FRAMES I
JUST SHOWED YOU.
You can always ’reconstruct’ these trajectories of what happened by making a data frame per user
>>> dfName Day Score
0 Allen 1 251 Joe 3 172 Joe 2 143 Mary 2 144 Mary 1 115 Allen 3 96 Mary 3 97 Joe 1 1
IT’S TRUE. YOU CAN ANSWER THESE QUESTIONS WITH THE TIDY DATA FRAMES I
JUST SHOWED YOU.
Option 2:>>> new_list = []>>> for tuple in df.groupby(['Name']):... new_list.append({tuple[0]: zip(tuple[1]['Day'], tuple[1]['Score'])})... >>> new_list[{'Allen': [(1, 25), (3, 9)]}, {'Joe': [(3, 17), (2, 14), (1, 1)]}, {'Mary': [(2, 14), (1, 11), (3, 9)]}]
You can always ’reconstruct’ these trajectories of what happened by making a data frame per user
27
>>> dfName Day Score
0 Allen 1 251 Joe 3 172 Joe 2 143 Mary 2 144 Mary 1 115 Allen 3 96 Mary 3 97 Joe 1 1
IT’S TRUE. YOU CAN ANSWER THESE QUESTIONS WITH THE TIDY DATA FRAMES I
JUST SHOWED YOU.
Option 3:>>> def process(new_df):... return [new_df[new_df['Day']==i]['Score'].values[0] if i in list(new_df['Day']) else None for i in range(1,4)]... >>> df.groupby(['Name']).apply(process)NameAllen [25, None, 9]Joe [1, 14, 17]Mary [11, 14, 9]
You can always ’reconstruct’ these trajectories of what happened by making a data frame per user
28
>>> dfName Day Score
0 Allen 1 251 Joe 3 172 Joe 2 143 Mary 2 144 Mary 1 115 Allen 3 96 Mary 3 97 Joe 1 1
LET’S BE HONEST…NO ONE WANTS THAT TO BE A F IRST STEP TO EVERY QUERY. ..AND
INCREASINGLY WE’LL BE REQUIRED TO MAKE THESE SORTS OF QUERIES
• Google ads (well…maybe less so in Europe)
29
LET’S BE HONEST…NO ONE WANTS THAT TO BE A F IRST STEP TO EVERY QUERY. ..AND
INCREASINGLY WE’LL BE REQUIRED TO MAKE THESE SORTS OF QUERIES
• Google ads (well…maybe less so in Europe)
• Wearable sensors
30
LET’S BE HONEST…NO ONE WANTS THAT TO BE A F IRST STEP TO EVERY QUERY. ..AND
INCREASINGLY WE’LL BE REQUIRED TO MAKE THESE SORTS OF QUERIES
• Google ads (well…maybe less so in Europe)
• Wearable sensors
• The unit of an observation should be the actor not the particular action observed at a particular time.
31
LET’S BE HONEST…NO ONE WANTS THAT TO BE A F IRST STEP TO EVERY QUERY. ..AND
INCREASINGLY WE’LL BE REQUIRED TO MAKE THESE SORTS OF QUERIES
• Google ads (well…maybe less so in Europe)
• Wearable sensors
• The unit of an observation should be the actor not the particular action observed at a particular time.
• Maybe we should rethink what we mean by ‘observations’
32
• High scalability
• Distributed computing
• Schema flexibility
• Semi-structured data
• No complex relationships
• Schema change all the time
• Patterns change all the time
• Same units of interest repeating new things
33
We don’t look for No-SQL because we have No-SQL databases... We have No-SQL Databases because we have no-SQL data.
WHAT IS NO-SQL PYTHON?
Data that doesn’t seem like it fits in a data frame
• Arbitrarily nested data
• Ragged data
• Comparative time series
34
WHERE DO WE FIND NO-SQL DATA?
Here’s where I’ve found it…
• Physics lab
• Running data
• Health data
35
2. HOW?
36
GETTING THE DATA INTO PYTHON WHEN IT ’S STRAIGHTFORWARD
• Scenario: you’re grabbing bunch of NoSQL data from an API or from a NoSQL db.
• We’ll stick with JSON since it’s a common format.
• Best case scenario. You’ll take everything however you can get it. In this case stick with pandas. The normalize_jsonworks great.
37
NORMALIZE_JSON WORKS PRETTY WELL
38
{"samples": [{ "name": "Jane Doe",
"age" : 42,"profession": "architect","series": [
{"day": 0,"measurement_value": 0.97
},{
"day": 1,"measurement_value": 1.55
},{
"day": 2,"measurement_value": 0.67
}]},
{ name": "Bob Smith","hobbies": ["tennis", "cooking"],"age": 37,"series":{
"day": 0,"measurement_value": 1.25
} }]}
NORMALIZE_JSON WORKS PRETTY WELL
39
{"samples": [{ "name": "Jane Doe",
"age" : 42,"profession": "architect","series": [
{"day": 0,"measurement_value": 0.97
},{
"day": 1,"measurement_value": 1.55
},{
"day": 2,"measurement_value": 0.67
}]},
{ name": "Bob Smith","hobbies": ["tennis", "cooking"],"age": 37,"series":{
"day": 0,"measurement_value": 1.25
} }]}
NORMALIZE_JSON WORKS PRETTY WELL
40
{"samples": [{ "name": "Jane Doe",
"age" : 42,"profession": "architect","series": [
{"day": 0,"measurement_value": 0.97
},{
"day": 1,"measurement_value": 1.55
},{
"day": 2,"measurement_value": 0.67
}]},
{ name": "Bob Smith","hobbies": ["tennis", "cooking"],"age": 37,"series":{
"day": 0,"measurement_value": 1.25
} }]}
NORMALIZE_JSON WORKS PRETTY WELL
41
{"samples": [{ "name": "Jane Doe",
"age" : 42,"profession": "architect","series": [
{"day": 0,"measurement_value": 0.97
},{
"day": 1,"measurement_value": 1.55
},{
"day": 2,"measurement_value": 0.67
}]},
{ name": "Bob Smith","hobbies": ["tennis", "cooking"],"age": 37,"series":{
"day": 0,"measurement_value": 1.25
} }]}
>>with open(json_file) as data_file:>> data = json.load(data_file)>> normalized_data = json_normalize(data['samples'])
Easy to process
NORMALIZE_JSON WORKS PRETTY WELL
42
{"samples": [{ "name": "Jane Doe",
"age" : 42,"profession": "architect","series": [
{"day": 0,"measurement_value": 0.97
},{
"day": 1,"measurement_value": 1.55
},{
"day": 2,"measurement_value": 0.67
}]},
{ name": "Bob Smith","hobbies": ["tennis", "cooking"],"age": 37,"series":{
"day": 0,"measurement_value": 1.25
} }]}
>>with open(json_file) as data_file:>> data = json.load(data_file)>> normalized_data = json_normalize(data['samples'])
Easy to process
>> print(normalized_data['series'][0])[1]>> {u'measurement_value': 1.55, u'day': 1}
Basically, it just works
NORMALIZE_JSON WORKS PRETTY WELL
43
{"samples": [{ "name": "Jane Doe",
"age" : 42,"profession": "architect","series": [
{"day": 0,"measurement_value": 0.97
},{
"day": 1,"measurement_value": 1.55
},{
"day": 2,"measurement_value": 0.67
}]},
{ name": "Bob Smith","hobbies": ["tennis", "cooking"],"age": 37,"series":{
"day": 0,"measurement_value": 1.25
} }]}
>>with open(json_file) as data_file:>> data = json.load(data_file)>> normalized_data = json_normalize(data['samples'])
Easy to process
Easy to add columns>> normalized_data['length'] = normalized_data['series'].apply(len)
>> print(normalized_data['series'][0])[1]>> {u'measurement_value': 1.55, u'day': 1}
Basically, it just works
USING SOME PROGRAMMER STUFF ALSO HELPS
44
class dfList(list):def __init__(self, originalValue):
if originalValue.__class__ is list().__class__:self = originalValue
else:self = list(originalValue)
def __getitem__(self, item):result = list.__getitem__(self, item)try:
return result[ITEM_TO_GET]except:
return result
def __iter__(self):for i in range(len(self)):
yield self.__getitem__(i)
def __call__(self):return sum(self)/list.__len__(self)
• Subclass an iterable to shorten your apply() calls
USING SOME PROGRAMMER STUFF ALSO HELPS
45
class dfList(list):def __init__(self, originalValue):
if originalValue.__class__ is list().__class__:self = originalValue
else:self = list(originalValue)
def __getitem__(self, item):result = list.__getitem__(self, item)try:
return result[ITEM_TO_GET]except:
return result
def __iter__(self):for i in range(len(self)):
yield self.__getitem__(i)
def __call__(self):return sum(self)/list.__len__(self)
• Subclass an iterable to shorten your apply() calls
• In particular, you need to subclass at least __getitem__ and __iter__
USING SOME PROGRAMMER STUFF ALSO HELPS
46
class dfList(list):def __init__(self, originalValue):
if originalValue.__class__ is list().__class__:self = originalValue
else:self = list(originalValue)
def __getitem__(self, item):result = list.__getitem__(self, item)try:
return result[ITEM_TO_GET]except:
return result
def __iter__(self):for i in range(len(self)):
yield self.__getitem__(i)
def __call__(self):return sum(self)/list.__len__(self)
• Subclass an iterable to shorten your apply() calls
• In particular, you need to subclass at least __getitem__ and __iter__
• You should probably subclass __init__ as well for the case of inconsistent format
USING SOME PROGRAMMER STUFF ALSO HELPS
47
class dfList(list):def __init__(self, originalValue):
if originalValue.__class__ is list().__class__:self = originalValue
else:self = list(originalValue)
def __getitem__(self, item):result = list.__getitem__(self, item)try:
return result[ITEM_TO_GET]except:
return result
def __iter__(self):for i in range(len(self)):
yield self.__getitem__(i)
def __call__(self):return sum(self)/list.__len__(self)
• Subclass an iterable to shorten your apply() calls
• In particular, you need to subclass at least __getitem__ and __iter__
• You should probably subclass __init__ as well for the case of inconsistent format
• Then __call__ can be a catch-all adjustable function...best to load it up with a call to a class function, which you can adjust at-will anytime.
CUSTOM CLASSES PAIR NICELY WITH CLASS METHODS
48
class Test:def __init__(self, name)
self.name1 = name
def print_class_instance(instance):print(instance.name1)
def print_self(self):self.__class__.print_class_instance(self)
>>> test1 = Test('test1')>>> test1.print_self()test1>>> def new_printing(instance):... print("Now I'm printing a constant string")... >>> test1.print_self()test1>>> Test.print_class_instance = new_printing>>> test1.print_self()Now I'm printing a constant string
• Design flexible classes that often reference class methods rather than instance methods
• Then as you are processing data, you can quickly swap out methods to call different field names in the event of highly nested JSON
• Data processing is faster and no mental gymnastics or annoying parse efforts required
GETTING NOSQL DATA: COMMONLY-ENCOUNTERED PROBLEMS
• CSVs with arrays
• Highly-nested JSON
• Unknown or Unreliably formatted API results
49
SOMETIMES YOU GET WEIRD CSV FILES…
• Sometimes your problem is as simple as getting a csv file with nested data
50
SOMETIMES YOU GET WEIRD CSV FILES…
• Sometimes your problem is as simple as getting a csv file with nested data
• This is pretty straightforward to deal with…use regex and common Python string operations to clean up the data
51
SOMETIMES YOU GET WEIRD CSV FILES…
• Sometimes your problem is as simple as getting a csv file with nested data
• This is pretty straightforward to deal with…use regex and common Python string operations to clean up the data
• Apply() is your best friend
52
SOMETIMES YOU GET WEIRD CSV FILES…
• Sometimes your problem is as simple as getting a csv file with nested data
• This is pretty straightforward to deal with…use regex and common Python string operations to clean up the data
• Apply() is your best friend
• Common problems: spaces between “,” and column name or column value (df = pd.read_csv("in.csv",sep="," , skipinitialspace=1)) use a parameter to avoid this problem
53
SOMETIMES YOU GET WEIRD CSV FILES…
name,favorites,agejoe,"[madonna,elvis,u2]",28mary,"[lady gaga, adele]",36allen,"[beatles, u2, adele, rolling stones]"
This isn’t even that weird
SOMETIMES YOU GET WEIRD CSV FILES…
name,favorites,agejoe,"[madonna,elvis,u2]",28mary,"[lady gaga, adele]",36allen,"[beatles, u2, adele, rolling stones]"
This isn’t even that weird
>> df = pd.read_csv(file_name, sep =",")Downright straightforward
SOMETIMES YOU GET WEIRD CSV FILES…
56
name,favorites,agejoe,"[madonna,elvis,u2]",28mary,"[lady gaga, adele]",36allen,"[beatles, u2, adele, rolling stones]"
This isn’t even that weird
>> df = pd.read_csv(file_name, sep =",")Downright straightforward
Hmmm….>> print(df['favorites'][0][1])>> m
SOMETIMES YOU GET WEIRD CSV FILES…
57
name,favorites,agejoe,"[madonna,elvis,u2]",28mary,"[lady gaga, adele]",36allen,"[beatles, u2, adele, rolling stones]"
This isn’t even that weird
>> df = pd.read_csv(file_name, sep =",")Downright straightforward
Hmmm….>> print(df['favorites'][0][1])>> m
Regex to the rescue…Python’s exceptionally easy string parsing a huge asset for No-SQL parsing>> df['favorites'] = df['favorites'].apply(lambda s: s[1:-1].split())>> print(df['favorites'][0][1])>> adele
WHAT ABOUT THIS ONE?
58
name,favorites,agejoe,[madonna,elvis,u2],28mary,[lady gaga, adele],36allen,[beatles, u2, adele, rolling stones]
This isn’t even that weird
>> df = pd.read_csv(file_name, sep =",")Downright straightforward?
Actually this fails miserably>> print(df['favorites'])>> joe [madonna elvis u2]mary [lady gaga adele] 36Name: name, dtype: object
We need more regex…this time before applying read_csv()....
59
name,favorites,agejoe,[madonna,elvis,u2],28mary,[lady gaga, adele],36allen,[beatles, u2, adele, rolling stones]
Missing quotes arouns arrays:
Basically, put in a the quotation marks to help out read_csv()
60
name,favorites,agejoe,[madonna,elvis,u2],28mary,[lady gaga, adele],36allen,[beatles, u2, adele, rolling stones]
Missing quotes arouns arrays:
pattern = "(\[.*\])"with open(file_name) as f:
for line in f:new_line = linematch = re.finditer(pattern, line)try:
m = match.next()while m:
replacement = '"'+m.group(1)+'"'new_line = new_line.replace(m.group(1), replacement)m = match.next()
except:pass
with open(write_file, 'a') as write_f:write_f.write(new_line)
new_df = pd.read_csv(write_file)
Basically, put in a the quotation marks to help out read_csv()
61
name,favorites,agejoe,[madonna,elvis,u2],28mary,[lady gaga, adele],36allen,[beatles, u2, adele, rolling stones]
Missing quotes arouns arrays:
pattern = "(\[.*\])"with open(file_name) as f:
for line in f:new_line = linematch = re.finditer(pattern, line)try:
m = match.next()while m:
replacement = '"'+m.group(1)+'"'new_line = new_line.replace(m.group(1), replacement)m = match.next()
except:pass
with open(write_file, 'a') as write_f:write_f.write(new_line)
new_df = pd.read_csv(write_file)
Basically, put in a the quotation marks to help out read_csv()
With multiple arrays per row, you’re gonna need to accommodate the greedy nature of regexpattern = "(\[.*?\])"
62
THAT WAS A LOT OF TEXT…ALMOST DONE
SOMETIMES YOU GET JSON AND YOU KNOW THE STRUCTURE, YOU JUST DON’T LIKE IT
• Use json_normalize()and then shed columns you don’t want. You’ve seen that today already (slides 32-38).
• Use some magic: sh with jq module to simplify your life…you can pick out the fields you want with jq either on the command line or with sh
• jq has a straightforward, easy to learn syntax: . = value, [] = array operation, etc… 63
cat = sh.catjq = sh.jqrule = """[{name: .samples[].name, days: .samples[].series[].day}]""”out = jq(rule, cat(_in=json_data)).stdoutjson.loads(uni_out)
AND SOMETIMES YOU HAVE NO IDEA WHAT’S IN AN ENORMOUS JSON FILE
• Inconsistent or undocumented API
• Legacy Mongo database
• Someone handed you some gnarly JSON because they couldn’t parse it
64
YOU’RE A PROGRAMMER…USE ITERATORS
• The ijson module is an iterator JSON parser…you can deal with structure one bit at a time
• This also gives you a great opportunity to make data parsing decisions as you go
• This isn’t fast, but it’s also not fast to shoot from the hip when you’re talking about gnarly JSON
65
YOU’RE A PROGRAMMER…USE ITERATORS
66
with open(file_name, 'rb') as file:results = ijson.items(file, "samples.item")
for newRecord in results:record = newRecordfor k in record.keys():
if isinstance(record[k], dict().__class__):recursive_check(record[k])
if isinstance(record[k], list().__class__):recursive_check(record[k])
process(record)
YOU’RE A PROGRAMMER…USE ITERATORS
67
total_dict = defaultdict(lambda: False)
def recursive_check(d):
if isinstance(d, dict().__class__):if not total_dict[tuple(sorted(d.keys()))]:
class_name = raw_input("Input the new classname with a space and then the file name defining the class ")
mod = import_module(class_name)cls = getattr(mod, class_name)total_dict[tuple(sorted(d.keys()))] = cls
for k in d.keys():new_class = recursive_check(k)if new_class:
d[k] = new_class(**d[k])
return total_dict[tuple(sorted(d.keys()))]
elif isinstance(d, list().__class__):for i in range(len(d)):
new_class = recursive_check(d[i])if new_class:
d[i] = new_class(**d[i])else:
return False
• Basically, you can build custom classes or generate appropriate named tuples as you go.
• This lets you know what you have and lets you build data structures to accommodate what you have.
• Storing these objects in a class rather than simple dictionary again gives you the option to customize .__call__() to your needs
YOU’RE A PROGRAMMER…USE ITERATORS
68
total_dict = defaultdict(lambda: False)
def recursive_check(d):
if isinstance(d, dict().__class__):if not total_dict[tuple(sorted(d.keys()))]:
class_name = raw_input("Input the new classname with a space and then the file name defining the class ")mod = import_module(class_name)cls = getattr(mod, class_name)total_dict[tuple(sorted(d.keys()))] = cls
for k in d.keys():new_class = recursive_check(k)if new_class:
d[k] = new_class(**d[k])
return total_dict[tuple(sorted(d.keys()))]
elif isinstance(d, list().__class__):for i in range(len(d)):
new_class = recursive_check(d[i])if new_class:
d[i] = new_class(**d[i])else:
return False
• Basically, you can build custom classes or generate appropriate Named tuples as you go. This lets you know what you have and lets you build data structures to accommodate what you have.
• Again remember that class methods can easily be adjusted dynamically, so it’s good to code classes with instances that reference class methods.
3. WHY? (AGAIN)
69
CLUSTERING TIME SERIES
• Reports of clustering and classifying time series are surprisingly rare
• Methods are computationally demanding O(N2)… but we’re getting there
• Relatedly ‘classification’ can also be used for series-related predictions
• Can use many commonly applied clustering algorithms once you get the distance metric
70
http://www.cse.ust.hk/vldb2002/VLDB2002-proceedings/slides/S12P01slides.pdf
WHEN DO PEOPLE GO RUNNING?
71
WHEN DO PEOPLE GO RUNNING?
72Actually, I made these plots with R…
NANO-SCALE PHYSICS
73Meisner et al, J . Am. Chem. Soc. 2012, 134, 20440−20445
• You can build an electrical circuit which has a single molecule as its narrowest part
• It turns out it’s quite easy to distinguish different molecules depending on their trajectory as you pull on them
• Particularly their summed behavior looks quite different
• Suggests that we could cluster and identify individual measurements with reasonable certainty
74
• Several months of pulling the top 25 threads off Reddit’sfront page shows significantly different trends for different subreddits.
75
• Several months of pulling the top 25 threads off Reddit’sfront page shows significantly different trends for different subreddits.
76
• Several months of pulling the top 25 threads off Reddit’sfront page shows significantly different trends for different subreddits.
• Some kinds of posts don’t last long (r/TwoX and r/videos)
• r/personalfinance shows a remarkable ability to have a second peak/second life on the front page
• r/videos do great but burn out quickly
QUICK: HOW IT WORKS I .
77
QUICK: HOW IT WORKS II .
78
• O(N2) in theory
• Various lower bounding techniques significantly reduce processing time
• Dynamic programming problem
http://wearables.cc.gatech.edu/paper_of_week/DTW_myths.pdf
WHY THE FANCY METHOD?
79
Euclidean distance matches ts3 to ts1, despite our intuition that ts1 and ts2 are more alike.
http://www.cse.ust.hk/vldb2002/VLDB2002-proceedings/slides/S12P01slides.pdf
http://nbviewer.jupyter.org/github/alexminnaar/time-series-classif ication-and clustering/blob/master/Time%20Series%20Classif ication%20and%20Clustering.ipynb
BIKE-SHARING STANDS
80http://ofdataandscience.blogspot.nl/2013/03/capital-bikeshare-time-series-clustering.html?m=1
http://ofdataandscience.blogspot.nl/2013/03/capital-bikeshare-time-series-clustering.html?m=1
FUTURE RESEARCH POSSIBILITIES
81
http://wearables.cc.gatech.edu/paper_of_week/DTW_myths.pdf
WHY?
Time series classification and related metrics can be one more thing to know…or even several more things to know
82
Name Ordered Scores
ScoreTrajectory Type
Number of Tests
PredictedScore For Next Test
Joe [1, 14, 17] good 3 19
Mary [11, 14, 9] meh 3 11
Allen [25, NA, 9] underachiever 2 35
Info from classificationInfo from prediction
Info from easy apply() calls
THE SHORT VERSION
• Pandas is already well-adapted to the No-SQL world
83
THE SHORT VERSION
• Pandas is already well-adapted to the No-SQL world
• Make your data format work for you
84
THE SHORT VERSION
• Pandas is already well-adapted to the No-SQL world
• Make your data format work for you
• Comparative time series go hand-in-hand with the increasing availability of No-SQL data. Everything is a time series if you look hard enough.
85
THE SHORT VERSION
• Pandas is already well-adapted to the No-SQL world
• Make your data format work for you
• Comparative time series go hand-in-hand with the increasing availability of No-SQL data. Everything is a time series if you look hard enough.
• Non-time series collections are also informative. This was just one example of what you can do.
86
THANK YOU
87