Date post: | 28-Jul-2015 |
Category: |
Science |
Upload: | nic-mcphee |
View: | 30 times |
Download: | 0 times |
Silico-paleontology with graph databasesRooting through the relics of digital evolution
Nic McPhee & David Donatucci (w/ Thomas Helmuth)
Division of Science and MathematicsUniversity of Minnesota, Morris
Morris, Minnesota, USA
May 2015Genetic Programming Theory and Practice
University of MichiganAnn Arbor, MI
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 1 / 26
Overview The Big Picture
The Big Picture
Genetic programming clearly works.But we rarely know why or how.Databases allow examination of the internal interactions of a run.Graph databases better suited for this than relational databases.Silico-paleontology can help us understand and improve our tools.
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 2 / 26
Overview Outline
Outline
1 What do we know? (And how do we talk about it?)
2 Using a graph database
3 Let’s go exploring!
4 Conclusions
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 3 / 26
What do we know? (And how do we talk about it?)
Outline
1 What do we know? (And how do we talk about it?)We throw so much awaySummary results are highly lossyPlots are better (but can still obscure details)Can we zoom in to individual runs?
2 Using a graph database
3 Let’s go exploring!
4 Conclusions
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 4 / 26
What do we know? (And how do we talk about it?) We throw so much away
We keep/see/share so little
EC research has the potential to generatehuge amounts of data.
What do we normally do with that data?
We normally throw it away – &paleontologists weep!
https://www.flickr.com/photos/blmoregon/14566767645/
https://www.flickr.com/photos/nicmcphee/1323950471
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 5 / 26
What do we know? (And how do we talk about it?) Summary results are highly lossy
Oooh – a table of results!
TreatmentProblem L T IRSWN 55 13 17SYL 22 1 2SLB 75 19 10NTZ 57 15 7
These show successes on 4 problemsfor 3 different treatments
L seems to be winning
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 6 / 26
What do we know? (And how do we talk about it?) Summary results are highly lossy
Oooh – a table of results!
TreatmentProblem L T IRSWN 55 13 17SYL 22 1 2SLB 75 19 10NTZ 57 15 7
But why?!?!?
What’s actually happening in all thosematings and crossovers and mutationsthat makes the difference?
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 6 / 26
What do we know? (And how do we talk about it?) Plots are better (but can still obscure details)
Let’s draw pretty pictures
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
lexicasetourney
0 100 200 300generation
erro
r.div
ersi
ty
So much more data!
Diversity over time across allthe runs.
L’s diversity (top) is consis-tently higher than T (bot-tom).
That might be important(and supports some hy-potheses).
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 7 / 26
What do we know? (And how do we talk about it?) Plots are better (but can still obscure details)
Let’s draw pretty pictures
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
lexicasetourney
0 100 200 300generation
erro
r.div
ersi
ty
Still, this mushes all the runstogether.
And that likely obscures in-teresting things.
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 7 / 26
What do we know? (And how do we talk about it?) Can we zoom in to individual runs?
Zooming in
0.2
0.4
0.6
0.8
0 25 50 75generation
erro
r.div
ersi
ty
Focusing on one successfulL run now.
Three big diversity changes:
First 15 generationshave a sharp drop thensteep riseAround generation 40 asharp drop and riseSharp drop at end justbefore a solution isfound
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 8 / 26
What do we know? (And how do we talk about it?) Can we zoom in to individual runs?
Zooming in
0.2
0.4
0.6
0.8
0 25 50 75generation
erro
r.div
ersi
ty
What’s happening at thosesections of the run?
We want to be able to digthrough a run and see whathappened.
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 8 / 26
Using a graph DB
Outline
1 What do we know? (And how do we talk about it?)
2 Using a graph databaseGoalsNeo4jCypher
3 Let’s go exploring!
4 Conclusions
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 9 / 26
Using a graph DB Goals
Goals
We want to store and analyze all theindividuals and their relationships.
Ancestry relationships are naturallymodeled with a graph
So graph databases seem a natural toolfor the relationship part.
www.hokstad.com/family-tree-using-graphviz-and-ruby
(a) Distribution of fitness values (b) Genealogies in the last generation (c) Root lineages in the last generation
(d) Genealogy of the best individual (e) Root lineage of the best individual
Fitness value (Pearson’s R2)
0.0 1.0
Figure 1: Distribution of fitness, genealogies and root lineages in the population graph.
[Burlacu et al., 2013]
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 10 / 26
Using a graph DB Neo4j
Neo4j graph database
Part of the new-ish NoSQL movementNeo4j’s initial release was 2007Started to take off in 2010
Represent individuals as nodesRepresent parent-child relationships asedges
Easy to represent complex relationshipsEasy to search for relationshipsEfficient recursive queries, esp.compared to traditional databases
http://neo4j.com
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 11 / 26
Using a graph DB Cypher
Cypher query language
Neo4j uses the Cypher query language.
Fundamental elements of Cypherqueries:
STARTMATCHWHERERETURN
Uses "ASCII art" to describerelationships:
(p)- ->(c)
(p)-[r:PARENT_OF]->(c)
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 12 / 26
Using a graph DB Cypher
Can model (complex) paths
Find Nic’s parents:
(Nic)<-[:PARENT_OF]-(p)
Find all Nic’s grandparents:
(Nic)<-[:PARENT_OF*2]-(gp)
Find everyone at most 5 steps from Nic:
(Nic)<-[:PARENT_OF*1..5]-(a)
Find all Nic’s siblings:
(Nic)<-[:PARENT_OF]-()-[:PARENT_OF]->(s)
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 13 / 26
Let’s go exploring!
Outline
1 What do we know? (And how do we talk about it?)
2 Using a graph database
3 Let’s go exploring!SetupComparing the end-games
4 Conclusions
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 14 / 26
Let’s go exploring! Setup
What are we exploring?
Tom Helmuth provided a lot of data:A number of program synthesis problems taken from introcomputing textsThree different selection mechanisms: Lexicase, tournament, andimplicit fitness sharing (IFS)All using Clojush implementation of Lee Spector’s PushGP systemhttps://github.com/lspector/Clojush
Population size 1,000; ≤ 300 generationsSee [Helmuth and Spector, 2015] for more.
We used batch-import tool and custom scripts to import into Neo4j.https://github.com/jexp/batch-import
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 15 / 26
Let’s go exploring! Setup
Only just the beginning
We have data from hundreds of runsCurrently a very “by hand” processDefinitely learned valuable things about:
The behavior of lexicaseRole of alternation (a type of crossover) in PushGPImpact of test cases on evolutionary dynamics
We’ll look at results from two runs:Both successful on replace-space-with-newline problemOne using lexicase (sol’n found in 88 gens)One using tournament selection (sol’n found in 151 gens)
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 16 / 26
Let’s go exploring! Comparing the end-games
How did we construct a winner?
How is a winner constructed at the end of a run?
This query finds all ancestors of a winner (zero total_error) goingback at most 8 steps:
MATCH (w) WHERE w.total_error = 0MATCH (p)-->(c)-[*0..7]->(w)RETURN DISTINCT id(p), id(c);
8 steps is fairly arbitrary; returns a small enough set to visualize.
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 17 / 26
Let’s go exploring! Comparing the end-games
Comparing the end-games
Ancestry of winner(s) look verydifferent
Tournament selection (below):Single winner w/ highbranching factorLexicase (right): 45 winners w/much lower branching factor
Gen 142
Gen 143
Gen 144
Gen 145
Gen 146
Gen 147
Gen 148
Gen 149
Gen 150
233 5 2
3
2332
2
2
2
2
2
Gen 79
Gen 80
Gen 81
Gen 82
Gen 83
Gen 84
Gen 85
Gen 86
Gen 87
80:220
82:447
83:04783:124 83:619
84:319
85:086
86:261
87:71987:941 87:94742 Other Winners
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 18 / 26
Let’s go exploring! Comparing the end-games
Lexicase selection
Gen 79
Gen 80
Gen 81
Gen 82
Gen 83
Gen 84
Gen 85
Gen 86
Gen 87
80:220
82:447
83:04783:124 83:619
84:319
85:086
86:261
87:71987:941 87:94742 Other Winners
A number of observations:45(!) “winning” individualsIndividual “86:261” is (a)parent of all 45Individual “86:261” is aparent of 934 (of 1,000)individuals in nextgeneration
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 19 / 26
Let’s go exploring! Comparing the end-games
Lexicase selection
Gen 79
Gen 80
Gen 81
Gen 82
Gen 83
Gen 84
Gen 85
Gen 86
Gen 87
80:220
82:447
83:04783:124 83:619
84:319
85:086
86:261
87:71987:941 87:94742 Other Winners
Seriously?!? 934 offspring?!?
Turns out to an be extreme caseof a common phenomena withlexicase
Nodes marked with diamondsall had at least 100 offspring
Shaded diamonds also have atleast 5 offspring that are ances-tors of or are winners
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 19 / 26
Let’s go exploring! Comparing the end-games
Lexicase selection
Gen 79
Gen 80
Gen 81
Gen 82
Gen 83
Gen 84
Gen 85
Gen 86
Gen 87
80:220
82:447
83:04783:124 83:619
84:319
85:086
86:261
87:71987:941 87:94742 Other Winners
What’s the total error (fitness) of“86:261”?
4,034(!)Bottom quartile!But had 934 offspring!
Failed to return on 4 cases(error 1,000 each)Got 2 other answers wrong(error 17 each)Terrible total error, butperfect on 194 of 200 testsGreat for lexicase!
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 19 / 26
Let’s go exploring! Comparing the end-games
Lexicase selection
Gen 79
Gen 80
Gen 81
Gen 82
Gen 83
Gen 84
Gen 85
Gen 86
Gen 87
80:220
82:447
83:04783:124 83:619
84:319
85:086
86:261
87:71987:941 87:94742 Other Winners
What’s the total error (fitness) of“86:261”?
4,034(!)Bottom quartile!But had 934 offspring!
Failed to return on 4 cases(error 1,000 each)Got 2 other answers wrong(error 17 each)Terrible total error, butperfect on 194 of 200 testsGreat for lexicase!
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 19 / 26
Let’s go exploring! Comparing the end-games
Lexicase selection
Gen 79
Gen 80
Gen 81
Gen 82
Gen 83
Gen 84
Gen 85
Gen 86
Gen 87
80:220
82:447
83:04783:124 83:619
84:319
85:086
86:261
87:71987:941 87:94742 Other Winners
What’s the total error (fitness) of“85:086”?
100,000!Rank 971 out of 1,000But had 180 offspring
Got all the “print” casesFailed to return value for all100 “return” cases (error1,000 each)Terrible total error, butperfect on 100 of 200 testsFine for lexicase
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 19 / 26
Let’s go exploring! Comparing the end-games
Lexicase selection
Gen 79
Gen 80
Gen 81
Gen 82
Gen 83
Gen 84
Gen 85
Gen 86
Gen 87
80:220
82:447
83:04783:124 83:619
84:319
85:086
86:261
87:71987:941 87:94742 Other Winners
What’s the total error (fitness) of“85:086”?
100,000!Rank 971 out of 1,000But had 180 offspring
Got all the “print” casesFailed to return value for all100 “return” cases (error1,000 each)Terrible total error, butperfect on 100 of 200 testsFine for lexicase
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 19 / 26
Let’s go exploring! Comparing the end-games
Lexicase selection
Gen 79
Gen 80
Gen 81
Gen 82
Gen 83
Gen 84
Gen 85
Gen 86
Gen 87
80:220
82:447
83:04783:124 83:619
84:319
85:086
86:261
87:71987:941 87:94742 Other Winners
High proportion of mutations:Roughly half the offspringin this graph created viamutationProbably why there’s lessbranching
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 19 / 26
Let’s go exploring! Comparing the end-games
Tournament selection
Gen 142
Gen 143
Gen 144
Gen 145
Gen 146
Gen 147
Gen 148
Gen 149
Gen 150
233 5 2
3
2332
2
2
2
2
2
Much broader: 42 ancestors of a winner for tournament 9 gensback; 14 for lexicaseAbout two-thirds created via crossover, so more branching thanlexicase
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 20 / 26
Let’s go exploring! Comparing the end-games
Number ancestors of “winners” over time
Gens from winner Lexicase Tournament
1 4 22 6 43 7 64 6 105 7 136 9 207 10 308 14 339 14 42
10 22 63...
......
18 58 297
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 21 / 26
Let’s go exploring! Comparing the end-games
12 most fecund individuals
Lexicase Tournament
934 24657 23594 23590 21433 20326 20297 19294 19285 19283 18279 18271 18
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 22 / 26
Conclusions
Outline
1 What do we know? (And how do we talk about it?)
2 Using a graph database
3 Let’s go exploring!
4 Conclusions
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 23 / 26
Conclusions
Conclusions
Still early days, but we can definitely see some useful things:Differences in ways selection mechanisms workSupport for hypotheses (e.g., Tom’s paper)Evidence for importance of crossover in PushGPImpact of test cases on evolutionary dynamics
Future WorkAutomate more of the workExamine more runs/problems/etc.Explore how to include this “on-line”
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 24 / 26
Conclusions
Thanks!
Thank you for your time and attention!
Thanks to M. Kirbie Dramdahl (University of Minnesota, Morris), and toLee Spector’s Computational Intelligence group (Hampshire College)for ideas and feedback.
Contacts:[email protected]
Questions?
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 25 / 26
References
References
Burlacu, B., Affenzeller, M., Kommenda, M., Winkler, S., and Kronberger, G. (2013).Visualization of genetic lineages and inheritance information in genetic programming.In Proceedings of the 15th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO ’13Companion, pages 1351–1358, New York, NY, USA. ACM.
Helmuth, T. and Spector, L. (2015).General program synthesis benchmark suite.In Proceedings of the 17th Annual Conference on Genetic and Evolutionary Computation, GECCO ’15, New York, NY,USA. ACM.
McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 26 / 26