+ All Categories
Home > Technology > Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Date post: 20-Jan-2015
Category:
Upload: jimfuller2009
View: 1,661 times
Download: 3 times
Share this document with a friend
Description:
Results of an experimental approach of using MarkLogic/Hadoop to generate source code using map reduce methods.
Popular Tags:
48
MarkLogic and Hadoop – Genetic Algorithm Jim Fuller email: [email protected] twitter: @xquery Senior Engineer, Europe 19/09/12
Transcript
Page 1: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

MarkLogic and Hadoop – Genetic Algorithm

Jim Fulleremail: [email protected] twitter: @xquerySenior Engineer, Europe19/09/12

Page 2: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

James Fuller

http://exslt.orghttp://www.xmlprague.cz

http://jim.fuller.name

@xquery

@perl6

XSLT UK 2001

Senior engineer

Page 3: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Overview

• Genetic Algorithm Refresher• Marklogic/Hadoop architecture for

implementing GA• Installing Hadoop• Installing MarkLogic Connector• Problem Statement• Review of GA process runs• Summary

Page 4: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Whats the Problem ?

• Bigdata breathes life into older algorithmic approaches

• I thought it would interesting to turn ‘bigdata’ problem on its head (code versus data)

• Demonstrate hadoop with MarkLogic, working to each other strengths

Page 5: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Get out of your comfort zone

• This talk is slightly different then the description … 150 slides! Part I.

• Its got hadoop/marklogic and the genetic algorithm but have focused on the process and early results

• Doing data science means pushing yourself out of your comfort zone

• Start simple, then iterate

Page 6: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Genetic Algorithm Refresher• The Genetic Algorithm ( GA ) is a model of the

evolution of a population of artificial individuals emulating Darwinian Selection.

• Each individual is a chromosome which contains discrete units of information (genes).

• The driving force behind the search for new and better solutions is the retention and combination of good partial solutions to a problem

Page 7: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Abridged Genetic Algorithm • The Fundamental Theorem of Genetic Algorithms

M(H, t):# of individuals in population 't' with the schema 'H'.f(H): average fitness of the individuals with the schema 'H'.F: average fitness of the entire population.p1:probability of the schema being destroyed by crossover.p2:probability of the schema being destroyed by mutation.

Page 8: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

GA operations

• Reproduction: An individual is perfectly replicated to a new population

• Crossover ( Recombination ): Parental material is recombined to create offspring to join new population

• Mutation: random changes (is key for pushing past local optima)

• Permutation: reordering • Editing: evaluation to a terminal• Encapsulation: single indivisible function• Decimation: removal of individuals

Page 9: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Typical GA ProcessStep 0. Create a random initial population of individuals

Step 1. Evaluate the fitness of each individual

Step 2. Select individuals according to their fitness, which will participate in generating offspring (moms+dads)

Step 3. Apply primary and secondary genetic operations to generate new offspring population

Step 4. Repeat the steps 1,2,3, to generate X number of generations

Step 5. choose fittest individual of last generation based on stop criteria

Page 10: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Endemic GA Problems

• Finding the optimal solution to complex high dimensional, multimodal problems often requires very expensive fitness function

• Hard to pose problem statement e.g. Stop criteria is not clear in every problem

• Premature convergence on local optima

Page 11: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

(+( 2 3) 4) evaluates to 10 and symbolic expression looks like;

Bit strings vs Lisp Parse Trees

3

4

+

2

Hierarchical computer programs are more expressive then manipulating linear strings

Page 12: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

XSLT – markup is useful!

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version=“2.0">

<xsl:template match="a"> <d/>

<c/> </xsl:template></xsl:stylesheet>

<d/><c/>

<xsl:template/>

<xsl:stylesheet/>

Obvious Difficulties to address; different node types and xpath

Page 13: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Objective Generate an xslt program that transforms source xml into result xml which is equivalent to target xml

Terminal Set <a/> <b/> <c/> <d/>

Function Set Subset of xslt instructions

Fitness Cases One fitness case

Raw fitness Treediffmerge result, node count + standard diff

Standardized fitness

Same as raw fitness, approaching 0 is better fitness

Parameters M=500, G=51

Page 14: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Source XML

<a><b>

<c><d></d>

</c></b>

</a>

Page 15: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Target XML – clear stop criteria

<a><b>

<c><d></d>

</c></b>

</a>

Page 16: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Generation zero

• XML Instance Generator which is part of the Sun Multi-Schema Validator

• Sun Multi-Schema Validator• The following can do it

– OxygenXML – Visual Studio– Eclipse

• Ended up using IBM XML Generate – very old, supply it a schema and it would generate example xml

Page 17: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Step 1a: Evaluate against Input

XSLT generation

xslt Source.xml

result.xml

transformation

MarkLogic evals and places the result into the property for the xslt itself

Page 18: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Step 1b: Evaluate Fitness

XSLT generation

xslt Source.xml

result.xml

evaluate fitness

transformation

HADOOP

fitness performed with treediffmerge + standard diff

Page 19: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

XML Diff issues

• Many diff algorithms are based on a paper published in 1976 by J. W. Hunt and M. D. McIlroy, An Algorithm for Differential File Comparison

• XML has a structure, text based diff programs do not take this into accordance

• simple example: <footie/> versus <footie></footie>logically these are equal

• XML Canonization helps !

Page 20: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

TREEDIFFMERGE DIFFERENCE RESULTS

<?xml version="1.0" encoding="UTF-8"?><diff xmlns:diff='http://diff.org'> <diff:insert dst="1">

<a><b>

<c>

<d />

</c> </b>

</a> </diff:insert>

</diff>

<?xml version="1.0" encoding="UTF-8"?><root/>

<?xml version="1.0" encoding="UTF-8"?><diff xmlns:diff='http://diff.org'> <diff:copy src="2" dst="1">

<diff:copy src="16" dst="2" />

</diff:copy></diff>

<?xml version="1.0" encoding="utf-8"?><root><a/><a><a><c/><c><a><d/></a><c/></c></a><b><b/><a/><c/><b> <c> <d/> </c> </b></b><a/></a><d><a><c/><a/><a/></a><c/></d><c/></root>

XML Canonize + TreeDiffMerge

Page 21: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Simple if we match: we are done!<?xml version="1.0" encoding="UTF-8"?><diff />

<?xml version="1.0" encoding="utf-8"?><root><a> <b> <c> <d/> </c> </b></a></root>

Page 22: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

MarkLogic/Hadoop ArchitectureInterlude

MarkLogic

MarkLogic

Connector API via XDBC

Connector API via XDBC

Page 23: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

From Hadoop pov

Page 24: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Hadoop Installation Recipe• installing Hadoop (setting up a single node cluster)

– brew install hadoop– make sure ssh is setup properly– generate id_rsa and id_rsa.pub– append pub to auth keys

• cat id_rsa.pub >> authorized_keys – enable remote on mac osx

• configure hadoop– edit core-site.xml– edit mapred-site.xml

• ssh localhost– format hdfs

• hadoop namenode –format

• bin/start-all.sh– if asks for password, you got problem with your ssh setup

• to check that all is well– run jps– ps ax | grep hadoop | wc –l– Check

• http://localhost:50030/jobtracker.jsp• http://localhost:50060/tasktracker.jsp• http://localhost:50070/dfshealth.jsp

Page 25: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Installing ML Hadoop Connector

• copy latest xcc and connector jars to hadoop lib

• Copy ml-examples jar as well• Copy ml hadoop conf to hadoop conf

Page 26: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Starting it all Up

• Start marklogic• Create database• Create xdbc connection (how hadoop/ml

communicate)• Edit marklogic-hello-world.xml

• Make sure hadoop is started

Page 27: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Starting it all Up

• Load test Data via query console

xquery version "1.0-ml";

let $hello := <data><child>hello mom</child></data>let $world := <data><child>world event</child></data>

return( xdmp:document-insert("hello.xml", $hello), xdmp:document-insert("world.xml", $world))

Page 28: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Run hello world example

• bin/start-all.sh

• hadoop jar lib/marklogic-xcc-examples-6.0.20120914.jar com.marklogic.mapreduce.examples.HelloWorld

• Review https://gist.github.com/2484318

Page 29: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Fitness (hadoop) step

• Applies XML canonization• Performs treediffmerge, outputs and writes to

original xslt document xml property• Performs text diff and writes to original xslt

document xml property

Page 30: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Step 2. Select individuals

• Probabilistic selection to choose which individuals participate in genetic operation

Selected XSLT population

Select individuals for genetic operations, based on their fitness

Page 31: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

About fitness

• Raw fitness: is the natural representation in terms of the specific problem (primitive counting nodes of treediffmerge patch)

• Standardized fitness: lower the better• Adjusted fitness: lies between 0-1• Normalized fitness: lies between 0-1 with

sum of fitness values = 1• In our case the lower the number of ‘different’

nodes the better, use standardized fitness

Page 32: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Step 3. Apply Primary Genetic Operations

Selected XSLT population

New generation

Reproduction

Individual reproduced into new generation

Page 33: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Step 3. Primary Genetic Operations

Selected XSLT population

New generation

Creates 2 offspring‘Mom’

‘Dad’

Crossover ( Recombination )

Select parents then crossover creates 2 offspring

Page 34: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Step 3. Primary Genetic OperationsCrossover ( Recombination )

‘Dad XSLT’‘Mom XSLT’

‘offspring xslt’

‘offspring xslt’

New generationSwap nodes between selected parent xslt

Page 35: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Crossover with xqueryxquery version "1.0-ml";import module namespace mem = "http://xqdev.com/in-mem-update" at "/MarkLogic/appservices/utils/in-mem-update.xqy" ;

let $mom := <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/" as="item()*"> <bar>help</bar> </xsl:template> <xsl:template match="text()" as="item()*"/> </xsl:stylesheet> let $dad := <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/" as="item()*"> <a><b><c>test</c></b></a> </xsl:template> </xsl:stylesheet> let $momCount := fn:count($mom//.) let $dadCount := fn:count($dad//.) (: never want root node :) let $momRdm := xdmp:random($momCount - 2) + 2 let $dadRdm := xdmp:random($dadCount - 2) + 2 (: node selection :) let $momNode := ($mom//.)[$momRdm] let $dadNode := ($dad//.)[$dadRdm]

(: crossover :) let $newMom := mem:node-replace( $momNode, $dadNode ) let $newDad := mem:node-replace( $dadNode, $momNode ) return <result> <newMom>{$newMom}</newMom> <newDad>{$newDad}</newDad> </result>

Page 36: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Step 3. Secondary Genetic Operations

• Mutation: is a form of random crossover• Permutation: Reorganize nodes• Editing: evaluate a set of nodes• Encapsulation: takes a branch and replaces

with 1 indivisible node• Decimation: removes individual based on

domain specific criteria

Page 37: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Step 3. Secondary Genetic Operationsmutation

‘selected XSLT’

Pick a node and randomly mutate

Completely new set of instructions

‘offspring xslt’

Page 38: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Step 3. Secondary Genetic Operationspermutation

‘selected XSLT’ ‘offspring xslt’

Permutated node order

Page 39: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Step 3. Secondary Genetic Operationsediting

‘selected XSLT’ ‘offspring xslt’

Replace node with evaluated expression

Page 40: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Step 3. Secondary Genetic Operationsencapsulation

‘selected XSLT’ ‘define new function’

Identify useful subtrees and encapsulate by defining new function

‘XSLT’

Page 41: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Step 3. Secondary Genetic Operations

decimation

Identify very poor fitness individuals and remove from population

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"></xsl:stylesheet>

<xsl:stylesheet/>

Page 42: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Initial tests

• Initial Population= 500, generations = 51• Set initial genetic operation probabilities:

90% crossover on selected individuals10% reproduction on selected individuals0% secondary operations on selected

individuals

Page 43: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Results

• runs faster with more servers … extreme scale out – unusual for GA

• Arrived quickly to a ‘correct’ solution• Though some runs Local optima was ‘wrong solution’

e.g. embedded literal• need to constrain xpath (baby steps)• Need to constrain terminal set• Enhance fitness definition

Page 44: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Source XML

<a><b>

<c><d></d>

</c></b>

</a>

Page 45: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Target XML

<a><b/>

<c/> <d/>

</a>

Page 46: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Results

• Needed larger generations/ more individuals• Mutation operation needed to kick out of local

optima

Page 47: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Summary

• This approach can be applied to any language parse tree (xquery with xqueryparser.xq)

• Difficulties with little languages being embedded

• Today, commercially applicable to generating mapping solutions, more research required

• Illustrates applying strength of ML/Hadoop together

• Will place code and results on github soon …

Page 48: Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

References• JOHN R KOZA, Genetic Programming, MIT Press 1992• J. W. Hunt and M. D. McIlroy , An Algorithm for

Differential File Comparison published in 1976


Recommended