+ All Categories
Home > Documents > Pat Langley Center for the Study of Language and Information Stanford University, Stanford,...

Pat Langley Center for the Study of Language and Information Stanford University, Stanford,...

Date post: 27-Mar-2015
Category:
Upload: connor-brooks
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
41
Pat Langley Pat Langley Center for the Study of Language and Information Center for the Study of Language and Information Stanford University, Stanford, California Stanford University, Stanford, California http://cll.stanford.edu/~langley http://cll.stanford.edu/~langley [email protected] [email protected] Computational Discovery of Computational Discovery of Explanatory Process Models Explanatory Process Models N. Asgharbeygi, K. Arrigo, S. Bay, A. Pohorille, J. Sanchez, K. Sa N. Asgharbeygi, K. Arrigo, S. Bay, A. Pohorille, J. Sanchez, K. Sa r for their contributions to this research. r for their contributions to this research.
Transcript
Page 1: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Pat LangleyPat LangleyCenter for the Study of Language and InformationCenter for the Study of Language and Information

Stanford University, Stanford, CaliforniaStanford University, Stanford, California

http://cll.stanford.edu/~langleyhttp://cll.stanford.edu/~langley

[email protected]@csli.stanford.edu

Computational Discovery of Computational Discovery of Explanatory Process ModelsExplanatory Process Models

Thanks to N. Asgharbeygi, K. Arrigo, S. Bay, A. Pohorille, J. Sanchez, K. Saito, and Thanks to N. Asgharbeygi, K. Arrigo, S. Bay, A. Pohorille, J. Sanchez, K. Saito, and J. Shrager for their contributions to this research. J. Shrager for their contributions to this research.

Page 2: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Data Mining vs. Scientific DiscoveryData Mining vs. Scientific Discovery

induce predictive models from large, often business, data sets;induce predictive models from large, often business, data sets; cast models as decision trees, logical rules, or other notations cast models as decision trees, logical rules, or other notations

invented by AI researchers.invented by AI researchers.

There exist two computational paradigms for discovering explicit There exist two computational paradigms for discovering explicit knowledge from data. knowledge from data.

The The data miningdata mining movement develops computational methods that: movement develops computational methods that:

Both approaches draw on heuristic search to find regularities in Both approaches draw on heuristic search to find regularities in data, but they differ considerably in their emphases.data, but they differ considerably in their emphases.

constructing models from (often small) scientific data sets;constructing models from (often small) scientific data sets; stated in formalisms invented by scientists and engineers.stated in formalisms invented by scientists and engineers.

In contrast,In contrast, computational scientific discovery computational scientific discovery focuses onfocuses on::

Page 3: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

In MemoriamIn Memoriam

Herbert A. Simon (1916 – 2001)Herbert A. Simon (1916 – 2001)

Jan M. Zytkow (1945 – 2001)Jan M. Zytkow (1945 – 2001)

Three years ago, computational scientific discovery lost two of Three years ago, computational scientific discovery lost two of its founding fathers:its founding fathers:

Both contributed to the field in many ways: posing new problems, Both contributed to the field in many ways: posing new problems, inventing methods, training students, and organizing meetings.inventing methods, training students, and organizing meetings.

Moreover, both were interdisciplinary researchers who contributed Moreover, both were interdisciplinary researchers who contributed to computer science, psychology, philosophy, and statistics.to computer science, psychology, philosophy, and statistics.

Herb Simon and Jan Zytkow were excellent role models who we Herb Simon and Jan Zytkow were excellent role models who we should all aim to emulate. should all aim to emulate.

Page 4: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Time Line for Research on Time Line for Research on Computational Scientific DiscoveryComputational Scientific Discovery

1989 19901979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

Bacon.1–Bacon.5Abacus,

CoperFahrehneit, E*,

Tetrad, IDSN

Hume,ARC

DST, GPN

LaGrangeSDS

SSF, RF5,LaGramge

Dalton, Stahl

RL, Progol

Gell-MannBR-3,

MendelPauli

Stahlp,Revolver

Dendral

AM Glauber NGlauberIDSQ,

Live

IECoast, Phineas,AbE, Kekada

Mechem, CDPAstra,GPM

HR

BR-4

Numeric laws Qualitative laws Structural models Process modelsLegendLegend

Page 5: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Successes of Computational Scientific DiscoverySuccesses of Computational Scientific Discovery

Over the past decade, systems of this type have helped discover Over the past decade, systems of this type have helped discover new knowledge in many scientific fields: new knowledge in many scientific fields:

qualitative chemical factors in mutagenesis (King et al., 1996)qualitative chemical factors in mutagenesis (King et al., 1996)

quantitative laws of metallic behavior (Sleeman et al., 1997)quantitative laws of metallic behavior (Sleeman et al., 1997)

qualitative conjectures in number theory (Colton et al., 2000)qualitative conjectures in number theory (Colton et al., 2000)

temporal laws of ecological behavior (Todorovski et al., 2000)temporal laws of ecological behavior (Todorovski et al., 2000)

reaction pathways in catalytic chemistry (Valdes-Perez, 1994)reaction pathways in catalytic chemistry (Valdes-Perez, 1994)

Each has led to publications in the refereed scientific literature Each has led to publications in the refereed scientific literature (e.g., Langley, 2000), but they did not focus on systems science. (e.g., Langley, 2000), but they did not focus on systems science.

Page 6: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

The Nature of Systems ScienceThe Nature of Systems Science

focus on synthesis rather than analysis in their operation;focus on synthesis rather than analysis in their operation;

rely on computer modeling as one of their central methods;rely on computer modeling as one of their central methods;

develop system-level models with many variables and relations;develop system-level models with many variables and relations;

evaluate their models on observational, not experimental, data. evaluate their models on observational, not experimental, data.

Disciplines like Earth science and computational biology differ Disciplines like Earth science and computational biology differ from traditional fields in that they:from traditional fields in that they:

Developing and testing such models are complex tasks that would Developing and testing such models are complex tasks that would benefit from computational aids. benefit from computational aids.

However, existing methods for computational scientific discovery However, existing methods for computational scientific discovery were not designed with systems science in mind. were not designed with systems science in mind.

Page 7: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Observations from the Ross SeaObservations from the Ross Sea

Page 8: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Inductive Process ModelingInductive Process Modeling

Our response is to design, construct, and evaluate computational Our response is to design, construct, and evaluate computational methods for methods for inductive process modelinginductive process modeling, which: , which:

represent scientific models as sets of quantitative processes;represent scientific models as sets of quantitative processes;

use these models to predict and explain observational data;use these models to predict and explain observational data;

search a space of process models to find good candidates;search a space of process models to find good candidates;

utilize background knowledge to constrain this search. utilize background knowledge to constrain this search.

This framework has great potential for aiding systems science, This framework has great potential for aiding systems science, but it raises new computational challenges. but it raises new computational challenges.

Page 9: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Challenges of Inductive Process ModelingChallenges of Inductive Process Modeling

process models characterize behavior of dynamical systems; process models characterize behavior of dynamical systems;

variables are continuous but can have discontinuous behavior; variables are continuous but can have discontinuous behavior;

observations are not independently and identically distributed;observations are not independently and identically distributed;

models may contain unobservable processes and variables;models may contain unobservable processes and variables;

multiple processes can interact to produce complex behavior. multiple processes can interact to produce complex behavior.

Process model induction differs from typical learning tasks in that:Process model induction differs from typical learning tasks in that:

Compensating factors include a focus on deterministic systems and Compensating factors include a focus on deterministic systems and the availability of background knowledge. the availability of background knowledge.

Page 10: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Issue 1: Representing Scientific ModelsIssue 1: Representing Scientific Models

address address observationalobservational rather than experimental data; rather than experimental data; deal with deal with dynamicdynamic systems that change over time; systems that change over time; have an have an explanatoryexplanatory rather than a descriptive character; rather than a descriptive character; are are causalcausal in that they describe chains of effects; in that they describe chains of effects; contain quantitative relations and contain quantitative relations and qualitativequalitative structure. structure.

To assist system scientists’ modeling efforts, we must first encode To assist system scientists’ modeling efforts, we must first encode candidate models that: candidate models that:

We need some formal way to represent such models that can be We need some formal way to represent such models that can be interpreted computationally. interpreted computationally.

Page 11: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Why Are Existing Formalisms Inadequate?Why Are Existing Formalisms Inadequate?

d[ice_mass,t] = d[ice_mass,t] = (18 (18 heat) / 6.02 heat) / 6.02d[water_mass,t] = (18 d[water_mass,t] = (18 heat) / 6.02 heat) / 6.02

systems of equationssystems of equations

B>6B>6

C>0C>0 C>4C>4

14.314.3 18.718.7 11.511.5 16.916.9

regression treesregression trees

gcd(X,X,X).gcd(X,X,X).gcd(X,Y,D) :- X<Y,Z is Y–X,gcd(X,Z,D).gcd(X,Y,D) :- X<Y,Z is Y–X,gcd(X,Z,D).gcd(X,Y,D) :- Y<X,gcd(Y,X,D).gcd(X,Y,D) :- Y<X,gcd(Y,X,D).

Horn clause programsHorn clause programs

xx=12,=12,xx=1=1

yy=18,=18,xx=2=2

xx=12,=12,xx=1=1

yy=10,=10,xx=2=2

xx=16,=16,xx=2=2

yy=13,=13,xx=1=1

xx=19,=19,xx=1=1

yy=11,=11,xx=2=2

0.30.3

0.70.7

1.01.0

1.01.0

hidden Markov modelshidden Markov models

Page 12: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

A Process Model for an Aquatic EcosystemA Process Model for an Aquatic Ecosystem

model Ross_Sea_Ecosystemmodel Ross_Sea_Ecosystem

variables: phyto, nitro, residue, light, growth_rate, effective_light, variables: phyto, nitro, residue, light, growth_rate, effective_light, ice_factorice_factorobservables: phyto, nitro, light, ice_factorobservables: phyto, nitro, light, ice_factor

process phyto_lossprocess phyto_loss equations:equations: d[phyto,t,1] = d[phyto,t,1] = 0.1 0.1 phyto phyto

d[residue,t,1] = 0.1 d[residue,t,1] = 0.1 phyto phyto

process phyto_growthprocess phyto_growth equations:equations: d[phyto,t,1] = growth_rate d[phyto,t,1] = growth_rate phyto phyto

process phyto_uptakes_nitroprocess phyto_uptakes_nitro conditions:conditions: nitro > 0nitro > 0 equations:equations: d[nitro,t,1] = d[nitro,t,1] = 1 1 0.204 0.204 growth_rate growth_rate phyto phyto

process growth_limitationprocess growth_limitation equations:equations: growth_rate = 0.23 growth_rate = 0.23 min(nitrate_rate, light_rate) min(nitrate_rate, light_rate)

process nitrate_availabilityprocess nitrate_availability equations:equations: nitrate_rate = nitrate / (nitrate + 5)nitrate_rate = nitrate / (nitrate + 5)

process light_availabilityprocess light_availability equations:equations: light_rate = effective_light / (effective_light + 50)light_rate = effective_light / (effective_light + 50)

process light_attenuationprocess light_attenuation equations:equations: effective_light = light effective_light = light ice_factor ice_factor

Page 13: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Advantages of Quantitative Process ModelsAdvantages of Quantitative Process Models

they embed quantitative relations within qualitative structure;they embed quantitative relations within qualitative structure;

that refer to notations and mechanisms familiar to scientists;that refer to notations and mechanisms familiar to scientists;

they provide dynamical predictions of changes over time;they provide dynamical predictions of changes over time;

they offer causal and explanatory accounts of phenomena;they offer causal and explanatory accounts of phenomena;

while retaining the modularity needed to support induction.while retaining the modularity needed to support induction.

Process models are a good target for discovery systems because: Process models are a good target for discovery systems because:

Quantitative process models provide an important alternative to Quantitative process models provide an important alternative to formalisms used currently in computational discovery. formalisms used currently in computational discovery.

Page 14: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Issue 2: Generating Predictions and ExplanationsIssue 2: Generating Predictions and Explanations

To utilize or evaluate a given process model, we must simulate its To utilize or evaluate a given process model, we must simulate its behavior over time: behavior over time:

specify initial values for input variables and time step size;specify initial values for input variables and time step size;

on each time step, determine which processes are active;on each time step, determine which processes are active;

solve active algebraic/differential equations with known values;solve active algebraic/differential equations with known values;

propagate values and recursively solve other active equations; propagate values and recursively solve other active equations;

when multiple processes influence the same variable, assume when multiple processes influence the same variable, assume their effects are additive. their effects are additive.

This performance method makes specific predictions that we can This performance method makes specific predictions that we can compare to observations. compare to observations.

Page 15: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Issue 3: Encoding Background KnowledgeIssue 3: Encoding Background Knowledge

Horn clause programs (e.g., Towell & Shavlik, 1990) Horn clause programs (e.g., Towell & Shavlik, 1990)

context-free grammars (e.g., Dzeroski & Todorovski, 1997) context-free grammars (e.g., Dzeroski & Todorovski, 1997)

prior probability distributions (e.g., Friedman et al., 2000)prior probability distributions (e.g., Friedman et al., 2000)

To constrain candidate models, we can utilize available backround To constrain candidate models, we can utilize available backround knowledge about the domain. knowledge about the domain.

Previous work has encoded background knowledge in terms of:Previous work has encoded background knowledge in terms of:

However, none of these notations are familiar to domain scientists, However, none of these notations are familiar to domain scientists, which suggests the need for another approach. which suggests the need for another approach.

Page 16: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Generic Processes as Background KnowledgeGeneric Processes as Background Knowledge

the variables involved in a process and their types;the variables involved in a process and their types;

the parameters appearing in a process and their ranges; the parameters appearing in a process and their ranges;

the forms of conditions on the process; andthe forms of conditions on the process; and

the forms of associated equations and their parameters.the forms of associated equations and their parameters.

Our framework casts background knowledge as Our framework casts background knowledge as generic processesgeneric processes that specify: that specify:

Generic processes are building blocks from which one can compose Generic processes are building blocks from which one can compose a specific process model. a specific process model.

Page 17: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Generic Processes for Aquatic EcosystemsGeneric Processes for Aquatic Ecosystems

generic process exponential_lossgeneric process exponential_loss generic process remineralizationgeneric process remineralization variables: S{species}, D{detritus}variables: S{species}, D{detritus} variables: N{nutrient}, variables: N{nutrient}, D{detritus}D{detritus} parameters: parameters: [0, 1] [0, 1] parameters: parameters: [0, 1] [0, 1] equations:equations: d[S,t,1] = d[S,t,1] = 1 1 S S equations: equations: d[N, t,1] = d[N, t,1] = D D

d[D,t,1] = d[D,t,1] = S S d[D, t,1] = d[D, t,1] = 1 1 DD

generic process grazinggeneric process grazing generic process constant_inflowgeneric process constant_inflow variables: S1{species}, S2{species}, D{detritus}variables: S1{species}, S2{species}, D{detritus} variables: variables: N{nutrient}N{nutrient} parameters: parameters: [0, 1], [0, 1], [0, 1] [0, 1] parameters: parameters: [0, 1] [0, 1] equations:equations: d[S1,t,1] = d[S1,t,1] = S1 S1 equations: equations: d[N,t,1] = d[N,t,1] =

d[D,t,1] = (1 d[D,t,1] = (1 ) ) S1 S1d[S2,t,1] = d[S2,t,1] = 1 1 S1 S1

generic process nutrient_uptakegeneric process nutrient_uptake variables: S{species}, N{nutrient}variables: S{species}, N{nutrient} parameters: parameters: [0, [0, ], ], [0, 1], [0, 1], [0, 1] [0, 1] conditions:conditions: N > N > equations:equations: d[S,t,1] = d[S,t,1] = S S

d[N,t,1] = d[N,t,1] = 1 1 S S

Page 18: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

process exponential_growth process exponential_growth variables: P {population} variables: P {population} equations: d[P,t] = [0, 1,equations: d[P,t] = [0, 1,] ] P P

process logistic_growthprocess logistic_growth variables: P {population}variables: P {population} equations: d[P,t] = [0, 1, equations: d[P,t] = [0, 1, ] ] P P (1 (1 P / [0, 1, P / [0, 1, ])])

process constant_inflowprocess constant_inflow variables: I {inorganic_nutrient}variables: I {inorganic_nutrient} equations: d[I,t] = [0, 1, equations: d[I,t] = [0, 1, ]]

process consumptionprocess consumption variables: P1 {population}, P2 {population}, variables: P1 {population}, P2 {population}, nutrient_P2 nutrient_P2 equations: d[P1,t] = [0, 1, equations: d[P1,t] = [0, 1, ] ] P1 P1 nutrient_P2, nutrient_P2, d[P2,t] = d[P2,t] = [0, 1, [0, 1, ] ] P1 P1 nutrient_P2 nutrient_P2

process no_saturationprocess no_saturation variables: P {number}, nutrient_P {number}variables: P {number}, nutrient_P {number} equations: nutrient_P = Pequations: nutrient_P = P

process saturationprocess saturation variables: P {number}, nutrient_P {number}variables: P {number}, nutrient_P {number} equations: nutrient_P = P / (P + [0, 1, equations: nutrient_P = P / (P + [0, 1, ])])

Issue 4: Inducing Process ModelsIssue 4: Inducing Process Models

model AquaticEcosystemmodel AquaticEcosystem

variables: nitro, phyto, zoo, nutrient_nitro, variables: nitro, phyto, zoo, nutrient_nitro, nutrient_phytonutrient_phytoobservables: nitro, phyto, zooobservables: nitro, phyto, zoo

process phyto_exponential_growthprocess phyto_exponential_growth equations: d[phyto,t] = 0.1 equations: d[phyto,t] = 0.1 phyto phyto

process zoo_logistic_growthprocess zoo_logistic_growth equations: d[zoo,t] = 0.1 equations: d[zoo,t] = 0.1 zoo / (1 zoo / (1 zoo / 1.5) zoo / 1.5)

process phyto_nitro_consumptionprocess phyto_nitro_consumption equations: d[nitro,t] = equations: d[nitro,t] = 1 1 phyto phyto nutrient_nitro, nutrient_nitro, d[phyto,t] = 1 d[phyto,t] = 1 phyto phyto nutrient_nitro nutrient_nitro

process phyto_nitro_no_saturationprocess phyto_nitro_no_saturation equations: nutrient_nitro = nitroequations: nutrient_nitro = nitro

process zoo_phyto_consumptionprocess zoo_phyto_consumption equations: d[phyto,t] = equations: d[phyto,t] = 1 1 zoo zoo nutrient_phyto, nutrient_phyto, d[zoo,t] = 1 d[zoo,t] = 1 zoo zoo nutrient_phyto nutrient_phyto

process zoo_phyto_saturationprocess zoo_phyto_saturation equations: nutrient_phyto = phyto / (phyto + 0.5)equations: nutrient_phyto = phyto / (phyto + 0.5)

InductionInduction

training datatraining data

generic processesgeneric processes

process modelprocess model

Page 19: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

A Method for Process Model InductionA Method for Process Model Induction

1. Find all ways to instantiate known generic processes with 1. Find all ways to instantiate known generic processes with specific variables, subject to type constraints;specific variables, subject to type constraints;

2. Combine instantiated processes into candidate generic models 2. Combine instantiated processes into candidate generic models subject to additional constraints (e.g., number of processes); subject to additional constraints (e.g., number of processes);

3. For each generic model, carry out search through parameter 3. For each generic model, carry out search through parameter space to find good coefficients;space to find good coefficients;

4. Return the parameterized model with the best overall score.4. Return the parameterized model with the best overall score.

We have implemented the IPM algorithm, which induces process We have implemented the IPM algorithm, which induces process models from generic components in four stages:models from generic components in four stages:

The evaluation metric can be squared error or description length The evaluation metric can be squared error or description length

(e.g., (e.g., MMDD = (M = (MVV + M + MCC ) ) log (n) + n log (n) + n log (M log (MEE ) ) . .

Page 20: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Estimating Parameters in Process ModelsEstimating Parameters in Process Models

1. Selects random initial values that fall within ranges specified 1. Selects random initial values that fall within ranges specified in the generic processes;in the generic processes;

2. Improves these parameters using the Levenberg-Marquardt 2. Improves these parameters using the Levenberg-Marquardt method until it reaches a local optimum;method until it reaches a local optimum;

3. Generates new candidate values through random jumps along 3. Generates new candidate values through random jumps along dimensions of the parameter vector and continue search; dimensions of the parameter vector and continue search;

4. If no improvement occurs after N jumps, it restarts the search 4. If no improvement occurs after N jumps, it restarts the search from a new random initial point.from a new random initial point.

To estimate the parameters for each generic model structure, the To estimate the parameters for each generic model structure, the IPM algorithm:IPM algorithm:

This multi-level method gives reasonable fits to time-series data This multi-level method gives reasonable fits to time-series data from a number of domains, but it is computationally intensive. from a number of domains, but it is computationally intensive.

Page 21: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

identifying conditions on component processes identifying conditions on component processes

inferring initial values of unobservable variablesinferring initial values of unobservable variables

keeping the structural search space tractablekeeping the structural search space tractable

reducing variance to mitigate overfitting effectsreducing variance to mitigate overfitting effects

Inductive process modeling raises a number of issues that have Inductive process modeling raises a number of issues that have clear analogues in other paradigms:clear analogues in other paradigms:

We have demonstrated promising responses to these problems We have demonstrated promising responses to these problems within the IPM framework. within the IPM framework.

More Issues in Process Model InductionMore Issues in Process Model Induction

Page 22: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Evaluation of the IPM AlgorithmEvaluation of the IPM Algorithm

1. We used the aquatic ecosystem model to generate data sets 1. We used the aquatic ecosystem model to generate data sets over 100 time steps for the variables over 100 time steps for the variables nitronitro and and phytophyto;;

2. We replaced each ‘true’ value 2. We replaced each ‘true’ value xx with with xx (1 (1 + r + r n n), where ), where rr followed a Gaussian distribution (followed a Gaussian distribution ( = 0, = 0, = 1) and = 1) and nn > 0; > 0;

3. We ran IPM on these noisy data, giving it type constraints and 3. We ran IPM on these noisy data, giving it type constraints and generic processes as background knowledge.generic processes as background knowledge.

To demonstrate IPM's ability to induce process models, we ran it To demonstrate IPM's ability to induce process models, we ran it on synthetic data for a known system:on synthetic data for a known system:

In two experiments, we let IPM determine the initial values and In two experiments, we let IPM determine the initial values and thresholds given the correct structure; in a third study, we let it thresholds given the correct structure; in a third study, we let it search through a space of 256 generic model structures.search through a space of 256 generic model structures.

Page 23: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Experimental Results with IPMExperimental Results with IPM

The main results of our studies with IPM on synthetic data were:The main results of our studies with IPM on synthetic data were:

1. The system infers accurate estimates for the initial values of 1. The system infers accurate estimates for the initial values of unobservable variables like unobservable variables like zoozoo and and residueresidue;;

2. The system induces estimates of condition thresholds on 2. The system induces estimates of condition thresholds on nitronitro that are close to the target values; andthat are close to the target values; and

3. The MDL criterion selects the correct model structure in all 3. The MDL criterion selects the correct model structure in all runs with 5% noise, but only 40% of runs with 10% noise.runs with 5% noise, but only 40% of runs with 10% noise.

These suggest that the basic approach is sound, but that we should These suggest that the basic approach is sound, but that we should consider more MDL schemes and other responses to overfitting.consider more MDL schemes and other responses to overfitting.

Page 24: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Observations from the Ross SeaObservations from the Ross Sea

Page 25: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Results on Training Data from Ross SeaResults on Training Data from Ross Sea

Page 26: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Results on Test Data from Ross SeaResults on Test Data from Ross Sea

Page 27: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Collecting Data on Photosynthetic ProcessesCollecting Data on Photosynthetic Processes

External stimuli (e.g., light)External stimuli (e.g., light)

Adaptation PeriodAdaptation Period

Sampling mRNA/cDNASampling mRNA/cDNA

Equlibrium PeriodEqulibrium Period

MicroarrayMicroarrayTraceTrace

Continuous Culture (Chemostat)Continuous Culture (Chemostat)

/wwwscience.murdoch.edu.au/teach

www.affymetrix.com/

www.affymetrix.com/

Hea

lth

of C

ultu

reH

ealt

h of

Cul

ture

TimeTime

Page 28: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Gene Expressions for CyanobacteriaGene Expressions for Cyanobacteria

Page 29: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Generic Processes for Photosynthesis RegulationGeneric Processes for Photosynthesis Regulation

generic process translationgeneric process translation generic process transcriptiongeneric process transcription variables: P{protein}, M{mRNA}variables: P{protein}, M{mRNA} variables: M{mRNA}, R{rate} variables: M{mRNA}, R{rate} parameters: parameters: [0, 1] [0, 1] parameters: parameters: equations:equations: d[P,t,1] = d[P,t,1] = M M equations: equations: d[M,t,1] = Rd[M,t,1] = R

generic process regulate_onegeneric process regulate_one generic process regulate_twogeneric process regulate_two variables: R{rate}, S{signal} variables: R{rate}, S{signal} variables: R{rate}, S{signal} variables: R{rate}, S{signal} parameters: parameters: [ [1 , 1] 1 , 1] parameters: parameters: [ [1 , 1], 1 , 1], [0, 1] [0, 1] equations:equations: R = R = S S equations: equations: R = R = S S

d[S, t,1] = d[S, t,1] = 1 1 S S

generic process automatic_degradationgeneric process automatic_degradation generic process controlled_degradationgeneric process controlled_degradation variables: C{concentration}variables: C{concentration} variables: D{concentration}, variables: D{concentration}, E{concentration}E{concentration} conditions:conditions: C > 0C > 0 conditions: conditions:D > 0, E > 0D > 0, E > 0 parameters: parameters: [0, 1] [0, 1] parameters: parameters: [0, 1] [0, 1] equations:equations: d[C,t,1] = d[C,t,1] = 1 1 C C equations: equations: d[D,t,1] = d[D,t,1] = 1 1 E E

d[E,t,1] = d[E,t,1] = 1 1 E Egeneric process photosynthesisgeneric process photosynthesis variables: L{light}, P{protein}, R{redox}, S{ROS}variables: L{light}, P{protein}, R{redox}, S{ROS} parameters: parameters: [0, 1], [0, 1], [0, 1] [0, 1] equations:equations: d[R,t,1] = d[R,t,1] = L L P P

d[S,t,1] = d[S,t,1] = L L P P

Page 30: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

A Process Model for Photosynthetic RegulationA Process Model for Photosynthetic Regulation

model photo_regulationmodel photo_regulation

variables: light, mRNA_protein, ROS, redox, transcription_ratevariables: light, mRNA_protein, ROS, redox, transcription_rateobservables: light, mRNAobservables: light, mRNA

process photosynthesis;process photosynthesis; equations:equations: d[redox,t,1] = 0.0155 d[redox,t,1] = 0.0155 light light protein protein

d[ROS,t,1] = 0.019 d[ROS,t,1] = 0.019 light light protein protein

process protein_translationprocess protein_translation process mRNA_transcriptionprocess mRNA_transcription equations:equations: d[protein,t,1] = 7.54 d[protein,t,1] = 7.54 mRNA mRNA equations:equations: d[mRNA,t,1] = transcription_rated[mRNA,t,1] = transcription_rate

process regulate_one_1process regulate_one_1 process regulate_two_2process regulate_two_2 equations:equations: transcription_rate = 0.99 transcription_rate = 0.99 light light equations:equations: transcription_rate = 1.203 transcription_rate = 1.203 redox redox

d[redox,t,1] = d[redox,t,1] = 0.0002 0.0002 redoxredox

process automatic_degradation_1process automatic_degradation_1 process controlled_degradation_1process controlled_degradation_1 conditions:conditions: protein > 0protein > 0 conditions: conditions:redox > 0, ROS > 0redox > 0, ROS > 0 equations:equations: d[protein,t,1] = d[protein,t,1] = 1.91 1.91 protein protein equations:equations: d[redox,t,1] = d[redox,t,1] = 0.0003 0.0003 ROS ROS

d[ROS,t,1] = d[ROS,t,1] = 0.0003 0.0003 ROS ROS

Page 31: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Predictions from Best Parameterized ModelPredictions from Best Parameterized Model

Page 32: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Electric Power on the International Space StationElectric Power on the International Space Station

Page 33: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Results on Battery Test DataResults on Battery Test Data

Page 34: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Results on Data from Rinkobing FjordResults on Data from Rinkobing Fjord

Page 35: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

specify a quantitative process model of the target system;specify a quantitative process model of the target system;

display and edit the model’s structure and details graphically;display and edit the model’s structure and details graphically;

simulate the model’s behavior over time and situations;simulate the model’s behavior over time and situations;

compare the model’s predicted behavior to observations; compare the model’s predicted behavior to observations;

invoke a revision module in response to detected anomalies.invoke a revision module in response to detected anomalies.

Because few scientists want to be replaced, we are developing an Because few scientists want to be replaced, we are developing an interactive environment that lets users:interactive environment that lets users:

The environment offers computational assistance in forming and The environment offers computational assistance in forming and evaluating models but lets the user retain control. evaluating models but lets the user retain control.

Issue 5: Interfacing with ScientistsIssue 5: Interfacing with Scientists

Page 36: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Viewing and Editing a Process ModelViewing and Editing a Process Model

Page 37: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Results of Revising the NPP ModelResults of Revising the NPP Model

Initial model:Initial model:

E = 0.56 · T1 · T2 · WE = 0.56 · T1 · T2 · W

T2 = 1.18 / [(1 + e T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10)0.2 · (Topt – Tempc – 10) ) · (1 + e ) · (1 + e 0.3 · (Tempc – Topt – 10)0.3 · (Tempc – Topt – 10) )] )]

PET = 1.6 · (10 · Tempc / AHI)PET = 1.6 · (10 · Tempc / AHI)AA · PET-TW-M · PET-TW-M

SR SR {3.06, 4.35, 4.35, 4.05, 5.09, 3.06, 4.05, 4.05, 4.05, 5.09, 4.05} {3.06, 4.35, 4.35, 4.05, 5.09, 3.06, 4.05, 4.05, 4.05, 5.09, 4.05}

RMSE on training data = 465.212 RMSE on training data = 465.212 andand r r 2 2 = 0.799 = 0.799

Revised model:Revised model:

E = 0.353 · T1E = 0.353 · T10.000.00 · T2 · T2 0.080.08 · W · W 0.000.00

T2 = 0.83 / [(1 + e T2 = 0.83 / [(1 + e 1.0 · (Topt – Tempc – 6.34)1.0 · (Topt – Tempc – 6.34) ) · (1 + e ) · (1 + e 1.0 · (Tempc – Topt – 11.52)1.0 · (Tempc – Topt – 11.52) )] )]

PET = 1.6 · (10 · Tempc / AHI)PET = 1.6 · (10 · Tempc / AHI) AA · PET-TW-M · PET-TW-M

SR SR {0.61, 3.99, 2.44, 10.0, 2.21, 2.13, 2.04, 0.43, 1.35, 1.85, 1.61} {0.61, 3.99, 2.44, 10.0, 2.21, 2.13, 2.04, 0.43, 1.35, 1.85, 1.61}

Cross-validated RMSE = 397.306 Cross-validated RMSE = 397.306 andand r r 2 2 = 0.853 [ 15= 0.853 [ 15 % reduction ]% reduction ]

••

••

••

Page 38: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

computational scientific discovery (e.g., Langley et al., 1983);computational scientific discovery (e.g., Langley et al., 1983);

theory revision in machine learning (e.g., Towell, 1991);theory revision in machine learning (e.g., Towell, 1991);

qualitative physics and simulation (e.g., Forbus, 1984);qualitative physics and simulation (e.g., Forbus, 1984);

languages for scientific simulation (e.g., languages for scientific simulation (e.g., STELLA, MATLABSTELLA, MATLAB););

interactive tools for data analysis (e.g., Schneiderman, 2001).interactive tools for data analysis (e.g., Schneiderman, 2001).

Intellectual InfluencesIntellectual Influences

Our approach to computational discovery incorporates ideas from Our approach to computational discovery incorporates ideas from many traditions:many traditions:

Our work combines, in novel ways, insights from machine learning, Our work combines, in novel ways, insights from machine learning, AI, programming languages, and human-computer interaction.AI, programming languages, and human-computer interaction.

Page 39: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Contributions of the ResearchContributions of the Research

a new formalism for representing scientific process models;a new formalism for representing scientific process models;

a computational method for simulating these models’ behavior;a computational method for simulating these models’ behavior;

an encoding for background knowledge as generic processes; an encoding for background knowledge as generic processes;

an algorithm for inducing process models from time-series data;an algorithm for inducing process models from time-series data;

an interactive environment for model construction/utilization.an interactive environment for model construction/utilization.

In summary, our work on computational scientific discovery has, In summary, our work on computational scientific discovery has, in responding to various challenges, produced:in responding to various challenges, produced:

We have demonstrated this approach to model creation on domains We have demonstrated this approach to model creation on domains from Earth science, microbiology, and engineering. from Earth science, microbiology, and engineering.

Page 40: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

Directions for Future ResearchDirections for Future Research

produce additional results on other scientific data setsproduce additional results on other scientific data sets

develop improved methods for fitting model parametersdevelop improved methods for fitting model parameters

extend the approach to handle data sets with missing valuesextend the approach to handle data sets with missing values

implement heuristic methods for searching the structure spaceimplement heuristic methods for searching the structure space

utilize knowledge of subsystems to further constrain searchutilize knowledge of subsystems to further constrain search

augment the modeling environment to make it more usableaugment the modeling environment to make it more usable

Despite our progress to date, we need further work in order to:Despite our progress to date, we need further work in order to:

Inductive process modeling has great potential to speed progress Inductive process modeling has great potential to speed progress in systems science and engineering.in systems science and engineering.

Page 41: Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California langley langley@csli.stanford.edu.

End of PresentationEnd of Presentation


Recommended