How to Grow a Mind: Statistics, Structure, and Abstraction...

DOI: 10.1126/science.1192788, 1279 (2011);331 Science

, et al.Joshua B. TenenbaumHow to Grow a Mind: Statistics, Structure, and Abstraction

This copy is for your personal, non-commercial use only.

clicking here.colleagues, clients, or customers by , you can order high-quality copies for yourIf you wish to distribute this article to others

here.following the guidelines

can be obtained byPermission to republish or repurpose articles or portions of articles

): March 10, 2011 www.sciencemag.org (this infomation is current as of

The following resources related to this article are available online at

http://www.sciencemag.org/content/331/6022/1279.full.htmlversion of this article at:

including high-resolution figures, can be found in the onlineUpdated information and services,

http://www.sciencemag.org/content/suppl/2011/03/08/331.6022.1279.DC1.html can be found at: Supporting Online Material

http://www.sciencemag.org/content/331/6022/1279.full.html#ref-list-1, 4 of which can be accessed free:cites 33 articlesThis article

http://www.sciencemag.org/cgi/collection/psychologyPsychology

subject collections:This article appears in the following

registered trademark of AAAS. is aScience2011 by the American Association for the Advancement of Science; all rights reserved. The title

CopyrightAmerican Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by theScience

on

Mar

ch 1

0, 2

011

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

http://www.sciencemag.org/about/permissions.dtl

http://www.sciencemag.org/about/permissions.dtl

http://www.sciencemag.org/content/331/6022/1279.full.html

http://www.sciencemag.org/content/331/6022/1279.full.html#ref-list-1

http://www.sciencemag.org/cgi/collection/psychology

http://www.sciencemag.org/

How to Grow a Mind: Statistics,Structure, and AbstractionJoshua B. Tenenbaum,1* Charles Kemp,2 Thomas L. Griffiths,3 Noah D. Goodman4

In coming to understand the world—in learning concepts, acquiring language, and graspingcausal relations—our minds make inferences that appear to go far beyond the data available.How do we do it? This review describes recent approaches to reverse-engineering human learningand cognitive development and, in parallel, engineering more humanlike machine learningsystems. Computational models that perform probabilistic inference over hierarchies of flexiblystructured representations can address some of the deepest questions about the nature and originsof human thought: How does abstract knowledge guide learning and reasoning from sparsedata? What forms does our knowledge take, across different domains and tasks? And how is thatabstract knowledge itself acquired?

The Challenge: How Does the Mind GetSo Much from So Little?

For scientists studying how humans cometo understand their world, the central chal-lenge is this: How do our minds get so

much from so little? We build rich causal models,make strong generalizations, and construct pow-erful abstractions, whereas the input data aresparse, noisy, and ambiguous—in every way fartoo limited. A massive mismatch looms betweenthe information coming in through our sensesand the ouputs of cognition.

Consider the situation of a child learning themeanings of words. Any parent knows, and sci-entists have confirmed (1, 2), that typical 2-year-olds can learn how to use a new word such as“horse” or “hairbrush” from seeing just a fewexamples. We know they grasp the meaning, notjust the sound, because they generalize: Theyuse the word appropriately (if not always per-fectly) in new situations. Viewed as a compu-tation on sensory input data, this is a remarkablefeat. Within the infinite landscape of all possibleobjects, there is an infinite but still highly con-strained subset that can be called “horses” andanother for “hairbrushes.” How does a child graspthe boundaries of these subsets from seeing justone or a few examples of each? Adults face thechallenge of learning entirely novel object conceptsless often, but they can be just as good at it (Fig. 1).

Generalization from sparse data is central inlearning many aspects of language, such as syn-tactic constructions or morphological rules (3).It presents most starkly in causal learning: Everystatistics class teaches that correlation does

not imply causation, yet children routinely in-fer causal links from just a handful of events (4),far too small a sample to compute even a reli-able correlation! Perhaps the deepest accomplish-ment of cognitive development is the constructionof larger-scale systems of knowledge: intuitivetheories of physics, psychology, or biology or rulesystems for social structure or moral judgment.Building these systems takes years, much longerthan learning a single new word or concept, buton this scale too the final product of learning faroutstrips the data observed (5–7).

Philosophers have inquired into these puz-zles for over two thousand years, most famouslyas “the problem of induction,” from Plato andAristotle through Hume, Whewell, and Mill toCarnap, Quine, Goodman, and others in the 20thcentury (8). Only recently have these questionsbecome accessible to science and engineering byviewing inductive learning as a species of compu-tational problems and the human mind as a nat-ural computer evolved for solving them.

The proposed solutions are, in broad strokes,just what philosophers since Plato have sug-gested. If the mind goes beyond the data given,another source of information must make upthe difference. Some more abstract backgroundknowledge must generate and delimit the hypothe-ses learners consider, or meaningful generaliza-tion would be impossible (9, 10). Psychologistsand linguists speak of “constraints;”machine learn-ing and artificial intelligence researchers, “induc-tive bias;” statisticians, “priors.”

This article reviews recent models of humanlearning and cognitive development arising atthe intersection of these fields. What has cometo be known as the “Bayesian” or “probabilistic”approach to reverse-engineering the mind has beenheavily influenced by the engineering successes ofBayesian artificial intelligence and machinelearning over the past two decades (9, 11) and,in return, has begun to inspire more powerful andmore humanlike approaches to machine learning.

As with “connectionist” or “neural network”models of cognition (12) in the 1980s (the last

moment when all these fields converged on acommon paradigm for understanding the mind),the labels “Bayesian” or “probabilistic” are mere-ly placeholders for a set of interrelated principlesand theoretical claims. The key ideas can bethought of as proposals for how to answer threecentral questions:

1) How does abstract knowledge guide learn-ing and inference from sparse data?

2) What forms does abstract knowledge take,across different domains and tasks?

3) How is abstract knowledge itself acquired?

We will illustrate the approach with a focuson two archetypal inductive problems: learningconcepts and learning causal relations. We thenbriefly discuss open challenges for a theory of hu-man cognitive development and conclude with asummary of the approach’s contributions.

We will also draw contrasts with two earlierapproaches to the origins of knowledge: nativismand associationism (or connectionism. These ap-proaches differ in whether they propose strongeror weaker capacities as the basis for answeringthe questions above. Bayesian models typicallycombine richly structured, expressive knowledgerepresentations (question 2) with powerful statis-tical inference engines (questions 1 and 3), arguingthat only a synthesis of sophisticated approachesto both knowledge representation and inductiveinference can account for human intelligence. Untilrecently it was not understood how this fusioncould work computationally. Cognitive modelerswere forced to choose between two alternatives(13): powerful statistical learning operating overthe simplest, unstructured forms of knowledge,such as matrices of associative weights in connec-tionist accounts of semantic cognition (12, 14),or richly structured symbolic knowledge equippedwith only the simplest, nonstatistical forms oflearning, checks for logical inconsistency betweenhypotheses and observed data, as in nativist ac-counts of language acquisition (15). It appearednecessary to accept either that people’s abstractknowledge is not learned or induced in a nontrivialsense from experience (hence essentially innate)or that human knowledge is not nearly as ab-stract or structured (as “knowledge-like”) as itseems (hence simply associations). Many devel-opmental researchers rejected this choice alto-gether and pursued less formal approaches todescribing the growing minds of children, underthe headings of “constructivism” or the “theorytheory” (5). The potential to explain how peo-ple can genuinely learn with abstract structuredknowledge may be the most distinctive featureof Bayesian models: the biggest reason for theirrecent popularity (16) and the biggest target ofskepticism from their critics (17).

The Role of Abstract KnowledgeOver the past decade, many aspects of higher-level cognition have been illuminated by the

REVIEW

1Department of Brain and Cognitive Sciences, Computer Sci-ence and Artificial Intelligence Laboratory (CSAIL), Massa-chusetts Institute of Technology, 77 Massachusetts Avenue,Cambridge, MA 02139, USA. 2Department of Psychology,Carnegie Mellon University, Pittsburgh, PA 15213, USA. 3De-partment of Psychology, University of California, Berkeley,Berkeley, CA 94720, USA. 4Department of Psychology, Stan-ford University, Stanford, CA 94305, USA.

*To whom correspondence should be addressed. E-mail:[email protected]

www.sciencemag.org SCIENCE VOL 331 11 MARCH 2011 1279

on

Mar

ch 1

0, 2

011

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om


mathematics of Bayesian statistics: our sense ofsimilarity (18), representativeness (19), and ran-domness (20); coincidences as a cue to hiddencauses (21); judgments of causal strength (22) andevidential support (23); diagnostic and condi-tional reasoning (24, 25); and predictions aboutthe future of everyday events (26).

The claim that human minds learn and rea-son according to Bayesian principles is not aclaim that the mind can implement any Bayesianinference. Only those inductive computations thatthe mind is designed to perform well, wherebiology has had time and cause to engineer ef-fective and efficient mechanisms, are likely to

be understood in Bayesian terms. In additionto the general cognitive abilities just mentioned,Bayesian analyses have shed light on many spe-cific cognitive capacities and modules that resultfrom rapid, reliable, unconscious processing, in-cluding perception (27), language (28), memory(29, 30), and sensorimotor systems (31). In contrast,in tasks that require explicit conscious manipu-lations of probabilities as numerical quantities—arecent cultural invention that few people becomefluent with, and only then after sophisticatedtraining—judgments can be notoriously biasedaway from Bayesian norms (32).

At heart, Bayes’s rule is simply a tool foranswering question 1: How does abstract knowl-edge guide inference from incomplete data?Abstract knowledge is encoded in a probabilisticgenerative model, a kind of mental model thatdescribes the causal processes in the world giv-ing rise to the learner’s observations as well asunobserved or latent variables that support ef-fective prediction and action if the learner caninfer their hidden state. Generative models mustbe probabilistic to handle the learner’s uncertain-ty about the true states of latent variables andthe true causal processes at work. A generativemodel is abstract in two senses: It describes notonly the specific situation at hand, but also a broaderclass of situations over which learning shouldgeneralize, and it captures in parsimonious formthe essential world structure that causes learners’observations and makes generalization possible.

Bayesian inference gives a rational frameworkfor updating beliefs about latent variables in gen-erative models given observed data (33, 34).Background knowledge is encoded through aconstrained space of hypotheses H about pos-sible values for the latent variables, candidateworld structures that could explain the observeddata. Finer-grained knowledge comes in the “priorprobability” P(h), the learner’s degree of belief ina specific hypothesis h prior to (or independentof) the observations. Bayes’s rule updates priorsto “posterior probabilities” P(h|d) conditional onthe observed data d:

P(hjd) ¼ P(djh)P(h)∑h′∈HP(djh′)P(h′)

ºP(djh)P(h)

ð1ÞThe posterior probability is proportional to theproduct of the prior probability and the likelihoodP(d|h), measuring how expected the data are underhypothesis h, relative to all other hypotheses h′ inH.

To illustrate Bayes’s rule in action, supposewe observe John coughing (d), and we considerthree hypotheses as explanations: John has h1, acold; h2, lung disease; or h3, heartburn. Intuitivelyonly h1 seems compelling. Bayes’s rule explainswhy. The likelihood favors h1 and h2 over h3:only colds and lung disease cause coughing andthus elevate the probability of the data abovebaseline. The prior, in contrast, favors h1 and h3over h2: Colds and heartburn are much morecommon than lung disease. Bayes’s rule weighs

Fig. 1. Human children learning names for object concepts routinely make strong generalizations fromjust a few examples. The same processes of rapid generalization can be studied in adults learning namesfor novel objects created with computer graphics. (A) Given these alien objects and three examples(boxed in red) of “tufas” (a word in the alien language), which other objects are tufas? Almost everyoneselects just the objects boxed in gray (75). (B) Learning names for categories can be modeled asBayesian inference over a tree-structured domain representation (2). Objects are placed at the leaves ofthe tree, and hypotheses about categories that words could label correspond to different branches.Branches at different depths pick out hypotheses at different levels of generality (e.g., Clydesdales, drafthorses, horses, animals, or living things). Priors are defined on the basis of branch length, reflecting thedistinctiveness of categories. Likelihoods assume that examples are drawn randomly from the branchthat the word labels, favoring lower branches that cover the examples tightly; this captures the sense ofsuspicious coincidence when all examples of a word cluster in the same part of the tree. Combiningpriors and likelihoods yields posterior probabilities that favor generalizing across the lowest distinctivebranch that spans all the observed examples (boxed in gray).

11 MARCH 2011 VOL 331 SCIENCE www.sciencemag.org1280

REVIEW

on

Mar

ch 1

0, 2

011

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om


hypotheses according to the product of priorsand likelihoods and so yields only explanationslike h1 that score highly on both terms.

The same principles can explain how peoplelearn from sparse data. In concept learning, thedata might correspond to several example ob-jects (Fig. 1) and the hypotheses to possible ex-tensions of the concept. Why, given three examplesof different kinds of horses, would a child gen-eralize the word “horse” to all and only horses(h1)? Why not h2, “all horses except Clydesdales”;h3, “all animals”; or any other rule consistent withthe data? Likelihoods favor the more specificpatterns, h1 and h2; it would be a highly suspi-cious coincidence to draw three random exam-ples that all fall within the smaller sets h1 or h2if they were actually drawn from the much largerh3 (18). The prior favors h1 and h3, because asmore coherent and distinctive categories, theyare more likely to be the referents of commonwords in language (1). Only h1 scores highlyon both terms. Likewise, in causal learning, thedata could be co-occurences between events; thehypotheses, possible causal relations linkingthe events. Likelihoods favor causal links thatmake the co-occurence more probable, whereaspriors favor links that fit with our backgroundknowledge of what kinds of events are likely tocause which others; for example, a disease (e.g.,cold) is more likely to cause a symptom (e.g.,coughing) than the other way around.

The Form of Abstract KnowledgeAbstract knowledge provides essential con-straints for learning, but in what form? This isjust question 2. For complex cognitive tasks suchas concept learning or causal reasoning, it is im-possible to simply list every logically possible hy-pothesis along with its prior and likelihood. Somemore sophisticated forms of knowledge repre-sentation must underlie the probabilistic gener-ative models needed for Bayesian cognition.

In traditional associative or connectionist ap-proaches, statistical models of learning were de-fined over large numerical vectors. Learning wasseen as estimating strengths in an associative mem-ory, weights in a neural network, or parameters ofa high-dimensional nonlinear function (12, 14).Bayesian cognitive models, in contrast, have hadmost success defining probabilities over morestructured symbolic forms of knowledge repre-sentations used in computer science and artificialintelligence, such as graphs, grammars, predicatelogic, relational schemas, and functional programs.Different forms of representation are used to cap-ture people’s knowledge in different domains andtasks and at different levels of abstraction.

In learning words and concepts from exam-ples, the knowledge that guides both children’sand adults’ generalizations has been well de-scribed using probabilistic models defined overtree-structured representations (Fig. 1B) (2, 35).Reasoning about other biological concepts fornatural kinds (e.g., given that cows and rhinoshave protein X in their muscles, how likely is it

that horses or squirrels do?) is also well describedby Bayesian models that assume nearby objectsin the tree are likely to share properties (36). How-ever, trees are by no means a universal represen-tation. Inferences about other kinds of categoriesor properties are best captured by using proba-bilistic models with different forms (Fig. 2): two-dimensional spaces or grids for reasoning aboutgeographic properties of cities, one-dimensionalorders for reasoning about values or abilities, ordirected networks for causally transmitted proper-ties of species (e.g., diseases) (36).

Knowledge about causes and effects moregenerally can be expressed in a directed graph-ical model (9, 11): a graph structure where nodesrepresent variables and directed edges betweennodes represent probabilistic causal links. In amedical setting, for instance (Fig. 3A), nodesmight represent whether a patient has a cold, acough, a fever or other conditions, and the pres-ence or absence of edges indicates that colds tendto cause coughing and fever but not chest pain;lung disease tends to cause coughing and chestpain but not fever; and so on.

Such a causal map represents a simple kindof intuitive theory (4), but learning causal net-works from limited data depends on the con-straints of more abstract knowledge. For example,learning causal dependencies between medicalconditions is enabled by a higher-level frameworktheory (37) specifying two classes of variables (ornodes), diseases and symptoms, and the tendencyfor causal relations (or graph edges) to run fromdiseases to symptoms, rather than within theseclasses or from symptoms to diseases (Fig. 3, Ato C). This abstract framework can be repre-sented by using probabilistic models defined overrelational data structures such as graph schemas(9, 38), templates for graphs based on types ofnodes, or probabilistic graph grammars (39), similarin spirit to the probabilistic grammars for stringsthat have become standard for representing lin-guistic knowledge (28). At the most abstract lev-el, the very concept of causality itself, in the senseof a directed relationship that supports interven-tion or manipulation by an external agent (40),can be formulated as a set of logical laws express-ing constraints on the structure of directed graphsrelating actions and observable events (Fig. 3D).

Each of these forms of knowledge makesdifferent kinds of prior distributions natural todefine and therefore imposes different constraintson induction. Successful generalization dependson getting these constraints right. Although in-ductive constraints are often graded, it is easiestto appreciate the effects of qualitative constraintsthat simply restrict the hypotheses learners canconsider (i.e., setting priors for many logicalpossible hypotheses to zero). For instance, inlearning concepts over a domain of n objects,there are 2n subsets and hence 2n logically pos-sible hypotheses for the extension of a novelconcept. Assuming concepts correspond to thebranches of a specific binary tree over the ob-jects, as in Fig. 1B, restricts this space to only

n − 1 hypotheses. In learning a causal networkover 16 variables, there are roughly 1046 logical-ly possible hypotheses (directed acyclic graphs),but a framework theory restricting hypothesesto bipartite disease-symptom graphs reduces thisto roughly 1023 hypotheses. Knowing which var-iables belong to the disease and symptom classesfurther restricts this to roughly 1018 networks.The smaller the hypothesis space, the more ac-curately a learner can be expected to generalize,but only as long as the true structure to be learnedremains within or near (in a probabilistic sense)the learner’s hypothesis space (10). It is no coin-cidence then that our best accounts of people’smental representations often resemble simpler ver-sions of how scientists represent the same do-mains, such as tree structures for biological species.A compact description that approximates howthe grain of the world actually runs offers themost useful formof constraint on inductive learning.

The Origins of Abstract KnowledgeThe need for abstract knowledge and the needto get it right bring us to question 3: How dolearners learn what they need to know to makelearning possible? How does a child know whichtree structure is the right way to organize hypothe-ses for word learning? At a deeper level, how cana learner know that a given domain of entitiesand concepts should be represented by using atree at all, as opposed to a low-dimensional spaceor some other form? Or, in causal learning, howdo people come to correct framework theoriessuch as knowledge of abstract disease and symp-tom classes of variables with causal links fromdiseases to symptoms?

The acquisition of abstract knowledge or newinductive constraints is primarily the provinceof cognitive development (5, 7). For instance,children learning words initially assume a flat,mutually exclusive division of objects into name-able clusters; only later do they discover that cat-egories should be organized into tree-structuredhierarchies (Fig. 1B) (41). Such discoveries are alsopivotal in scientific progress: Mendeleev launchedmodern chemistry with his proposal of a periodicstructure for the elements. Linnaeus famouslyproposed that relationships between biologicalspecies are best explained by a tree structure, ratherthan a simpler linear order (premodern Europe’s“great chain of being”) or some other form.

Such structural insights have long beenviewed by psychologists and philosophers ofscience as deeply mysterious in their mecha-nisms, more magical than computational. Con-ventional algorithms for unsupervised structurediscovery in statistics and machine learning—hierarchical clustering, principal components anal-ysis, multidimensional scaling, clique detection—assume a single fixed form of structure (42). Un-like human children or scientists, they cannotlearn multiple forms of structure or discovernew forms in novel data. Neither traditional ap-proach to cognitive development has a fullysatisfying response: Nativists have assumed that,


REVIEW

on

Mar

ch 1

0, 2

011

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om


if different domains of cognition are representedin qualitatively different ways, those forms mustbe innate (43, 44); connectionists have suggestedthese representationsmay be learned but in a genericsystem of associative weights that at best onlyapproximates trees, causal networks, and other formsof structure people appear to know explicitly (14).

Recently cognitive modelers have begun toanswer these challenges by combining the struc-tured knowledge representations described abovewith state-of-the-art tools from Bayesian statis-

tics. Hierarchical Bayesian models (HBMs) (45)address the origins of hypothesis spaces and priorsby positing not just a single level of hypothesesto explain the data but multiple levels: hypoth-esis spaces of hypothesis spaces, with priors onpriors. Each level of a HBM generates a proba-bility distribution on variables at the level below.Bayesian inference across all levels allows hypothe-ses and priors needed for a specific learning task tothemselves be learned at larger or longer time scales,at the same time as they constrain lower-level learn-

ing. In machine learning and artificial intelligence(AI), HBMs have primarily been used for transferlearning: the acquisition of inductive constraintsfrom experience in previous related tasks (46).Transfer learning is critical for humans as well(SOM text and figs. S1 and S2), but here wefocus on the role ofHBMs in explaining howpeopleacquire the right forms of abstract knowledge.

Kemp and Tenenbaum (36, 47) showed howHBMs defined over graph- and grammar-basedrepresentations can discover the form of structure

Fig. 2. Kemp and Tenenbaum (47)showed how the form of structure ina domain can be discovered by usinga HBM defined over graph gram-mars. At the bottom level of themodel is a data matrix D of objectsand their properties, or similaritiesbetween pairs of objects. Each squareof the matrix represents whether agiven feature (column) is observedfor a given object (row). One levelup is the structure S, a graph of rela-tions between objects that describeshow the features in D are distributed.Intuitively, objects nearby in the graphare expected to share similar featurevalues; technically, the graph Laplacianparameterizes the inverse covarianceof a gaussian distribution with onedimension per object, and each featureis drawn independently from that dis-tribution. The highest level of abstractprinciples specifies the form F ofstructure in the domain, in terms ofgrammatical rules for growing a graphS of a constrained form out of aninitial seed node. Red arrows repre-sent P(S|F) and P(D|S), the condi-tional probabilities that each levelspecifies for the level below. A searchalgorithm attempts to find both theform F and the structure S of that formthat jointly maximize the posteriorprobability P(S,F|D), a function of theproduct ofP(D|S) andP(S|F). (A) Givenas data the features of animals, thealgorithm finds a tree structure withintuitively sensible categories at mul-tiple scales. (B) The same algorithmdiscovers that the voting patterns ofU.S. Supreme Court judges are bestexplained by a linear “left-right” spec-trum. (C) Subjective similarities amongcolors are best explained by a circu-lar ring. (D) Given proximities betweencities on the globe, the algorithm dis-covers a cylindrical representationanalogous to latitude and longitude:the cross product of a ring and aring. (E) Given images of realisticallysynthesized faces varying in two di-mensions, race and masculinity, thealgorithm successfully recovers the un-derlying two-dimensional grid struc-ture: a cross product of two chains.

A

E

B

C

D

Abstractprinciples

tree: chain:

ring:

ring x chain

chain x chain

Features

Structure

Data

Ani

mal

s

Brennan

Marshal

BlackmunStevens Souter

Ginsburg

Breyer White

O'ConnorRehnquist

Scalia

ThomasKennedy

Los Angeles

Honolulu

Wellington

Sydney

Perth

Jakarta

Manila

Shanghai

Bangkok

Tokyo

Vladivostok

Irkutsk Moscow

Berlin

LondonMadrid

Dakar

NewYork

AnchorageVancouver

Chicago

Toronto

BombayTeheran

Cairo

Nairobi

Budapest

CapeTown

Mexico City LimaBogota

Santiago

BuenosAires

SaoPaulo

OstrichChicken

Finch

Robin

EaglePenguin

Salmon Trout Alligator

IguanaWhale

Dolphin

Ant

Cockroach

Butterfly

Bee

Seal

WolfDog

Cat

LionTiger

SquirrelMouse

CowHorse

RhinoElephant

DeerGiraffe

CamelGorilla

Chimp

Kinshasa


REVIEW

on

Mar

ch 1

0, 2

011

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om


governing similarity in a domain. Structures ofdifferent forms—trees, clusters, spaces, rings,orders, and so on—can all be represented asgraphs, whereas the abstract principles under-lying each form are expressed as simple gram-matical rules for growing graphs of that form.Embedded in a hierarchical Bayesian frame-work, this approach can discover the correctforms of structure (the grammars) for manyreal-world domains, along with the best struc-

ture (the graph) of the appropriate form (Fig.2). In particular, it can infer that a hierarchicalorganization for the novel objects in Fig. 1A(such as Fig. 1B) better fits the similarities peo-ple see in these objects, compared to alternativerepresentations such as a two-dimensional space.

Hierarchical Bayesian models can also beused to learn abstract causal knowledge, suchas the framework theory of diseases and symp-toms (Fig. 3), and other simple forms of intui-

tive theories (38). Mansinghka et al. (48) showedhow a graph schema representing two classesof variables, diseases and symptoms, and a pref-erence for causal links running from disease tosymptom variables can be learned from thesame data that support learning causal links be-tween specific diseases and symptoms and belearned just as fast or faster (Fig. 3, B and C).The learned schema in turn dramatically accel-erates learning of specific causal relations (the

A

B

C D

True structure

n = 20 n = 80 n = 20 n = 80

Variables

Variables

Abstractprinciples

Structure

Data

Structure

Data

Pat

ient

s

Eve

nts

'disease

s'

'symptoms'

1

11

6 7

67

16

162 3 4 51 2 3 7 8 9 10

11 12 1314 15 16

. . .. . .

4 5 6 0.4

6

7 8 9 10 11 12 13 14 15 16C1 C2

Fig. 3. HBMs defined over graph schemas can explain how intuitive theoriesare acquired and used to learn about specific causal relations from limiteddata (38). (A) A simple medical reasoning domain might be described byrelations among 16 variables: The first six encode presence or absence of“diseases” (top row), with causal links to the next 10 “symptoms” (bottomrow). This network can also be visualized as a matrix (top right, links shownin black). The causal learning task is to reconstruct this network based onobserving data D on the states of these 16 variables in a set of patients. (B)A two-level HBM formalizes bottom-up causal learning or learning with anuninformative prior on networks. The bottom level is the data matrix D. Thesecond level (structure) encodes hypothesized causal networks: a grayscalematrix visualizes the posterior probability that each pairwise causal linkexists, conditioned on observing n patients; compare this matrix with theblack-and-white ground truth matrix shown in (A). The true causal networkcan be recovered perfectly only from observing very many patients (n =1000; not shown). With n = 80, spurious links (gray squares) are inferred,and with n = 20 almost none of the true structure is detected. (C) A three-level nonparametric HBM (48) adds a level of abstract principles, represented bya graph schema. The schema encodes a prior on the level below (causal networkstructure) that constrains and thereby accelerates causal learning. Both schemaand network structure are learned from the same data observed in (B). The

schema discovers the disease-symptom framework theory by assigning var-iables 1 to 6 to class C1, variables 7 to 16 to class C2, and a prior favoringonly C1 → C2 links. These assignments, along with the effective number ofclasses (here, two), are inferred automatically via the Bayesian Occam's razor.Although this three-level model has many more degrees of freedom than themodel in (B), learning is faster and more accurate. With n = 80 patients, thecausal network is identified near perfectly. Even n = 20 patients are sufficientto learn the high-level C1→ C2 schema and thereby to limit uncertainty at thenetwork level to just the question of which diseases cause which symptoms.(D) A HBM for learning an abstract theory of causality (62). At the highestlevel are laws expressed in first-order logic representing the abstractproperties of causal relationships, the role of exogenous interventions indefining the direction of causality, and features that may mark an event as anexogenous intervention. These laws place constraints on possible directedgraphical models at the level below, which in turn are used to explain patternsof observed events over variables. Given observed events from several differentcausal systems, each encoded in a distinct data matrix, and a hypothesis spaceof possible laws at the highest level, the model converges quickly on a correcttheory of intervention-based causality and uses that theory to constraininferences about the specific causal networks underlying the different systems atthe level below.


REVIEW

on

Mar

ch 1

0, 2

011

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om


directed graph structure) at the level below.Getting the big picture first—discovering thatdiseases cause symptoms before pinning downany specific disease-symptom links—and then us-ing that framework to fill in the gaps of specificknowledge is a distinctively humanmode of learn-ing. It figures prominently in children’s develop-ment and scientific progress but has not previouslyfit into the landscape of rational or statistical lear-ning models.

Although this HBM imposes strong andvaluable constraints on the hypothesis space ofcausal networks, it is also extremely flexible:It can discover framework theories defined byany number of variable classes and any patternof pairwise regularities on how variables inthese classes tend to be connected. Not even thenumber of variable classes (two for the disease-symptom theory) need be known in advance. Thisis enabled by another state-of-the-art Bayesiantool, known as “infinite” or nonparametric hier-archical modeling. These models posit an un-bounded amount of structure, but only finitelymany degrees of freedom are actively engagedfor a given data set (49). An automatic Occam’srazor embodied in Bayesian inference trades offmodel complexity and fit to ensure that newstructure (in this case, a new class of variables) isintroduced only when the data truly require it.

The specific nonparametric distribution onnode classes in Fig. 3C is a Chinese restaurantprocess (CRP), which has been particularly in-fluential in recent machine learning and cogni-tive modeling. CRP models have given the firstprincipled account of how people form newcategories without direct supervision (50, 51): Aseach stimulus is observed, CRP models (guidedby the Bayesian Occam’s razor) infer whetherthat object is best explained by assimilation toan existing category or by positing a previouslyunseen category (fig. S3). The CrossCat mod-el extends CRPs to carve domains of objectsand their properties into different subdomains or“views,” subsets of properties that can all beexplained by a distinct way of organizing theobjects (52) (fig. S4). CRPs can be embedded inprobabilistic models for language to explainhow children discover words in unsegmentedspeech (53), learn morphological rules (54),and organize word meanings into hierarchicalsemantic networks (55, 56) (fig. S5). A relatedbut novel nonparametric construction, the Indianbuffet process (IBP), explains how new percep-tual features can be constructed during objectcategorization (57, 58).

More generally, nonparametric hierarchicalmodels address the principal challenge humanlearners face as knowledge grows over a life-time: balancing constraint and flexibility, or theneed to restrict hypotheses available for gener-alization at any moment with the capacity toexpand one’s hypothesis spaces, to learn newways that the world could work. Placing non-parametric distributions at higher levels of theHBM yields flexible inductive biases for lower

levels, whereas the Bayesian Occam’s razor en-sures the proper balance of constraint and flex-ibility as knowledge grows.

Across several case studies of learning abstractknowledge—discovering structural forms, caus-al framework theories, and other inductive con-straints acquired through transfer learning—ithas been found that abstractions in HBMs canbe learned remarkably fast from relatively littledata compared with what is needed for learningat lower levels. This is because each degree offreedom at a higher level of the HBM influencesand pools evidence from many variables at lev-els below. We call this property of HBMs “theblessing of abstraction.” It offers a top-downroute to the origins of knowledge that contrastssharply with the two classic approaches: nativ-ism (59, 60), in which abstract concepts are as-sumed to be present from birth, and empiricismor associationism (14), in which abstractions areconstructed but only approximately, and onlyslowly in a bottom-up fashion, by layering manyexperiences on top of each other and filteringout their common elements. Only HBMs thusseem suited to explaining the two most strikingfeatures of abstract knowledge in humans: that itcan be learned from experience, and that it canbe engaged remarkably early in life, serving toconstrain more specific learning tasks.

Open QuestionsHBMs may answer some questions about theorigins of knowledge, but they still leave uswondering: How does it all start? Developmen-talists have argued that not everything can belearned, that learning can only get off the groundwith some innate stock of abstract concepts suchas “agent,” “object,” and “cause” to provide thebasic ontology for carving up experience (7, 61).Surely some aspects of mental representationare innate, but without disputing this Bayesianmodelers have recently argued that even the mostabstract concepts may in principle be learned. Forinstance, an abstract concept of causality expressedas logical constraints on the structure of directedgraphs can be learned from experience in a HBMthat generalizes across the network structures ofmany specific causal systems (Fig. 3D). Followingthe “blessing of abstraction,” these constraintscan be induced from only small samples of eachnetwork’s behavior and in turn enable more ef-ficient causal learning for new systems (62). Howthis analysis extends to other abstract conceptssuch as agent or object and whether children ac-tually acquire these concepts in such a manner re-main open questions.

Although HBMs have addressed the acqui-sition of simple forms of abstract knowledge,they have only touched on the hardest subjectsof cognitive development: framework theoriesfor core common-sense domains such as intui-tive physics, psychology, and biology (5, 6, 7).First steps have come in explaining develop-ing theories of mind, how children come tounderstand explicit false beliefs (63) and in-

dividual differences in preferences (64), as wellas the origins of essentialist theories in intui-tive biology and early beliefs about magnetismin intuitive physics (39, 38). The most dauntingchallenge is that formalizing the full contentof intuitive theories appears to require Turing-complete compositional representations, such asprobabilistic first-order logic (65, 66) and prob-abilistic programming languages (67). How toeffectively constrain learning with such flexiblerepresentations is not at all clear.

Lastly, the project of reverse-engineering themind must unfold over multiple levels of anal-ysis, only one of which has been our focus here.Marr (68) famously argued for analyses that in-tegrate across three levels: The computationallevel characterizes the problem that a cognitivesystem solves and the principles by which its so-lution can be computed from the available inputsin natural environments; the algorithmic level de-scribes the procedures executed to produce thissolution and the representations or data structuresover which the algorithms operate; and the im-plementation level specifies how these algorithmsand data structures are instantiated in the circuitsof a brain or machine. Many early Bayesian mod-els addressed only the computational level, char-acterizing cognition in purely functional terms asapproximately optimal statistical inference in agiven environment, without reference to how thecomputations are carried out (25, 39, 69). TheHBMs of learning and development discussedhere target a view between the computational andalgorithmic levels: cognition as approximately op-timal inference in probabilistic models definedover a learner’s subjective and dynamically growingmental representations of the world’s structure, ra-ther than some objective and fixed world statistics.

Much ongoing work is devoted to pushingBayesian models down through the algorithmicand implementation levels. The complexity ofexact inference in large-scale models implies thatthese levels can at best approximate Bayesiancomputations, just as in any working BayesianAI system (9). The key research questions are asfollows: What approximate algorithms does themind use, how do they relate to engineering ap-proximations in probabilistic AI, and how arethey implemented in neural circuits?Much recentwork points toMonte Carlo or stochastic sampling–based approximations as a unifying frameworkfor understanding how Bayesian inference maywork practically across all these levels, in minds,brains, and machines (70–74). Monte Carlo in-ference in richly structured models is possible(9, 67) but very slow; constructing more efficientsamplers is a major focus of current work. Thebiggest remaining obstacle is to understand howstructured symbolic knowledge can be representedin neural circuits. Connectionist models sidestepthese challenges by denying that brains actuallyencode such rich knowledge, but this runs counterto the strong consensus in cognitive science andartificial intelligence that symbols and structuresare essential for thought. Uncovering their neural


REVIEW

on

Mar

ch 1

0, 2

011

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om


basis is arguably the greatest computationalchallenge in cognitive neuroscience moregenerally—our modern mind-body problem.

ConclusionsWe have outlined an approach to understandingcognition and its origins in terms of Bayesianinference over richly structured, hierarchical gen-erative models. Although we are far from a com-plete understanding of how human minds workand develop, the Bayesian approach brings us closerin several ways. First is the promise of a unifyingmathematical language for framing cognition asthe solution to inductive problems and buildingprincipled quantitative models of thought with aminimum of free parameters and ad hoc assump-tions. Deeper is a framework for understandingwhy the mind works the way it does, in terms ofrational inference adapted to the structure of real-world environments, and what the mind knowsabout the world, in terms of abstract schemasand intuitive theories revealed only indirectlythrough how they constrain generalizations.

Most importantly, the Bayesian approach letsus move beyond classic either-or dichotomiesthat have long shaped and limited debates incognitive science: “empiricism versus nativism,”“domain-general versus domain-specific,” “logicversus probability,” “symbols versus statistics.”Instead we can ask harder questions of reverse-engineering, with answers potentially rich enoughto help us build more humanlike AI systems. Howcan domain-general mechanisms of learning andrepresentation build domain-specific systemsof knowledge? How can structured symbolicknowledge be acquired through statistical learn-ing? The answers emerging suggest new waysto think about the development of a cognitivesystem. Powerful abstractions can be learned sur-prisingly quickly, together with or prior to learn-ing the more concrete knowledge they constrain.Structured symbolic representations need not berigid, static, hard-wired, or brittle. Embedded in aprobabilistic framework, they can grow dynam-ically and robustly in response to the sparse,noisy data of experience.

References and Notes1. P. Bloom, How Children Learn the Meanings of Words

(MIT Press, Cambridge, MA, 2000).2. F. Xu, J. B. Tenenbaum, Psychol. Rev. 114, 245 (2007).3. S. Pinker, Words and Rules: The Ingredients of Language

(Basic, New York, 1999).4. A. Gopnik et al., Psychol. Rev. 111, 3 (2004).5. A. Gopnik, A. N. Meltzoff, Words, Thoughts, and Theories

(MIT Press, Cambridge, MA, 1997).6. S. Carey, Conceptual Change in Childhood (MIT Press,

Cambridge, MA, 1985).7. S. Carey, The Origin of Concepts (Oxford Univ. Press,

New York, 2009).8. P. Godfrey-Smith, Theory and Reality (Univ. of Chicago

Press, Chicago, 2003).9. S. Russell, P. Norvig, Artificial Intelligence: A Modern

Approach (Prentice Hall, Upper Saddle River, NJ, 2009).10. D. McAllester, in Proceedings of the Eleventh Annual

Conference on Computational Learning Theory [Associationfor Computing Machinery (ACM), New York, 1998], p. 234.

11. J. Pearl, Probabilistic Reasoning in Intelligent Systems(Morgan Kaufmann, San Francisco, CA, 1988).

12. J. McClelland, D. Rumelhart, Eds., Parallel DistributedProcessing: Explorations in the Microstructure ofCognition (MIT Press, Cambridge, MA, 1986).

13. S. Pinker, How the Mind Works (Norton, New York, 1997).14. T. Rogers, J. McClelland, Semantic Cognition: A Parallel

Distributed Processing Approach (MIT Press, Cambridge,MA, 2004).

15. P. Niyogi, The Computational Nature of Language Learningand Evolution (MIT Press, Cambridge, MA, 2006).

16. T. L. Griffiths, N. Chater, C. Kemp, A. Perfors, J. B. Tenenbaum,Trends Cogn. Sci. 14, 357 (2010).

17. J. L. McClelland et al., Trends Cogn. Sci. 14, 348 (2010).18. J. B. Tenenbaum, T. L. Griffiths, Behav. Brain Sci.

24, 629 (2001).19. J. Tenenbaum, T. Griffiths, in Proceedings of the 23rd

Annual Conference of the Cognitive Science Society,J. D. Moore, K. Stenning, Eds. (Erlbaum, Mahwah, NJ,2001), pp. 1036–1041.

20. T. Griffiths, J. Tenenbaum, in Proceedings of the 23rdAnnual Conference of the Cognitive Science Society,J. D. Moore, K. Stenning, Eds. (Erlbaum, Mahwah, NJ,2001), pp. 370–375.

21. T. L. Griffiths, J. B. Tenenbaum, Cognition 103, 180 (2007).22. H. Lu, A. L. Yuille, M. Liljeholm, P. W. Cheng, K. J. Holyoak,

Psychol. Rev. 115, 955 (2008).23. T. L. Griffiths, J. B. Tenenbaum, Cognit. Psychol. 51,

334 (2005).24. T. R. Krynski, J. B. Tenenbaum, J. Exp. Psychol. Gen. 136,

430 (2007).25. M. Oaksford, N. Chater, Trends Cogn. Sci. 5, 349 (2001).26. T. L. Griffiths, J. B. Tenenbaum, Psychol. Sci. 17, 767 (2006).27. A. Yuille, D. Kersten, Trends Cogn. Sci. 10, 301 (2006).28. N. Chater, C. D. Manning, Trends Cogn. Sci. 10, 335 (2006).29. R. M. Shiffrin, M. Steyvers, Psychon. Bull. Rev. 4, 145

(1997).30. M. Steyvers, T. L. Griffiths, S. Dennis, Trends Cogn. Sci.

10, 327 (2006).31. K. P. Körding, D. M. Wolpert, Nature 427, 244 (2004).32. A. Tversky, D. Kahneman, Science 185, 1124 (1974).33. E. T. Jaynes, Probability Theory: The Logic of Science

(Cambridge Univ. Press, Cambridge, 2003).34. D. J. C. Mackay, Information Theory, Inference, and Learning

Algorithms (Cambridge Univ. Press, Cambridge, 2003).35. F. Xu, J. B. Tenenbaum, Dev. Sci. 10, 288 (2007).36. C. Kemp, J. B. Tenenbaum, Psychol. Rev. 116, 20 (2009).37. H. M. Wellman, S. A. Gelman, Annu. Rev. Psychol. 43,

337 (1992).38. C. Kemp, J. B. Tenenbaum, S. Niyogi, T. L. Griffiths,

Cognition 114, 165 (2010).39. T. L. Griffiths, J. B. Tenenbaum, in Causal Learning: Psychology,

Philosophy, and Computation, A. Gopnik, L. Schulz, Eds.(Oxford University Press, Oxford, 2007), pp. 323–345.

40. J. Woodward, Making Things Happen: A Theory of CausalExplanation (Oxford Univ. Press, Oxford, 2003).

41. E. S. Markman, Categorization and Naming in Children(MIT Press, Cambridge, MA, 1989).

42. R. N. Shepard, Science 210, 390 (1980).43. N. Chomsky, Rules and Representations (Basil Blackwell,

Oxford, 1980).44. S. Atran, Behav. Brain Sci. 21, 547, (1998).45. A. Gelman, J. B. Carlin, H. S. Stern, D. B. Rubin, Bayesian

Data Analysis (Chapman and Hall, New York, 1995).46. C. Kemp, A. Perfors, J. B. Tenenbaum,Dev. Sci. 10, 307 (2007).47. C. Kemp, J. B. Tenenbaum, Proc. Natl. Acad. Sci. U.S.A.

105, 10687 (2008).48. V. K. Mansinghka, C. Kemp, J. B. Tenenbaum, T. L. Griffiths, in

Proceedings of the 22nd Conference on Uncertainty inArtificial Intelligence, R. Dechter, T. Richardson, Eds. (AUAIPress, Arlington, VA, 2006), pp. 324–331.

49. C. Rasmussen, in Advances in Neural InformationProcessing Systems (MIT Press, Cambridge, MA, 2000),vol. 12, pp. 554–560.

50. J. R. Anderson, Psychol. Rev. 98, 409 (1991).51. T. L. Griffiths, A. N. Sanborn, K. R. Canini, D. J. Navarro,

in The Probabilistic Mind, N. Chater, M. Oaksford, Eds.(Oxford Univ. Press, Oxford, 2008).

52. P. Shafto, C. Kemp, V. Mansinghka, M. Gordon,J. B. Tenenbaum, in Proceedings of the 28th AnnualConference of the Cognitive Science Society (Erlbaum,Mahwah, NJ, 2006), pp. 2146–2151.

53. S. Goldwater, T. L. Griffiths, M. Johnson, Cognition 112,21 (2009).

54. M. Johnson, T. L. Griffiths, S. Goldwater, in Advances inNeural Information Processing Systems (MIT Press,Cambridge, MA, 2007), vol. 19, pp. 641–648.

55. T. L. Griffiths, M. Steyvers, J. B. Tenenbaum, Psychol. Rev.114, 211 (2007).

56. D. Blei, T. Griffiths, M. Jordan, J. Assoc. Comput. Mach.57, 1 (2010).

57. T. L. Griffiths, Z. Ghahramani, in Advances in NeuralInformation Processing Systems (MIT Press, Cambridge,MA, 2006), vol. 18, pp. 475–482.

58. J. Austerweil, T. L. Griffiths, in Advances in NeuralInformation Processing Systems (MIT Press, Cambridge,MA, 2009), vol. 21, pp. 97–104.

59. N. Chomsky, Language and Problems of Knowledge: TheManagua Lectures (MIT Press, Cambridge, MA, 1986).

60. E. S. Spelke, K. Breinlinger, J. Macomber, K. Jacobson,Psychol. Rev. 99, 605 (1992).

61. S. Pinker, The Stuff of Thought: Language as a Windowinto Human Nature (Viking, New York, 2007).

62. N. D. Goodman, T. D. Ullman, J. B. Tenenbaum, Psychol.Rev. 118, 110 (2011).

63. N. Goodman et al., in Proceedings of the 28th AnnualConference of the Cognitive Science Society (Erlbaum,Mahwah, NJ, 2006), pp. 1382–1387.

64. C. Lucas, T. Griffiths, F. Xu, C. Fawcett, in Advances inNeural Information Processing Systems (MIT Press,Cambridge, MA, 2009), vol. 21, pp. 985–992.

65. B. Milch, B. Marthi, S. Russell, in ICML 2004 Workshopon Statistical Relational Learning and Its Connectionsto Other Fields, T. Dietterich, L. Getoor, K. Murphy, Eds.(Omnipress, Banff, Canada, 2004), pp. 67–73.

66. C. Kemp, N. Goodman, J. Tenenbaum, in Proceedingsof the 30th Annual Meeting of the CognitiveScience Society (Publisher, City, Country, 2008),pp. 1606–1611.

67. N. Goodman, V. Mansinghka, D. Roy, K. Bonawitz,J. Tenenbaum, in Proceedings of the 24th Conference onUncertainty in Artificial Intelligence (AUAI Press,Corvallis, OR, 2008), vol. 22, p. 23.

68. D. Marr, Vision (W. H. Freeman, San Francisco, CA, 1982).69. J. B. Tenenbaum, T. L. Griffiths, in Advances in Neural

Information Processing Systems, T. Leen, T. Dietterich,V. Tresp, Eds. (MIT Press, Cambridge, MA, 2001),vol. 13, pp. 59–65.

70. A. N. Sanborn, T. L. Griffiths, D. J. Navarro, inProceedings of the 28th Annual Conference of theCognitive Science Society (Erlbaum, Mahwah, NJ, 2006),pp. 726–731.

71. S. D. Brown, M. Steyvers, Cognit. Psychol. 58, 49 (2009).72. R. Levy, F. Reali, T. L. Griffiths, in Advances in Neural

Information Processing Systems, D. Koller, D. Schuurmans,Y. Bengio, L. Bottou, Eds. (MIT Press, Cambridge, MA,2009), vol. 21, pp. 937–944.

73. J. Fiser, P. Berkes, G. Orbán, M. Lengyel, Trends Cogn.Sci. 14, 119 (2010).

74. E. Vul, N. D. Goodman, T. L. Griffiths, J. B. Tenenbaum, inProceedings of the 31st Annual Conference of the CognitiveScience Society (Erlbaum, Mahwah, NJ, 2009), pp. 148–153.

75. L. Schmidt, thesis, Massachusetts Institute of Technology,Cambridge, MA (2009).

76. We gratefully acknowledge the suggestions of R. R. Saxe,M. Bernstein, and J. M. Tenenbaum on this manuscriptand the collaboration of N. Chater and A. Yuille on aforthcoming joint book expanding on the methodsand perspectives reviewed here. Grant support wasprovided by Air Force Office of Scientific Research,Office of Naval Research, Army Research Office,NSF, Defense Advanced Research Projects Agency,Nippon Telephone and Telegraph CommunicationSciences Laboratories, Qualcomm, Google, Schlumberger,and the James S. McDonnell Foundation.

Supporting Online Materialwww.sciencemag.org/cgi/content/full/331/6022/1279/DC1SOM TextFigs. S1 to S5References

10.1126/science.1192788


REVIEW

on

Mar

ch 1

0, 2

011

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om


Date post:	02-Sep-2019
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

How to Grow a Mind: Statistics, Structure, and Abstraction...

Documents