Reflection Expanded: Expanding the Cognitive Reflection Task

Occidental College department of cognitive science

Reflection Expanded

Expanding the Cognitive Reflection Task

Samuel C. Boland

Spring 2013 Senior Comprehensive Project

The Cognitive Reflection Task (Frederick, 2005) is designed to measure the tendency of an individual to engage in cognitive reflection, or the propensity to think about their own responses analytically. In this study, I explain the CRT, the meaning behind it, and various correlating measures, and I propose an expansion of the test. Twelve new questions were tested against the original three on various measures such as SAT/ACT Score, Age, Gender, and cognitive heuristics-and-biases tasks, and of that, six were selected for possible inclusion in the CRT due to high correlations.

Boland 1

INTRODUCTION

COGNITIVE REFLECTION AND DECISION MAKING

The Cognitive Reflection task (CRT), first created in whole by economist Shane Frederick in 2005 (Cognitive

Reflection and Decision Making; Frederick, 2005), is a test designed to measure “cognitive reflection.” He drew on

Stanovich and West (2000) for a formal definition of this cognitive ability. According to the authors, human

cognition can be generally characterized into one of two systems, the exact nature of which differs depending on

the construct being measured. Regardless, they say that most cognitive systems have a “System 1” and a “System

2” format. System 1 is fast and immediate. It is utilized in such actions as driving, walking, simple arithmetic, and

any other non-cognitively taxing activity. System 2, on the other hand, is slow and analytic; it must be specifically

activated, and requires sustained effort and active concentration to maintain. It is implicated in complex tasks that

require active concentration, such as learning a new skill, complex mathematics, reading dense books, or writing a

paper. System 1 is the default system for most activities – it would make little sense to devote intense cognitive

ability to walking in a straight line, or most of any daily tasks. Thus, it is more easily activated. The activation of

System 2 requires a specific desire to do so, and sustained motivation and ability throughout.

Frederick’s paper analyzes the relationship between individual affinity towards cognitive reflection and

other cognitive measures. In order to do this, Frederick created a series of questions that were designed to activate

both System 1 and System 2. Specifically, these questions must have a “pre-potent,” or “gut” response: one which

seems immediately obvious, but which is, however, incorrect. The question then needs a correct, analytically

derivable response, which can only be arrived at by the application of System 2. However, System 2 will only be

activated if the participant “catches” themselves, or notes that they have made an error. In order to have done

that in the first place, then they must have been reviewing their previous actions and responses, in effect

“reflecting” upon their recent mental past. Hence the name – The Cognitive Reflection Task.

To this end, Frederick created three questions that satisfy this pre-potency condition. They are:

1. A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?

Boland 2

a. Intuitive Answer: 10 cents.

b. Correct Answer: 5 cents.

c. Why: People seem to be eager to simply subtract $1.00 from $1.10, as it is a cognitively simple

procedure. However, after a small amount of reflection, it becomes obvious that this way of

completing the question violates the stipulation that they together cost $1.10, as $1.10 + $0.10 =

$1.20. Instead they must generate a simple set of equations {X + Y = 1.1, X – Y = $1.00, solve for

Y}.

d. This question comes from Kahneman and Frederick (2002) and Kahneman and Frederick (2005),

and formed the springboard from which Frederick created the next two questions.

2. If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100

widgets?

a. Intuitive Answer: 100 minutes.

b. Correct Answer: 5 minutes.

c. Why: It takes 1 machine 5 minutes to make 1 widget. So, it will take 100 machines 5 minutes to

make 100 widgets. The obvious answer is to scale up all of the variables to 100 – however,

Frederick when creating this question picked a special case where all of the numbers were the

same, which may have instantiated a mental schema wherein X = Y = Z for all members of the set,

which would provide inconsistent results.

3. In a lake, there is a patch of lily pads. Every day, the patch doubles in size. If it takes 48 days for the patch

to cover the entire lake, how long would it take for the patch to cover half of the lake?

a. Intuitive Answer: 24 days.

b. Correct Answer: 47 days.

Boland 3

c. Why: It seems that people tend to assume linearity in mental calculations, perhaps because many

spectra in day to day life operate linearly, or at least approximately linearly on the scales that we

perceive. However, a doubling every day is an exponential function, which must be taken into

account in order to correctly answer this question.

In Kahneman & Frederick (2005), Frederick notes that “The critical feature of this [bat and ball] problem is

that anyone who reports 10 cents has obviously not taken the trouble to check his or her answer. The surprisingly

high rate of errors in this easy problem illustrates how lightly system 2 monitors the output of system 1: People are

often content to trust a plausible judgment that quickly comes to mind.”

CORRELATING MEASURES TO THE ORIGINAL CRT

Frederick was interested in how this measure might correlate with other cognitive measures. Specifically,

he was interested in various measures of economic cognition. One of these is Temporal Discounting, or the

tendency/ability to put off a small immediate reward for a larger later reward, and the ability to accurately gauge

whether it is a better choice to receive a reward (such as money) presently or later, including an understanding of

the effects of inflation and compound interest. One question in this vein that was utilized in the present study is:

“Would you rather be given $3400 this month, or $3800 next month?” The high-CRT group (that is, preferred

receiving more money later, with a high statistical significance and N = 806. Frederick explained this task in terms

of “Annual Discount Rate,” or at what percentage of annual interest one would require for a certain amount of

money A to grow to amount of money B. Frederick stated that the annual rate of $3400 to $3800 = 280%, much

higher than any other sort of savings program, and so the best answer must be to wait, according to his judgment.

In the field of Risk Aversion, the high-CRT groups were more willing to gamble when the utility of an

uncertain payoff was much higher than that of a certain payoff, whereas the low-CRT groups were more eager to

take the certain money, and not risk losing it. For some questions of this sort that were presented, the expected

value would be maximized by picking the gamble. On others, it would be maximized by picking the certain money.

To quote Frederick: “In the domain of gains, the high CRT group was more willing to gamble – particularly when

the gamble had higher expected value, but, notably, even when it did not.” (Frederick, 2005).

Boland 4

Another example was sensitivity to the Gambler’s Fallacy. The gambler’s fallacy is the belief that

statistically independent events are actually causally linked in some way, generally involving a sense of “luck.”

Some individuals appear to believe that “luck” is some sort of tally, keeping track of wins and losses through times,

and that a string of losses increases the chances for a later victory. One question designed to elicit this response is:

“When playing slot machines, people win something about 1 in every 10 times. Julie, however, has just won on her

first three plays. What are her chances of winning the next time she plays?”

Frederick found high correlations between these and related measures and the CRT, seemingly pointing

towards an underlying relationship between cognitive reflection and these various measures of economic

cognition. Risk aversion, temporal discounting, and sensitivity to the gambler’s fallacy all correlate with the ability

to reflect upon immediate actions and their implications, even when they are not immediately obvious. (Frederick,

2005)

Frederick found further strong correlations between the CRT and other

measurements like the SAT, ACT, WLP (Wonderlick Personality Test), NFC (Need for

Cognition Scale). These are reported here, all correlations P <.01. Further, he found a

significant correlation between gender and CRT score, with men tending to perform

better than women.

The correlation with the SAT is to be expected, as the SAT Verbal and Math both require cognitive

reflection to perform well on. While the questions in the SAT do not generally have a pre-potent response like the

CRT questions do, they do frequently require a reframing of the question in order to correctly answer. Thus, a

mediocre but significant correlation is not a surprise.

Frederick collected thousands of data points across multiple locations. These were primarily colleges and

universities, but some sampling of the general public was also conducted. Data was collected from MIT, Princeton,

Carnegie Mellon, Harvard, University of Michigan at Ann Arbor, Bowling Green University, University of Michigan

at Dearborn, Michigan State University, and the University of Toledo. He also collected data from the public at a

Boston fireworks display and a web-based survey. In total, 3428 people participated in Frederick’s study, although

Boland 5

not all filled out all fields (that is, not all had taken the SAT or ACT, and so were not part of analyses on those

subjects. )

So it seems that the CRT is a rather powerful test, tapping in to some low-level construct that is shared by

such seemingly disparate cognitive faculties as economic cognition, SAT and ACT scores, the WPT, and even a test

designed solely to measure a respondents desire to think (the NFC).

TOPLAK, STANOVICH, AND WEST (2010)

Heuristics-and-biases questions are those designed to measure an individual’s propensity to fall into

common cognitive traps. Thinking deeply about complex issues is time-consuming and cognition-intensive, and so

humans seem to have created or been born with certain mental shortcuts, or heuristics, that allow solving complex

problems quickly with little effort. However, sometimes these heuristics act more like illogical biases, causing

suboptimal performance. (Kahneman and Tversky; 1974 & 1983.)

A 2010 paper by Toplak et al sought to find further correlations between the CRT and measures of

cognitive ability, specifically “heuristics-and-biases” questions. (Toplak, Stanovich and West; 2010) Through a series

of regression analyses, they found that the CRT was a more potent predictor of performance on heuristics-and-

biases questions than other more traditional predictors such as self-report measures. They approached the CRT as

a test of one’s propensity towards being a “cognitive miser,” that is, the propensity to expend the least amount of

effort possible to come to a conclusion. Previous literature has found a strong connection between such cognitive

miserhood and common reasoning errors (Stanovich, 2009b; Tversky and Kahneman, 1974). One possible reason

that they put forward as a reason for the CRT’s efficacy in this field is that, unlike most other measurements

designed to probe miserly cognitive behavior, the CRT contains the aforementioned “pre-potent” response, as well

as a correct response, meaning that a strong immediate response must be actively inhibited in favor of a less

obvious one, a cognitively expensive procedure that cognitive misers would not engage in.

CORRELATING MEASURES OF TOPLAK ET AL.

Boland 6

Toplak et al utilized 15 classic heuristics-and-biases tasks drawn from multiple studies designed to

measure various subfields of human cognition, such as probabilistic reasoning, hypothetical thought, and statistical

thinking. Not all of the questions utilized correlated significantly in Toplak’s study, but the combined aggregate

measure of these questions correlated at a .49 level, with P < .001, demonstrating the existence of a link between

the two.

PROFESSOR SHTULMAN’S ADDITIONAL QUESTIONS

Professor Shtulman added two additional questions to the CRT, which I am familiar with from my summer

research experience at Occidental college in summer 2012. The purpose of this was to make the CRT section of the

exam five questions long, the same as the other portions of the exam, so as to not tip off test-takers that the CRT

section is different. These questions were:

1. A house contains a living room and a den that are perfectly square. The living room has four times the

square footage of the den. If the walls in the den are 10 feet long, how long are the walls in the living

room?

1. Intuitive Answer: 40 feet.

2. Correct answer: 20 feet.

3. Why: The size of a room increases based on the square of the length of its walls, however it is

cognitively easier to perform cognitive calculations assuming linearity.

2. A store owner reduced the price of a pair of $100 shoes by 10%. The next week, he reduced it by a further

10%. How much do the shoes cost now?

1. Intuitive Answer: $80.

2. Correct answer: $81.

3. Why: The second reduction was based off of the price of the first reduction. Thus it is not 100-

20%*100, but rather 100-10%, which is 90, then 90-10%, which is 81.

PROLIFERATION OF THE CRT

Boland 7

A footnote in Toplak et al (2010) points out that the CRT is in danger of becoming a self-report meaure,

rather than a performance measure, due to the proliferation of the questions on the internet and between people.

Toplak et al note that the ultimate answer to this quandary is the creation of more CRT items that vary in surface

characteristics. An entirely non-rigorous analysis of this assertion seems to point to its possibility – specifically, a

Google Search for “bat ball riddle,” referencing the first question of the original CRT, returns over 1.8 million

search results – in and of itself this is not surprising, however what is surprising is the sheer number of relevant

results multiple pages in to the search – at the present moment, relevant answers can still be found up to page 20,

a remarkable feat anecdotally speaking. This could point to a number of things – Google could have changed its

search algorithms, for example, and the causation of this phenomenon cannot ascertained in such a cursory

analysis. That being said, the possibility exists that individuals are spreading the CRT questions through the internet

and other channels of communication, aided by the small size and high memorability of the questions. This is

especially dangerous for the CRT, as the very nature of the questions relies on individuals noticing that their

answers were wrong – if a tainted participant were to take the CRT and see a problem that he is familiar with, he

may likely (and correctly) assume that the other questions in the set were of a similar nature, thus putting him on

guard against the very thing that the CRT was looking to test for in the first place.

EXPANSION OF THE CRT

As stated before, this is a problem for experimenters who wish to utilize the CRT. The idea behind the

current study is to expand the CRT, inoculating it somewhat, and for a short while, against the possibility of

corruption by individuals who take the test with pre-existing knowledge of the questions. Further, if successfully

extended, this larger CRT could allow a pool of questions for researchers to draw upon, increasing the utility of test

– the possibilities of expanding the CRT are discussed in the discussion section, below.

To that end, the point of this this study is to expand the CRT. The plan is simple: First, find questions that

are candidates for inclusion in the CRT. Specifically, questions that are “CRT-Like.” That is, they have a pre-potent

response that is incorrect, and a correct response that requires the recruitment of System 2. Second, find

measures that correlate and do not correlate with the original CRT (to achieve both convergent and divergent

validity). Then, combine the correlating measures and the expanded CRT into one large test, and administer it.

Boland 8

Once that is complete, examine the correlations between the original CRT questions, the new CRT questions, and

the correlating measures. The set of new CRT questions will require pruning to remove those that do not correlate

with the CRT on the given measures. This will be accomplished by multiple passes of analysis, determining how

each individual question reacts to all possible correlating measures, and seeing if it reacts similarly to the original

CRT questions and that measure. If the new question correlates with measures similarly to the original CRT, then it

is acceptable. If it does not, then it is removed.

After searching through books of riddles, LSAT practice exams, and various internet sites devoted to

riddles and jokes, I came up with a list of 10 potential new questions for the CRT. They were all, in the end, pulled

from anonymous internet sources – no physical books or LSAT practice questions were used, as I could not find

CRT-like questions that fit my criteria. It may seem odd to pull questions from the internet to prevent from their

proliferation on the internet, but this is not the case. First, almost all information available in books is now

available on the internet, and security through obscurity of the source (such as an old book of riddles) is not a

particularly powerful one once the source is discovered and digitized. Second, the point of this expansion is not to

create a bullet-proof system of questions, but to reduce the damage to the CRT in the event that an individual

knows one of the questions. The damage in mindset of the participant is unavoidable – if they know that one of

the questions is a “trick,” then they may extend that idea to all of the questions. However, if a participant knows

the bat-and-ball question from the original CRT, that would invalidate 1/3 of the test. If, however, they know one

answer to the expanded CRT, they will have invalidated 1/n questions, where n is the expanded size. Unless I were

to somehow invalidate one of the original questions, then that N will always be greater than or equal to 3, and will

thus be a better buffer against statistical invalidation of the test for that participant. The new questions are listed

below in no particular order.

THE ADDITIONAL QUESTIONS

1. “Some months contain 30 days, others contain 31 days. How many contain 28 days?

1. Intuitive Response: Only one month has 28 days.

2. Correct Response: 12 months, as all months contain at least 30 days.

Boland 9

2. “A red clock and a blue clock are both broken. The red clock doesn’t move at all. The blue clock moves but

loses 24 seconds every day. Which clock is more accurate?”

1. Intuitive Response: The blue clock, as it is at least still running.

2. Correct Response: The red clock, as it is correct twice a day, while the blue clock will have to cycle

through 12 hours (assuming it’s an analog clock – if digital with an AM/PM indicator or on

military time, it will take 24 hours) in 24-second increments per day until, once every thirty (or 60)

days, it is on time the entire day. This is further assuming that it loses 24 seconds one per day in

one large chunk at the end – if the clock is simply moving more slowly than others such that it

cumulatively loses 24 seconds over the course of the entire day, then it will be even less accurate.

3. You are in third place in a race. You overtake the person in second place. What place are you in now?

1. Intuitive Response: First place. Since you just beat second place, you must be in first.

2. Correct Response: Second place. You passed up the previous second-place runner, who is now in

third place, but the original first-place individual is still in front of you.

4. “You have a book of matches and enter a cold, dark room. You know that in the room there is an oil lamp,

a candle, and a heater. What do you light first?”

1. Intuitive Response: Any of the above depending on the preferences of light vs. heat for the

individual.

2. Correct response: The matches must be lit before anything else.

5. “Divide 30 by ½ and add 10. What is the answer?”

1. Intuitive Response: 25. ((30/2) + 10) = 25.

2. Correct response: 70. It says to divide by ½, not multiply by ½. So it is ((30/.5) + 10) = 70.

3. This is very similar to SAT Reading Comprehension questions, which sometimes seek to actively

obscure the answer through non-standard wording of a problem.

6. “If within a family there are nine brothers, and each brother has one sister, how many people are within

the family including the mother and father?”

1. Intuitive Response: 20. If each brother has one (unique) sister, then there are 9 brothers, 9 sisters,

and the 2 parents = 20 people.

Boland 10

2. Correct Response: Nowhere does it mention that each brother has a unique sister, rather that

they have *a* sister. Thus, the real answer is 9 brothers + 1 sister + 2 parents = 12 people.

7. “An airplane travelling at 400 mph crashes on the US/Canadian border. Where are the survivors buried?”

1. Intuitive Response: Either where they are from, or where their family wishes for them to be

buried.

2. Correct Response: Noting that survivors are not buried, as they survived, and that would be cruel.

8. “If it takes 20 minutes to hard-boil one goose egg, how long would it take to hard-boil 4?”

1. Intuitive Response: 80 minutes.

2. Correct Response: 20 minutes – just put them all in the same pot.

9. “A Doctor gives you three (3) pills, and tells you to take one every half an hour. How long will it be until

you no longer have any pills?”

1. Intuitive Response: 1.5 hours. Three * 30 minutes = 1.5 hours.

2. Correct Response: 1 hour. This is a problem of counting the fence posts – If you take one pill

immediately, you have two left. When you take another half an hour later, you will have one left.

Finally, when you take the last pill one hour later, you will have 0 left. Thus you will have 0 after

one hour.

10. “You have a ribbon that is 30 inches long. How many cuts with a pair of scissors would it take to divide it

into inch long pieces?”

1. Intuitive Answer: 30.

2. Correct Answer: 29.

All of these questions are, at the very least, weakly CRT-like, in that they have a pre-potent response and a

correct response. In some cases however the pre-potent response is not particularly strong, such as the question

involving the clocks or the ribbon. Further, not every question was strictly mathematical in nature, as the original

questions were. However, I still believe that they may tap into the difference between Systems 1 and 2, and

effectively measure an individual’s proclivity to engage in cognitive reflection.

THE CORRELATING MEASURES

Boland 11

These measures were taken from Frederick (2005), Toplak et al (2010), and a variety of other sources. The

reported sources are the original studies in which these questions were used, as far as I can ascertain. For others,

such as the gambler’s fallacy question, no original could be determined, and so the source that the question was

found in is cited.

1. Gambler’s Fallacy: When playing slot machines, people win something about 1 in every 10 times. Julie,

however, has just won on her first three plays. What are her chances of winning the next time she plays?

(Frederick, 2005)

a. The point of this question is to gauge an individual’s proclivity to believe in “luck,” “karma,”

“fate,” or, more specifically, the cognitive fallacy that unrelated probabilistic events are actually

related. The correct answer was 1/10, .1, 10%, or equivalent – any other answer was coded as

false.

2. Sample Size Sensitivity: A game of squash can be played to either 9 or 15 points. Player A is a better

player than player B. Which amount of points to finish the game (9 or 15) gives A a higher chance of

winning? (Kahneman & Tversky, 1982)

a. The correct answer is 15. Much like how a coin, being flipped, will generally settle on a 50-50

distribution with enough time - reflecting the true underlying probabilities, a game with more

chances to play with tend to settle towards the underlying probability distribution more quickly

than a game with fewer chances.

3. Regression to the Mean: “After the first two weeks of the major league baseball season, newspapers

begin to print the top 10 batting averages. Typically, after 2 weeks, the leading batter often has an

average of about .450. However, no batter in major league history has ever averaged .450 at the end of

the season. Why do you think this is?” (Lehmann, Lempert and Nisbett; 1988)

a. When a batter is known to be hitting for a high average, pitchers bear down more when they

pitch to him.

b. Pitchers tend to get better over the course of a season, as they get more in shape. As pitchers

improve, they are more likely to strike out batters, so batters’ averages go down.

Boland 12

c. A player’s high average at the beginning of the season may be just luck. The longer season

provides a more realistic test of a batter’s skill.

d. A batter who has such a hot streak at the beginning of the season is under a lot of stress to

maintain his performance record. Such stress adversely affects his playing.

e. When a batter is known to be hitting for a high average, he stops getting good pitches to hit.

Instead, pitchers “play the corners” of the plate because they don’t mind walking him.

i. The only correct answer is C.

ii. This question, much like the previous one, tests how well an individual understands such

statistical concepts as the law of large numbers and regression to the mean.

4. Covariational Reasoning: A doctor has been working on a cure for a mysterious disease. Finally, he

created a drug that he thinks will cure people of the disease. Before he can begin to use it regularly, he

has to test the drug. He selected 300 people who had the disease and gave them the drug to see what

happened. He also observed 100 people who had the disease but who were not given the drug. When the

treatment was used, 200 people were cured, and 100 were not. When the treatment was NOT used, 75

people were cured, and 25 people were not. On a scale of 1 to 10, how strong of an effect did the

treatment have, if any, either positive or negative? (Toplak et al., 2010)

a. As can be seen, this is not a good treatment. When the treatment was used, 200/300, or 2/3 of

people were cured. When the treatment was not used, 75/100 or ¾ were cured. Thus, the

treatment was actually either slightly negative or completely useless.

b. Any answer under a 5, was scored as correct. This many need to be changed.

5. Methodological Reasoning in everyday life: The city of Middleopolis has had an unpopular police chief for

a year and a half. He is a political appointee who is a crony of the mayor, and he had little previous

experience in police administration when he was appointed. The mayor has recently defended the chief in

public, announcing that in the time since he took office, crime rates decreased by 12%. Which of the

following pieces of evidence would most deflate the mayor's claim that his chief is competent? (Lehman

et al; 1988)

a. The crime rate in the city closest to Middleopolis in location and size has fallen by 18%.

Boland 13

b. An independent survey of the citizens of Middleopolis report 40% more crimes than in the police

records.

c. Common sense indicates that there is little that a police chief can do to low crime rates, as these

are mostly social and economic matters beyond his or her control.

d. The police chief was discovered to have business contracts with people in organized crime.

i. Only A contains data specifically – all others are unfounded conjecture, despite the fact

that they may seem like they would make sense.

6. Sunk Cost Fallacy: This was composed of two questions. The first part was:

a. “Imagine that you are staying in a hotel room, and that you have just paid $9.95 for a pay-per-

view movie. Five minutes into the movie, you find yourself bored with it. Do you change the

channel or continue watching the movie?”

b. And the second part is: “Imagine that you are staying a hotel room, flipping channels on the TV.

You come across a movie that is just starting. Five minutes into the movie, you find yourself

bored with it. Do you change the channel or continue watching the movie?” (Toplak et al; 2010)

i. If a participant selected the same answer to both of these questions, they were deemed

as correct. If they chose different responses, they were deemed as incorrect. This was

designed to measure sensitivity to the Sunk Cost fallacy. The sunk cost fallacy is the

tendency of a person to continue an unpleasant activity if they expended value (time,

money) acquiring it.

ii. It should be noted that no individual who stated that they would switch the channel in

the pay condition reported that they would stay on the channel in the free condition. It

was only when an individual spent money that they were willing to sit through a movie

that they did not like – they’ve already paid, or so the reasoning goes, so they should get

their “money’s worth.” This is unreasonable, as the money is already gone and you

could be spending your time in a more useful way.

c. Outcome Bias Questions: Like the previous question, this came in two parts. (Baron and Hershey;

1988)

Boland 14

i. Part one: “There is a 55-year old man with a serious heart condition. He had an

operation to fix the problem, which succeeded. The probability of him dying from the

surgery was 8%. Please rate how good of a decision this was on the following scale, with

1 being “incorrect, a very bad decision” and 7 being “clearly correct, an excellent

decision.”

ii. Part two: “There is a 55 year old man with a hip condition. He had an operation to fix

the problem, which did not succeed - the old man died on the operating table. The

probability of him dying from the surgery was 2%. Please rate how good of a decision

this was on the following scale, with one being “incorrect, a very bad decision” and 7

being “clearly correct, an excellent decision:”

1. The participant was rated as correct only if they rated part two as a better

decision than part one. This is because, even if the patient died in question two,

they had only a 25% chance of dying as compared to person one, who just so

happened to survive. This is outcome bias, reflected in the phrase “hindsight is

20/20.”

7. Temporal Discounting 1: “Would you rather be given $3400 right now or $3800 one month from now?”

a. This was coded as correct if the individual chose the second answer. This is because, in the

current and foreseeable economic climate, interest rates would not allow $3400 to change to

$3800 in one month. However, it *is* conceivable to turn $3400 to $3800 through other

activities, such as arbitrage or short-term loans. This was pointed out to me after the experiment

was conducted. (All temporal discounting questions taken from Frederick, 2005)

8. Temporal Discounting 2: “What is the highest amount of money that you would pay for a book that you

really want to be shipped to you overnight?”

a. This was coded as a 1 for correct and a 0 for incorrect. To determine this, the given answers were

averaged, and all above the average were given a 0, and all below were given a 1.

9. Temporal Discounting 3:

Boland 15

a. “On a scale of 1 to 10, where 1 is very little and 10 is quite a bit, please rate how much you think

about monetary inflation.”

b. This was not scored as correct or incorrect, but was used as a scale.

METHODS

PARTICIPANTS AND PROCEDURE

A total of 59 participants (43 from within Occidental college, 16 from the general public) took part in the

study. Individuals from outside of the college were recruited to attempt to reduce the effect of “WEIRD”

populations. (Heinrich et al., 2010) The individuals were recruited through social networking, Sona-systems, and

word of mouth. Individuals who were eligible received .5 course credits for participation in the study. The test was

administered as a Google form, one for within-Occidental that collected @oxy.edu email addresses, and one for

outside of the college which was completely anonymous. Ages of participants ranged from college-aged to mid-

50’s, with all over 18 years of age. The test was not timed, but anecdotal reports estimate that it took

approximately 30-45 minutes to complete the study.

TASKS AND MEASURES

Participants completed a combined survey of demographic data, self-report testing data (SAT and ACT

scores as well as Age), the above-mentioned heuristics-and-biases tasks, the original CRT (from here on referred to

as the ‘oCRT’), Professor Shtulman’s expanded CRT, and my 10 CRT question candidates (the ‘mCRT’). (For further

discussion, Professor Shtulman’s expansion will be combined with mine under the moniker of “mCRT.”) Mean

performance on the oCRT was 1.55 questions correct, among only the Occidental students. Mean performance for

the mCRT is reported later, as they must be filtered beforehand.

RESULTS

Boland 16

INTER-CRT CORRELATIONS

Question Correlation to CRT P ValLiving Room/Den .346** .007Family of Brothers .443** .000The Doctor gives you 3 pills .572* .000Months with 28 days .314* .016Passing in a Race -.185 .161Planecrash .364** .005Divide by ½ .287* .027Ribbon Cut .250 .056Shoe Price .294* .024Goose Egg .354** .006Clocks .124 .312Matches .224 .088

(Correlations are marked for easy of viewing: * = P<.05, ** = P<.01) These are the first-pass zero-order

correlations between the aggregated oCRT (known as the “CRT” in the above screenshot) and the questions

comprising the sCRT and mCRT. The majority of the questions do indeed correlate with the CRT. Only a few do not

correlate at all – these are “Race,” “Ribboncut,” “Clocks,” and “Matches.” Specifically, “Clocks” and “Race” have

very high P-values, at .312 and .161 respectively, while “Ribboncut” and “Matches” are somewhat closer to

correlations, with a P = .056 and .088 respectively, possibly hinting towards correlation had there been more

participants. Thus, “Clocks” and “Race” are removed from the analysis, while “Ribboncut” and “Matches” stays for

now. This reduces the list of new questions to 10 from 12.

CRT/SAT CORRELATIONS

The next pass of analysis was concerned with the correlations between the CRT, the mCRT, and the SAT.

The SAT is useful as a comparative statistic for the CRT because of the nature of SAT questions, specifically those in

the reading and writing portions of the test. SAT-reading comprehension questions generally give an individual a

passage of text to read, and then ask them questions about it designed to measure their comprehension of the

passage. SAT-writing questions will provide a prompt, and then ask test-takers to write a long-form essay critically

responding to the question. These sorts of question would, out of necessity, require reflection on the knowledge,

Boland 17

either knowledge recently received (as in the reading comprehension portion) or analysis of knowledge that you

already have (as in the writing portion). For example, on one practice SAT reading comprehension question located

here (http://www.majortests.com/sat/reading-comprehension-test01), one of the questions asked is “What is the

author implying in the above text?” Another asks for which definition of a word best fits in the context of the

passage. Both of these questions require the activation of Stanovich and West’s cognitive system 2, and the

question on implication specifically requires internet reflection to decide from multiple different outcomes, similar

to the decisions that must be made when answering CRT questions.

The first pass of this second analysis for SAT correlations revealed a few important pieces of information:

First, neither the CRT nor the mCRT correlate significantly (or even near significance) with the SAT-math subscore.

For the other subscores, the oCRT correlates with the SAT Reading Comprehension subtest at a .441* level, P

= .013, and with the SAT Writing subtest at .503**, P = .002. The mCRT correlates with SAT reading comprehension

at a .330 level, P = .070, and with the SAT Writing subtest at P.517**, P = .002. So, both the oCRT and the mCRT

correlate with the SAT writing subtest at a high and very significant level, whereas the oCRT does not correlate (But

approaches correlation with) the SAT reading comprehension test. It would appear that a few questions exist

within the mCRT that do not correlate with the SAT the same way that the questions in the oCRT do. To determine

which of these questions the culprit was, the aggregate scores were split and compared to the SAT questions

individually.

To provide a baseline for further analysis, the oCRT was first split and compared individually to the SAT, to

see how the CRT questions (which are usually used only in the aggregate) compare individually to the SAT, giving a

base level of comparison for the additional questions. As expected based on the aggregate data, none of the oCRT

questions correlate with the SAT Math score. Surprisingly however, two of the three original questions do not

correlate with any SAT subscores, although they both are somewhat close (P <=.14 for SAT reading, P<=.097 for

SAT writing), and they may reach significance with more data. This is important, because it shows that the oCRT is

not a monolithic bloc with regards to correlations with the SAT.

http://www.majortests.com/sat/reading-comprehension-test01

Boland 18

This analysis was repeated with the mCRT. Four questions were found to be very far from correlation with

any of the original SAT scores, with P values at approximately .8 for all scores. These four questions were: Months,

Ribboncut, Shoe Price, and Matches. These questions are removed from further analysis, lowering the size of the

mCRT to 6 questions from the previous 10. Upon removal of these four questions, and rechecking the correlations

between the oCRT, mCRT, and SAT, the correlations between the mCRT and the SAT Reading Comprehension is

now at P<.05, within the same range of .05>P>.01 as the oCRT.

CRT/HEURISTICS-AND-BIASES CORRELATIONS

Frederick and Toplak both found correlations between various measures of economic and cognitive

heuristics and the CRT. Attempting to replicate their results, I utilized ten such questions which are detailed above

in the introduction. At first, the data was quite muddied, and no correlations could be found. To see what was

wrong, I examined the correlations between each heuristics-and-biases question and the oCRT/mCRT. What I

found was that questions having to do with money, specifically the questions about Temporal Discounting, did not

perform well at Occidental. Neither the oCRT nor the mCRT correlated significantly with any of them, at odds with

Frederick’s study.

This lack of correlation was found within the Occidental college participant pool in particular, and the

outside-Occidental pool was too small to get any significant data from alone. Questions such as “How much would

you pay to have a book shipped to you overnight” were particularly strange, with answer ranging from 0 to 120

dollars within Occidental. This might be due to the nature of the college experience, and the change in

technologies between the time of Frederick’s study in 2005 and now. Now, a large amount of people own Kindles,

iPhones, iPads, and notebooks. Internet access is also generally much quicker now than 8+ years ago when

Frederick was performing his study. It could be that the increase of easy access to online materials has lowered the

amount of books that college students are required to order, such that, when a student does need to order a book,

it is because they need it immediately for a class, and that might be the only time that a student orders a physical

book from the internet, necessitating expedited shipping and higher payment, and skewing the economically

“correct” answer.

Boland 19

Another question that did not correlate, that should have according to Frederick, was one asking

participants whether or not they would prefer to be given $3400 immediately or $3800 in one month. Frederick’s

justification for picking the latter is that it has a 280% annual discount rate or an increase equivalent to 280%

yearly over one month, which is higher than can be found any official investment. However, when talking to

participants who took my test, some did not see it that way – some claimed that they could easily turn $3400 to

$3800 in less than a month, with some to spare. Others claimed that they would need it now to pay pressing bills,

and would not be able to wait one month. Whether this is peculiar to Occidental college, I cannot say. It should be

noted that the Gambler’s fallacy question did make it through this pass, so not all measures of economic cognition

were inefficacious. Further, both the oCRT and the mCRT did not correlate, indicating not a problem in the test but

a problem in the participant pool.

These questions were removed, created an adjusted aggregate bias measure. To mirror Frederick’s

analysis on these measures, the oCRT and mCRT groups were split into two groups each – one low, one high.

Individuals were assigned to the low group if they generated less than N/2 (where N = the amount of questions)

answers correctly, and were assigned to the high group if they generated more than N/2 answers correctly. Low-

CRT individuals, both on the mCRT and the oCRT, scored lower at a significant level on the adjusted aggregate

biases tasks. Conversely, High-CRT individuals on both tests scored higher at significant levels on the adjusted

biases tasks. Not only that, but the oCRT and mCRT acted almost identically: The oCRT_high correlated with the

adjusted bias aggregate at .304, P = .019. The mCRT_high correlated at .364, P = .005. The oCRT_low correlated at

-.512, P = .000, and the mCRT_low correlated at -.378 at P = .003.

CRT/DEMOGRAPHIC CORRELATIONS

Despite Frederick’s findings on the influence of Gender on the outcome of his test, with females tending

to perform worse than males, no correlation with gender was found on either the oCRT or the mCRT. However, a

correlation was found between both CRTs and Age, a measurement not used by Frederick, Toplak, nor by any

other CRT-based experiments that I read. The oCRT correlates with Age at .294, P=.024. The mCRT correlates

Boland 20

at .253, P = .053. This is >.05, however considering how close it is it is likely that this is a problem of test power,

and I am confident that having more individuals in the pool would have allowed it to reach correlation.

FINAL OCRT/MCRT CORRELATION

The previous two passes of analysis did not require the removal of any more questions from the mCRT,

leaving the expanded test at 6 mCRT questions + 3 oCRT questions, for a total of 9 CRT questions. The mCRT and

the oCRT aggregates correlate with each other at a .653 level, P < .000.

DISCUSSION

To save space, the entire list of final questions is not presented here; however, they are the Airplane

question, the Doctor/Pill question, the Sister/Family question, the Goose Egg question, the dividing 30 question,

and the living-room/den-size question. (As well as the original 3 CRT questions.) All of these require at least some

modicum of mathematical thought except for the Airplane question, which, quite frankly, is a rather surprising

candidate to make it through the gauntlet of correlations. It is possible that its inclusion is an artifact of over fitting

the questions to the data, or of Occidental’s peculiarities. It is also possible that it is tapping in to the same

underlying cognitive reflection abilities as the other questions, and so does indeed deserve a spot on the list.

There are some possible problems with the data that was collected. Primarily, correlations that were

supposed to exist were on occasion not found; specifically between the CRT scores (oCRT and mCRT) and the SAT

Math score, Gender, or the economically-minded heuristics and biases questions. However, it is important to note

that neither the oCRT nor the mCRT correlated with these measures – it would have been much worse if one had

correlated while the other did not, which would mean that they acted very differently on one of these measures.

Since they both do not correlate at similar levels, this may be considered a measure of divergent validity.

That being said, the expanded mCRT correlates almost identically with the original oCRT on many

disparate measures, including SAT Reading and Writing score, wide-ranging cognitive heuristic problems, and age.

Boland 21

Further, the individual questions comprising the CRT correlate with the aggregate CRT, many of them on more

than one question, and all significantly.

It seems that the mCRT is in fact a viable candidate for expansion to the CRT. More rigorous testing would

of course be required, and a much larger pool of participants would be needed, but these questions preliminarily

seem to travel with the CRT in a way that hints that they are both tapping into the same underlying cognitive

process or proclivity. Specifically that they are both measuring, to various degrees, cognitive reflection.

This is desirable; not only for the main reason of protecting the CRT against invalidation through

proliferation, but in creating a larger pool of questions that researchers can draw upon, each measuring slightly

different aspects of cognitive reflection ability, which may ease further lines of study with the CRT. One possibility

would be to use the CRT, and any expansions to it, to measure the origin and malleability of reflective cognition.

For example, are these scores generally constant throughout life? A singe 3-question test could not answer that, as

it would require repetition of the same questions each measurement. But a larger 9-question test, such as the

expanded CRT presented here, would allow for three measurements of three questions, for a broader perspective

on how this skill acts through time. Another possibility is to see whether CRT score can be changed through

training, another repeated-measures test that occurs through time which would require more than the original 3

question test.

WORKS CITED

Baron, J., & Hershey, J. (1988). Outcome bias in decision evaluation. Journal of Personality and Social Psychology,

54, 569-579.

Frederick, S. (2005). Cognitive Reflection and Decision Making. Journal of Economic Perspectives, 19(4), 25-42.

Boland 22

Heinrich, J., Heine, S., & Norenzayan, A. (2010). The weirdest people in the world? Behavioral and Brain Sciences,

33(2/3), 1-75.

Kahneman, D. & Frederick, S. (2002). Representativeness Revisited: Attribute Substitution in Intuitive Judgment.

Heuristics of Intuitive Judgment: Extensions and Applications, New York: Cambridge University Press

Kahneman, D. & Frederick, S. (2005). A model of heuristic judgment. The Cambridge Handbook of Thinking and

Reasoning, 267-293.

Kahneman, D. & Tversky, A. (1974). Judgment under Uncertainty: Heuristics and Biases. Science, 185, 1124-1131

Kahneman, D., & Tversky, A. (1982). On the study of statistical intuitions. Cognition, 11, 123-141.

Kahneman D., & Tversky, A.(1983). Extension vs. Intuitive Reasoning: The conjunction fallacy in probability

judgment. Psychological Review, 90(4), 293-315

Lehman, D., Lempert, R., & Nisbett, R. (1988). The effect of graduate training on reasoning. American Psychologist,

43, 431-442.

Stanovich, K. E. (2009b). What intelligence tests miss: The psychology of rational thought. New Haven: Yale

University Press.

Toplak, M., West, R., & Stanovich, K. (2011). The Cognitive Reflection Test as a predictor of performance on

heuristics-and-biases tasks. Memory and Cognition, 39, 1275-1289.

Date post:	12-Apr-2015
Category:	Documents
Upload:	sam-boland
View:	105 times
Download:	3 times

Reflection Expanded: Expanding the Cognitive Reflection Task

Documents