Date post: | 12-Apr-2015 |
Category: |
Documents |
Upload: | sam-boland |
View: | 105 times |
Download: | 3 times |
Occidental College department of cognitive science
Reflection Expanded
Expanding the Cognitive Reflection Task
Samuel C. Boland
Spring 2013 Senior Comprehensive Project
The Cognitive Reflection Task (Frederick, 2005) is designed to measure the tendency of an individual to engage in cognitive reflection, or the propensity to think about their own responses analytically. In this study, I explain the CRT, the meaning behind it, and various correlating measures, and I propose an expansion of the test. Twelve new questions were tested against the original three on various measures such as SAT/ACT Score, Age, Gender, and cognitive heuristics-and-biases tasks, and of that, six were selected for possible inclusion in the CRT due to high correlations.
Boland 1
INTRODUCTION
COGNITIVE REFLECTION AND DECISION MAKING
The Cognitive Reflection task (CRT), first created in whole by economist Shane Frederick in 2005 (Cognitive
Reflection and Decision Making; Frederick, 2005), is a test designed to measure “cognitive reflection.” He drew on
Stanovich and West (2000) for a formal definition of this cognitive ability. According to the authors, human
cognition can be generally characterized into one of two systems, the exact nature of which differs depending on
the construct being measured. Regardless, they say that most cognitive systems have a “System 1” and a “System
2” format. System 1 is fast and immediate. It is utilized in such actions as driving, walking, simple arithmetic, and
any other non-cognitively taxing activity. System 2, on the other hand, is slow and analytic; it must be specifically
activated, and requires sustained effort and active concentration to maintain. It is implicated in complex tasks that
require active concentration, such as learning a new skill, complex mathematics, reading dense books, or writing a
paper. System 1 is the default system for most activities – it would make little sense to devote intense cognitive
ability to walking in a straight line, or most of any daily tasks. Thus, it is more easily activated. The activation of
System 2 requires a specific desire to do so, and sustained motivation and ability throughout.
Frederick’s paper analyzes the relationship between individual affinity towards cognitive reflection and
other cognitive measures. In order to do this, Frederick created a series of questions that were designed to activate
both System 1 and System 2. Specifically, these questions must have a “pre-potent,” or “gut” response: one which
seems immediately obvious, but which is, however, incorrect. The question then needs a correct, analytically
derivable response, which can only be arrived at by the application of System 2. However, System 2 will only be
activated if the participant “catches” themselves, or notes that they have made an error. In order to have done
that in the first place, then they must have been reviewing their previous actions and responses, in effect
“reflecting” upon their recent mental past. Hence the name – The Cognitive Reflection Task.
To this end, Frederick created three questions that satisfy this pre-potency condition. They are:
1. A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?
Boland 2
a. Intuitive Answer: 10 cents.
b. Correct Answer: 5 cents.
c. Why: People seem to be eager to simply subtract $1.00 from $1.10, as it is a cognitively simple
procedure. However, after a small amount of reflection, it becomes obvious that this way of
completing the question violates the stipulation that they together cost $1.10, as $1.10 + $0.10 =
$1.20. Instead they must generate a simple set of equations {X + Y = 1.1, X – Y = $1.00, solve for
Y}.
d. This question comes from Kahneman and Frederick (2002) and Kahneman and Frederick (2005),
and formed the springboard from which Frederick created the next two questions.
2. If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100
widgets?
a. Intuitive Answer: 100 minutes.
b. Correct Answer: 5 minutes.
c. Why: It takes 1 machine 5 minutes to make 1 widget. So, it will take 100 machines 5 minutes to
make 100 widgets. The obvious answer is to scale up all of the variables to 100 – however,
Frederick when creating this question picked a special case where all of the numbers were the
same, which may have instantiated a mental schema wherein X = Y = Z for all members of the set,
which would provide inconsistent results.
3. In a lake, there is a patch of lily pads. Every day, the patch doubles in size. If it takes 48 days for the patch
to cover the entire lake, how long would it take for the patch to cover half of the lake?
a. Intuitive Answer: 24 days.
b. Correct Answer: 47 days.
Boland 3
c. Why: It seems that people tend to assume linearity in mental calculations, perhaps because many
spectra in day to day life operate linearly, or at least approximately linearly on the scales that we
perceive. However, a doubling every day is an exponential function, which must be taken into
account in order to correctly answer this question.
In Kahneman & Frederick (2005), Frederick notes that “The critical feature of this [bat and ball] problem is
that anyone who reports 10 cents has obviously not taken the trouble to check his or her answer. The surprisingly
high rate of errors in this easy problem illustrates how lightly system 2 monitors the output of system 1: People are
often content to trust a plausible judgment that quickly comes to mind.”
CORRELATING MEASURES TO THE ORIGINAL CRT
Frederick was interested in how this measure might correlate with other cognitive measures. Specifically,
he was interested in various measures of economic cognition. One of these is Temporal Discounting, or the
tendency/ability to put off a small immediate reward for a larger later reward, and the ability to accurately gauge
whether it is a better choice to receive a reward (such as money) presently or later, including an understanding of
the effects of inflation and compound interest. One question in this vein that was utilized in the present study is:
“Would you rather be given $3400 this month, or $3800 next month?” The high-CRT group (that is, preferred
receiving more money later, with a high statistical significance and N = 806. Frederick explained this task in terms
of “Annual Discount Rate,” or at what percentage of annual interest one would require for a certain amount of
money A to grow to amount of money B. Frederick stated that the annual rate of $3400 to $3800 = 280%, much
higher than any other sort of savings program, and so the best answer must be to wait, according to his judgment.
In the field of Risk Aversion, the high-CRT groups were more willing to gamble when the utility of an
uncertain payoff was much higher than that of a certain payoff, whereas the low-CRT groups were more eager to
take the certain money, and not risk losing it. For some questions of this sort that were presented, the expected
value would be maximized by picking the gamble. On others, it would be maximized by picking the certain money.
To quote Frederick: “In the domain of gains, the high CRT group was more willing to gamble – particularly when
the gamble had higher expected value, but, notably, even when it did not.” (Frederick, 2005).
Boland 4
Another example was sensitivity to the Gambler’s Fallacy. The gambler’s fallacy is the belief that
statistically independent events are actually causally linked in some way, generally involving a sense of “luck.”
Some individuals appear to believe that “luck” is some sort of tally, keeping track of wins and losses through times,
and that a string of losses increases the chances for a later victory. One question designed to elicit this response is:
“When playing slot machines, people win something about 1 in every 10 times. Julie, however, has just won on her
first three plays. What are her chances of winning the next time she plays?”
Frederick found high correlations between these and related measures and the CRT, seemingly pointing
towards an underlying relationship between cognitive reflection and these various measures of economic
cognition. Risk aversion, temporal discounting, and sensitivity to the gambler’s fallacy all correlate with the ability
to reflect upon immediate actions and their implications, even when they are not immediately obvious. (Frederick,
2005)
Frederick found further strong correlations between the CRT and other
measurements like the SAT, ACT, WLP (Wonderlick Personality Test), NFC (Need for
Cognition Scale). These are reported here, all correlations P <.01. Further, he found a
significant correlation between gender and CRT score, with men tending to perform
better than women.
The correlation with the SAT is to be expected, as the SAT Verbal and Math both require cognitive
reflection to perform well on. While the questions in the SAT do not generally have a pre-potent response like the
CRT questions do, they do frequently require a reframing of the question in order to correctly answer. Thus, a
mediocre but significant correlation is not a surprise.
Frederick collected thousands of data points across multiple locations. These were primarily colleges and
universities, but some sampling of the general public was also conducted. Data was collected from MIT, Princeton,
Carnegie Mellon, Harvard, University of Michigan at Ann Arbor, Bowling Green University, University of Michigan
at Dearborn, Michigan State University, and the University of Toledo. He also collected data from the public at a
Boston fireworks display and a web-based survey. In total, 3428 people participated in Frederick’s study, although
Boland 5
not all filled out all fields (that is, not all had taken the SAT or ACT, and so were not part of analyses on those
subjects. )
So it seems that the CRT is a rather powerful test, tapping in to some low-level construct that is shared by
such seemingly disparate cognitive faculties as economic cognition, SAT and ACT scores, the WPT, and even a test
designed solely to measure a respondents desire to think (the NFC).
TOPLAK, STANOVICH, AND WEST (2010)
Heuristics-and-biases questions are those designed to measure an individual’s propensity to fall into
common cognitive traps. Thinking deeply about complex issues is time-consuming and cognition-intensive, and so
humans seem to have created or been born with certain mental shortcuts, or heuristics, that allow solving complex
problems quickly with little effort. However, sometimes these heuristics act more like illogical biases, causing
suboptimal performance. (Kahneman and Tversky; 1974 & 1983.)
A 2010 paper by Toplak et al sought to find further correlations between the CRT and measures of
cognitive ability, specifically “heuristics-and-biases” questions. (Toplak, Stanovich and West; 2010) Through a series
of regression analyses, they found that the CRT was a more potent predictor of performance on heuristics-and-
biases questions than other more traditional predictors such as self-report measures. They approached the CRT as
a test of one’s propensity towards being a “cognitive miser,” that is, the propensity to expend the least amount of
effort possible to come to a conclusion. Previous literature has found a strong connection between such cognitive
miserhood and common reasoning errors (Stanovich, 2009b; Tversky and Kahneman, 1974). One possible reason
that they put forward as a reason for the CRT’s efficacy in this field is that, unlike most other measurements
designed to probe miserly cognitive behavior, the CRT contains the aforementioned “pre-potent” response, as well
as a correct response, meaning that a strong immediate response must be actively inhibited in favor of a less
obvious one, a cognitively expensive procedure that cognitive misers would not engage in.
CORRELATING MEASURES OF TOPLAK ET AL.
Boland 6
Toplak et al utilized 15 classic heuristics-and-biases tasks drawn from multiple studies designed to
measure various subfields of human cognition, such as probabilistic reasoning, hypothetical thought, and statistical
thinking. Not all of the questions utilized correlated significantly in Toplak’s study, but the combined aggregate
measure of these questions correlated at a .49 level, with P < .001, demonstrating the existence of a link between
the two.
PROFESSOR SHTULMAN’S ADDITIONAL QUESTIONS
Professor Shtulman added two additional questions to the CRT, which I am familiar with from my summer
research experience at Occidental college in summer 2012. The purpose of this was to make the CRT section of the
exam five questions long, the same as the other portions of the exam, so as to not tip off test-takers that the CRT
section is different. These questions were:
1. A house contains a living room and a den that are perfectly square. The living room has four times the
square footage of the den. If the walls in the den are 10 feet long, how long are the walls in the living
room?
1. Intuitive Answer: 40 feet.
2. Correct answer: 20 feet.
3. Why: The size of a room increases based on the square of the length of its walls, however it is
cognitively easier to perform cognitive calculations assuming linearity.
2. A store owner reduced the price of a pair of $100 shoes by 10%. The next week, he reduced it by a further
10%. How much do the shoes cost now?
1. Intuitive Answer: $80.
2. Correct answer: $81.
3. Why: The second reduction was based off of the price of the first reduction. Thus it is not 100-
20%*100, but rather 100-10%, which is 90, then 90-10%, which is 81.
PROLIFERATION OF THE CRT
Boland 7
A footnote in Toplak et al (2010) points out that the CRT is in danger of becoming a self-report meaure,
rather than a performance measure, due to the proliferation of the questions on the internet and between people.
Toplak et al note that the ultimate answer to this quandary is the creation of more CRT items that vary in surface
characteristics. An entirely non-rigorous analysis of this assertion seems to point to its possibility – specifically, a
Google Search for “bat ball riddle,” referencing the first question of the original CRT, returns over 1.8 million
search results – in and of itself this is not surprising, however what is surprising is the sheer number of relevant
results multiple pages in to the search – at the present moment, relevant answers can still be found up to page 20,
a remarkable feat anecdotally speaking. This could point to a number of things – Google could have changed its
search algorithms, for example, and the causation of this phenomenon cannot ascertained in such a cursory
analysis. That being said, the possibility exists that individuals are spreading the CRT questions through the internet
and other channels of communication, aided by the small size and high memorability of the questions. This is
especially dangerous for the CRT, as the very nature of the questions relies on individuals noticing that their
answers were wrong – if a tainted participant were to take the CRT and see a problem that he is familiar with, he
may likely (and correctly) assume that the other questions in the set were of a similar nature, thus putting him on
guard against the very thing that the CRT was looking to test for in the first place.
EXPANSION OF THE CRT
As stated before, this is a problem for experimenters who wish to utilize the CRT. The idea behind the
current study is to expand the CRT, inoculating it somewhat, and for a short while, against the possibility of
corruption by individuals who take the test with pre-existing knowledge of the questions. Further, if successfully
extended, this larger CRT could allow a pool of questions for researchers to draw upon, increasing the utility of test
– the possibilities of expanding the CRT are discussed in the discussion section, below.
To that end, the point of this this study is to expand the CRT. The plan is simple: First, find questions that
are candidates for inclusion in the CRT. Specifically, questions that are “CRT-Like.” That is, they have a pre-potent
response that is incorrect, and a correct response that requires the recruitment of System 2. Second, find
measures that correlate and do not correlate with the original CRT (to achieve both convergent and divergent
validity). Then, combine the correlating measures and the expanded CRT into one large test, and administer it.
Boland 8
Once that is complete, examine the correlations between the original CRT questions, the new CRT questions, and
the correlating measures. The set of new CRT questions will require pruning to remove those that do not correlate
with the CRT on the given measures. This will be accomplished by multiple passes of analysis, determining how
each individual question reacts to all possible correlating measures, and seeing if it reacts similarly to the original
CRT questions and that measure. If the new question correlates with measures similarly to the original CRT, then it
is acceptable. If it does not, then it is removed.
After searching through books of riddles, LSAT practice exams, and various internet sites devoted to
riddles and jokes, I came up with a list of 10 potential new questions for the CRT. They were all, in the end, pulled
from anonymous internet sources – no physical books or LSAT practice questions were used, as I could not find
CRT-like questions that fit my criteria. It may seem odd to pull questions from the internet to prevent from their
proliferation on the internet, but this is not the case. First, almost all information available in books is now
available on the internet, and security through obscurity of the source (such as an old book of riddles) is not a
particularly powerful one once the source is discovered and digitized. Second, the point of this expansion is not to
create a bullet-proof system of questions, but to reduce the damage to the CRT in the event that an individual
knows one of the questions. The damage in mindset of the participant is unavoidable – if they know that one of
the questions is a “trick,” then they may extend that idea to all of the questions. However, if a participant knows
the bat-and-ball question from the original CRT, that would invalidate 1/3 of the test. If, however, they know one
answer to the expanded CRT, they will have invalidated 1/n questions, where n is the expanded size. Unless I were
to somehow invalidate one of the original questions, then that N will always be greater than or equal to 3, and will
thus be a better buffer against statistical invalidation of the test for that participant. The new questions are listed
below in no particular order.
THE ADDITIONAL QUESTIONS
1. “Some months contain 30 days, others contain 31 days. How many contain 28 days?
1. Intuitive Response: Only one month has 28 days.
2. Correct Response: 12 months, as all months contain at least 30 days.
Boland 9
2. “A red clock and a blue clock are both broken. The red clock doesn’t move at all. The blue clock moves but
loses 24 seconds every day. Which clock is more accurate?”
1. Intuitive Response: The blue clock, as it is at least still running.
2. Correct Response: The red clock, as it is correct twice a day, while the blue clock will have to cycle
through 12 hours (assuming it’s an analog clock – if digital with an AM/PM indicator or on
military time, it will take 24 hours) in 24-second increments per day until, once every thirty (or 60)
days, it is on time the entire day. This is further assuming that it loses 24 seconds one per day in
one large chunk at the end – if the clock is simply moving more slowly than others such that it
cumulatively loses 24 seconds over the course of the entire day, then it will be even less accurate.
3. You are in third place in a race. You overtake the person in second place. What place are you in now?
1. Intuitive Response: First place. Since you just beat second place, you must be in first.
2. Correct Response: Second place. You passed up the previous second-place runner, who is now in
third place, but the original first-place individual is still in front of you.
4. “You have a book of matches and enter a cold, dark room. You know that in the room there is an oil lamp,
a candle, and a heater. What do you light first?”
1. Intuitive Response: Any of the above depending on the preferences of light vs. heat for the
individual.
2. Correct response: The matches must be lit before anything else.
5. “Divide 30 by ½ and add 10. What is the answer?”
1. Intuitive Response: 25. ((30/2) + 10) = 25.
2. Correct response: 70. It says to divide by ½, not multiply by ½. So it is ((30/.5) + 10) = 70.
3. This is very similar to SAT Reading Comprehension questions, which sometimes seek to actively
obscure the answer through non-standard wording of a problem.
6. “If within a family there are nine brothers, and each brother has one sister, how many people are within
the family including the mother and father?”
1. Intuitive Response: 20. If each brother has one (unique) sister, then there are 9 brothers, 9 sisters,
and the 2 parents = 20 people.
Boland 10
2. Correct Response: Nowhere does it mention that each brother has a unique sister, rather that
they have *a* sister. Thus, the real answer is 9 brothers + 1 sister + 2 parents = 12 people.
7. “An airplane travelling at 400 mph crashes on the US/Canadian border. Where are the survivors buried?”
1. Intuitive Response: Either where they are from, or where their family wishes for them to be
buried.
2. Correct Response: Noting that survivors are not buried, as they survived, and that would be cruel.
8. “If it takes 20 minutes to hard-boil one goose egg, how long would it take to hard-boil 4?”
1. Intuitive Response: 80 minutes.
2. Correct Response: 20 minutes – just put them all in the same pot.
9. “A Doctor gives you three (3) pills, and tells you to take one every half an hour. How long will it be until
you no longer have any pills?”
1. Intuitive Response: 1.5 hours. Three * 30 minutes = 1.5 hours.
2. Correct Response: 1 hour. This is a problem of counting the fence posts – If you take one pill
immediately, you have two left. When you take another half an hour later, you will have one left.
Finally, when you take the last pill one hour later, you will have 0 left. Thus you will have 0 after
one hour.
10. “You have a ribbon that is 30 inches long. How many cuts with a pair of scissors would it take to divide it
into inch long pieces?”
1. Intuitive Answer: 30.
2. Correct Answer: 29.
All of these questions are, at the very least, weakly CRT-like, in that they have a pre-potent response and a
correct response. In some cases however the pre-potent response is not particularly strong, such as the question
involving the clocks or the ribbon. Further, not every question was strictly mathematical in nature, as the original
questions were. However, I still believe that they may tap into the difference between Systems 1 and 2, and
effectively measure an individual’s proclivity to engage in cognitive reflection.
THE CORRELATING MEASURES
Boland 11
These measures were taken from Frederick (2005), Toplak et al (2010), and a variety of other sources. The
reported sources are the original studies in which these questions were used, as far as I can ascertain. For others,
such as the gambler’s fallacy question, no original could be determined, and so the source that the question was
found in is cited.
1. Gambler’s Fallacy: When playing slot machines, people win something about 1 in every 10 times. Julie,
however, has just won on her first three plays. What are her chances of winning the next time she plays?
(Frederick, 2005)
a. The point of this question is to gauge an individual’s proclivity to believe in “luck,” “karma,”
“fate,” or, more specifically, the cognitive fallacy that unrelated probabilistic events are actually
related. The correct answer was 1/10, .1, 10%, or equivalent – any other answer was coded as
false.
2. Sample Size Sensitivity: A game of squash can be played to either 9 or 15 points. Player A is a better
player than player B. Which amount of points to finish the game (9 or 15) gives A a higher chance of
winning? (Kahneman & Tversky, 1982)
a. The correct answer is 15. Much like how a coin, being flipped, will generally settle on a 50-50
distribution with enough time - reflecting the true underlying probabilities, a game with more
chances to play with tend to settle towards the underlying probability distribution more quickly
than a game with fewer chances.
3. Regression to the Mean: “After the first two weeks of the major league baseball season, newspapers
begin to print the top 10 batting averages. Typically, after 2 weeks, the leading batter often has an
average of about .450. However, no batter in major league history has ever averaged .450 at the end of
the season. Why do you think this is?” (Lehmann, Lempert and Nisbett; 1988)
a. When a batter is known to be hitting for a high average, pitchers bear down more when they
pitch to him.
b. Pitchers tend to get better over the course of a season, as they get more in shape. As pitchers
improve, they are more likely to strike out batters, so batters’ averages go down.
Boland 12
c. A player’s high average at the beginning of the season may be just luck. The longer season
provides a more realistic test of a batter’s skill.
d. A batter who has such a hot streak at the beginning of the season is under a lot of stress to
maintain his performance record. Such stress adversely affects his playing.
e. When a batter is known to be hitting for a high average, he stops getting good pitches to hit.
Instead, pitchers “play the corners” of the plate because they don’t mind walking him.
i. The only correct answer is C.
ii. This question, much like the previous one, tests how well an individual understands such
statistical concepts as the law of large numbers and regression to the mean.
4. Covariational Reasoning: A doctor has been working on a cure for a mysterious disease. Finally, he
created a drug that he thinks will cure people of the disease. Before he can begin to use it regularly, he
has to test the drug. He selected 300 people who had the disease and gave them the drug to see what
happened. He also observed 100 people who had the disease but who were not given the drug. When the
treatment was used, 200 people were cured, and 100 were not. When the treatment was NOT used, 75
people were cured, and 25 people were not. On a scale of 1 to 10, how strong of an effect did the
treatment have, if any, either positive or negative? (Toplak et al., 2010)
a. As can be seen, this is not a good treatment. When the treatment was used, 200/300, or 2/3 of
people were cured. When the treatment was not used, 75/100 or ¾ were cured. Thus, the
treatment was actually either slightly negative or completely useless.
b. Any answer under a 5, was scored as correct. This many need to be changed.
5. Methodological Reasoning in everyday life: The city of Middleopolis has had an unpopular police chief for
a year and a half. He is a political appointee who is a crony of the mayor, and he had little previous
experience in police administration when he was appointed. The mayor has recently defended the chief in
public, announcing that in the time since he took office, crime rates decreased by 12%. Which of the
following pieces of evidence would most deflate the mayor's claim that his chief is competent? (Lehman
et al; 1988)
a. The crime rate in the city closest to Middleopolis in location and size has fallen by 18%.
Boland 13
b. An independent survey of the citizens of Middleopolis report 40% more crimes than in the police
records.
c. Common sense indicates that there is little that a police chief can do to low crime rates, as these
are mostly social and economic matters beyond his or her control.
d. The police chief was discovered to have business contracts with people in organized crime.
i. Only A contains data specifically – all others are unfounded conjecture, despite the fact
that they may seem like they would make sense.
6. Sunk Cost Fallacy: This was composed of two questions. The first part was:
a. “Imagine that you are staying in a hotel room, and that you have just paid $9.95 for a pay-per-
view movie. Five minutes into the movie, you find yourself bored with it. Do you change the
channel or continue watching the movie?”
b. And the second part is: “Imagine that you are staying a hotel room, flipping channels on the TV.
You come across a movie that is just starting. Five minutes into the movie, you find yourself
bored with it. Do you change the channel or continue watching the movie?” (Toplak et al; 2010)
i. If a participant selected the same answer to both of these questions, they were deemed
as correct. If they chose different responses, they were deemed as incorrect. This was
designed to measure sensitivity to the Sunk Cost fallacy. The sunk cost fallacy is the
tendency of a person to continue an unpleasant activity if they expended value (time,
money) acquiring it.
ii. It should be noted that no individual who stated that they would switch the channel in
the pay condition reported that they would stay on the channel in the free condition. It
was only when an individual spent money that they were willing to sit through a movie
that they did not like – they’ve already paid, or so the reasoning goes, so they should get
their “money’s worth.” This is unreasonable, as the money is already gone and you
could be spending your time in a more useful way.
c. Outcome Bias Questions: Like the previous question, this came in two parts. (Baron and Hershey;
1988)
Boland 14
i. Part one: “There is a 55-year old man with a serious heart condition. He had an
operation to fix the problem, which succeeded. The probability of him dying from the
surgery was 8%. Please rate how good of a decision this was on the following scale, with
1 being “incorrect, a very bad decision” and 7 being “clearly correct, an excellent
decision.”
ii. Part two: “There is a 55 year old man with a hip condition. He had an operation to fix
the problem, which did not succeed - the old man died on the operating table. The
probability of him dying from the surgery was 2%. Please rate how good of a decision
this was on the following scale, with one being “incorrect, a very bad decision” and 7
being “clearly correct, an excellent decision:”
1. The participant was rated as correct only if they rated part two as a better
decision than part one. This is because, even if the patient died in question two,
they had only a 25% chance of dying as compared to person one, who just so
happened to survive. This is outcome bias, reflected in the phrase “hindsight is
20/20.”
7. Temporal Discounting 1: “Would you rather be given $3400 right now or $3800 one month from now?”
a. This was coded as correct if the individual chose the second answer. This is because, in the
current and foreseeable economic climate, interest rates would not allow $3400 to change to
$3800 in one month. However, it *is* conceivable to turn $3400 to $3800 through other
activities, such as arbitrage or short-term loans. This was pointed out to me after the experiment
was conducted. (All temporal discounting questions taken from Frederick, 2005)
8. Temporal Discounting 2: “What is the highest amount of money that you would pay for a book that you
really want to be shipped to you overnight?”
a. This was coded as a 1 for correct and a 0 for incorrect. To determine this, the given answers were
averaged, and all above the average were given a 0, and all below were given a 1.
9. Temporal Discounting 3:
Boland 15
a. “On a scale of 1 to 10, where 1 is very little and 10 is quite a bit, please rate how much you think
about monetary inflation.”
b. This was not scored as correct or incorrect, but was used as a scale.
METHODS
PARTICIPANTS AND PROCEDURE
A total of 59 participants (43 from within Occidental college, 16 from the general public) took part in the
study. Individuals from outside of the college were recruited to attempt to reduce the effect of “WEIRD”
populations. (Heinrich et al., 2010) The individuals were recruited through social networking, Sona-systems, and
word of mouth. Individuals who were eligible received .5 course credits for participation in the study. The test was
administered as a Google form, one for within-Occidental that collected @oxy.edu email addresses, and one for
outside of the college which was completely anonymous. Ages of participants ranged from college-aged to mid-
50’s, with all over 18 years of age. The test was not timed, but anecdotal reports estimate that it took
approximately 30-45 minutes to complete the study.
TASKS AND MEASURES
Participants completed a combined survey of demographic data, self-report testing data (SAT and ACT
scores as well as Age), the above-mentioned heuristics-and-biases tasks, the original CRT (from here on referred to
as the ‘oCRT’), Professor Shtulman’s expanded CRT, and my 10 CRT question candidates (the ‘mCRT’). (For further
discussion, Professor Shtulman’s expansion will be combined with mine under the moniker of “mCRT.”) Mean
performance on the oCRT was 1.55 questions correct, among only the Occidental students. Mean performance for
the mCRT is reported later, as they must be filtered beforehand.
RESULTS
Boland 16
INTER-CRT CORRELATIONS
Question Correlation to CRT P ValLiving Room/Den .346** .007Family of Brothers .443** .000The Doctor gives you 3 pills .572* .000Months with 28 days .314* .016Passing in a Race -.185 .161Planecrash .364** .005Divide by ½ .287* .027Ribbon Cut .250 .056Shoe Price .294* .024Goose Egg .354** .006Clocks .124 .312Matches .224 .088
(Correlations are marked for easy of viewing: * = P<.05, ** = P<.01) These are the first-pass zero-order
correlations between the aggregated oCRT (known as the “CRT” in the above screenshot) and the questions
comprising the sCRT and mCRT. The majority of the questions do indeed correlate with the CRT. Only a few do not
correlate at all – these are “Race,” “Ribboncut,” “Clocks,” and “Matches.” Specifically, “Clocks” and “Race” have
very high P-values, at .312 and .161 respectively, while “Ribboncut” and “Matches” are somewhat closer to
correlations, with a P = .056 and .088 respectively, possibly hinting towards correlation had there been more
participants. Thus, “Clocks” and “Race” are removed from the analysis, while “Ribboncut” and “Matches” stays for
now. This reduces the list of new questions to 10 from 12.
CRT/SAT CORRELATIONS
The next pass of analysis was concerned with the correlations between the CRT, the mCRT, and the SAT.
The SAT is useful as a comparative statistic for the CRT because of the nature of SAT questions, specifically those in
the reading and writing portions of the test. SAT-reading comprehension questions generally give an individual a
passage of text to read, and then ask them questions about it designed to measure their comprehension of the
passage. SAT-writing questions will provide a prompt, and then ask test-takers to write a long-form essay critically
responding to the question. These sorts of question would, out of necessity, require reflection on the knowledge,
Boland 17
either knowledge recently received (as in the reading comprehension portion) or analysis of knowledge that you
already have (as in the writing portion). For example, on one practice SAT reading comprehension question located
here (http://www.majortests.com/sat/reading-comprehension-test01), one of the questions asked is “What is the
author implying in the above text?” Another asks for which definition of a word best fits in the context of the
passage. Both of these questions require the activation of Stanovich and West’s cognitive system 2, and the
question on implication specifically requires internet reflection to decide from multiple different outcomes, similar
to the decisions that must be made when answering CRT questions.
The first pass of this second analysis for SAT correlations revealed a few important pieces of information:
First, neither the CRT nor the mCRT correlate significantly (or even near significance) with the SAT-math subscore.
For the other subscores, the oCRT correlates with the SAT Reading Comprehension subtest at a .441* level, P
= .013, and with the SAT Writing subtest at .503**, P = .002. The mCRT correlates with SAT reading comprehension
at a .330 level, P = .070, and with the SAT Writing subtest at P.517**, P = .002. So, both the oCRT and the mCRT
correlate with the SAT writing subtest at a high and very significant level, whereas the oCRT does not correlate (But
approaches correlation with) the SAT reading comprehension test. It would appear that a few questions exist
within the mCRT that do not correlate with the SAT the same way that the questions in the oCRT do. To determine
which of these questions the culprit was, the aggregate scores were split and compared to the SAT questions
individually.
To provide a baseline for further analysis, the oCRT was first split and compared individually to the SAT, to
see how the CRT questions (which are usually used only in the aggregate) compare individually to the SAT, giving a
base level of comparison for the additional questions. As expected based on the aggregate data, none of the oCRT
questions correlate with the SAT Math score. Surprisingly however, two of the three original questions do not
correlate with any SAT subscores, although they both are somewhat close (P <=.14 for SAT reading, P<=.097 for
SAT writing), and they may reach significance with more data. This is important, because it shows that the oCRT is
not a monolithic bloc with regards to correlations with the SAT.
Boland 18
This analysis was repeated with the mCRT. Four questions were found to be very far from correlation with
any of the original SAT scores, with P values at approximately .8 for all scores. These four questions were: Months,
Ribboncut, Shoe Price, and Matches. These questions are removed from further analysis, lowering the size of the
mCRT to 6 questions from the previous 10. Upon removal of these four questions, and rechecking the correlations
between the oCRT, mCRT, and SAT, the correlations between the mCRT and the SAT Reading Comprehension is
now at P<.05, within the same range of .05>P>.01 as the oCRT.
CRT/HEURISTICS-AND-BIASES CORRELATIONS
Frederick and Toplak both found correlations between various measures of economic and cognitive
heuristics and the CRT. Attempting to replicate their results, I utilized ten such questions which are detailed above
in the introduction. At first, the data was quite muddied, and no correlations could be found. To see what was
wrong, I examined the correlations between each heuristics-and-biases question and the oCRT/mCRT. What I
found was that questions having to do with money, specifically the questions about Temporal Discounting, did not
perform well at Occidental. Neither the oCRT nor the mCRT correlated significantly with any of them, at odds with
Frederick’s study.
This lack of correlation was found within the Occidental college participant pool in particular, and the
outside-Occidental pool was too small to get any significant data from alone. Questions such as “How much would
you pay to have a book shipped to you overnight” were particularly strange, with answer ranging from 0 to 120
dollars within Occidental. This might be due to the nature of the college experience, and the change in
technologies between the time of Frederick’s study in 2005 and now. Now, a large amount of people own Kindles,
iPhones, iPads, and notebooks. Internet access is also generally much quicker now than 8+ years ago when
Frederick was performing his study. It could be that the increase of easy access to online materials has lowered the
amount of books that college students are required to order, such that, when a student does need to order a book,
it is because they need it immediately for a class, and that might be the only time that a student orders a physical
book from the internet, necessitating expedited shipping and higher payment, and skewing the economically
“correct” answer.
Boland 19
Another question that did not correlate, that should have according to Frederick, was one asking
participants whether or not they would prefer to be given $3400 immediately or $3800 in one month. Frederick’s
justification for picking the latter is that it has a 280% annual discount rate or an increase equivalent to 280%
yearly over one month, which is higher than can be found any official investment. However, when talking to
participants who took my test, some did not see it that way – some claimed that they could easily turn $3400 to
$3800 in less than a month, with some to spare. Others claimed that they would need it now to pay pressing bills,
and would not be able to wait one month. Whether this is peculiar to Occidental college, I cannot say. It should be
noted that the Gambler’s fallacy question did make it through this pass, so not all measures of economic cognition
were inefficacious. Further, both the oCRT and the mCRT did not correlate, indicating not a problem in the test but
a problem in the participant pool.
These questions were removed, created an adjusted aggregate bias measure. To mirror Frederick’s
analysis on these measures, the oCRT and mCRT groups were split into two groups each – one low, one high.
Individuals were assigned to the low group if they generated less than N/2 (where N = the amount of questions)
answers correctly, and were assigned to the high group if they generated more than N/2 answers correctly. Low-
CRT individuals, both on the mCRT and the oCRT, scored lower at a significant level on the adjusted aggregate
biases tasks. Conversely, High-CRT individuals on both tests scored higher at significant levels on the adjusted
biases tasks. Not only that, but the oCRT and mCRT acted almost identically: The oCRT_high correlated with the
adjusted bias aggregate at .304, P = .019. The mCRT_high correlated at .364, P = .005. The oCRT_low correlated at
-.512, P = .000, and the mCRT_low correlated at -.378 at P = .003.
CRT/DEMOGRAPHIC CORRELATIONS
Despite Frederick’s findings on the influence of Gender on the outcome of his test, with females tending
to perform worse than males, no correlation with gender was found on either the oCRT or the mCRT. However, a
correlation was found between both CRTs and Age, a measurement not used by Frederick, Toplak, nor by any
other CRT-based experiments that I read. The oCRT correlates with Age at .294, P=.024. The mCRT correlates
Boland 20
at .253, P = .053. This is >.05, however considering how close it is it is likely that this is a problem of test power,
and I am confident that having more individuals in the pool would have allowed it to reach correlation.
FINAL OCRT/MCRT CORRELATION
The previous two passes of analysis did not require the removal of any more questions from the mCRT,
leaving the expanded test at 6 mCRT questions + 3 oCRT questions, for a total of 9 CRT questions. The mCRT and
the oCRT aggregates correlate with each other at a .653 level, P < .000.
DISCUSSION
To save space, the entire list of final questions is not presented here; however, they are the Airplane
question, the Doctor/Pill question, the Sister/Family question, the Goose Egg question, the dividing 30 question,
and the living-room/den-size question. (As well as the original 3 CRT questions.) All of these require at least some
modicum of mathematical thought except for the Airplane question, which, quite frankly, is a rather surprising
candidate to make it through the gauntlet of correlations. It is possible that its inclusion is an artifact of over fitting
the questions to the data, or of Occidental’s peculiarities. It is also possible that it is tapping in to the same
underlying cognitive reflection abilities as the other questions, and so does indeed deserve a spot on the list.
There are some possible problems with the data that was collected. Primarily, correlations that were
supposed to exist were on occasion not found; specifically between the CRT scores (oCRT and mCRT) and the SAT
Math score, Gender, or the economically-minded heuristics and biases questions. However, it is important to note
that neither the oCRT nor the mCRT correlated with these measures – it would have been much worse if one had
correlated while the other did not, which would mean that they acted very differently on one of these measures.
Since they both do not correlate at similar levels, this may be considered a measure of divergent validity.
That being said, the expanded mCRT correlates almost identically with the original oCRT on many
disparate measures, including SAT Reading and Writing score, wide-ranging cognitive heuristic problems, and age.
Boland 21
Further, the individual questions comprising the CRT correlate with the aggregate CRT, many of them on more
than one question, and all significantly.
It seems that the mCRT is in fact a viable candidate for expansion to the CRT. More rigorous testing would
of course be required, and a much larger pool of participants would be needed, but these questions preliminarily
seem to travel with the CRT in a way that hints that they are both tapping into the same underlying cognitive
process or proclivity. Specifically that they are both measuring, to various degrees, cognitive reflection.
This is desirable; not only for the main reason of protecting the CRT against invalidation through
proliferation, but in creating a larger pool of questions that researchers can draw upon, each measuring slightly
different aspects of cognitive reflection ability, which may ease further lines of study with the CRT. One possibility
would be to use the CRT, and any expansions to it, to measure the origin and malleability of reflective cognition.
For example, are these scores generally constant throughout life? A singe 3-question test could not answer that, as
it would require repetition of the same questions each measurement. But a larger 9-question test, such as the
expanded CRT presented here, would allow for three measurements of three questions, for a broader perspective
on how this skill acts through time. Another possibility is to see whether CRT score can be changed through
training, another repeated-measures test that occurs through time which would require more than the original 3
question test.
WORKS CITED
Baron, J., & Hershey, J. (1988). Outcome bias in decision evaluation. Journal of Personality and Social Psychology,
54, 569-579.
Frederick, S. (2005). Cognitive Reflection and Decision Making. Journal of Economic Perspectives, 19(4), 25-42.
Boland 22
Heinrich, J., Heine, S., & Norenzayan, A. (2010). The weirdest people in the world? Behavioral and Brain Sciences,
33(2/3), 1-75.
Kahneman, D. & Frederick, S. (2002). Representativeness Revisited: Attribute Substitution in Intuitive Judgment.
Heuristics of Intuitive Judgment: Extensions and Applications, New York: Cambridge University Press
Kahneman, D. & Frederick, S. (2005). A model of heuristic judgment. The Cambridge Handbook of Thinking and
Reasoning, 267-293.
Kahneman, D. & Tversky, A. (1974). Judgment under Uncertainty: Heuristics and Biases. Science, 185, 1124-1131
Kahneman, D., & Tversky, A. (1982). On the study of statistical intuitions. Cognition, 11, 123-141.
Kahneman D., & Tversky, A.(1983). Extension vs. Intuitive Reasoning: The conjunction fallacy in probability
judgment. Psychological Review, 90(4), 293-315
Lehman, D., Lempert, R., & Nisbett, R. (1988). The effect of graduate training on reasoning. American Psychologist,
43, 431-442.
Stanovich, K. E. (2009b). What intelligence tests miss: The psychology of rational thought. New Haven: Yale
University Press.
Toplak, M., West, R., & Stanovich, K. (2011). The Cognitive Reflection Test as a predictor of performance on
heuristics-and-biases tasks. Memory and Cognition, 39, 1275-1289.