Tracking Attitudes and Behaviors to Improve Games Ramon Romero Game Developers Conference 2008.

transcript

Tracking Attitudes and Behaviors to Improve Games

Ramon RomeroGame Developer’s Conference 2008

What is it?

http://www.mgsuserresearch.com The website is filled with useful information we have presented in previous talks on the subject of User Research

Creating a Feedback Loop For Game Designers

Lend audience insightsDetect problems Create opportunities to fix those problemsPrior to release

User Advocate / Data ChampionResearch expert

We have a lot of experience working with game developers and conforming our approaches to the challenges of the development schedule

In our setup we have a person who is entirely devoted to the problem of representing and working on user data

Using Formal Research Methods From a Variety of Research Disciplines

Industry ResearchersUsability EngineerHuman Factors Engineer

Academic ResearchersCultural Anthropologist, EthnographerExperimental Psychologists

○ Cognitive, Social, Developmental, Behavioral

There are multiple research types that can provide value to game developers at different points in the development cycle

What is it? Some people will talk about ‘logging’ or automation to refer to the same thing. But what do we mean…

Tracking Real-time User Experience (TRUE)

This refers to logging the things that matter most to the play experience. Where did people die, what killed them, what were their opponents carrying…

We will re-present the key points from this diagram through the remainder of this presentation

Note how our report viewer is active – the arrow is pointing towards that report – the viewer is actively trying to obtain information

TRUE System

Critical events Surrounding context On the TRUE System slide(s) we

will be rebuilding that diagram.

Critical events and Surrounding context are things we measure.

Critical events are relatively easy to understand. Think of major progress (beating a boss) and major setbacks (dying, losing all cash money) as Critical.

Surrounding context refers to related information (what were they holding, what level were they).

Our first example.

Imagine it is the Summer of 2004 and that we just asked a number of real consumers to play through all of Halo 2 over the course of a weekend of testing.

That test has just completed and now we are going to look over the results of that test, just like the ‘active viewer’ from the diagram. Time is pressing, the game will be out this Fall and we need to turn feedback around fast…

…average deaths per mission……for all missions in the game…

Mission 6≈ 8+ hours in…

Although in the course of regular analysis we would examine everything, our eyes are drawn to the spike for Mission 6. And so we click on the bar in the chart

Mission 6≈ 8+ hours in…

And now we see the details of death counts for individual encounters across the entire mission. An encounter is the smallest meaningful chunk of a Mission to a designer at Bungie. It can be a conversation or a cutscene but most often it will be a firefight. Results from firefights are below…

“interesting” results≈ 8+ hours in…

In regular analysis we would examine all 5 spikes… but for this talk we will focus in on one example with “interesting” results. Now let’s say you want to get even more information so you click on this bar.

Cause of Player Death

Flood Human56%

UNKNOWN16%

Flood Elite16%

Sentinels4%Other

And now we see who is causing the various deaths. The Flood Humans are the greatest source of problems and so we perhaps could retune them in some fashion.

But our “interesting” result was the high number of Unknowns. Something occurred here which we did not anticipate, and hence had no label for it. 16% was very high. The highest we saw in other parts of the game was 1-2%. And so wanting to learn more we click on the Unknown portion of the chart…

Video plays… multiple players are drawn to their own doom… to a pit that is too attractive and looks too much like the correct way to go…

…nearly everyone is fooled by the pit once… they fall straight in…

And this is how we fixed that pit

TRUE System

Critical events Surrounding context

Supporting informationVideo

Deaths, both averages and raw counts

What killed them

In this example video acts as the back-up plan. Something occurred that we could not anticipate.

Rather than plowing through hours and hours of video to discover the source of the problem. The TRUE system points us right to the problem so we can use those hours to think about solutions.

TRUE Principles

Instrument to answer ‘why’

When building a TRUE system it is important to be able to use the data to tell a story when you have collected it. Otherwise it has a tendency to sit or for findings to remain mysterious and unhelpful.

Another example…

In Forza 2 one of the modes of play is a Time Trial. You beat the time on a given track and you earn a car. We tested every one of them and will focus in on the results from the Tsukuba Short Circuit…

Time Trial Summary

Tsukuba Short

% of participants passed

Target Time to Beat the Trial (seconds)

Average Time (seconds)

…where things did not go well

On first appearance things look really bad. People are averaging runs that are nearly 40 seconds off. But that is a little misleading…

Arcade | Time Trial | Tsukuba ShortYou see we actually had people run the

race 10 times and so it is a better data presentation to break out results per run. Averages are not always your friend. Click ahead to look through individual results…

Tsukuba – P1

This participant improved over time, nearly beating it but not quite

Tsukuba – P2

She nearly beat it on 4 separate occasions

Tsukuba – P3

Wow – dramatic changes… and was able to put together one decent lap

Tsukuba – P4

Lots of good runs but even that last one was not good enough

Tsukuba – P5Another case of dramatic change.

Data loss can occur. Sometimes our logging system fails, or the game crashes or a participant needs to use the restroom. All of this means that sometimes we do not learn the full story for all participants

Tsukuba – P6

Several close runs but no cigars

Time Trial Summary

Tsukuba Short

UR suggested 50.2 seconds

Design decided 48.8

seconds

So 84.9 was the wrong number to pay attention to. Instead we focused on each individual’s best run and how they progressed over time. And then we made a suggestion…

It has been suggested that in the face of all this data that a development team will lose control of their vision. This is not the case. The Game Development team makes all the final decisions about how to adjust their game.

The Game Designers looked over the same data but they knew that the cars were not done being tuned and decided a different number would work well. The next few slides show you how well their number worked…

Adjusted Target Time

Tsukuba – P1, Adjusted

TRUE System

Data visualization

Needs iteratio

LOTS!!

Data visualizations are a key aspect of the TRUE system. You have seen a bunch already and they are not always so straightforward to create

TRUE Principles

Instrument to answer ‘why’Make the key findings pop

IntentDesigners must declare intent

This is the goal of those visualizations. No theoretical discussions. No debates. Clear clear clear findings. If not then the Data Visualizations actually work against you, no matter how clear they are to you.

Luckily there are a few Game Designers around who can help you out with this. Once they declare intent then working together you can create those visualizations. Examples of what we mean by intent…

Time Trial Summary

Tsukuba Short

Beatable within 10 tries

We already saw the intent statement from the Forza 2 designers… but there was more to it.

This is an excellent statement. The statement not only helps determine the nature of the visualization it also helped us determine precisely how to test it… i.e., let’s give them 10 laps and see what happens.

It’s also a really easy example of design intent, as was our Halo 2 example earlier… people need to die at the “intended” rate…

But there are much harder cases…

Crackdown is a successful title that was released in 2007…

It is a non-linear game…

This creates difficulties when attempting to map out intent.

If people die too much then they are supposed to find another way around. At any one time players can be doing anything…

The many ways to play Crackdown…

Users must find their own fun…

The intent statement is very broad… and so we found that certain aspects of the game’s intent were not really declared but were understood only in the context of the play experience…

The experience players had with Agency Nodes, also referred to as Supply Points is an interesting example…

In the game these points are places where you go to re-supply your weapons… they also double as re-spawn points.

Video plays… opening play sequence in Crackdown (starting after opening cutscene completes)… run around a little… find car… drive out of Agency starting point… takes a minute or two…

Video plays… jump ahead… we found some fighting… run around… eventually die… and respawn… back at the Agency… where we started… now we have to go through the same tired sequence of finding a car and driving out of there before we can get back to the action…

People were not finding the orange supply points which, again, are respawn points that players could use to get back to the action sooner. So this meant death was more punishing than the Designers intended.

Using TRUE we started tracking how long it took players to find the orange beacons…

Average Time (mins) to First Agency Node

Test 7 31.5

Test 8 21.2

Test 9 21.1

Test 11 14.1

…but first they should find…

How many times do you think people died in 31 minutes of play… quite a few it turns out… more than 50 times for one poor individual.

So we made a few adjustments to make sure that players would notice these beacons and things improved over time.

And returning to our key point… the intent statement needed to evolve and did so as an iterative function.

TRUE Principles

IntentDesigners must declare intentUR must find a way to measure it

Once the intent is declared (or discovered) the act of measuring it and analyzing it can be really straightforward as in all examples so far.

But sometimes the measurements can be misleading or unclear...

Valhalla is a multiplayer map in Halo 3. It was available as early as the Alpha test period.

In the distance is one of the towers on this map. Towers are places where players will spawn into the game, vehicles are usually nearby, there are a pair of transport devices that will also shoot a players out into the environment. Players are expected to fight for control over them.

On the next slide we will look at another picture of the same tower. This time it will be a small red blob on the right side of the picture.

H3 Alpha

Here it is…

Everywhere you see red is a spot where relatively MORE deaths occurred then in the black and grey areas. We call this a heat map. The deeper the red, the hotter the spot, the more violence people are committing.

Anyway the neat thing is looking at the huge problem we can see here. You see it right?

No not this…Not the other tower…

And not here either…

It’s here… and easiest to understand when we look at the beta results for comparison.

Do not feel bad if you missed it. The User Researcher working on the product almost missed it too.

H3 Alpha

Users must use the entire map…

People were not using this part of the map…

So an assumption in the design intent was found and declared and then the adjustment was relatively straightforward. They changed the direction that the transport devices would shoot players so that they could experience all parts of the map. And the beta results showed that the adjustment worked out as hoped.

But let’s not gloss over the key point here. The User Researcher working by him or herself might have missed this because all aspects of design intent will not be clearly declared every time.

TRUE Principles

IntentDesigners must declare intentUR must find a way to measure it

Design and UR analyze together

Designers are expert at the experience they are trying to create, so naturally they should help with the analysis. In the example you just saw the Designers at Bungie saw the problem instantly where the User Researcher nearly missed it

All the more reason to concentrate on the visualizations and ensuring the findings are instantly understandable

TRUE System

Data visualization

Needs iteratio

LOTS!!

The initial version of this chart was a giant indistinct red blob, informing us that our alpha participants died just about everywhere. It took time and repeated revision to get the heat map meaningful.

The next few slides are visualizations noting where players are standing (measurements pulled every 15 seconds) in a level broken down by encounter.

Encounters are the smallest meaningful chunk of an individual mission to a Bungie Designer. An encounter can be a firefight as in the Halo 2 example earlier.

In this case we are looking at encounters of the other variety (NPCs are speaking) where there is no action.

Encounter 1

Here we see all the places where the participants were standing during Encounter 1.

We are not distinguishing among participants – all points where a participant ever stood are noted.

Encounter 2

Now we see that in Encounter 2 (encounters proceed based on a combination of player movement and timing) players are beginning to move forward and also appear to be milling around in the starting area of the mission.

Encounter 3Note that there are more dots visible now. The number of participants has not changed, we are seeing greater indication of people spreading out.

The overall pattern continues – they move forward… and then they move back.

Remember this entire sequence contains no action.

Encounter 4

And now we see even more spreading out…

Yet some people are still wandering back here.

There are numerous steps we could take to attempt to learn what the problem is. In the TRUE system we could click on any individual dot and watch video to observe what transpired. But let’s be honest – if you are planning to emulate this system the video will be the last thing you get functioning.

What else could inform you about what is going on?

Maybe the participants themselves can be of service…

Whenever testing Halo 2 or Halo 3 a survey was integrated into the play experience. Every 3 minutes, participants are given a chance to tell us how they feel about the experience

I know what you are thinking. Every 3 minutes…

People get used to the survey. Every 3 minutes… they get used to it.

How do we know? Well we test our own methods and ask people how they feel about the surveys.

And they get used to the surveys…

Lots of Brown. What was that again?

Here we are at the beginning of the first mission (did I mention that?), people have not even fired a weapon and they are so frustrated with the game that they are done with it… they are ready to quit.

People’s attitudes seem like an important thing to know…

The purple and green dots clue us in. People are feeling lost. The brown dots tell us how extreme the problem is. We fixed the problem by providing more guidance and the problem dissipated.

TRUE System

Data visualization Measure attitude

The 4th and final pillar of the TRUE system.

There were several things going on all at once in that example so we will break out the benefits of measuring attitude into separate examples.

On the next slide we will again be examining a mission in Halo 3. This time dots represent spots where the player was killed and color indicates what killed them.

The level is the first one where players face a Scarab (a bossfight). Purple dots tell you where players were killed by the Scarab.

The boss fight…

We tested again one week later… Weren’t purple dots the boss?

Where did all the tan dots come from?

Oh… the Grunts…

Hold it… the Grunts??!?!…It turns out we are talking about a bug. This testing took place at the Normal difficulty level. But there was a bug introduced into the code somewhere such that whenever a Grunt was behind the controls of a vehicle their AI was suddenly reset to maximum difficulty, turning them into, in the words of one participant…

“…grunts of death…”At the end of every mission we asked participants how they felt about the play experience. We also include several open-ended questions and this was one of the responses.

The bug was identified, fixed and dismissed.

Fine… but we were discussing measures of attitude, how does this matter?

“…grunts of death…”When preparing this presentation the User Researcher working on the product (John Hopson) was struggling to remember good examples (we ran a multitude of studies, generating hundreds if not thousands of individual findings). Yet within a matter of moments he pulled this single individual finding out of his head with little effort.

It was memorable…

TRUE System

Memorable quotes

When dealing with large sets of data that we are attempting to interpret quickly it is important to discover certain hooks or anchors around which to tell the story of what happened. Without those anchors the dataset no matter how powerful or complete Is much more likely to be ignored.

Different people are convinced by different kinds of information. Some need a clear breakdowns and percentages of who did what and where and others need to hear the story of what transpired. These memorable quotes are critical to helping your audience (and yourself) understand what is going on.

Time Trial Summary

Tsukuba Short

Beatable within 10 tries...

…and the experience should feel appropriately challenging…

It turns out the Design intent statement had multiple clauses…

“How challenging was this race?”

After each set of 10 races we asked them how they felt…

Those who dislike this so-called “subjective” data tend to criticize the fact that players only complain in one direction As you can see looking at the

data from the Sunset Peninsula track this is not always the case

In the games industry we are always attempting to build new and novel experiences. This means that we could be wrong in either direction and the behavioral data by itself does not necessarily communicate that. Perhaps people could have completed their 10 laps of the Tsukuba Short circuit and indicated that it represented a reasonable challenge as with the Suzuka Circuit. Even with a well understood mode of gameplay such as a Time Trial we can be wrong in unpredictable fashion.

“How challenging was this race?”

Attitudeindicates valence

Attitude indicates magnitude of dissatisfaction

So we need some outside opinions to helps us understand the “polarity” or “valence “of the experience. And inform us whether or not there is a problem.

We also get a statement of priority. Nearly 80% of the participants felt the Tsukuba Short circuit was harder then it needed to be, making it the worst circuit we tested that day. If we were in a crunch and could only fix a few things in the game, then this data would help us make the right decision about where to expend resources.

Tsukuba – All, Adjusted

Attitude says nothing about the magnitude of the fix

Those who hate this form of data tend to point out this issue. And they are absolutely right to do so. The behavioral data gives you precision information on how to fix an issue. But don’t let that down-play the significance of acquiring attitudinal data. Without it you may be focusing on the wrong things.

TRUE System Critical events Surrounding context

Memorable quotesValencePriority

One final point is that these measures are less susceptible to the so-called subjectivity problem. If we conducted that same Forza 2 test 100 times then the memorable quotes would differ based on the individuals but the Valence and Priority findings will not differ substantially. These metrics are statistically reliable when used appropriately.

TRUE Principles

IntentDesigners declare intentUR must find a way to measure it

Design and UR analyze togetherBehavior and Attitude receive equal weight

Another way to state this point is to say that each type of metric holds primacy in its arena of mastery. But neither type of data should leave its arena

Final example in the presentation…

Test 12Imagine another one of the all weekend tests completed. This is Test 12, so the Agency Node/Supply Point problem is basically behind us. People played a total of 12 hours of the game.

There are tons of data available to us. We could examine:• Deaths via heatmap or by enemy type• Where people go• How far people progressed in the RPG component of

the game• Which boss fights they discovered and/or completed• What skill types are players using to defeat enemies

We have to turn the data around tomorrow so the team can make adjustments as needed. Remember it’s a non-linear game and so the intent statement is not incredibly good at leading your queries…

Actually it does kind of point towards one question…

“How fun was this game?”

Naturally at the end of that 12 hours of play we asked people how they felt about the experience.

Let’s take a look…

3.8We use a 5-point scale so that’s pretty good but not fantastic. You want to be up and over 4.1 or 4.2 if possible. Looking at the average response is a little weird so…

It’s really just another valence question…

And things look pretty good. Usually a great deal of our efforts are spent on reducing the negatives but we appear to have removed most of the primary frustrations from the experience…

So if we want to amp the positive… where should we go…?

What was fun…?

“I really liked the overall feeling of mobility the game possesses. Especially once my agility greatly improved…”

“the super hero like abilities”

“I like leveling up the most. Seeing my guy get stronger and jump higher made the game good.”

“being able to upgrade my character”

Let’s check the open-ended responses. Maybe there is something memorable to think on…

It seems like there is a trend… what is it in Crackdown that makes people feel like a Super Hero…?

THEOREM…

GIVEN ;

AGILITY = FUN?

They are mostly talking about jumping and running fast, both of which are tied to the Agility statistic. A player statistic that they can increase by collecting agility orbs within the game.

How would we verify that…? There was some data on this… how far did people get on the RPG skillset…

What level were they…?By the end of the study (12 hours of play)…

Just focusing on agility…

And once leveled by as much as two stars then players can jump on to rooftops with ease and they should feel substantially faster.

Let’s take just these folks who got to 2 stars... And see how they responded…

…to this question…

And they all fell in this bucket…

Interesting but it’s not absolute proof of a relationship. We need far more evidence to prove this. We discussed the possibility with the developers…

TRUE System Critical events Surrounding context

Can direct d

mining…

Quick aside: Not ‘shall’. Not ‘will’. Those memorable quotes can be lighthouses in the mist.

Anyway, we discussed the possibility with the developers. Meaning that while all of this was going on

Players needed to treat this as a goal…

And these ones too…

Several adjustments were made with the intent of drawing users attention to the agility orbs, in some cases changing their locations to make them more accessible

So there would be a lot more of this going on…

Test 13

And then we retested…

Checked our results…

4.5And felt pretty good about where the game was going…

Tracking Real-time User Experience (TRUE) – System Critical events Surrounding context

There is lots of cool stuff in the TRUE system. We are stressing the efficacy of attitudinal measures because when the behavioral data comes rolling in there is a tendency to downplay attitude and this is not an advisable practice.

Tracking Real-time User Experience (TRUE) – Principles Instrument to answer ‘why’

Make the key findings pop

IntentDesigners declare intentUR must find a way to measure it

Design and UR analyze togetherBehavior and Attitude receive equal weight

○ Behavior measures Design intent ○ Attitude validates Design intent

In truth, multiple clauses or not, the design intent statement will always be a behavioral statement… “we want/expect players to do ‘x’”…

The second part… “and we want players to feel good about ‘x’”… i.e., acquiring opinions about the intended experience, is the determining factor in your overall success. You need to measure attitude and ensure the metrics receive appropriate primacy in the face of overwhelming (and cool) behavioral data.

Tracking Real-time User Experience (TRUE) – Do it yourself…

General Measures Critical Events by Game Type

Overall Status Location, timestamp, ‘minor’

progress, items

Attitudes Linked to general status Forced choice Open-ended

Critical Events

Event Games Measure outcomes

Linear Games ‘Major’ progress ‘Major’ setbacks

Non-linear Games Time till… x Attitude!

Measuring EVERYTHING impedes analysisWe may expand on these points in future talks…

Thanks to… Bungie Design and Bungie Engineering:

Halo 2 and Halo 3 User Research:

○ John Hopson, Kris Moreno, Randy Pagulayan Turn 10: Forza 2

User Research: ○ Daniel Gunn, Tracey Sellar

Real Time Worlds: Crackdown User Research:

○ Jerome Hagen, Eric Schuh Microsoft Game Studios

User Research Manager: ○ Dennis Wixon

Game Essentials Director: ○ David Holmes

Corporate Vice President: ○ Shane Kim

http://www.mgsuserresearch.com

Tracking Attitudes and Behaviors to Improve Games Ramon Romero Game Developers Conference 2008.

Documents