Legal Disclaimer
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS
OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS
DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL
ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO
SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A
PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER
INTELLECTUAL PROPERTY RIGHT.
UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR
ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL
INJURY OR DEATH MAY OCCUR.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not
rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel
reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities
arising from future changes to them. The information here is subject to change without notice. Do not finalize a
design with this information.
The products described in this document may contain design defects or errors known as errata which may cause
the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your
product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature,
may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer
systems, components, software, operations, and functions. Any change to any of those factors may cause the
results to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products.
Any software source code reprinted in this document is furnished under a software license and may only be used
or copied in accordance with the terms of that license.
Intel, the Intel logo, and Ultrabook, are trademarks of Intel Corporation in the US and/or other countries.
Copyright © 2012-2013 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Table of Contents
Introduction .............................................................................................................................................. 1
Welcome ............................................................................................................................................... 1
About the Camera ................................................................................................................................. 2
High-Level Design Principles ..................................................................................................................... 5
Input Modalities .................................................................................................................................... 5
Design Philosophy ................................................................................................................................. 6
Multimodality ........................................................................................................................................ 7
Gesture Design Guidelines ........................................................................................................................ 8
Capture Volumes ................................................................................................................................... 8
Occlusion ............................................................................................................................................... 8
High-Level Mid-Air Gesture Recommendations ................................................................................... 9
Recognized Poses ................................................................................................................................ 11
Universal Gesture Primitives ............................................................................................................... 12
Other Considerations .......................................................................................................................... 15
Samples and API .................................................................................................................................. 17
Voice Design Guidelines .......................................................................................................................... 19
High-Level Voice Design Recommendations ....................................................................................... 19
Voice Recognition ............................................................................................................................... 19
Speech Synthesis ................................................................................................................................. 21
Samples and API .................................................................................................................................. 21
Face Tracking Design Guidelines ............................................................................................................. 22
High-Level Recommendations ............................................................................................................ 22
Samples and API .................................................................................................................................. 22
Visual Feedback Guidelines..................................................................................................................... 23
High-Level Recommendations ............................................................................................................ 23
Representing the User ........................................................................................................................ 23
Representing Objects .......................................................................................................................... 25
2D vs. 3D ............................................................................................................................................. 26
Traditional UI Elements ....................................................................................................................... 26
Integrating the Keyboard, Mouse, and Touch .................................................................................... 27
Questions or Suggestions? ...................................................................................................................... 28
1
Introduction
Welcome
Participate in the revolution of Perceptual Computing! Imagine new ways of navigating your world with
more senses and sensors integrated into the computing platform of the future. Give your users a new,
natural, engaging way to experience applications, and have fun while doing it. At Intel we are excited to
provide the tools as the foundation for this journey with the Intel® Perceptual Computing SDK—and look
forward to seeing what you come up with. Over the new few months, you will be able to incorporate
new capabilities into your applications including close-range hand gestures, finger articulation, speech
recognition, face tracking, and augmented reality experiences to fundamentally change how people
interact with their PCs.
Perceptual Computing is about bringing exciting user experiences
through new human-computing interactions where devices sense and
perceive the user’s actions in a natural, immersive, and intuitive way.
This document is intended to help you create innovative, enjoyable, functional, consistent, and powerful
user interfaces for the Perceptual Computing applications of the future. In particular, it will help you:
Develop compelling user experiences appropriate for the platform.
Design intuitive and approachable interactions.
Make proper use of different input modalities.
Remember, Perceptual Computing is a new field, and the technology gets better literally every week.
Don’t just design for today; as a designer and developer you will need to be creatively agile in designing
for extensibility, modularity, and scalability for tomorrow’s capabilities. We’ll share new updates with
you as they become available!
2
About the Camera
Intel has announced the release of a peripheral device for use in Perceptual Computing applications- the
CREATIVE* Interactive Gesture Camera. This is the first, but not necessarily only, technology platform
from Intel that will be able to sense gesture, voice, and other input modalities. The guidelines in this
document apply to this device, but also apply, in a broader sense, to other potential technology
platforms.
The following are some of the critical specifications the CREATIVE Interactive Gesture Camera:
Size: 4.27 in × 2.03 in × 2.11 in (10.8cm × 5.2cm × 5.4cm)
Weight: 9.56 oz (271 grams)
Power: Single USB 2.0 (<2.5W)
RGB Camera
Native Resolution: 720p (1280×720 pixels) Frame Rate: 30fps FOV: 73 degrees diagonal Range: 0-23 feet (0m-7.01m) RGB + Depth frame sync
IR Depth Sensor (3D depth mapping)
Native Resolution: QVGA (320×240 pixels)
Frame Rate: 30fps
FOV: 73 degrees diagonal
Range: 6 inches to 3.25 feet (15 cm to 100 cm)
Ranging Technology: Time-of-flight
Audio
Dual-array microphones
Recommended System Configuration
PC with 2nd or 3rd generation Intel® Core™ processor Windows* 7 with Service Pack 1 or higher / Windows 8 Desktop UI 4GB system memory USB 2.0 port
Dual-array
Microphones
HD 720p Image sensor
Power LED
Indicator
Multi-attach
Base
3D Depth Sensor
3
Physical Device Configuration
You’ll want your app to work on a variety of platforms. Users might be running your application on a
notebook, Ultrabook™ device, All-in-one, convertible, tablet, or traditional PC and monitor. These
different platforms present different ergonomic limitations. Keep in mind the following variables:
Screen size
Smaller laptops and Ultrabook systems commonly have 13-inch screens and, occasionally, have even
smaller screens. Desktops may have 24-inch screens or larger. This presents a design challenge for
generating application UI and related artwork and for designing interactions. You must be flexible in
supporting different display sizes.
Screen distance
Users are normally closer to laptop screens than desktop ones because laptop screens and keyboards
are attached. Likewise, a laptop screen is often lower than a desktop one, relative to the user’s face and
hands.
When using a laptop, a user’s hands tend to be very close to the screen. The screen is usually
lower, relative to the user’s head.
When using a desktop, a user’s hands are farther away from the screen. The screen is also higher,
relative to the user’s head.
Camera configuration
The Perceptual Computing camera is designed to be mounted on top of the monitor. Design your
application assuming that this is the location of the camera. The camera is typically pointed at the user
such that the user’s head and the upper portion of the user’s torso are in view. This supports common
use cases such as video-conferencing. The camera will be placed at different heights on different
platforms. For a large desk mounted display, the camera height could be even with the top of the user’s
head, oriented to look down at the user. For an Ultrabook device on the user’s lap, the camera could be
much lower, angled up at the user. Your application should support these different camera
configurations.
4
Proper camera mounting on a stand-alone monitor. Proper camera mounting on a laptop.
You should be flexible in supporting different screen sizes and camera configurations since this will
impact the user’s interaction space.
5
High-Level Design Principles
To design a successful app for the Perceptual Computing platform, you must understand its strengths.
The killer apps for Perceptual Computing will not be the ones that we have seen on traditional
platforms, or even more recent platforms such as phones and tablets.
Input Modalities
What sets the Perceptual Computing platform apart from traditional platforms are the new and
different input modalities. You’ll want to understand the strengths of these modalities, and incorporate
them into your app appropriately. It can be especially powerful to combine multiple modalities. For
example, users can often coordinate simultaneous physical and voice input operations, making
interaction richer and less taxing.
Mid-air hand gestures. Allows for very rich and engaging interaction with 2D or
3D objects. Allows easier, more literal direct manipulation. However, mid-air
gesture can be tiring over long periods, and precision is limited.
Touch. Also very concrete and easy to understand, with the additional benefit
of having tactile feedback to touch events. However, touch is limited to 2D
interaction. It is not as flexible as mid-air gesture.
Voice. Human language is a powerful and compelling means of expression.
Voice is also useful when a user is not within range of a computer’s other
sensors. Environmental noise and social appropriateness should be
considered.
Mouse. The best modality for the accurate indication of a 2D point. Large-scale
screen movements can be made with small mouse movements.
Keyboard. Currently the best and most common modality for consistent and accurate text input. Useful for easy and reliable shortcuts.
6
Design Philosophy
Designing and implementing applications for the Perceptual Computing platform requires a very
different mindset than designing for traditional platforms, such as Windows* or Mac* OS X, or even
newer platforms like iOS* or Android*. When designing your app, you’ll want it to be:
Reality-inspired, but not a clone of reality. You should draw inspiration from the real-world.
Perceptual Computing builds off of our natural skills used in every-day life. Every day we use our
hands to pick up and manipulate objects and our voices to communicate. Leverage these natural
human capabilities. However, do not slavishly imitate reality. In a virtual environment, we can
relax the rules of the physical world to make interaction easier. For example, it is very difficult
for a user to precisely wrap their virtual fingers around a virtual object in order to pick it up.
With the Intel® Perceptual Computing SDK, it may be easier for a user to perform a grasp action
within a short proximity of a virtual object in order to pick it up.
Literal, not abstract. Visual cues and interaction styles built from real-world equivalents are
easier to understand than abstract symbolic alternatives. Also, symbolism can vary by geography
and culture, and doesn’t necessarily translate. Literal design metaphors, such as switches and
knobs, are culturally universal.
Intuitive. Your application should be approachable and immediately usable. Visual cues should
be built in to guide the user. Voice input commands should be based around natural language
usage, and your app should be flexible and tolerant in interpreting input.
Consistent. Similar operations in different parts of your application should be performed in
similar ways. Where guidelines for interaction exist, as described in this document, you should
follow them. Consistency across applications in the Perceptual Computing ecosystem builds
understanding and trust in the user.
Extensible. Keep future SDK enhancements in mind. Unlike mouse interfaces, the power,
robustness, and flexibility of Perceptual Computing platforms will improve over time. How will
your app function in the future when sensing of hand poses improves dramatically? How about
when understanding natural language improves? Design your app such that it can be improved
as technology improves and new senses are integrated together.
Reliable. It only takes a small number of false positives to discourage a user from your
application. Focus on simplicity where possible to minimize errors.
Intelligently manage persistence. For example, if a user’s hand goes out of the field of view of
the camera, make sure that your application doesn’t crash or do something completely
unexpected. Intelligently handle such types of situations and provide feedback.
Designed to strengths. Mid-air gesture input is very different from mouse input or touch input.
Each modality has its strengths and weaknesses—use each when appropriate.
Contextually appropriate. Are you designing a game? A medical application? A corporate
content-sharing application? Make sure that the interactions you provide match the context. For
example, you expect to have more fun interactions in a game, but may want more
7
straightforward interactions in a more serious context. Pay attention to modalities (e.g., don’t
rely on voice in a noisy environment).
Take user-centered design seriously. Even the best designs need to be tested by the intended users.
Don’t do this right before you plan to launch your application or product. Unexpected issues will come
up and require you to redesign your application. Make sure you know who your audience is before
choosing the users you work with.
Multimodality
As we add more to our SDK, you will have additional sensors and inputs to play with. Make sure to
design smartly—don’t use all types of input just for the sake of it, but also make sure to take advantage
of combining different input modalities both synchronously and asynchronously. This will make it a
more exciting and natural experience for the user, and can minimize fatigue of the hands, fingers, or
voice. Having a few different modalities working in unison can also inspire confidence in the user that
they are conveying the proper information. For example, use your hand to swipe through images, and
use your voice to email the ones you like to a friend. Design in such a way that extending to different
modalities and combinations of modalities is easy. Make sure that it is comfortable for the user to
switch between modalities both mentally and physically. Also keep in mind that some of your users may
prefer certain modalities over others, or have differing abilities.
8
Gesture Design Guidelines
In this section we describe best practices for designing and implementing mid-air hand input (gesture)
interactions.
Capture Volumes
It is important to be aware of the sensing capabilities of your platform when designing and
implementing your application. A camera has a certain field-of-view, or capture volume, beyond which it
can’t see anything. Furthermore, most depth sensing cameras have minimum and maximum sensing
distances. The camera cannot sense objects closer than the minimum distance or farther than the
maximum distance.
The capture volume of the
camera is visualized as a frustum defined by near and far planes
and a field-of-view.
The user is performing a hand gesture that is captured in the
camera’s capture volume.
The user is performing a hand gesture outside of the capture
volume. The camera will not see this gesture.
Capture volume constraints limit the practical range of motion of the user and the general interaction
space. Especially in games, enthusiastic users can inadvertently move outside of the capture volume.
Feedback and interaction must take these situations into account.
When performing gestures, it is expected that the user leans back in the chair in a relaxed position. The
user’s hands move around a virtual plane roughly 12 inches away from the camera. This virtual plane
serves multiple purposes: (a) it activates hand tracking when the user’s hand is within 12 inches from
the camera; (b) the swipe gestures use the plane to distinguish between a left swipe and a right swipe.
It is also recommended that the user’s head always be eight inches away from the user’s hands. The
hand-tracking software cannot reliably distinguish a hand from a head if they are too close to each
other.
Occlusion
For applications involving mid-air gestures, keep in mind the problem of a user’s hands occluding the user’s view of the screen. It is awkward if users raise their hand to grab an object on the screen, but can’t see the object because their hand caused the object to be hidden. When mapping the hand to the screen coordinates, map them in such a way that the hand is not in the line of sight of the screen object to be manipulated.
9
High-Level Mid-Air Gesture Recommendations
For many Perceptual Computing applications, mid-air gestures will be the primary input modality.
Consider the following points when designing your interaction and when considering gesture choices:
Where possible make use of our universal gesture primitives. Introduce your own gesture
primitives only when there is a compelling reason to do so. A small set of general-purpose
natural gestures is preferable to a larger set of specialized gestures. As more apps come out,
users will come to expect certain primitives, which will improve the perceived intuitiveness.
Stay away from abstract gestures that require users to memorize a sequence or a pose.
Abstract gestures are gestures that do not have a real-life equivalent and don’t fit any existing
mental models. An example of a confusing pose is “thumbs down” to delete something. An
example of a better delete gesture is to place or throw an item in a trash can.
Poses vs. gestures. Be aware of the different types of gestures. Poses are sustained postures,
ones like clenching a fist to select an item, and dynamic gestures are those like swiping to turn a
page. Figure out which make more sense for different interactions, and be clear in
communicating which is needed at any given point.
Innate vs. learned gestures. Some gestures will be natural to the user (e.g., grabbing an object
on the screen), while some will have to be learned (e.g., waving to escape a mode). Make sure
you keep the number of gestures small for a low cognitive load on the user.
Be aware of which gestures should be actionable. What will you do if the user fixes her hair,
drinks some coffee, or turns to talk to a friend? Make sure to make your gestures specific
enough to be safe in these situations and not mess up the experience.
Relative vs. absolute motion. Relative motion allows users to reset their current hand
representation on the screen to a location more comfortable for their hand (e.g., as one would
lift a mouse and reposition it so that it is still on a mouse pad). Absolute motion preserves
spatial relationships. Applications should use the motion model that makes the most sense for
the particular context.
Design your gestures to be ergonomically comfortable. If the user gets tired or uncomfortable,
they will likely stop using your application.
Gesturing left-and-right is easier than up-and-down. Whenever presented with a choice, design
for movement in the left-right directions for ease and ergonomic considerations.
Two hands when appropriate. Some tasks, like zooming, are best performed with two hands.
Support bi-manual interaction where appropriate.
Handedness. Be aware of supporting both right- and left-handed gestures.
Flexible thresholds. Make sure your code can accommodate hands of varying sizes and amounts
of motor control. Some people may have issues with the standard settings and the application
will need to work with them. For example, to accommodate an older person with hand jitter,
the jitter threshold should be customizable. Another example is accommodating a young child
or an excitable person who makes much larger gestures than you might expect.
10
Teach the gestures to the users. Provide users with a tutorial for your application, or show
obvious feedback that guides them when first using the application. You could have an option to
turn this training off after a certain amount of time or number of uses.
Give an escape plan. Make it easy for the user to back out of a gesture or a mode, or reset.
Consider providing the equivalent of a traditional “home button.”
Be aware of your gesture engagement models. You may choose to design a gesture such that
the system only looks for it once the user has done something to engage the system first (e.g.,
spoken a command, made a thumbs up pose).
Design for the right space. Be aware of designing for a larger world space (e.g. with larger
gestures, more arm movement) versus a smaller more constrained space (e.g. manipulating a
single object). Distinguish between environmental and object interaction.
11
Recognized Poses
A pose and a gesture are two distinct things. A pose is a sustained posture, while a gesture is a
movement between poses. Here are the poses that we currently recognize as part of the SDK.
Openness
Using our SDK, you are able to discern between an open hand and a closed hand, by looking at the
LABEL_OPEN and LABEL_CLOSE attributes respectively.
Thumbs Up and Thumbs Down
“Thumbs up” and “thumbs down” poses can be recognized by looking at the LABEL_POSE_THUMB_UP
and LABEL_POSE_THUMB_DOWN attributes, respectively. These could be used, for example, to confirm
or cancel a verbal command.
Peace
The peace sign pose can be recognized by looking at the LABEL_POSE_PEACE attribute. This could be
used as a trigger command, for example.
Big5
The Big 5 pose can be recognized by looking at the LABEL_POSE_BIG5 attribute. Depending on the
context of the application, this pose could be used to stop some sort of action (or to turn off voice
commands, for example), or to initiate a gesture.
12
Universal Gesture Primitives
We have defined some gestures that are reserved for pre-defined actions. In general, these gestures
should be used only for these actions. Conversely, when these actions exist in your application, they
should generally be performed using the given gestures. Providing feedback for these gestures is critical,
and is discussed in the Visual Feedback Guidelines section. We don’t require that you conform to these
guidelines, but if you depart from these guidelines you should have a compelling user experience reason
to do so. This set of universal gestures will become learned by users as standard and will become more
expansive over time.
Partial support for these gestures exists in the SDK. Some gestures are supported in their entirety, some
are supported in a limited number of poses, and some are not yet supported. We plan to provide more
complete support as the SDK matures.
Grab and Release
The gesture for grabbing an on-screen object is shown below. The user should start with a unique pose
in order to start the sequence. The user should then have her fingers and thumb apart, and then bring
them together into the grab pose. The reverse action, moving the fingers and thumb apart, releases the
object. Limited grab and release functionality can be achieved through the “openness” parameter (the
value from 0 to 100 to indicate the level of palm openness) and fingertips (e.g. LABEL_FINGER_THUMB,
LABEL_FINGER_INDEX) exposed by the SDK. For more reliable detection, you can also detect the top,
middle, and bottom of the hand (e.g. LABEL_HAND_MIDDLE).
A user grabs an object.
13
Move
After grabbing an object, the user moves her hand to move the object. Some of the general guidelines
for the design of basic grabbable objects are:
It should be obvious to the user which objects can be moved and which cannot be moved.
If the interface relies heavily on grabbing and moving, it should be obvious to the user where a
grabbed object can be dropped. It may be useful to provide snappable behavior.
Objects should be large enough to account for slight hand jitter.
Objects should be far enough apart so users won’t inadvertently grab the wrong object.
If the hand becomes untracked while the user is moving an object the moved object should reset to its
origin, and the tracking failure should be communicated to the user. This functionality can be realized
through hand tracking with an openness value indicating a closed hand.
A user moves an object.
Pan
If the application supports panning, this should be done using a flat hand. Panning engages once the hand is made mostly flat. Translation of the flat hand pans the view. Once the hand relaxes into a natural slightly curled pose, which can be determined by the hand openness parameter, panning ends. Note that if one panning is not good enough, the hand will have to move back and pan again.
A user pans the view.
14
Zoom
If the application supports zooming, this should be done using two flat hands. Zooming engages once
both hands become mostly flat. Zooming is then coupled to the distance between the two hands (similar
to pinch-zooming for touch). Zoom functionality requires an action to disengage the zooming otherwise
the user cannot escape without changing the zoom.
Resizing an object is very similar. Instead of keeping the 2 hands open, one hand will grab one side of an
object, while the second hand grabs the other side of the object. Then the user moves the hands relative
to one another, either closer together to shrink the object, or farther apart to grow the object. Once the
user releases one hand, the resize operation ends.
Wave
The gesture for resetting, escaping a mode, or moving up a hierarchy is shown below. The user quickly
waves her hand back and forth. This is a general purpose “get-me-out-of-here” gesture. You can find this
in the SDK under LABEL_HAND_WAVE.
A user waves to reset a mode.
A user zooms the view.
15
Circle
The circle gesture, LABEL_HAND_CIRCLE, is recognized when the user extends all fingers and moves the
hand in a circle. This could be used for selection or resetting, for example.
Swipe
Swipes are basic navigation gestures. However, it is technically challenging to recognize swipes accurately. There are many cases a left swipe is exactly like a right swipe from the camera’s view point, if one is performing multiple swipes. This also applies to up and down swipes, respectively. You can find Swipe in the SDK under LABEL_NAV_SWIPE_LEFT, LABEL_NAV_SWIPE_RIGHT, LABEL_NAV_SWIPE_UP ,and LABEL_NAV_SWIPE_DOWN.
To avoid confusion, the user should perform the swipe gestures as follows:
Imagine there is a virtual plane about 12 inches away from the camera. The swipes must first go into the plane,
travel inside the plane from left to right or right to left, and then go out of the plane.
Other Considerations
Hand Agnosticism
All one-handed gestures can be performed with either the right or left hand. For two-handed gestures where the sequence of operations matters (e.g., grabbing an object with both hands for the resize gesture), the hand choice for starting the operation does not matter.
A user circles with her hand to move to the next level of a game.
16
Finger Count Independence
For many gestures, the number of fingers extended does not matter. For example, the pan operation can be performed with all fingers extended, or only a few. Restrictions in finger count only exist where necessary to avoid conflict. For example, having the index finger extended could be reserved for pointing at a 2D location, in which case it can’t also be used for panning.
Flexibility in Interpretation of Pose
Hands can be in poses similar to, but slightly different, from the poses described. For example, accurate panning can be accomplished with the fingers pressed together or fanned apart.
Rate Controlled or Absolute Controlled Rotation and Translation
You can use an absolute-controlled model or a rate-controlled model to control gesture-adjusted parameters such as rotation, translation (of object or view), and zoom level. In an absolute model, the magnitude to which the hand is rotated or translated in the gesture is translated directly into the parameter being adjusted, i.e., rotation or translation. For example, a 90-degree rotation by the input hand results in a 90-degree rotation in the virtual object. In a rate-controlled model, the magnitude of rotation/translation is translated into the rate of change of the parameter, i.e., rotational velocity or linear velocity. For example, a 90-degree rotation could be translated into a rate of change of 10 degrees/second (or some other constant rate). With a rate-controlled model, users release the object or return their hands to the starting state to stop the change.
17
How to Minimize Fatigue
Gestural input is naturally fatiguing as it relies on several large muscles to sustain the whole arm in the
air. It is a serious problem and should not be disregarded; otherwise, users may quickly abandon the
application. By carefully balancing the following guidelines, you can alleviate the issue of fatigue as
much as possible:
Allow users to interact with elbows rested on a surface. Perhaps the best way to alleviate arm
fatigue is by resting elbows on a chair’s arm rest. Support this kind of input when possible. This,
however, reduces the usable range of motion of the hand to an arc in the left and right
direction. Evaluate whether interaction can be designed around this type of motion.
Make gestures short-lasting. Long-lasting gestures, especially ones where the arms must be
held in a static pose, quickly induce fatigue in the user’s arm and shoulder (e.g., holding the arm
up for several seconds to make a selection).
Design for breaks. Users naturally, and often subconsciously, take quick breaks (e.g., professors
writing on the blackboard). Short, frequent breaks are better than long, infrequent ones.
Do not require precise input. Users naturally tense up their muscles when trying to perform
very precise actions (much like trying to reduce camera shake when taking a picture in the dark).
This, in turn, accelerates fatigue. Allow for gross gestures and make your interactive objects
large.
Do not require many repeating gestures. If you require users to constantly move their hands in
a certain way for a long period of time (e.g., while moving through a very long list of items by
panning right), they will become tired and frustrated very quickly.
Samples and API
In the /doc folder of the SDK, you can find a file called sdksamples.pdf. This gives you examples that
show finger tracking, pose/gesture recognition, and event notification (gesture_viewer,
gesture_viewer_simple) in both C++ and C#. You can run the applications and view the source code in
the Intel/PCSDK/sample folder.
18
Also, in sdkmanual-gesture.pdf, you can find the most current version of the gesture module, which consumes RGB, depth, or IR streams as input and returns blob information, geometric node tracking results, pose/gesture notification, and alert notification. For an example of 2d pan, zoom, and rotate, see: http://github.com/IntelPerceptual/PerceptualP5/ tree/master/PanZoomRotate For an online tutorial on close-range hand/finger tracking, see: http://software.intel.com/en-
us/sites/default/files/article/328725/perc-gesturerecognition-tutorial-final.pdf
19
Voice Design Guidelines
In this section we describe best practices for designing and implementing voice
command and control, dictation, and text to speech for your applications. As of
now, English is the only supported language.
High-Level Voice Design Recommendations
Test your application in noisy background and different environmental spaces to ensure
robustness of sound input.
Watch out for false positives. For example, don’t let a specific sound delete a file without
verification, as this sound could unexpectedly crop up as background noise.
Always show listening status of the system. Is your application listening? Not listening?
Processing sound?
People do not speak the way they write. Be aware of pauses and interjections such as “um”
and “uh”.
Teach the user how to use your system as they use it. Give more help initially, then fade it
away as the user gets more comfortable (or have it as a customizable option).
Voice Recognition
Command Mode Vs. Dictation Mode
Be aware of the different listening modes your application will be in. Once listening, your application can be listening in command mode or dictation mode. Command mode is for issuing commands (e.g., “Start computer”, “Email photo”, “Volume up”). In Command mode, the SDK module recognizes only from a predefined list of context phrases that you have set. The developer can use multiple command lists, which we will call grammars. Good “command” application design would create multiple grammars and activate the one that is relevant to the current application state (this limits what the user can do at any given point in time based on the command grammar used). To invoke the command mode, provide a grammar.
Dictation mode is for general language from the user (e.g., entering in the text for a Facebook status update). Dictation mode has a predefined vocabulary. It is a large, generic vocabulary containing 50k+ common words (with some common named entities). Highly domain specific terms (e.g. medical terminology) may not be widely represented. Absence of a grammar will invoke the SDK in dictation mode. Dictation is limited to 30 seconds. Currently, grammar mode and dictation mode cannot be run at the same time.
20
Constructing Grammars
Keep the following points in mind when constructing your grammars:
Don’t assume that your command phrasing is natural! The language you use is very important.
Ask other people -- friends, family, people on forums, study participants – how they would want
to interact with your system or initiate certain events.
Provide many different options for your grammar to exert less effort on the user, and try to
make interaction more natural. For example, instead of constraining the user to say “Program
start”, you could also accept “Start program”, “Start my program”, “Begin program”, etc.
Complicated words/names are not easily recognized. Make your grammar include commonly
used words. However, very short words can be difficult to recognize because of sound ambiguity
with other words.
Be aware of the length of the phrases in your grammar. Longer phrases are easier to
distinguish between, but you also don’t want users to have to say long phrases too often.
Beware of easily confusable commands. For example, “Create playlist” and “Create a list” will
likely sound the same to your application. One would be used in a media player setting, and the
other could be in a word processor setting, but if they are all in one grammar the application
could have undesired responses.
Experiment with different lengths of “end of sentence detection.” Responsiveness is important
and the end of sentence parameter (endofSentence in PXCVoiceRecognition::ProfileInfo) can
help adjust the responsiveness of the application.
User Feedback
Let the user know what commands are possible. It is not obvious to the user what your
application’s current grammar is.
Let the user know how to initiate listening mode. Make your application’s listening status clear.
Let the user know that their commands have been understood. The user needs to know this to
trust the system, and know which part is broken if something doesn’t go the way they planned.
One easy way to do this is to relay back a command. For example, the user could say “Program
start”, and the system could respond by saying “Starting program, please wait”.
Give users the ability to make the system stop listening.
Give users the ability to edit/redo/change their dictation quickly. It might be easier for the
user to edit with the mouse, keyboard, or touchscreen at some point to edit their dictations.
If you give verbal feedback, make sure it is necessary, important, and concise! Don’t overuse
verbal feedback as it could get annoying to the user.
If sound is muted, provide visual components and feedback.
21
How To Minimize Fatigue
Remember, the user should not have to be speaking constantly. Speech is best used for times when
dictation is being used, or triggers are necessary to accomplish actions. Speech can be socially awkward
in public, and background noise can easily get in the way of successful voice recognition. Be aware of
speech’s best uses. Speech is best used as a shortcut for a multi-menu action (something that requires
more than a first-level menu and a single mouse click). To scroll down a menu, it would make more
sense to use a gesture rather than repeatedly have the user say “Down”, “Down”, “Down”.
Speech Synthesis
You can also generate speech using the built in Nuance speech synthesis that comes with our SDK.
Currently a female voice is used for TTS.
Make sure to use speech synthesis where it makes sense. Have an alternative for people who
cannot hear well, or if speakers are muted.
Listening to long synthesized speech will be tiresome. Synthesize and speak only short
sentences.
Samples and API
You can run and view the source code for the voice_recognition and voice_synthesis projects in
Intel/PCSDK/sample.
In sdksamples.pdf, you can check out the audio_recorder sample. You can also find more information in sdkmanual-core.pdf on audio abstraction with the PXCAudio, PXCAccelerator, and PXCCapture::AudioStream interfaces. An article on how to use the SDK for Voice Recognition can be found here: http://software.intel.com/en-us/sites/default/files/article/328725/voicerecognitionhowto.pdf
22
Face Tracking Design Guidelines
In a future release we will provide more guidelines
for designing interactions based on face tracking, face
detection, and face recognition. Stay tuned!
High-Level Recommendations
More expressions will be in the SDK in the future- smiles and winks can be currently detected.
Natural expressions in front of a computer will be difficult to detect, users should be prompted
to show exaggerated expressions.
Give feedback to the user to make sure they are in a typical working distance away from the
computer, for optimal feature detection.
Give feedback to the user about any orientation or lighting issues- provide error messages or
recommendations.
For optimal tracking, have ambient light or light facing the user’s face (avoid shadows).
Try to make the interface background as close to white as possible (the screen can serve as a
second light to ensure good reading).
Notify the user if they are moving too fast to properly track facial features.
Samples and API
Check out out the face_detection and landmark_detection samples (in Intel/PCSDK/sample and
discussed in sdksamples.pdf) to run the application and see the source code.
You can also find more information in sdkmanual-face.pdf. An article on how to use the Face Detection module can be found here: http://software.intel.com/en-us/articles/intel-perceptual-computing-sdk-how-to-use-the-face-detection-module
23
Visual Feedback Guidelines
You’ll want your Perceptual Computing application to appear and behave very differently from a
traditional desktop PC style application. Familiar concepts, such as cursors, clicking, icons, menus, and
folders, don’t necessarily apply to an environment in which gesture and voice are the primary
interaction modalities. In this section we provide design guidelines for developing your application to
visually conform to the Perceptual Computing interaction model.
High-Level Recommendations
Don’t have a delay between the user’s input (whether it’s gesture, voice, or anything else) and
the visual feedback on the display.
Smooth movements. Apply a filter to the user’s movements if necessary to prevent jarring
visual movements.
Combine different kinds of feedback. This can convince the user that the interactions are more
realistic. Stay tuned to the next version of this manual for more advice on how to deal with
audio feedback.
Show what is actionable. You don’t want the user trying to interact with something that they
can’t interact with.
Show the current state of the system. Is the current object selected? If so, what can you do to
show this visually? Ideas include using different colors, tilting the object, orienting the object
differently, and changing object size.
Show recognition of commands or interactions. This will let the user know they are on the right
or wrong track.
Show progress. For example, you could show an animation or a timer for short timespans.
Consider physics. Think about the physics that you want to use to convey a more realistic and
satisfying experience to the user. You could simulate magnetic snapping to an object to make
selection easier, for example. While the user is panning through a list, you could accelerate the
list movement and slow it down after the user has finished panning.
Representing the User
A user must be represented in the virtual world. The user embodiment allows the user to interact with
elements in the scene. In traditional environments this embodiment is a mouse cursor. In a Perceptual
Computing environment, the representation of the user should reflect the modalities used to interact
and the nature of the application in question. Typically, where hand gestures are used, a representation
of the hands should be shown on the screen. The hand representation depends on the application. In a
magic game, the user may be represented as a glowing wand held by a hand. In a 3D modeling
application, the user may be represented by an articulated hand model. You could have the cursor be a
static object, or also have the cursor change orientation, size, or color depending on the movement or
depth of the user’s hands.
24
The hand representation should be neither very realistic, nor very simplistic. A very realistic hand risks
the “uncanny valley” effect1, which would disturb users. A too simplistic hand will be inadequate to
communicate the complex state of the hand and risks being too close to a cursor.
If head location is relevant to interaction (e.g. you are using face-tracking) a representation of the head
may need to be incorporated. Similar rules hold for other modalities.
Hand representation consisting of an articulated
hand model. This would be appropriate for applications involving direct object manipulation.
Hand representation consisting of a magic wand. This would be appropriate for a magic game.
Hand representation when tracking has failed. The user is told that tracking has failed, so they know to
act to fix tracking.
1 The uncanny valley is a hypothesis in the field of robotics and 3D computer animation which holds that when
human replicas look and act almost, but not perfectly, like actual human beings, it causes a response of revulsion among human observers. The valley refers to the dip in the graph of the comfort level of humans as a function of a robot’s human likeness [Wikipedia 2013].
25
Sensor limitations can result in cases where the user is not being tracked. For example, a user may be
too far from the camera, or may have moved to the side and is out of the view of the camera. Users
often don’t understand when this has happened. Your application should tell users when tracking has
failed, why tracking has failed, and what they can do to correct the situation. This feedback can be
incorporated into the design of the user representation (e.g., showing the relation between the user and
the interaction bounding box/camera field of view visually). Other measures can be taken when tracking
fails. In a game, where lost tracking can result in the user losing the game, the action can be dramatically
slowed down until tracking is re-established.
In general, you should recognize the limitations of the sensors and insure that the experiences you are
trying to create are intelligent in working with the technology that you currently have. For example, it
would be a poor design to have a user interaction that, in real life, would require sensitive tracking that
is super-fast when your tracking only enables something slow. You may want to modify the interaction
and visual representations to work within the current abilities of the technology.
Representing Objects
The ideal representation of objects in the scene is influenced greatly by the method in which we interact
with them. In a Perceptual environment, we are able to interact much more richly with objects. We can
push, grab, twist, or stretch them. This is much more than can be done with a mouse. On the other
hand, a hand has much less precision than a mouse. The representation of objects should reflect these
realities. Objects in your application should:
Take advantage of the rich manipulation abilities of the human hand
Convey visibly the interactive possibilities, so users can understand what can be done
Be of a size that can be manipulated easily
Not demand a degree of precise manipulation that results in a large number of errors or a large
amount of fatigue
Gestural Actions on Objects
Some action states to consider while interacting with objects in your application may include:
Targeting
Hovering
Selecting
Dragging
Releasing
Resizing
Rotating
26
2D vs. 3D
A Perceptual Computing graphical application can be shown either within a 2D or 3D interactional
environment. 2D environments are easier to understand and navigate on a 2D display, so should be
used when there isn’t a compelling need for a 3D environment. When using gesture to interact with a
2D environment, however, consider using subtle 3D cues to enhance interaction. For example, a
grabbed object can be made slightly larger with a drop shadow, to indicate it has been lifted off the 2D
surface. Full 3D environments play to many of the strengths of a Perceptual environment, and should be
used when the use case demands it. Some applications, especially games, benefit from operating in 3D.
Traditional UI Elements
The interactive elements in a primarily gesture-driven interface are different from those in a primarily
mouse-driven environment. This section suggests some of the more traditional UI elements for use with
mid-air gesture. These can be useful for clarity and efficiency, and many users are familiar with these
models. Of course, it isn’t good to just rely on what people are accustomed to if there are better
solutions, but don’t discount some of the UI elements people are already using.
Horizontal Lists
Horizontal lists can be good because they rely on the more natural left-right motion with the right hand.
A welcomed improvement to linear lists is presenting choices on a slight arc, which allows the user to
make a choice while resting their elbow on a hard surface. Note, however, that this approach is
handedness-dependent. A left-handed user might not find it comfortable. Consider accommodating left-
handed users by optionally mirroring the interface.
Example of a horizontal list sweep.
27
Radial Lists
Radial lists (also known as pie menus) are useful, especially for gestural input, as they are less error-
prone since the distance the user has to traverse in order to reach any option is short, and a user
doesn’t have to aim precisely to select an option. Also they can take up less space than linear lists. When
constructing radial lists, maximize the selectable area for each option by making the whole “slice” of the
list selectable.
Sliders
Typically, sliders are used for adjusting values within a given range. You may want to use a slider for
absolute panning instead of using relative panning depending on your application. Follow these
guidelines:
Create discrete sliders as opposed to continuous ones. Gestural input lacks the fidelity required
to make fine selections without inducing fatigue.
Try to keep sufficient distance between “steps” to avoid demanding too much precision on the
part of the user.
The top slider has fewer steps, allowing the user to easily select the one they want using mid-air gesture.
The numerous steps on the lower slider make it much harder to select the desired value.
Integrating the Keyboard, Mouse, and Touch
Don’t ignore the mouse, keyboard, and touchpad or touchscreen. People are used to these form factors,
and each has their own specialized purpose. Often, it makes much more sense to type in information
using the keyboard, rather than using an onscreen keyboard (although in some situations, like when the
user only has to input a few letters, using gesture makes sense). Keys can still be used as failsafe
shortcuts or escapes. To find a very precise 2D location, the mouse and touchscreen can still be very
useful and efficient.
Example of a radial list with “paste” currently selected.
28
Questions or Suggestions?
This document provides guidelines that are rooted in many years of research in human-computer
interaction, user interface design, and multi-modal input. However, if you feel that certain guidelines do
not fit your use case or you have proposals for modifications or additions, please post to the forum
thread “Human Interaction Guidelines-Questions and Suggestions,” and we will be happy to discuss the
issues with you.
Other helpful information:
Our website:
http://intel.com/software/perceptual
For information and updates on the SDK, follow us on Twitter at:
@PerceptualSDK
All manuals mentioned in this document that were downloaded with the SDK are also available
online: http://software.intel.com/en-us/articles/intel-perceptual-computing-sdk-manual-page
Check out our tutorials!
http://software.intel.com/en-us/articles/intel-perceptual-computing-sdk-tutorials
Check out our github repository:
http://github.com/IntelPerceptual
We also have a social hub where you can find links to our videos and connect with us on
Facebook, Twitter, and Google+ :
http://about.me/IntelPerceptual
Frequently Asked Questions
http://software.intel.com/articles/perc-faq
And last but not least, participate in our Intel® Developer Zone Intel® Perceptual Computing SDK
forum to share information with fellow developers and ask questions.
http://software.intel.com/en-us/forums/intel-perceptual-computing-sdk